Downloading Datasets#

Note

See the Glossary for the meaning of the acronyms used in this guide.

When running any of the examples or custom experiments, it will be necessary to acquire datasets for training. A list of datasets used in the examples is presented below, along with instructions on how to use our download_data.py CLI tool to acquire them.

Datasets#

Kaggle Datasets#

When downloading Kaggle Datasets either through the custom scripts Dioptra provides or through the browser interface, users will need to sign in to Kaggle, and also agree to the rules of the competition, even if the competition has finished.

Furthermore, users of the CLI tool will need to setup a Kaggle API token to be able to access some of these datasets. More information on the Kaggle API tokens can be found here: https://www.kaggle.com/docs/api.

Dataset Placement#

The datasets should be downloaded and organized in the same directory on the host machine that is running Dioptra. This folder can be stored anywhere on your host machine’s filesystem (this folder will then need to be mounted into the worker containers). For the sake of this documentation, we assume that the datasets are stored in the /dioptra/data directory so that it matches with the filepath also used in the examples.

To use the aforementioned datasets with the Dioptra examples, they will need to be organized in the /dioptra/data folder in a specific way, which the download_data.py script will handle automatically for you. The required directory structure for each of the datasets is described below.

For MNIST, the data needs to be in this format:

dioptra/
└── data/
    └── Mnist/
        ├── training/
        │   ├── 0/
        │   │   ├── 00002.png
        │   │   ├── 00005.png
        │   │   ...
        │   ├── 1/
        │   ├── 2/
        │   ...
        └── testing/
            ├── 0/
            │   ├── 00001.png
            │   ├── 00021.png
            │   ...
            ├── 1/
            ├── 2/
            ...

For Fruits360, the data needs to be in this format:

dioptra/
└── data/
    └── Fruits360/
        ├── training/
        │   ├── Apple Braeburn/
        │   │   ├── 0_100.png
        │   │   ├── 1_100.png
        │   │   ...
        │   ├── Apple Crimson Snow/
        │   ├── Apple Golden 1/
        │   ...
        └── testing/
            ├── Apple Braeburn/
            │   ├── 3_100.png
            │   ├── 4_100.png
            │   ...
            ├── Apple Crimson Snow/
            ├── Apple Golden 1/
            ...

For ImageNet, the data needs to be in this format:

dioptra/
└── data/
    └── ImageNet-Kaggle/
        ├── metadata/
        │   ├── image_sets/
        │   └── synset_mapping.txt
        ├── training/
        │   ├── annotations/
        │   │   ├── n01440764/
        │   │   │   ├── n01440764_10040.xml
        │   │   │   ├── n01440764_10048.xml
        │   │   │   ...
        │   │   ├── n01443537/
        │   │   ...
        │   └── images/
        │       ├── n01440764/
        │       │   ├── n01440764_10040.JPEG
        │       │   ├── n01440764_10048.JPEG
        │       │   ...
        │       ├── n01443537/
        │       ...
        └── testing/
            ├── annotations/
            │   ├── n01440764/
            │   │   ├── n01440764_10030.xml
            │   │   ├── n01440764_10031.xml
            │   │   ...
            │   ├── n01443537/
            │   ...
            └── images/
                ├── n01440764/
                │   ├── n01440764_10030.JPEG
                │   ├── n01440764_10031.JPEG
                │   ...
                ├── n01443537/
                ...

Please note that the testing folder in the above tree structure is actually the val/ folder in the dataset on Kaggle, as the actual testing set does not come with labels.

For the Road Signs Detection dataset, the data needs to be in this format:

dioptra/
└── data/
    └── Road-Sign-Detection-v2/
        ├── training/
        │   ├── annotations/
        │   │   ├── 00000_road2.xml
        │   │   ├── 00000_road3.xml
        │   │   ...
        │   └── images/
        │       ├── 00000_road2.png
        │       ├── 00000_road3.png
        │       ...
        └── testing/
            ├── annotations/
            │   ├── 00000_road0.xml
            │   ├── 00000_road1.xml
            │   ...
            └── images/
                ├── 00000_road0.png
                ├── 00000_road1.png
                ...

Using the Download Script#

Dioptra provides the examples/scripts/download_data.py script to simplify the download and organization of these datasets. To get started, it is recommended that you create a virtual environment to manage the script’s dependencies. Open a terminal, clone the repository, navigate into the dioptra folder, then run the following:

# Move into the examples folder of cloned repo
cd ./examples

# Create a new virtual environment at ./examples/.venv
python -m venv .venv

# Activate the virtual environment
source .venv/bin/activate

# Install the dependencies
python -m pip install -r ./scripts/venvs/examples-setup-requirements.txt

Then, to run this script and download a dataset directly to the /dioptra/data directory, simply use the following:

python ./scripts/download_data.py --output /dioptra/data DATASET_NAME

For the full list of options and available datasets, run python ./scripts/download_data.py -h to display the script’s help message:

Usage: download_data.py [OPTIONS] COMMAND [ARGS]...

  Fetch a dataset used in Dioptra's examples and demos.

Options:
  --output DIRECTORY            The path to the folder where the example
                                datasets are stored. Defaults to the current
                                working directory.
  --overwrite / --no-overwrite  Fetch the data even if the target folder
                                already exists and overwrite any existing data
                                files. By default the program will exit early
                                if the target folder already exists.
  -h, --help                    Show this message and exit.

Commands:
  fruits360  Fetch the Fruits 360 dataset hosted on Kaggle.
  imagenet   Fetch the ImageNet Object Localization Challenge dataset...
  mnist      Fetch the MNIST dataset.
  roadsigns  Fetch the Road Signs Detection dataset hosted on Kaggle.

Please note that some of the datasets have additional options that can be viewed by running python ./scripts/download_data.py DATASET -h. For example, running python ./scripts/download_data.py fruits360 -h displays the help message for the Fruits 360 dataset shown below:

Usage: download_data.py fruits360 [OPTIONS]

  Fetch the Fruits 360 dataset hosted on Kaggle.

  This downloader uses the Kaggle API and requires the use of an API token.
  For instructions on how to obtain and use a Kaggle API token, see
  https://github.com/Kaggle/kaggle-api#api-credentials.

Options:
  --remove-zip / --no-remove-zip  Remove/keep the dataset zip file after
                                  extracting it. By default it will be
                                  removed.
  -h, --help                      Show this message and exit.

Some example usages are shown below.

Example usage: MNIST#

python ./scripts/download_data.py --output /dioptra/data --overwrite mnist

Downloads the MNIST dataset to /dioptra/data/Mnist, overwriting an existing dataset at that location if it exists.

Example usage: Fruits360#

python ./scripts/download_data.py --output /dioptra/data --no-overwrite fruits360 --no-remove-zip

Downloads the Fruits360 dataset to /dioptra/data/Fruits360, without overwriting an existing dataset at that location if it exists.

Important

If you receive a 403 error when downloading the Fruits360 dataset, it is likely that you need to accept the rules of the competition for the dataset you are downloading on the Kaggle website.

Example usage: ImageNet#

Warning

The ImageNet downloader is currently under construction and does not yet function as described here.

python ./scripts/download_data.py --output /dioptra/data --overwrite imagenet --remove-zip

Downloads the ImageNet dataset to /dioptra/data/ImageNet-Kaggle, overwriting the existing dataset at that location, and removing the zip file downloaded in the process.

Important

If you receive a 403 error when downloading the ImageNet dataset, it is likely that you need to accept the rules of the competition for the dataset you are downloading on the Kaggle website.

Example usage: Road Signs#

python ./scripts/download_data.py --output /dioptra/data --overwrite roadsigns --no-remove-zip

Downloads the Road Signs dataset to /dioptra/data/Road-Signs-Detection-v2, overwriting the existing dataset at that location, but leaving the zip file downloaded in the process.

Important

If you receive a 403 error when downloading the Road Signs dataset, it is likely that you need to accept the rules of the competition for the dataset you are downloading on the Kaggle website.