Downloading Datasets#
Note
See the Glossary for the meaning of the acronyms used in this guide.
When running any of the examples or custom experiments, it will be necessary to acquire datasets for training. A list of datasets used in the examples is presented below, along with instructions on how to use our download_data.py CLI tool to acquire them.
Datasets#
- MNIST Handwritten Digit Dataset
Kaggle Datasets#
When downloading Kaggle Datasets either through the custom scripts Dioptra provides or through the browser interface, users will need to sign in to Kaggle, and also agree to the rules of the competition, even if the competition has finished.
Furthermore, users of the CLI tool will need to setup a Kaggle API token to be able to access some of these datasets. More information on the Kaggle API tokens can be found here: https://www.kaggle.com/docs/api.
Dataset Placement#
The datasets should be downloaded and organized in the same directory on the host machine that is running Dioptra.
This folder can be stored anywhere on your host machine’s filesystem (this folder will then need to be mounted into the worker containers).
For the sake of this documentation, we assume that the datasets are stored in the /dioptra/data
directory so that it matches with the filepath also used in the examples.
To use the aforementioned datasets with the Dioptra examples, they will need to be organized in the /dioptra/data
folder in a specific way, which the download_data.py script will handle automatically for you.
The required directory structure for each of the datasets is described below.
For MNIST, the data needs to be in this format:
dioptra/
└── data/
└── Mnist/
├── training/
│ ├── 0/
│ │ ├── 00002.png
│ │ ├── 00005.png
│ │ ...
│ ├── 1/
│ ├── 2/
│ ...
└── testing/
├── 0/
│ ├── 00001.png
│ ├── 00021.png
│ ...
├── 1/
├── 2/
...
For Fruits360, the data needs to be in this format:
dioptra/
└── data/
└── Fruits360/
├── training/
│ ├── Apple Braeburn/
│ │ ├── 0_100.png
│ │ ├── 1_100.png
│ │ ...
│ ├── Apple Crimson Snow/
│ ├── Apple Golden 1/
│ ...
└── testing/
├── Apple Braeburn/
│ ├── 3_100.png
│ ├── 4_100.png
│ ...
├── Apple Crimson Snow/
├── Apple Golden 1/
...
For ImageNet, the data needs to be in this format:
dioptra/
└── data/
└── ImageNet-Kaggle/
├── metadata/
│ ├── image_sets/
│ └── synset_mapping.txt
├── training/
│ ├── annotations/
│ │ ├── n01440764/
│ │ │ ├── n01440764_10040.xml
│ │ │ ├── n01440764_10048.xml
│ │ │ ...
│ │ ├── n01443537/
│ │ ...
│ └── images/
│ ├── n01440764/
│ │ ├── n01440764_10040.JPEG
│ │ ├── n01440764_10048.JPEG
│ │ ...
│ ├── n01443537/
│ ...
└── testing/
├── annotations/
│ ├── n01440764/
│ │ ├── n01440764_10030.xml
│ │ ├── n01440764_10031.xml
│ │ ...
│ ├── n01443537/
│ ...
└── images/
├── n01440764/
│ ├── n01440764_10030.JPEG
│ ├── n01440764_10031.JPEG
│ ...
├── n01443537/
...
Please note that the testing folder in the above tree structure is actually the val/
folder in the dataset on Kaggle, as the actual testing set does not come with labels.
For the Road Signs Detection dataset, the data needs to be in this format:
dioptra/
└── data/
└── Road-Sign-Detection-v2/
├── training/
│ ├── annotations/
│ │ ├── 00000_road2.xml
│ │ ├── 00000_road3.xml
│ │ ...
│ └── images/
│ ├── 00000_road2.png
│ ├── 00000_road3.png
│ ...
└── testing/
├── annotations/
│ ├── 00000_road0.xml
│ ├── 00000_road1.xml
│ ...
└── images/
├── 00000_road0.png
├── 00000_road1.png
...
Using the Download Script#
Dioptra provides the examples/scripts/download_data.py script to simplify the download and organization of these datasets.
To get started, it is recommended that you create a virtual environment to manage the script’s dependencies.
Open a terminal, clone the repository, navigate into the dioptra
folder, then run the following:
# Move into the examples folder of cloned repo
cd ./examples
# Create a new virtual environment at ./examples/.venv
python -m venv .venv
# Activate the virtual environment
source .venv/bin/activate
# Install the dependencies
python -m pip install -r ./scripts/venvs/examples-setup-requirements.txt
Then, to run this script and download a dataset directly to the /dioptra/data
directory, simply use the following:
python ./scripts/download_data.py --output /dioptra/data DATASET_NAME
For the full list of options and available datasets, run python ./scripts/download_data.py -h
to display the script’s help message:
Usage: download_data.py [OPTIONS] COMMAND [ARGS]...
Fetch a dataset used in Dioptra's examples and demos.
Options:
--output DIRECTORY The path to the folder where the example
datasets are stored. Defaults to the current
working directory.
--overwrite / --no-overwrite Fetch the data even if the target folder
already exists and overwrite any existing data
files. By default the program will exit early
if the target folder already exists.
-h, --help Show this message and exit.
Commands:
fruits360 Fetch the Fruits 360 dataset hosted on Kaggle.
imagenet Fetch the ImageNet Object Localization Challenge dataset...
mnist Fetch the MNIST dataset.
roadsigns Fetch the Road Signs Detection dataset hosted on Kaggle.
Please note that some of the datasets have additional options that can be viewed by running python ./scripts/download_data.py DATASET -h
.
For example, running python ./scripts/download_data.py fruits360 -h
displays the help message for the Fruits 360 dataset shown below:
Usage: download_data.py fruits360 [OPTIONS]
Fetch the Fruits 360 dataset hosted on Kaggle.
This downloader uses the Kaggle API and requires the use of an API token.
For instructions on how to obtain and use a Kaggle API token, see
https://github.com/Kaggle/kaggle-api#api-credentials.
Options:
--remove-zip / --no-remove-zip Remove/keep the dataset zip file after
extracting it. By default it will be
removed.
-h, --help Show this message and exit.
Some example usages are shown below.
Example usage: MNIST#
python ./scripts/download_data.py --output /dioptra/data --overwrite mnist
Downloads the MNIST dataset to /dioptra/data/Mnist
, overwriting an existing dataset at that location if it exists.
Example usage: Fruits360#
python ./scripts/download_data.py --output /dioptra/data --no-overwrite fruits360 --no-remove-zip
Downloads the Fruits360 dataset to /dioptra/data/Fruits360
, without overwriting an existing dataset at that location if it exists.
Important
If you receive a 403 error when downloading the Fruits360 dataset, it is likely that you need to accept the rules of the competition for the dataset you are downloading on the Kaggle website.
Example usage: ImageNet#
Warning
The ImageNet downloader is currently under construction and does not yet function as described here.
python ./scripts/download_data.py --output /dioptra/data --overwrite imagenet --remove-zip
Downloads the ImageNet dataset to /dioptra/data/ImageNet-Kaggle
, overwriting the existing dataset at that location, and removing the zip file downloaded in the process.
Important
If you receive a 403 error when downloading the ImageNet dataset, it is likely that you need to accept the rules of the competition for the dataset you are downloading on the Kaggle website.
Example usage: Road Signs#
python ./scripts/download_data.py --output /dioptra/data --overwrite roadsigns --no-remove-zip
Downloads the Road Signs dataset to /dioptra/data/Road-Signs-Detection-v2
, overwriting the existing dataset at that location, but leaving the zip file downloaded in the process.
Important
If you receive a 403 error when downloading the Road Signs dataset, it is likely that you need to accept the rules of the competition for the dataset you are downloading on the Kaggle website.