Add a Dataset#

To make datasets or other resources available when running experiments, they must be placed in a directory accessible to your workers. This directory can be stored anywhere on your host machine’s filesystem and then mounted into the worker containers.

Once the dataset directory is configured as part of the Dioptra deployment, adding data is as simple as placing the data into that directory. Data added to the datasets directory is immediately accessible to plugin tasks by referencing the /dioptra/data location.

Dioptra provides the examples/scripts/download_data.py script to simplify the download and organization of datasets. This script uses tensorflow_datasets (tfds) as source of publicly-available datasets and to download and prepare the data for use.

Important

The provided download_data.py script is not the only way to acquire datasets for use in Dioptra. It is simply a convenient tool to access a wide variety of publicly available datasets.

To list the available datasets, run:

uv run ./examples/scripts/download_data.py list

Then, to download and add a dataset directly to the /dioptra/data directory, run:

uv run ./examples/scripts/download_data.py download --data-dir /dioptra/data DATASET_NAME

For the full list of options, run the following to display the script’s help message:

uv run ./examples/scripts/download_data.py --help