Entry Points#

Note

See the Glossary for the meaning of the acronyms used in this guide.

Tip

Instructions for how to create your own entry points are available in the Creating a New Entry Point guide.

What is an Entry Point?#

../_images/entry-point-components.svg

The term entry point, in the context of running experiment jobs, refers to executable scripts or binaries that are paired with information about their parameters. Entry points are the fundamental unit of work within the Testbed, where each job submitted to the Testbed selects one entry point to run. Dioptra derives its modularity in part by establishing a convention for how to compose new entry points. The convention is identifying related units of work, for example applying one out of many evasion attacks to generate a batch of adversarial images, and then ensuring they are interchangeable with one another by implementing the corresponding executable scripts to share a common set of inputs and outputs. This guided the construction of all the example experiments distributed as part of this project. The SDK library and the task plugins system are both provided to help Testbed users apply this convention to their own experiments.

Note

This particular usage for the term entry point originates with the MLFlow library, and we have adopted it for this project since Dioptra uses MLFlow on the backend to provide job tracking capabilities.

Each implementation of a Testbed entry point requires, at minimum, two separate files, a YAML formatted MLproject file, and an executable Python script that can set its internal parameters via command-line options (e.g. argparse and click). These files are the topics of the following sections.

MLproject Specifications#

As was mentioned before, the MLproject file is a plain text file in YAML syntax that declares an entry point’s executable command and its available parameters. An example of an MLproject file is below,

name: My Project

entry_points:
  train:
    parameters:
      data_dir: { type: path, default: "/nfs/data" }
      image_size: { type: string, default: "28,28,1" }
    command: >
      python src/train.py
      --data-dir {data_dir}
      --image-size {image_size}
  infer:
    parameters:
      run_id: { type: string }
      image_size: { type: string, default: "28,28,1" }
    command: >
      python src/infer.py
      --run-id {run_id}
      --image-size {image_size}

As we can see, there are two entry points in this MLproject file, train and infer. If we submit a job that selects the train entry point and use its default parameters, we will end up running the following command in the Testbed environment,

python src/train.py --data-dir /nfs/data --image-size 28,28,1

Users should note that each of the declared parameters must specify a data type. You can specify just the data type by writing the following in your MLproject file,

parameter_name: data_type

A default value, in contrast, is not required, but users are encouraged to try and provide one wherever possible. There are two equivalent ways to specify both a data type and a default value in the MLproject file,

# Short syntax
parameter_name: {type: data_type, default: value}

# Long syntax
parameter_name:
  type: data_type
  default: value

The MLproject file supports four parameter types, some of which are handled in a special way (for example, the path data type will download certain files to local storage). Any undeclared parameters are treated as string. The parameter types are:

string

A text string.

float

A real number. The parameter will be checked if it is a number at runtime.

path

A path on the local file system. Any relative path parameters will be converted to absolute paths. Any paths passed as distributed storage URIs (s3://, dbfs://, gs://, etc.) will be downloaded to local files. Use this type for programs that can only read local files.

uri

A URI for data either in a local or distributed storage system. Relative paths are converted to absolute paths, as in the path type. Use this type for programs that know how to read from distributed storage (e.g., programs that use the boto3 package to directly access S3 storage).

Executable Script#

The entry point script, in principle, is just an executable Python script that accepts command-line options, so Testbed users can get started quickly by using their pre-existing Python scripts. However, if users wish to make use of the Testbed’s powerful job tracking and task plugin capabilities, they will need to adopt the Testbed’s standard for writing entry point scripts outlined in this section.

Attention

The Testbed SDK, in a planned future release, will be extending the MLproject specification to facilitate the templated generation of entry point scripts. Users will have an easier time migrating their scripts to this new approach if they follow the Testbed’s standard for entry point scripts when creating their own entry points.

Setting Parameters#

The click library should be used to create command-line interfaces for their Python scripts and to convert data types that aren’t supported by the MLproject file (bool and list, for instance). The following is a short example based on the train entry point from the MLproject examples we considered earlier in this guide,

# src/train.py
import os

import click
from dioptra.sdk.utilities.contexts import plugin_dirs
from dioptra.sdk.utilities.logging import (
    StderrLogStream,
    StdoutLogStream,
    attach_stdout_stream_handler,
    clear_logger_handlers,
    configure_structlog,
    set_logging_level,
)


def _coerce_comma_separated_ints(ctx, param, value):
    return tuple(int(x.strip()) for x in value.split(","))

@click.command()
@click.option(
    "--data-dir",
    type=click.Path(
        exists=True, file_okay=False, dir_okay=True, resolve_path=True, readable=True
    ),
    help="Root directory for shared datasets",
)
@click.option(
    "--image-size",
    type=click.STRING,
    callback=_coerce_comma_separated_ints,
    help="Dimensions for the input images",
)
def train(data_dir, image_size):
    ...


if __name__ == "__main__":
    log_level = os.getenv("DIOPTRA_JOB_LOG_LEVEL", default="INFO")
    as_json = True if os.getenv("DIOPTRA_JOB_LOG_AS_JSON") else False

    clear_logger_handlers(get_prefect_logger())
    attach_stdout_stream_handler(as_json)
    set_logging_level(log_level)
    configure_structlog()

    with plugin_dirs(), StdoutLogStream(as_json), StderrLogStream(as_json):
        _ = train()

Here, Click is validating our inputs by checking if --image-size is passed a string and if --data-dir points to a directory that exists and is readable. We also define a callback function for --image-size that will convert a string of comma-separated integers into a tuple, i.e. transform "28,28,1" into (28, 28, 1). The code underneath the if __name__ == "__main__": block at the end ensures that the python src/train.py command specified in the MLproject file will call the train() function and use the values passed via the --data-dir and --image-size command-line options.

Important

While most of the code underneath the if __name__ == "__main__": block is for configuring the script’s logger, the context created by with plugin_dirs(): plays a different and very important role, which will be discussed in the following guide on task plugins.

This small example only scratches the surface of what Click can do. Testbed users are encouraged to peruse the Click documentation to learn more about its features: https://click.palletsprojects.com/en/7.x/

MLFlow - Tracking Runs#

Every entry point script needs to invoke mlflow.start_run() to create an active run context for MLFlow and it should be done near the top of their entry point function. This context is needed when logging results and artifacts to the MLFlow Tracking service. The following example shows how this context would be started in the train() function from the previous section.

import mlflow

# Truncated...
def train(data_dir, image_size):
    # Only use this when training a model
    mlflow.autolog()

    # Start the active run context for MLFlow
    with mlflow.start_run() as active_run:
        flow = init_flow()
        state = flow.run(parameters=dict(data_dir=data_dir, image_size=image_size))

    return state

Within this context block, the active_run variable will contain a mlflow.entities.Run object that provides metadata about the run that is useful to have available. MLFlow functions like mlflow.log_param(), mlflow.log_metric(), and mlflow.log_artifact() will be able to infer the current run automatically and be able to log their data to the appropriate place. Please note that the init_flow() function is introduced in the following section.

Testbed users are encouraged to peruse the MLflow Tracking documentation to learn more about the tracking context and the kinds of things you can do when it’s active: https://mlflow.org/docs/latest/tracking.html.

Prefect - Task Execution#

The main work done within an entry point needs to use the Flow class from the Prefect library to create a context for assembling the entry point script’s task workflow. Prefect is a modern workflow library that is aimed at helping data scientists set up task execution graphs with minimal changes to their existing code, and in Dioptra it provides a framework for wiring task plugins together. The following example shows the beginnings of a Flow context to be run by the train() function in the previous section.

from prefect import Flow, Parameter
from dioptra import pyplugs

_PLUGINS_IMPORT_PATH: str = "dioptra_builtins"


def init_flow() -> Flow:
    with Flow("Image Resizer") as flow:
        data_dir, image_size = Parameter("data_dir"), Parameter("image_size")
        resize_output = pyplugs.call_task(
            f"{_PLUGINS_IMPORT_PATH}.data",
            "images",
            "resize",
            data_dir=training_dir,
            image_size=image_size,
        )
        ...

This example illustrates the requirement that all the input parameters for an entry point need to be declared as such using the prefect.Parameter class. It also introduces us to our first task plugin call with pyplugs.call_task(). The anatomy of this call will be discussed in the next section of the user guide, so for now, users just need to know that this is how task plugins are used within the Testbed, and that the Testbed standard is to have all function calls within the Flow context be invocations of pyplugs.call_task().