Entry Points#
Note
See the Glossary for the meaning of the acronyms used in this guide.
Tip
Instructions for how to create your own entry points are available in the Creating a New Entry Point guide.
What is an Entry Point?#
The term entry point, in the context of running experiment jobs, refers to executable scripts or binaries that are paired with information about their parameters. Entry points are the fundamental unit of work within the Testbed, where each job submitted to the Testbed selects one entry point to run. Dioptra derives its modularity in part by establishing a convention for how to compose new entry points. The convention is identifying related units of work, for example applying one out of many evasion attacks to generate a batch of adversarial images, and then ensuring they are interchangeable with one another by implementing the corresponding executable scripts to share a common set of inputs and outputs. This guided the construction of all the example experiments distributed as part of this project. The SDK library and the task plugins system are both provided to help Testbed users apply this convention to their own experiments.
Note
This particular usage for the term entry point originates with the MLFlow library, and we have adopted it for this project since Dioptra uses MLFlow on the backend to provide job tracking capabilities.
Each implementation of a Testbed entry point requires, at minimum, two separate files, a YAML formatted MLproject
file, and an executable Python script that can set its internal parameters via command-line options (e.g. argparse
and click
).
These files are the topics of the following sections.
MLproject Specifications#
As was mentioned before, the MLproject
file is a plain text file in YAML syntax that declares an entry point’s executable command and its available parameters.
An example of an MLproject
file is below,
name: My Project
entry_points:
train:
parameters:
data_dir: { type: path, default: "/nfs/data" }
image_size: { type: string, default: "28,28,1" }
command: >
python src/train.py
--data-dir {data_dir}
--image-size {image_size}
infer:
parameters:
run_id: { type: string }
image_size: { type: string, default: "28,28,1" }
command: >
python src/infer.py
--run-id {run_id}
--image-size {image_size}
As we can see, there are two entry points in this MLproject
file, train and infer.
If we submit a job that selects the train entry point and use its default parameters, we will end up running the following command in the Testbed environment,
python src/train.py --data-dir /nfs/data --image-size 28,28,1
Users should note that each of the declared parameters must specify a data type.
You can specify just the data type by writing the following in your MLproject
file,
parameter_name: data_type
A default value, in contrast, is not required, but users are encouraged to try and provide one wherever possible.
There are two equivalent ways to specify both a data type and a default value in the MLproject
file,
# Short syntax
parameter_name: {type: data_type, default: value}
# Long syntax
parameter_name:
type: data_type
default: value
The MLproject
file supports four parameter types, some of which are handled in a special way (for example, the path data type will download certain files to local storage).
Any undeclared parameters are treated as string.
The parameter types are:
- string
A text string.
- float
A real number. The parameter will be checked if it is a number at runtime.
- path
A path on the local file system. Any relative
path
parameters will be converted to absolute paths. Any paths passed as distributed storage URIs (s3://
,dbfs://
,gs://
, etc.) will be downloaded to local files. Use this type for programs that can only read local files.- uri
A URI for data either in a local or distributed storage system. Relative paths are converted to absolute paths, as in the path type. Use this type for programs that know how to read from distributed storage (e.g., programs that use the
boto3
package to directly access S3 storage).
Executable Script#
The entry point script, in principle, is just an executable Python script that accepts command-line options, so Testbed users can get started quickly by using their preexisting Python scripts. However, if users wish to make use of the Testbed’s powerful job tracking and task plugin capabilities, they will need to adopt the Testbed’s standard for writing entry point scripts outlined in this section.
Attention
The Testbed SDK, in a planned future release, will be extending the MLproject
specification to facilitate the templated generation of entry point scripts.
Users will have an easier time migrating their scripts to this new approach if they follow the Testbed’s standard for entry point scripts when creating their own entry points.
Setting Parameters#
The click
library should be used to create command-line interfaces for their Python scripts and to convert data types that aren’t supported by the MLproject
file (bool
and list
, for instance).
The following is a short example based on the train entry point from the MLproject
examples we considered earlier in this guide,
# src/train.py
import os
import click
from dioptra.sdk.utilities.contexts import plugin_dirs
from dioptra.sdk.utilities.logging import (
StderrLogStream,
StdoutLogStream,
attach_stdout_stream_handler,
clear_logger_handlers,
configure_structlog,
set_logging_level,
)
def _coerce_comma_separated_ints(ctx, param, value):
return tuple(int(x.strip()) for x in value.split(","))
@click.command()
@click.option(
"--data-dir",
type=click.Path(
exists=True, file_okay=False, dir_okay=True, resolve_path=True, readable=True
),
help="Root directory for shared datasets",
)
@click.option(
"--image-size",
type=click.STRING,
callback=_coerce_comma_separated_ints,
help="Dimensions for the input images",
)
def train(data_dir, image_size):
...
if __name__ == "__main__":
log_level = os.getenv("DIOPTRA_JOB_LOG_LEVEL", default="INFO")
as_json = True if os.getenv("DIOPTRA_JOB_LOG_AS_JSON") else False
clear_logger_handlers(get_prefect_logger())
attach_stdout_stream_handler(as_json)
set_logging_level(log_level)
configure_structlog()
with plugin_dirs(), StdoutLogStream(as_json), StderrLogStream(as_json):
_ = train()
Here, Click is validating our inputs by checking if --image-size
is passed a string and if --data-dir
points to a directory that exists and is readable.
We also define a callback function for --image-size
that will convert a string of comma-separated integers into a tuple
, i.e. transform "28,28,1"
into (28, 28, 1)
.
The code underneath the if __name__ == "__main__":
block at the end ensures that the python src/train.py
command specified in the MLproject
file will call the train()
function and use the values passed via the --data-dir
and --image-size
command-line options.
Important
While most of the code underneath the if __name__ == "__main__":
block is for configuring the script’s logger, the context created by with plugin_dirs():
plays a different and very important role, which will be discussed in the following guide on task plugins.
This small example only scratches the surface of what Click can do. Testbed users are encouraged to peruse the Click documentation to learn more about its features: https://click.palletsprojects.com/en/7.x/
MLFlow - Tracking Runs#
Every entry point script needs to invoke mlflow.start_run()
to create an active run context for MLFlow and it should be done near the top of their entry point function.
This context is needed when logging results and artifacts to the MLFlow Tracking service.
The following example shows how this context would be started in the train()
function from the previous section.
import mlflow
# Truncated...
def train(data_dir, image_size):
# Only use this when training a model
mlflow.autolog()
# Start the active run context for MLFlow
with mlflow.start_run() as active_run:
flow = init_flow()
state = flow.run(parameters=dict(data_dir=data_dir, image_size=image_size))
return state
Within this context block, the active_run
variable will contain a mlflow.entities.Run
object that provides metadata about the run that is useful to have available.
MLFlow functions like mlflow.log_param()
, mlflow.log_metric()
, and mlflow.log_artifact()
will be able to infer the current run automatically and be able to log their data to the appropriate place.
Please note that the init_flow()
function is introduced in the following section.
Testbed users are encouraged to peruse the MLflow Tracking documentation to learn more about the tracking context and the kinds of things you can do when it’s active: https://mlflow.org/docs/latest/tracking.html.
Prefect - Task Execution#
The main work done within an entry point needs to use the Flow
class from the Prefect library to create a context for assembling the entry point script’s task workflow.
Prefect is a modern workflow library that is aimed at helping data scientists set up task execution graphs with minimal changes to their existing code, and in Dioptra it provides a framework for wiring task plugins together.
The following example shows the beginnings of a Flow
context to be run by the train()
function in the previous section.
from prefect import Flow, Parameter
from dioptra import pyplugs
_PLUGINS_IMPORT_PATH: str = "dioptra_builtins"
def init_flow() -> Flow:
with Flow("Image Resizer") as flow:
data_dir, image_size = Parameter("data_dir"), Parameter("image_size")
resize_output = pyplugs.call_task(
f"{_PLUGINS_IMPORT_PATH}.data",
"images",
"resize",
data_dir=training_dir,
image_size=image_size,
)
...
This example illustrates the requirement that all the input parameters for an entry point need to be declared as such using the prefect.Parameter
class.
It also introduces us to our first task plugin call with pyplugs.call_task()
.
The anatomy of this call will be discussed in the next section of the user guide, so for now, users just need to know that this is how task plugins are used within the Testbed, and that the Testbed standard is to have all function calls within the Flow
context be invocations of pyplugs.call_task()
.