Manage Requirements¶
This page will walk through the steps to managing requirements for your workflow.
It assumes you have followed the previous steps in the Getting Started page.
Make a New Workflow¶
Lets start a new project that depends on the the results of the previous
one. Make a new folder called second-workflow, and copy the contents from
Make Your First Workflow into it. Then, change the name of the first-workflow.rme.yml
to second-workflow.rme.yml and update the release title to second-workfow.
release:
# name of the release
title: 'second-workflow'
# version tag
version: '0.1.0'
# make custom include patterns
includes:
- '{my_output}'
ignores:
- 'inputs/'
datasets:
my_input: 'inputs/input-file.txt'
my_result: 'results/output-file.txt'
jobs:
copy-file:
inputs:
- 'my_input'
outputs:
- 'my_output'
commands:
- ['cp', '{my_input}', '{my_output}']
Add Workflow Requirements¶
Requirments are declared by adding a requires-releases field to one of two places:
Add a project wide requirment by adding it to
rmeproject.ymlAdd a workflow specific requirement by adding it to
second-workflow.rme.yml
A project is a folder, that may contain more then one workflow. A project wide requirement is a requirement shared by every workflow in a project. A workflow specific dependency is only shared by the workflow it is included in.
For example,
some-release==0.1.0
some-release>=0.1.0
some-release>=0.1.0, <0.2.0
Make the first-workflow a requirement of second-workflow by adding it
to the requires-releases.
requires-releases:
- 'first-workflow>=0.1.0'
release:
# name of the release
title: 'second-workflow'
# version tag
version: '0.1.0'
# make custom include patterns
includes:
- '{my_output}'
ignores:
- 'inputs/'
datasets:
my_input: 'inputs/input-file.txt'
my_result: 'results/output-file.txt'
jobs:
copy-file:
inputs:
- 'my_input'
outputs:
- 'my_output'
commands:
- ['cp', '{my_input}', '{my_output}']
Run the sync command in order to synce your new project.
rme sync second-workflow
This will create a data environment folder called data_env inside your project. The data environment
will store shortcuts to all the data you’ve declared in your requirements file.
Datasets exists within the context of a project folder, that project folder structure is recreated and soft-linked to inside the .packages folder. If you
go into data_env/.packages/first-workflow-v0.1.3 you will see the project we made previously. Individual files are soft-linked
to the actual files, which are stored in a cache that rme manages.
In the root folder of data_env/ are the registry folders. These folders contain recreations of only the datasets released by a particular
workflow, that have been renamed based on the name for the dataset defined in the workflow file. This is the intended access point for data
within a particular workflow. The extensions of the original datasets (if they have one) are copied onto these registry links to make it easier
to identify what types of files there are.
A requirements file can declare multiple versions of the same workflow, folders in the registry and the .packages folder are versioned. However, for convenience, the registry will duplicate the most recent version of a particular workflow without a version identifier. This way, you can structure your code to use the links in the “most recent”” links. If the datasets are updated to newer versions, your code will automatically pull from the most recent requirement.
Add File System Requirements¶
It’s also possible to declare dependencies on files stored on your local file system
(or on a network drive). You can do that with a requires-files field. You declare
a folder you want to group the files under, then assign the files to that folder.
This lets you organize files spread out across multiple folders and network drive locations in a way that is logical and coherent to your project, without having to duplicate the data.
While it’s preffered to have external data dependencies be derived from release packages, so their provenance can be inferred from the package metadata, in cases where that isn’t possible this at least provides a way to track where your data is coming from in a reproducable way.
This example creates a folder in data_env called my-files, and inside that folder a file
called my-file.ext and a subfolder called some-sub-folder that links to the files on
a network drive stored in ${network-drive-path}/path/to/folder.
requires-files:
# group under a folder name space
my-files:
# link out to files
my-file: 'path/to/external/file.ext'
some-sub-folder: '${network-drive-path}/path/to/folder'
Using environment variables allows sensitive information like network-drive paths to be hidden in the config file.
Run the New Workflow¶
Update the datasets field of our second workflow to that my_input points to the datasets
my_result of our first workflow.
datasets:
my_input: 'data_env/first-workflow/my_result.txt'
my_result: 'results/output-file.txt'
If we run our second workflow,
rme run second-workflow
you will notice that the environment is synchronized. The core idea is that rme will always attempt to synchronize your data environment with the definition of your data requirements.
In addition, you will notice that the after synchronizing your environment for the fist time, there will
be a file called second-workflow.rme.lock. The lock file stores the solution to your data requirements
for a given workflow.
For example, our requirement first-workflow>=0.1.0 is statisfied by any workflow first-workflow with
a version greater then or equal to 0.1.0. To facilitate recreating analysis, the lock file stores the exact solution
to that requirement found at runtime. If you wish to update a dataset, then you will need to update the lockfile.