Manage Requirements

This page will walk through the steps to managing requirements for your workflow.

It assumes you have followed the previous steps in the Getting Started page.

Make a New Workflow

Lets start a new project that depends on the the results of the previous one. Make a new folder called second-workflow, and copy the contents from Make Your First Workflow into it. Then, change the name of the first-workflow.rme.yml to second-workflow.rme.yml and update the release title to second-workfow.

release:
  # name of the release
  title: 'second-workflow'
  # version tag
  version: '0.1.0'
  # make custom include patterns
  includes:
    - '{my_output}'
  ignores:
    - 'inputs/'

datasets:
  my_input: 'inputs/input-file.txt'
  my_result: 'results/output-file.txt'

jobs:
  copy-file:
    inputs:
      - 'my_input'
    outputs:
      - 'my_output'
    commands:
      - ['cp', '{my_input}', '{my_output}']

Add Workflow Requirements

Requirments are declared by adding a requires-releases field to one of two places:

  1. Add a project wide requirment by adding it to rmeproject.yml

  2. Add a workflow specific requirement by adding it to second-workflow.rme.yml

A project is a folder, that may contain more then one workflow. A project wide requirement is a requirement shared by every workflow in a project. A workflow specific dependency is only shared by the workflow it is included in.

For example,

  • some-release==0.1.0

  • some-release>=0.1.0

  • some-release>=0.1.0, <0.2.0

Make the first-workflow a requirement of second-workflow by adding it to the requires-releases.

requires-releases:
  - 'first-workflow>=0.1.0'

release:
  # name of the release
  title: 'second-workflow'
  # version tag
  version: '0.1.0'
  # make custom include patterns
  includes:
    - '{my_output}'
  ignores:
    - 'inputs/'

datasets:
  my_input: 'inputs/input-file.txt'
  my_result: 'results/output-file.txt'

jobs:
  copy-file:
    inputs:
      - 'my_input'
    outputs:
      - 'my_output'
    commands:
      - ['cp', '{my_input}', '{my_output}']

Run the sync command in order to synce your new project.

rme sync second-workflow

This will create a data environment folder called data_env inside your project. The data environment will store shortcuts to all the data you’ve declared in your requirements file.

Datasets exists within the context of a project folder, that project folder structure is recreated and soft-linked to inside the .packages folder. If you go into data_env/.packages/first-workflow-v0.1.3 you will see the project we made previously. Individual files are soft-linked to the actual files, which are stored in a cache that rme manages.

In the root folder of data_env/ are the registry folders. These folders contain recreations of only the datasets released by a particular workflow, that have been renamed based on the name for the dataset defined in the workflow file. This is the intended access point for data within a particular workflow. The extensions of the original datasets (if they have one) are copied onto these registry links to make it easier to identify what types of files there are.

A requirements file can declare multiple versions of the same workflow, folders in the registry and the .packages folder are versioned. However, for convenience, the registry will duplicate the most recent version of a particular workflow without a version identifier. This way, you can structure your code to use the links in the “most recent”” links. If the datasets are updated to newer versions, your code will automatically pull from the most recent requirement.

Add File System Requirements

It’s also possible to declare dependencies on files stored on your local file system (or on a network drive). You can do that with a requires-files field. You declare a folder you want to group the files under, then assign the files to that folder.

This lets you organize files spread out across multiple folders and network drive locations in a way that is logical and coherent to your project, without having to duplicate the data.

While it’s preffered to have external data dependencies be derived from release packages, so their provenance can be inferred from the package metadata, in cases where that isn’t possible this at least provides a way to track where your data is coming from in a reproducable way.

This example creates a folder in data_env called my-files, and inside that folder a file called my-file.ext and a subfolder called some-sub-folder that links to the files on a network drive stored in ${network-drive-path}/path/to/folder.

requires-files:
  # group under a folder name space
  my-files:
    # link out to files
    my-file: 'path/to/external/file.ext'
    some-sub-folder: '${network-drive-path}/path/to/folder'

Using environment variables allows sensitive information like network-drive paths to be hidden in the config file.

Run the New Workflow

Update the datasets field of our second workflow to that my_input points to the datasets my_result of our first workflow.

datasets:
  my_input: 'data_env/first-workflow/my_result.txt'
  my_result: 'results/output-file.txt'

If we run our second workflow,

rme run second-workflow

you will notice that the environment is synchronized. The core idea is that rme will always attempt to synchronize your data environment with the definition of your data requirements.

In addition, you will notice that the after synchronizing your environment for the fist time, there will be a file called second-workflow.rme.lock. The lock file stores the solution to your data requirements for a given workflow.

For example, our requirement first-workflow>=0.1.0 is statisfied by any workflow first-workflow with a version greater then or equal to 0.1.0. To facilitate recreating analysis, the lock file stores the exact solution to that requirement found at runtime. If you wish to update a dataset, then you will need to update the lockfile.