Make Your First Workflow

This tutorial will walk through how to set up a project, make a workflow, and release it to an archive (without any dependencies).

Create a Project

Make a folder (this is your project) and in that directory make a file called rmeproject.yml. This is where project settings will be stored, which is where you can define settings that workflows in a project should have. For now, leave it empty.

Make a Workflow File

Make a workflow file anywhere in your project with the extension .rme.yml. This can have any filename. For this example we will do first-workflow.rme.yml.

Describe the Release

Add a release field in first-workflow.rme.yml.

release:
  # name of the release
  title: 'first-workflow'
  # version tag
  version: '0.1.0'

Add Datasets

Datasets are files and/or folders within a project that are intended to be bundled with a particular release. They are the the inputs and outputs of jobs, which together comprise the workflow.

Add a release field in first-workflow.rme.yml:

datasets:
  my_input: 'inputs/input-file.txt'
  my_result: 'results/output-file.txt'

Add Jobs

Jobs are a set of terminal commands that are associated with an input and an output.

Add a jobs field, and make one called copy-file that takes in my_input and outputs my_output. The command cp copies the file in the first argument to the file in the second.

We can use curly braces in a string to replace dataset names defined in the datasets field with the actual file path.

jobs:
  copy-file:
    inputs:
      - my_input
    outputs:
      - my_output
    commands:
      - ['cp', '{my_input}', '{my_output}']

Environment Variables

Environment variables can be defined for jobs, and expressed as requirements for a particular workflow. This is done by adding A requires-env field with a list environment variables. These can then be used in jobs by adding ${VARIABLE} anywhere in a job configuration.

Environment variables can be defined in a .env and will be loaded in and used to expand variables in the workflow file. For rme to discover them they need to be located:

  1. a .env file located at the project’s root directory for project-level environment variables.

  2. a .env file in the same directory as the workflow for workflow level environment variables. This overloads project level environment variables.

Create a file in the project directory called .env and add the following to it:

MY_NAME=READER

Then declare that environment variable as a requirement and add an echo command to the copy-file job.

requires-env:
# For something
- MY_NAME

  jobs:
    copy-file:
      inputs:
        - my_input
      outputs:
        - my_output
      commands:
        - ['echo', 'hello ${MY_NAME}!']
        - ['cp', '{my_input}', '{my_output}']

Run Jobs

You don’t need to include the file extension when running workflows, but you need to include the path to the workflow from the root directory of your project.

rme run first-workflow

The runner will print the status of each jobs and their datasets, but not each of their standard outputs - unless they return a failing status code. Each command is run sequentially within a job, and checked for a successful status code. The standard output of each job is logged to a file.

You can view the log with the rme log command.

rme log copy-file

Mapping the Release

When you release a workflow, you release the datasets and the project structure that it was created in.

rme respects .gitignore rules when determining what files to include in the release, except for datasets which are included by default.

Add this to a folder called .gitignore at the root of the project. The .env file may contain sensitivie information, and the .rme file contains runtime infromation that rme uses. Neither of those should be commited to version control nor shared with a release, so add them both to gitignore.

.rme/
.env/

And add this to a results/.gitignore to ignore the contents of the results folder but keep the folder itself under version control.

*
!.gitignore

Check the release mapping with:

rme map first-workflow

You should see a printout of everything that will be included in the release.

We can modify what files are included or excluded in the release by adding the release.ignores and release.includes fields. These fields can utilize curly braces to sub in dataset paths, and folow the same glob pattern conventions as gitignore relative to the project’s root directory.

In this case, we opt in to include the output file, and elect to ignore the input file. The default behaviour is to ignore everything not included by git ignore, so if you want to include a dataset as part of the release then make sure to add it here.

release:
  # name of the release
  title: 'first-workflow'
  # version tag
  version: '0.1.0'
  # make custom include patterns
  includes:
    - '{my_output}'
  ignores:
    - 'inputs/'

At this point your workflow file should look like this:

release:
  # name of the release
  title: 'first-workflow'
  # version tag
  version: '0.1.0'
  # make custom include patterns
  includes:
    - '{my_output}'
  ignores:
    - 'inputs/'

datasets:
  my_input: 'inputs/input-file.txt'
  my_result: 'results/output-file.txt'

requires-env:
# For something
- MY_NAME

jobs:
  copy-file:
    inputs:
      - my_input
    outputs:
      - my_output
    commands:
      - ['echo', 'hello ${MY_NAME}!']
      - ['cp', '{my_input}', '{my_output}']

Run the map command again and check that the input file is ignored, and the output file is included.

rme map first-workflow

Publish a Release

To publish a release, use the release command, replacing archive_host with the path to the archive you are publishing to. The default workspace is the Global Public Workspace.

rme release first-workflow <archive_host>