# Recipes

## Library generation

### Protein sequences to peptide library

To generate a library of peptides, which is typically the first step in generating a peptide
spectral library, use the program `fasta2peptides`. This program takes protein sequences
in fasta format and generates a peptide library in parquet format.

* to change the name of the input file, specify `input.file=myfilename.fasta` on the
command line.
  * note that `fasta2peptides` expects the fasta header lines to have the format `>db|UniqueIdentifier|EntryName`.
* to change the name of the output file, specify `output.file=myfilename.parquet` on
the command line.
* the program supports the following options:
  * `protein.cleavage.digest=tryptic` where the digest can be tryptic, semitryptic, or nonspecific
  * `protein.cleavage.max_missed=1` which is the number of missed cleavages allowed
  * `peptide.charge.min=2` and `peptide.charge.max=4` sets the range of charges generated
  * `peptide.mods.fixed=Carbamidomethyl` is a list of fixed modifications, using a [string format](#modification-specification).
  * `peptide.mods.variable=Phospho{S/T}#Oxidation#Acetyl{^}` is a list of variable modifications, using a [string format](#modification-specification).
  * `peptide.length.min=7` and `peptide.length.max=30` are the minimum and maximum sizes of
  the peptide generated.
  * `peptide.nce=[30]` is a list of NCE values to generate per peptide
  * `peptide.use_basic_limit=True` limits the max charge of a peptide to the number of basic residues

To get additional help on options for this program, run the program using the `-h` option.

#### An example command line to convert `uniprot.fasta` to `uniprot_peptides.parquet`

```bash
fasta2peptides input.file=uniprot.fasta output.file=uniprot_peptides.parquet
```

## Library import

Masskit computational pipelines operate on standardized parquet and arrow files.
These open source columnar data stores allow for high performance from vectorization
and parallelization and, by checking and correcting
data at import and placing it in well specified fields, modularizes and
simplifies computational tasks by avoiding data errors that can arise in
computational pipelines that depend on ill-specified file formats. The
command line program `batch_converter` is used to load different file formats
into standardized parquet and arrow files and to convert these standardized
files into common file formats. For performance, `batch_converter` is
parallelized and operates on batches so that it can handle any size of
file without exhausting memory.

### SDF Molfiles to small molecule libraries

To convert an SDF file (also known as a Molfile) into parquet format, use a
command line of the format:

```bash
batch_converter input.file.names=my_sdf.sdf output.file.name=my_sdf output.file.types=[parquet]
```

To get additional help on options for this program, run the program using the `-h` option.

#### SDF files with incorrect encoding or pre-v2000 format

Some SDF files include characters encoded using non-ASCII encodings,
such as Latin-1 (ISO-8859-1) while rdkit and python support ASCII and
UTF-8 (unicode).  Other SDF files are written in a pre version v2000
format that does not include 'M  END' section separators.  To address
these issues, use the command line program:

```bash
rewrite_sdf input.file.name=my_input.sdf output.file.name=my_output.sdf
```

If the encoding is not latin-1, set the `input.file.encoding` option
to the encoding used.

### CSV file to small molecule libraries

To convert an CSV file that includes SMILES molecular specifications into
parquet format, use a command line of the format:

```bash
batch_converter input.file.names=my_csv.csv output.file.name=my_csv output.file.types=[parquet]
```

Options for csv parsing:

* `conversion.csv.no_column_headers`, set this to true if the csv file does not have column headers. In this case, the columns will be named `f0`, `f1`, ...
* the default SMILES column name is `SMILES` if there is a header and `f0` if not. To change the column name, set `conversion.csv.smiles_column_name`
* use `conversion.csv.delimiter`, set to the column delimiter.  Tab delimited is `conversion.csv.delimiter="\t"`
* the default rdkit Mol column name is "mol". To change this, use `conversion.csv.mol_column_name`

To get additional help on options for this program, run the program using the `-h` option.

#### Example of reading in a headerless tab delimited file with the SMILES in the second column

```bash
batch_converter input.file.names=my_csv.csv output.file.name=my_csv output.file.types=[parquet] conversion.csv.no_column_headers=true conversion.csv.smiles_column_name=f1 conversion.csv.delimiter="\t"
```

## Modification specification

Modifications in Masskit are taken from [Unimod](https://www.unimod.org) and identified using either
the `Interim name` for naming by string or the `Accession #` for naming by integer.

Site encoding of a modification:

* `A`-`Y` amino acid

which can be appended with a modification position encoding:

* `0` peptide N-terminus
* `.` peptide C-terminus
* `^` protein N-terminus
* `$` protein C-terminus

So that `K.` means lysine at the C-terminus of the peptide.
The position encoding can be used separately, e.g. `^` means apply to any protein N-terminus,
regardless of amino acid

A list of modifications is separated by hashes:
`Phospho{S}#Methyl{0/I}#Carbamidomethyl#Deamidated{F^/Q/N}`

An optional list of sites is specified within the `{}` for each modification.
If there are no `{}` then a default set of sites is used.  
Multiple sites are separated by a `/`.

Note that this string may have be escaped when
using a command line like bash, e.g. `peptide.mods.fixed='"TMT6plex#Carbamidomethyl"'`