Recipes

Library generation

Protein sequences to peptide library

To generate a library of peptides, which is typically the first step in generating a peptide spectral library, use the program fasta2peptides. This program takes protein sequences in fasta format and generates a peptide library in parquet format.

  • to change the name of the input file, specify input.file=myfilename.fasta on the command line.

    • note that fasta2peptides expects the fasta header lines to have the format >db|UniqueIdentifier|EntryName.

  • to change the name of the output file, specify output.file=myfilename.parquet on the command line.

  • the program supports the following options:

    • protein.cleavage.digest=tryptic where the digest can be tryptic, semitryptic, or nonspecific

    • protein.cleavage.max_missed=1 which is the number of missed cleavages allowed

    • peptide.charge.min=2 and peptide.charge.max=4 sets the range of charges generated

    • peptide.mods.fixed=Carbamidomethyl is a list of fixed modifications, using a string format.

    • peptide.mods.variable=Phospho{S/T}#Oxidation#Acetyl{^} is a list of variable modifications, using a string format.

    • peptide.length.min=7 and peptide.length.max=30 are the minimum and maximum sizes of the peptide generated.

    • peptide.nce=[30] is a list of NCE values to generate per peptide

    • peptide.use_basic_limit=True limits the max charge of a peptide to the number of basic residues

To get additional help on options for this program, run the program using the -h option.

An example command line to convert uniprot.fasta to uniprot_peptides.parquet

fasta2peptides input.file=uniprot.fasta output.file=uniprot_peptides.parquet

Library import

Masskit computational pipelines operate on standardized parquet and arrow files. These open source columnar data stores allow for high performance from vectorization and parallelization and, by checking and correcting data at import and placing it in well specified fields, modularizes and simplifies computational tasks by avoiding data errors that can arise in computational pipelines that depend on ill-specified file formats. The command line program batch_converter is used to load different file formats into standardized parquet and arrow files and to convert these standardized files into common file formats. For performance, batch_converter is parallelized and operates on batches so that it can handle any size of file without exhausting memory.

SDF Molfiles to small molecule libraries

To convert an SDF file (also known as a Molfile) into parquet format, use a command line of the format:

batch_converter input.file.names=my_sdf.sdf output.file.name=my_sdf output.file.types=[parquet]

To get additional help on options for this program, run the program using the -h option.

SDF files with incorrect encoding or pre-v2000 format

Some SDF files include characters encoded using non-ASCII encodings, such as Latin-1 (ISO-8859-1) while rdkit and python support ASCII and UTF-8 (unicode). Other SDF files are written in a pre version v2000 format that does not include ‘M END’ section separators. To address these issues, use the command line program:

rewrite_sdf input.file.name=my_input.sdf output.file.name=my_output.sdf

If the encoding is not latin-1, set the input.file.encoding option to the encoding used.

CSV file to small molecule libraries

To convert an CSV file that includes SMILES molecular specifications into parquet format, use a command line of the format:

batch_converter input.file.names=my_csv.csv output.file.name=my_csv output.file.types=[parquet]

Options for csv parsing:

  • conversion.csv.no_column_headers, set this to true if the csv file does not have column headers. In this case, the columns will be named f0, f1, …

  • the default SMILES column name is SMILES if there is a header and f0 if not. To change the column name, set conversion.csv.smiles_column_name

  • use conversion.csv.delimiter, set to the column delimiter. Tab delimited is conversion.csv.delimiter="\t"

  • the default rdkit Mol column name is “mol”. To change this, use conversion.csv.mol_column_name

To get additional help on options for this program, run the program using the -h option.

Example of reading in a headerless tab delimited file with the SMILES in the second column

batch_converter input.file.names=my_csv.csv output.file.name=my_csv output.file.types=[parquet] conversion.csv.no_column_headers=true conversion.csv.smiles_column_name=f1 conversion.csv.delimiter="\t"

Modification specification

Modifications in Masskit are taken from Unimod and identified using either the Interim name for naming by string or the Accession # for naming by integer.

Site encoding of a modification:

  • A-Y amino acid

which can be appended with a modification position encoding:

  • 0 peptide N-terminus

  • . peptide C-terminus

  • ^ protein N-terminus

  • $ protein C-terminus

So that K. means lysine at the C-terminus of the peptide. The position encoding can be used separately, e.g. ^ means apply to any protein N-terminus, regardless of amino acid

A list of modifications is separated by hashes: Phospho{S}#Methyl{0/I}#Carbamidomethyl#Deamidated{F^/Q/N}

An optional list of sites is specified within the {} for each modification. If there are no {} then a default set of sites is used.
Multiple sites are separated by a /.

Note that this string may have be escaped when using a command line like bash, e.g. peptide.mods.fixed='"TMT6plex#Carbamidomethyl"'