Recipes¶
Spectral library generation¶
Peptide library to spectral library¶
To generate a library of peptides, which is typically the first step in generating a peptide
spectral library, use the program predict
. This program takes a peptide library in parquet
format and generates a spectral library using an AI network. The peptide library can be generated using
fasta2peptide.
The default configuration for the predict
program is contained in
masskit_ai/src/masskit_ai/apps/ml/peptide/conf/config_predict.yaml
.
to change the name of the input file, specify
input.test.spectral_library=myfilename.parquet
on the command line.the prefix of the output file(s) is specified using
predict.output_prefix=myfilename
on the command line.the program supports the following options:
predict.min_intensity=0.1
is the minimum intensity to predict (out of a max of 999)predict.min_mz=28
is the minimum mz value for predicted ionspredict.num=0
is the number of spectra to predict, 0 = allpredict.model_ensemble=[https://github.com/usnistgov/masskit_ai/releases/download/v1.2.0/aiomics_model.tgz]
is a list of AI networks to use for predictionpredict.upres=True
perform upresolution on the spectra
To get additional help on options for these programs, run the program using the -h
option.
Example set of commands to predict spectra from a fasta file uniprot.fasta
¶
fasta2peptides input.file=uniprot.fasta output.file=uniprot_peptides.parquet
predict input.test.spectral_library=uniprot_peptides.parquet predict.output_prefix=uniprot_peptides predict.output_suffixes=[mgf,msp]
The predicted spectra are found in the files uniprot_peptides.msp
and uniprot_peptides.mgf
.
Predicting RI values using AIRI¶
The first step in prediction is to use batch_converter
to convert SDF molfiles or CSV files containing SMILES to parquet format,
which is the standard format Masskit uses for processing.
Once parquet files are generated, molecular bond path information, which is a feature used by the
AIRI model, should be calculated and added to the parquet file using the program shortest_path
.
Finally, the AIRI predictions can be performed using the predict
command line.
The output from this command is a CSV file, which has columns that correspond to either the columns
in the original csv file or the fields in the SDF file plus some computed molecular descriptors.
Each row corresponds to one molecular structure and has three added columns, predicted_ri
, predicted_ri_stddev
and predicted_ri_stddev_clip
, which correspond to the predicted RI value as well as the standard deviation of the predicted RI and the standard deviation clipped at a lower bound
to generate a more normal distribution of RI values.
Example set of commands to calculate AIRI values from a CSV file my_csv.csv
with SMILES in the molecules
column¶
batch_converter input.file.names=my_csv.csv output.file.name=my_csv output.file.types=[parquet] conversion.csv.smiles_column_name=molecules
reactor input.file.name=my_csv.parquet output.file.name=my_csv_derivatized.parquet conversion.num_tautomers=5 conversion.mass_range=[0,5000] conversion.reactant_names=[trimethylsilylation]
shortest_path input.file.name=my_csv_derivatized.parquet output.file.name=my_csv_path.parquet
predict --config-name config_predict_ri input.test.spectral_library=my_csv_path.parquet predict.output_prefix=my_csv_predicted predict.output_suffixes=[csv]
The AIRI values are found in the file my_csv_predicted.csv
. The use of reactor
to derivatize molecules is optional.
If reactor
is removed from the list of commands, use the output from batch_converter
as the input file to shortest_path
, e.g.shortest_path input.file.name=my_csv.parquet output.file.name=my_csv_path.parquet
.
To get additional help on options for these programs, run the program using the -h
option.
Example set of commands to calculate AIRI values from an SDF molfile my_sdf.sdf
¶
batch_converter input.file.names=my_sdf.sdf output.file.name=my_sdf output.file.types=[parquet]
shortest_path input.file.name=my_sdf.parquet output.file.name=my_sdf_path.parquet
predict --config-name config_predict_ri input.test.spectral_library=my_sdf_path.parquet predict.output_prefix=my_sdf_predicted predict.output_suffixes=[csv]
The AIRI values are found in the file my_sdf_predicted.csv
If the SDF file includes latin-1 encoded characters or is a pre v2000 version SDF file,
use the program rewrite_sdf
to create a corrected SDF file:
rewrite_sdf input.file.name=my_orginal_sdf.sdf output.file.name=my_sdf.sdf