masskit_ai.spectrum package¶
Subpackages¶
- masskit_ai.spectrum.peptide package
- Subpackages
- Submodules
- masskit_ai.spectrum.peptide.peptide_callbacks module
- masskit_ai.spectrum.peptide.peptide_constants module
- masskit_ai.spectrum.peptide.peptide_embed module
- masskit_ai.spectrum.peptide.peptide_prediction module
- masskit_ai.spectrum.peptide.peptide_samplers module
- Module contents
- masskit_ai.spectrum.small_mol package
- Subpackages
- Submodules
- masskit_ai.spectrum.small_mol.small_mol_datasets module
TandemArrowSearchDataset
TandemArrowSearchDataset.data
TandemArrowSearchDataset.data_search
TandemArrowSearchDataset.get_data_row()
TandemArrowSearchDataset.get_x()
TandemArrowSearchDataset.get_y()
TandemArrowSearchDataset.index
TandemArrowSearchDataset.row2id
TandemArrowSearchDataset.row2id_search
TandemArrowSearchDataset.spectrum2array()
- masskit_ai.spectrum.small_mol.small_mol_lightning module
- masskit_ai.spectrum.small_mol.small_mol_losses module
- Module contents
Submodules¶
masskit_ai.spectrum.spectrum_base_objects module¶
- class masskit_ai.spectrum.spectrum_base_objects.SpectrumModel(*args: Any, **kwargs: Any)¶
Bases:
SpectrumModule
base class for spectral prediction models
the “output” from the model is a dictionary - output[‘y_prime’] contains a batch of predicted spectra
“batch” is the input to the model - batch[‘y’] is a batch of experimental spectra corresponding to the predicted spectra - each batch of spectra is a float 32 tensor of shape (batch, channel, mz_bins)
by convention, channel 0 are intensities, which are not necessarily scaled
channel 1 are standard deviations of the corresponding intensities
- property channels¶
calculate number of channels in the input
masskit_ai.spectrum.spectrum_datasets module¶
- class masskit_ai.spectrum.spectrum_datasets.SpectrumDataset(*args: Any, **kwargs: Any)¶
Bases:
BaseDataset
Base spectrum dataset
- get_y(data_row)¶
given the data row, return the target of the network
- class masskit_ai.spectrum.spectrum_datasets.TandemArrowDataset(*args: Any, **kwargs: Any)¶
Bases:
SpectrumDataset
class for accessing a tandem dataframe of spectra
How workers are set up requires some explanation:
if there is more than one gpu, each gpu has a corresponding main process.
the numbering of this gpu within a node is given by the environment variable LOCAL_RANK
if there is more than one node, the numbering of the node is given by the NODE_RANK environment variable
the number of nodes times the number of gpus is given by the WORLD_SIZE environment variable
the number of gpus on the current node can be found by parsing the PL_TRAINER_GPUS environment variable
these environment variables are only available when doing ddp. Otherwise sharding should be done using id and num_workers in torch.utils.data.get_worker_info()
each main process creates an instance of Dataset. This instance is NOT initialized by worker_init_fn, only the constructor.
each worker is created by forking the main process, giving each worker a copy of Dataset, already constructed. - each of these forked Datasets is initialized by worker_init_fn - the global torch.utils.data.get_worker_info() contains a reference to the forked Dataset and other info - these workers then take turns feeding minibatches into the training process - *important* since each worker is a copy, __init__() is only called once, only in the main process
the dataset in the main processes is used by other steps, such as the validation callback - this means that if there is any important initialization done in worker_init_fn, it must explicitly be done to the main process Dataset
- alternative sources of parameters:
global_rank = trainer.node_rank * trainer.nudatasetm_processes + process_idx
world_size = trainer.num_nodes * trainer.num_processes
- property data¶
- get_column(column)¶
retrieve a column from the parquet file
- Parameters:
column – the column to retrieve
- get_data_row(index)¶
given the index, return corresponding data for the index
- to_pandas()¶
return data as pandas dataframe
- Raises:
NotImplementedError – not implemented
- class masskit_ai.spectrum.spectrum_datasets.TandemDataframeDataset(*args: Any, **kwargs: Any)¶
Bases:
SpectrumDataset
,DataframeDataset
masskit_ai.spectrum.spectrum_embed module¶
- class masskit_ai.spectrum.spectrum_embed.Embed1D(config)¶
Bases:
Embed
generic 1d embedding
- charge_channels()¶
the number of charge channels. no charge, which should be rare, is encoded as an empty vector
- Returns:
the number of charge channels
- charge_embed(row)¶
embed the charge as a one hot tensor
- Parameters:
row – data record
- Returns:
one hot tensor
- static charge_singleton_channels()¶
the number of charge_float channels
- Returns:
the number of charge channels
- charge_singleton_embed(row)¶
embed the charge as a float tensor ranging from 0 to 1
- Parameters:
row – data record
- Returns:
float tensor
- ev_channels()¶
the number of ev channels
- Returns:
the number of ev channels
- ev_embed(row)¶
embed the ev as a one hot tensor
- Parameters:
row – data record
- Returns:
one hot tensor
- static ev_singleton_channels()¶
the number of ev channels
- Returns:
the number of ev_float channels
- ev_singleton_embed(row)¶
embed the ev as a single float value from 0 to 1
- Parameters:
row – data record
- Returns:
FloatTensor
- nce_channels()¶
the number of nce channels
- Returns:
the number of nce channels
- nce_embed(row)¶
embed the nce as a one hot tensor
- Parameters:
row – data record
- Returns:
one hot tensor
- static nce_singleton_channels()¶
the number of nce channels
- Returns:
the number of nce_float channels
- nce_singleton_embed(row)¶
embed the nce as a single float value from 0 to 1
- Parameters:
row – data record
- Returns:
FloatTensor
masskit_ai.spectrum.spectrum_lightning module¶
- class masskit_ai.spectrum.spectrum_lightning.BaseSpectrumLightningModule(*args: Any, **kwargs: Any)¶
Bases:
LightningModule
,ABC
base class for pytorch lightning module used to run the training
- calc_loss(output, batch, params=None)¶
overrideable loss function
- Parameters:
output – output from the model
batch – batch data, including input and true spectra
params – optional dictionary of parameters, such as epoch type
- Returns:
loss
- configure_optimizers()¶
- forward(x)¶
- on_test_epoch_end()¶
- on_train_epoch_end()¶
- on_validation_epoch_end()¶
- test_step(batch, batch_idx)¶
- abstract training_step(batch, batch_idx)¶
- training_step_end(outputs)¶
- validation_step(batch, batch_idx, dataloader_idx=None)¶
validation step
- Parameters:
batch – batch data tensor
batch_idx – the index of the batch
dataloader_idx – which dataloader is being used (None if just one)
- Returns:
loss
- validation_test_epoch_end(outputs, loop)¶
shared code between validation and test epoch ends. logs and prints losses, resets metrics
- Parameters:
outputs – output list from model
loop – ‘val’ or ‘test’
- abstract validation_test_step(batch, batch_idx, loop, outputs)¶
- class masskit_ai.spectrum.spectrum_lightning.SpectrumLightningModule(*args: Any, **kwargs: Any)¶
Bases:
BaseSpectrumLightningModule
pytorch lightning module used to run the training
- training_step(batch, batch_idx)¶
- validation_test_step(batch, batch_idx, loop, outputs)¶
step shared with test and validation loops
- Parameters:
batch – batch
batch_idx – index into data for batch
loop – the name of the loop
outputs – the list containing outputs
- Returns:
loss
masskit_ai.spectrum.spectrum_losses module¶
- class masskit_ai.spectrum.spectrum_losses.BaseSpectrumLoss(*args: Any, **kwargs: Any)¶
Bases:
BaseLoss
abstract base class for spectrum losses assumes spectra have dimensions (batch, channel, mz_array)
- extract_spectra(output, batch) Tuple[torch.Tensor, torch.Tensor] ¶
- extract_variance(input_tensor: torch.Tensor) torch.Tensor ¶
- abstract forward(output, batch, params=None) torch.Tensor ¶
calculate the loss
- Parameters:
output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type
- Returns:
loss tensor
- class masskit_ai.spectrum.spectrum_losses.SpectrumCosineKLLoss(*args: Any, **kwargs: Any)¶
Bases:
BaseSpectrumLoss
cosine similarity of intensity channel and KL divergence
- forward(output, batch, params=None) torch.Tensor ¶
calculate the loss
- Parameters:
output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type
- Returns:
loss tensor
- class masskit_ai.spectrum.spectrum_losses.SpectrumCosineLoss(*args: Any, **kwargs: Any)¶
Bases:
BaseSpectrumLoss
cosine similarity of intensity channel
- forward(output, batch, params=None) torch.Tensor ¶
calculate the loss
- Parameters:
output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type
- Returns:
loss tensor
- class masskit_ai.spectrum.spectrum_losses.SpectrumLogCosineKLLoss(*args: Any, **kwargs: Any)¶
Bases:
BaseSpectrumLoss
log of cosine similarity of intensity channel and KL divergence
- forward(output, batch, params=None) torch.Tensor ¶
calculate the loss
- Parameters:
output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type
- Returns:
loss tensor
- class masskit_ai.spectrum.spectrum_losses.SpectrumMSEKLLoss(*args: Any, **kwargs: Any)¶
Bases:
BaseSpectrumLoss
mean square error of intensity channel plus KL divergence
- forward(output, batch, params=None) torch.Tensor ¶
calculate the loss
- Parameters:
output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type
- Returns:
loss tensor
- class masskit_ai.spectrum.spectrum_losses.SpectrumMSELoss(*args: Any, **kwargs: Any)¶
Bases:
BaseSpectrumLoss
mean square error of intensity channel
- forward(output, batch, params=None) torch.Tensor ¶
calculate the loss
- Parameters:
output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type
- Returns:
loss tensor
- class masskit_ai.spectrum.spectrum_losses.SpectrumNormalNLL(*args: Any, **kwargs: Any)¶
Bases:
BaseSpectrumLoss
negative log likelihood loss for a normal distribution for a spectral model that emits predictions and variance of that prediction omits constants in log likelihood
- forward(output, batch, params=None) torch.Tensor ¶
calculate the loss
- Parameters:
output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type
- Returns:
loss tensor
masskit_ai.spectrum.spectrum_prediction module¶
- class masskit_ai.spectrum.spectrum_prediction.PeptideSpectrumPredictor(config=None, *args, **kwargs)¶
Bases:
Predictor
class used to predict multiple spectra per record, which are averaged into a single spectrum with a standard deviation
- add_item(item_idx, item)¶
add newly predicted item at index idx
- Parameters:
item_idx – index into items
item – item to add
- create_dataloaders(model)¶
Create dataloaders that contains experimental spectra.
- Parameters:
model – the model to use to predict spectrum
- Returns:
list of dataloader objects
- create_items(dataloader_idx, start)¶
for a given loader, return back a batch of accumulators
- Parameters:
dataloader_idx – the index of the dataloader in self.dataloaders
start – the start row of the batch
- create_mz_tolerance(model)¶
generate mz array and mass tolerance for model
- Parameters:
model – the model to use
- Returns:
mz, tolerance
- finalize_items(dataloader_idx, start)¶
do final processing on a batch of predicted spectra
- Parameters:
dataloader_idx – the index of the dataloader in self.dataloaders
start – position of the start of the batch
- make_spectrum(precursor_mz)¶
- single_prediction(model, item_idx, dataloader_idx)¶
predict a single spectrum
- Parameters:
model – the prediction model
item_idx – the index of item in the current dataset
dataloader_idx – the index of the dataloader in self.dataloaders
- write_items(dataloader_idx, start)¶
write the spectra to files
- Parameters:
dataloader_idx – the index of the dataloader in self.dataloaders
start – position of the start of the batch
- class masskit_ai.spectrum.spectrum_prediction.SinglePeptideSpectrumPredictor(config=None, *args, **kwargs)¶
Bases:
PeptideSpectrumPredictor
class used to predict a single spectrum per records
- add_item(idx, item)¶
add newly predicted item at index idx
- Parameters:
item_idx – index into items
item – item to add
- make_spectrum(precursor_mz)¶
- masskit_ai.spectrum.spectrum_prediction.finalize_spectrum(spectrum, min_intensity, mz_window, upres=False)¶
function to finalize a predicted spectrum. Separated from the class so can be used in multiprocessing
- Parameters:
spectrum – spectrum to be finalized
min_intensity – minimum peak intensity for filtering
mz_window – size of window for filtering
upres – upres the resulting spectrum
- Returns:
the finalized spectrum