masskit_ai.spectrum package¶

Subpackages¶

Submodules¶

masskit_ai.spectrum.spectrum_base_objects module¶

class masskit_ai.spectrum.spectrum_base_objects.SpectrumModel(*args: Any, **kwargs: Any)¶

Bases: SpectrumModule

base class for spectral prediction models

the “output” from the model is a dictionary - output[‘y_prime’] contains a batch of predicted spectra
“batch” is the input to the model - batch[‘y’] is a batch of experimental spectra corresponding to the predicted spectra - each batch of spectra is a float 32 tensor of shape (batch, channel, mz_bins)
by convention, channel 0 are intensities, which are not necessarily scaled
channel 1 are standard deviations of the corresponding intensities

property channels¶: calculate number of channels in the input

class masskit_ai.spectrum.spectrum_base_objects.SpectrumModule(*args: Any, **kwargs: Any)¶

Bases: Module

base class for a spectrum module. contains the configuration object.

property bins¶: calculate number of bins in the output spectrum

masskit_ai.spectrum.spectrum_datasets module¶

class masskit_ai.spectrum.spectrum_datasets.SpectrumDataset(*args: Any, **kwargs: Any)¶

Bases: BaseDataset

Base spectrum dataset

get_y(data_row)¶: given the data row, return the target of the network

class masskit_ai.spectrum.spectrum_datasets.TandemArrowDataset(*args: Any, **kwargs: Any)¶

Bases: SpectrumDataset

class for accessing a tandem dataframe of spectra

How workers are set up requires some explanation:

if there is more than one gpu, each gpu has a corresponding main process.
the numbering of this gpu within a node is given by the environment variable LOCAL_RANK
if there is more than one node, the numbering of the node is given by the NODE_RANK environment variable
the number of nodes times the number of gpus is given by the WORLD_SIZE environment variable
the number of gpus on the current node can be found by parsing the PL_TRAINER_GPUS environment variable
these environment variables are only available when doing ddp. Otherwise sharding should be done using id and num_workers in torch.utils.data.get_worker_info()
each main process creates an instance of Dataset. This instance is NOT initialized by worker_init_fn, only the constructor.
each worker is created by forking the main process, giving each worker a copy of Dataset, already constructed. - each of these forked Datasets is initialized by worker_init_fn - the global torch.utils.data.get_worker_info() contains a reference to the forked Dataset and other info - these workers then take turns feeding minibatches into the training process - *important* since each worker is a copy, __init__() is only called once, only in the main process
the dataset in the main processes is used by other steps, such as the validation callback - this means that if there is any important initialization done in worker_init_fn, it must explicitly be done to the main process Dataset
alternative sources of parameters:
- global_rank = trainer.node_rank * trainer.nudatasetm_processes + process_idx
- world_size = trainer.num_nodes * trainer.num_processes

property data¶

get_column(column)¶

retrieve a column from the parquet file

Parameters:: column – the column to retrieve

get_data_row(index)¶: given the index, return corresponding data for the index

to_pandas()¶

return data as pandas dataframe

Raises:: NotImplementedError – not implemented

class masskit_ai.spectrum.spectrum_datasets.TandemDataframeDataset(*args: Any, **kwargs: Any)¶: Bases: SpectrumDataset, DataframeDataset

masskit_ai.spectrum.spectrum_embed module¶

class masskit_ai.spectrum.spectrum_embed.Embed1D(config)¶

Bases: Embed

generic 1d embedding

charge_channels()¶

the number of charge channels. no charge, which should be rare, is encoded as an empty vector

Returns:: the number of charge channels

charge_embed(row)¶

embed the charge as a one hot tensor

Parameters:: row – data record
Returns:: one hot tensor

static charge_singleton_channels()¶

the number of charge_float channels

Returns:: the number of charge channels

charge_singleton_embed(row)¶

embed the charge as a float tensor ranging from 0 to 1

Parameters:: row – data record
Returns:: float tensor

ev_channels()¶

the number of ev channels

Returns:: the number of ev channels

ev_embed(row)¶

embed the ev as a one hot tensor

Parameters:: row – data record
Returns:: one hot tensor

static ev_singleton_channels()¶

the number of ev channels

Returns:: the number of ev_float channels

ev_singleton_embed(row)¶

embed the ev as a single float value from 0 to 1

Parameters:: row – data record
Returns:: FloatTensor

nce_channels()¶

the number of nce channels

Returns:: the number of nce channels

nce_embed(row)¶

embed the nce as a one hot tensor

Parameters:: row – data record
Returns:: one hot tensor

static nce_singleton_channels()¶

the number of nce channels

Returns:: the number of nce_float channels

nce_singleton_embed(row)¶

embed the nce as a single float value from 0 to 1

Parameters:: row – data record
Returns:: FloatTensor

masskit_ai.spectrum.spectrum_lightning module¶

class masskit_ai.spectrum.spectrum_lightning.BaseSpectrumLightningModule(*args: Any, **kwargs: Any)¶

Bases: LightningModule, ABC

base class for pytorch lightning module used to run the training

calc_loss(output, batch, params=None)¶

overrideable loss function

Parameters:

output – output from the model
batch – batch data, including input and true spectra
params – optional dictionary of parameters, such as epoch type

Returns:

loss

configure_optimizers()¶

forward(x)¶

on_test_epoch_end()¶

on_train_epoch_end()¶

on_validation_epoch_end()¶

test_step(batch, batch_idx)¶

abstract training_step(batch, batch_idx)¶

training_step_end(outputs)¶

validation_step(batch, batch_idx, dataloader_idx=None)¶

validation step

Parameters:

batch – batch data tensor
batch_idx – the index of the batch
dataloader_idx – which dataloader is being used (None if just one)

Returns:

loss

validation_test_epoch_end(outputs, loop)¶

shared code between validation and test epoch ends. logs and prints losses, resets metrics

Parameters:

outputs – output list from model
loop – ‘val’ or ‘test’

abstract validation_test_step(batch, batch_idx, loop, outputs)¶

class masskit_ai.spectrum.spectrum_lightning.SpectrumLightningModule(*args: Any, **kwargs: Any)¶

Bases: BaseSpectrumLightningModule

pytorch lightning module used to run the training

training_step(batch, batch_idx)¶

validation_test_step(batch, batch_idx, loop, outputs)¶

step shared with test and validation loops

Parameters:

batch – batch
batch_idx – index into data for batch
loop – the name of the loop
outputs – the list containing outputs

Returns:

loss

masskit_ai.spectrum.spectrum_losses module¶

class masskit_ai.spectrum.spectrum_losses.BaseSpectrumLoss(*args: Any, **kwargs: Any)¶

Bases: BaseLoss

abstract base class for spectrum losses assumes spectra have dimensions (batch, channel, mz_array)

extract_spectra(output, batch) → Tuple[torch.Tensor, torch.Tensor]¶

extract_variance(input_tensor: torch.Tensor) → torch.Tensor¶

abstract forward(output, batch, params=None) → torch.Tensor¶

calculate the loss

Parameters:

output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

class masskit_ai.spectrum.spectrum_losses.SpectrumCosineKLLoss(*args: Any, **kwargs: Any)¶

Bases: BaseSpectrumLoss

cosine similarity of intensity channel and KL divergence

forward(output, batch, params=None) → torch.Tensor¶

calculate the loss

Parameters:

output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

class masskit_ai.spectrum.spectrum_losses.SpectrumCosineLoss(*args: Any, **kwargs: Any)¶

Bases: BaseSpectrumLoss

cosine similarity of intensity channel

forward(output, batch, params=None) → torch.Tensor¶

calculate the loss

Parameters:

output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

class masskit_ai.spectrum.spectrum_losses.SpectrumLogCosineKLLoss(*args: Any, **kwargs: Any)¶

Bases: BaseSpectrumLoss

log of cosine similarity of intensity channel and KL divergence

forward(output, batch, params=None) → torch.Tensor¶

calculate the loss

Parameters:

output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

class masskit_ai.spectrum.spectrum_losses.SpectrumMSEKLLoss(*args: Any, **kwargs: Any)¶

Bases: BaseSpectrumLoss

mean square error of intensity channel plus KL divergence

forward(output, batch, params=None) → torch.Tensor¶

calculate the loss

Parameters:

output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

class masskit_ai.spectrum.spectrum_losses.SpectrumMSELoss(*args: Any, **kwargs: Any)¶

Bases: BaseSpectrumLoss

mean square error of intensity channel

forward(output, batch, params=None) → torch.Tensor¶

calculate the loss

Parameters:

output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

class masskit_ai.spectrum.spectrum_losses.SpectrumNormalNLL(*args: Any, **kwargs: Any)¶

Bases: BaseSpectrumLoss

negative log likelihood loss for a normal distribution for a spectral model that emits predictions and variance of that prediction omits constants in log likelihood

forward(output, batch, params=None) → torch.Tensor¶

calculate the loss

Parameters:

output – output dictionary from the model, type ModelOutput
batch – batch data from the dataloader, type ModelInput
params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

masskit_ai.spectrum.spectrum_prediction module¶

class masskit_ai.spectrum.spectrum_prediction.PeptideSpectrumPredictor(config=None, *args, **kwargs)¶

Bases: Predictor

class used to predict multiple spectra per record, which are averaged into a single spectrum with a standard deviation

add_item(item_idx, item)¶

add newly predicted item at index idx

Parameters:

item_idx – index into items
item – item to add

create_dataloaders(model)¶

Create dataloaders that contains experimental spectra.

Parameters:: model – the model to use to predict spectrum
Returns:: list of dataloader objects

create_items(dataloader_idx, start)¶

for a given loader, return back a batch of accumulators

Parameters:

dataloader_idx – the index of the dataloader in self.dataloaders
start – the start row of the batch

create_mz_tolerance(model)¶

generate mz array and mass tolerance for model

Parameters:: model – the model to use
Returns:: mz, tolerance

finalize_items(dataloader_idx, start)¶

do final processing on a batch of predicted spectra

Parameters:

dataloader_idx – the index of the dataloader in self.dataloaders
start – position of the start of the batch

make_spectrum(precursor_mz)¶

single_prediction(model, item_idx, dataloader_idx)¶

predict a single spectrum

Parameters:

model – the prediction model
item_idx – the index of item in the current dataset
dataloader_idx – the index of the dataloader in self.dataloaders

write_items(dataloader_idx, start)¶

write the spectra to files

Parameters:

dataloader_idx – the index of the dataloader in self.dataloaders
start – position of the start of the batch

class masskit_ai.spectrum.spectrum_prediction.SinglePeptideSpectrumPredictor(config=None, *args, **kwargs)¶

Bases: PeptideSpectrumPredictor

class used to predict a single spectrum per records

add_item(idx, item)¶

add newly predicted item at index idx

Parameters:

item_idx – index into items
item – item to add

make_spectrum(precursor_mz)¶

masskit_ai.spectrum.spectrum_prediction.finalize_spectrum(spectrum, min_intensity, mz_window, upres=False)¶

function to finalize a predicted spectrum. Separated from the class so can be used in multiprocessing

Parameters:

spectrum – spectrum to be finalized
min_intensity – minimum peak intensity for filtering
mz_window – size of window for filtering
upres – upres the resulting spectrum

Returns:

the finalized spectrum

masskit_ai.spectrum package¶

Subpackages¶

Submodules¶

masskit_ai.spectrum.spectrum_base_objects module¶

masskit_ai.spectrum.spectrum_datasets module¶

masskit_ai.spectrum.spectrum_embed module¶

masskit_ai.spectrum.spectrum_lightning module¶

masskit_ai.spectrum.spectrum_losses module¶

masskit_ai.spectrum.spectrum_prediction module¶

Module contents¶

Table of Contents

This Page