masskit_ai.spectrum package

Subpackages

Submodules

masskit_ai.spectrum.spectrum_base_objects module

class masskit_ai.spectrum.spectrum_base_objects.SpectrumModel(*args: Any, **kwargs: Any)

Bases: SpectrumModule

base class for spectral prediction models

  • the “output” from the model is a dictionary - output[‘y_prime’] contains a batch of predicted spectra

  • “batch” is the input to the model - batch[‘y’] is a batch of experimental spectra corresponding to the predicted spectra - each batch of spectra is a float 32 tensor of shape (batch, channel, mz_bins)

  • by convention, channel 0 are intensities, which are not necessarily scaled

  • channel 1 are standard deviations of the corresponding intensities

property channels

calculate number of channels in the input

class masskit_ai.spectrum.spectrum_base_objects.SpectrumModule(*args: Any, **kwargs: Any)

Bases: Module

base class for a spectrum module. contains the configuration object.

property bins

calculate number of bins in the output spectrum

masskit_ai.spectrum.spectrum_datasets module

class masskit_ai.spectrum.spectrum_datasets.SpectrumDataset(*args: Any, **kwargs: Any)

Bases: BaseDataset

Base spectrum dataset

get_y(data_row)

given the data row, return the target of the network

class masskit_ai.spectrum.spectrum_datasets.TandemArrowDataset(*args: Any, **kwargs: Any)

Bases: SpectrumDataset

class for accessing a tandem dataframe of spectra

How workers are set up requires some explanation:

  • if there is more than one gpu, each gpu has a corresponding main process.

  • the numbering of this gpu within a node is given by the environment variable LOCAL_RANK

  • if there is more than one node, the numbering of the node is given by the NODE_RANK environment variable

  • the number of nodes times the number of gpus is given by the WORLD_SIZE environment variable

  • the number of gpus on the current node can be found by parsing the PL_TRAINER_GPUS environment variable

  • these environment variables are only available when doing ddp. Otherwise sharding should be done using id and num_workers in torch.utils.data.get_worker_info()

  • each main process creates an instance of Dataset. This instance is NOT initialized by worker_init_fn, only the constructor.

  • each worker is created by forking the main process, giving each worker a copy of Dataset, already constructed. - each of these forked Datasets is initialized by worker_init_fn - the global torch.utils.data.get_worker_info() contains a reference to the forked Dataset and other info - these workers then take turns feeding minibatches into the training process - *important* since each worker is a copy, __init__() is only called once, only in the main process

  • the dataset in the main processes is used by other steps, such as the validation callback - this means that if there is any important initialization done in worker_init_fn, it must explicitly be done to the main process Dataset

  • alternative sources of parameters:
    • global_rank = trainer.node_rank * trainer.nudatasetm_processes + process_idx

    • world_size = trainer.num_nodes * trainer.num_processes

property data
get_column(column)

retrieve a column from the parquet file

Parameters:

column – the column to retrieve

get_data_row(index)

given the index, return corresponding data for the index

to_pandas()

return data as pandas dataframe

Raises:

NotImplementedError – not implemented

class masskit_ai.spectrum.spectrum_datasets.TandemDataframeDataset(*args: Any, **kwargs: Any)

Bases: SpectrumDataset, DataframeDataset

masskit_ai.spectrum.spectrum_embed module

class masskit_ai.spectrum.spectrum_embed.Embed1D(config)

Bases: Embed

generic 1d embedding

charge_channels()

the number of charge channels. no charge, which should be rare, is encoded as an empty vector

Returns:

the number of charge channels

charge_embed(row)

embed the charge as a one hot tensor

Parameters:

row – data record

Returns:

one hot tensor

static charge_singleton_channels()

the number of charge_float channels

Returns:

the number of charge channels

charge_singleton_embed(row)

embed the charge as a float tensor ranging from 0 to 1

Parameters:

row – data record

Returns:

float tensor

ev_channels()

the number of ev channels

Returns:

the number of ev channels

ev_embed(row)

embed the ev as a one hot tensor

Parameters:

row – data record

Returns:

one hot tensor

static ev_singleton_channels()

the number of ev channels

Returns:

the number of ev_float channels

ev_singleton_embed(row)

embed the ev as a single float value from 0 to 1

Parameters:

row – data record

Returns:

FloatTensor

nce_channels()

the number of nce channels

Returns:

the number of nce channels

nce_embed(row)

embed the nce as a one hot tensor

Parameters:

row – data record

Returns:

one hot tensor

static nce_singleton_channels()

the number of nce channels

Returns:

the number of nce_float channels

nce_singleton_embed(row)

embed the nce as a single float value from 0 to 1

Parameters:

row – data record

Returns:

FloatTensor

masskit_ai.spectrum.spectrum_lightning module

class masskit_ai.spectrum.spectrum_lightning.BaseSpectrumLightningModule(*args: Any, **kwargs: Any)

Bases: LightningModule, ABC

base class for pytorch lightning module used to run the training

calc_loss(output, batch, params=None)

overrideable loss function

Parameters:
  • output – output from the model

  • batch – batch data, including input and true spectra

  • params – optional dictionary of parameters, such as epoch type

Returns:

loss

configure_optimizers()
forward(x)
on_test_epoch_end()
on_train_epoch_end()
on_validation_epoch_end()
test_step(batch, batch_idx)
abstract training_step(batch, batch_idx)
training_step_end(outputs)
validation_step(batch, batch_idx, dataloader_idx=None)

validation step

Parameters:
  • batch – batch data tensor

  • batch_idx – the index of the batch

  • dataloader_idx – which dataloader is being used (None if just one)

Returns:

loss

validation_test_epoch_end(outputs, loop)

shared code between validation and test epoch ends. logs and prints losses, resets metrics

Parameters:
  • outputs – output list from model

  • loop – ‘val’ or ‘test’

abstract validation_test_step(batch, batch_idx, loop, outputs)
class masskit_ai.spectrum.spectrum_lightning.SpectrumLightningModule(*args: Any, **kwargs: Any)

Bases: BaseSpectrumLightningModule

pytorch lightning module used to run the training

training_step(batch, batch_idx)
validation_test_step(batch, batch_idx, loop, outputs)

step shared with test and validation loops

Parameters:
  • batch – batch

  • batch_idx – index into data for batch

  • loop – the name of the loop

  • outputs – the list containing outputs

Returns:

loss

masskit_ai.spectrum.spectrum_losses module

class masskit_ai.spectrum.spectrum_losses.BaseSpectrumLoss(*args: Any, **kwargs: Any)

Bases: BaseLoss

abstract base class for spectrum losses assumes spectra have dimensions (batch, channel, mz_array)

extract_spectra(output, batch) Tuple[torch.Tensor, torch.Tensor]
extract_variance(input_tensor: torch.Tensor) torch.Tensor
abstract forward(output, batch, params=None) torch.Tensor

calculate the loss

Parameters:
  • output – output dictionary from the model, type ModelOutput

  • batch – batch data from the dataloader, type ModelInput

  • params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

class masskit_ai.spectrum.spectrum_losses.SpectrumCosineKLLoss(*args: Any, **kwargs: Any)

Bases: BaseSpectrumLoss

cosine similarity of intensity channel and KL divergence

forward(output, batch, params=None) torch.Tensor

calculate the loss

Parameters:
  • output – output dictionary from the model, type ModelOutput

  • batch – batch data from the dataloader, type ModelInput

  • params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

class masskit_ai.spectrum.spectrum_losses.SpectrumCosineLoss(*args: Any, **kwargs: Any)

Bases: BaseSpectrumLoss

cosine similarity of intensity channel

forward(output, batch, params=None) torch.Tensor

calculate the loss

Parameters:
  • output – output dictionary from the model, type ModelOutput

  • batch – batch data from the dataloader, type ModelInput

  • params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

class masskit_ai.spectrum.spectrum_losses.SpectrumLogCosineKLLoss(*args: Any, **kwargs: Any)

Bases: BaseSpectrumLoss

log of cosine similarity of intensity channel and KL divergence

forward(output, batch, params=None) torch.Tensor

calculate the loss

Parameters:
  • output – output dictionary from the model, type ModelOutput

  • batch – batch data from the dataloader, type ModelInput

  • params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

class masskit_ai.spectrum.spectrum_losses.SpectrumMSEKLLoss(*args: Any, **kwargs: Any)

Bases: BaseSpectrumLoss

mean square error of intensity channel plus KL divergence

forward(output, batch, params=None) torch.Tensor

calculate the loss

Parameters:
  • output – output dictionary from the model, type ModelOutput

  • batch – batch data from the dataloader, type ModelInput

  • params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

class masskit_ai.spectrum.spectrum_losses.SpectrumMSELoss(*args: Any, **kwargs: Any)

Bases: BaseSpectrumLoss

mean square error of intensity channel

forward(output, batch, params=None) torch.Tensor

calculate the loss

Parameters:
  • output – output dictionary from the model, type ModelOutput

  • batch – batch data from the dataloader, type ModelInput

  • params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

class masskit_ai.spectrum.spectrum_losses.SpectrumNormalNLL(*args: Any, **kwargs: Any)

Bases: BaseSpectrumLoss

negative log likelihood loss for a normal distribution for a spectral model that emits predictions and variance of that prediction omits constants in log likelihood

forward(output, batch, params=None) torch.Tensor

calculate the loss

Parameters:
  • output – output dictionary from the model, type ModelOutput

  • batch – batch data from the dataloader, type ModelInput

  • params – optional dictionary of parameters, such as epoch type

Returns:

loss tensor

masskit_ai.spectrum.spectrum_prediction module

class masskit_ai.spectrum.spectrum_prediction.PeptideSpectrumPredictor(config=None, *args, **kwargs)

Bases: Predictor

class used to predict multiple spectra per record, which are averaged into a single spectrum with a standard deviation

add_item(item_idx, item)

add newly predicted item at index idx

Parameters:
  • item_idx – index into items

  • item – item to add

create_dataloaders(model)

Create dataloaders that contains experimental spectra.

Parameters:

model – the model to use to predict spectrum

Returns:

list of dataloader objects

create_items(dataloader_idx, start)

for a given loader, return back a batch of accumulators

Parameters:
  • dataloader_idx – the index of the dataloader in self.dataloaders

  • start – the start row of the batch

create_mz_tolerance(model)

generate mz array and mass tolerance for model

Parameters:

model – the model to use

Returns:

mz, tolerance

finalize_items(dataloader_idx, start)

do final processing on a batch of predicted spectra

Parameters:
  • dataloader_idx – the index of the dataloader in self.dataloaders

  • start – position of the start of the batch

make_spectrum(precursor_mz)
single_prediction(model, item_idx, dataloader_idx)

predict a single spectrum

Parameters:
  • model – the prediction model

  • item_idx – the index of item in the current dataset

  • dataloader_idx – the index of the dataloader in self.dataloaders

write_items(dataloader_idx, start)

write the spectra to files

Parameters:
  • dataloader_idx – the index of the dataloader in self.dataloaders

  • start – position of the start of the batch

class masskit_ai.spectrum.spectrum_prediction.SinglePeptideSpectrumPredictor(config=None, *args, **kwargs)

Bases: PeptideSpectrumPredictor

class used to predict a single spectrum per records

add_item(idx, item)

add newly predicted item at index idx

Parameters:
  • item_idx – index into items

  • item – item to add

make_spectrum(precursor_mz)
masskit_ai.spectrum.spectrum_prediction.finalize_spectrum(spectrum, min_intensity, mz_window, upres=False)

function to finalize a predicted spectrum. Separated from the class so can be used in multiprocessing

Parameters:
  • spectrum – spectrum to be finalized

  • min_intensity – minimum peak intensity for filtering

  • mz_window – size of window for filtering

  • upres – upres the resulting spectrum

Returns:

the finalized spectrum

Module contents