masskit_ai.mol package¶

Subpackages¶

masskit_ai.mol.small package
- Subpackages
  - masskit_ai.mol.small.models package
- Module contents

Submodules¶

masskit_ai.mol.mol_datasets module¶

class masskit_ai.mol.mol_datasets.MolPropDataset(*args: Any, **kwargs: Any)¶

Bases: BaseDataset

class for accessing a dataframe of small molecule and properties

How workers are set up requires some explanation:

if there is more than one gpu, each gpu has a corresponding main process.
the numbering of this gpu within a node is given by the environment variable LOCAL_RANK
if there is more than one node, the numbering of the node is given by the NODE_RANK environment variable
the number of nodes times the number of gpus is given by the WORLD_SIZE environment variable
the number of gpus on the current node can be found by parsing the PL_TRAINER_GPUS environment variable
these environment variables are only available when doing ddp. Otherwise sharding should be done using id and num_workers in torch.utils.data.get_worker_info()
each main process creates an instance of Dataset. This instance is NOT initialized by worker_init_fn, only the constructor.
each worker is created by forking the main process, giving each worker a copy of Dataset, already constructed. - each of these forked Datasets is initialized by worker_init_fn - the global torch.utils.data.get_worker_info() contains a reference to the forked Dataset and other info - these workers then take turns feeding minibatches into the training process - *important* since each worker is a copy, __init__() is only called once, only in the main process
the dataset in the main processes is used by other steps, such as the validation callback - this means that if there is any important initialization done in worker_init_fn, it must explicitly be done to the main process Dataset
alternative sources of parameters:
- global_rank = trainer.node_rank * trainer.nudatasetm_processes + process_idx
- world_size = trainer.num_nodes * trainer.num_processes

property data¶

get_column(column)¶

retrieve a column from the parquet file

Parameters:: column – the column to retrieve

get_data_row(index)¶: given the index, return corresponding data for the index

get_y(data_row)¶: given the data row, return the target of the network

to_pandas()¶

return data as pandas dataframe

Raises:: NotImplementedError – not implemented

masskit_ai.mol.mol_datasets.graphormer_collator(config)¶: collation function factory for graphormer

masskit_ai.mol.mol_datasets.pad_1d_unsqueeze(x, padlen)¶

masskit_ai.mol.mol_datasets.pad_2d_unsqueeze(x, padlen)¶

masskit_ai.mol.mol_datasets.pad_3d_unsqueeze(x, padlen1, padlen2, padlen3)¶

masskit_ai.mol.mol_datasets.pad_attn_bias_unsqueeze(x, padlen)¶

masskit_ai.mol.mol_datasets.pad_edge_type_unsqueeze(x, padlen)¶

masskit_ai.mol.mol_datasets.pad_spatial_pos_unsqueeze(x, padlen)¶

masskit_ai.mol.mol_datasets.patn_collator(config)¶: collation function factory for PATN

masskit_ai.mol.mol_embed module¶

class masskit_ai.mol.mol_embed.EmbedGraphormerMol(config)¶

Bases: Embed

Embedding of mol for Graphormer

embed(row)¶

call the requested embedding functions as listed in config.ml.embedding.embeddings

Parameters:: row – the data row
Returns:: the concatenated one hot tensor of the embeddings

static mol_channels(self)¶

the number of mol channels

Returns:: the number of mol channels

mol_embed(row)¶

embed a processed mol

Parameters:: row – data record
Returns:: one hot tensor

class masskit_ai.mol.mol_embed.EmbedPATNMol(config)¶

Bases: Embed

Embedding for PATN PropPredictor

embed(row)¶

call the requested embedding functions as listed in config.ml.embedding.embeddings

Parameters:: row – the data row
Returns:: the concatenated one hot tensor of the embeddings

static mol_channels(self)¶

the number of mol channels

Returns:: the number of mol channels

mol_path_embed(row)¶

embed the nce as a one hot tensor

Parameters:: row – data record
Returns:: one hot tensor

masskit_ai.mol.mol_embed.convert_to_single_emb(x, offset: int = 512)¶

masskit_ai.mol.mol_embed.from_mol(mol)¶

Converts a mol to a torch_geometric.data.Data instance.

Parameters:: mol – an rdkit molecule
Returns:: Data

masskit_ai.mol.mol_embed.preprocess_item(item)¶

masskit_ai.mol.mol_prediction module¶

class masskit_ai.mol.mol_prediction.MolPropPredictor(config=None, *args, **kwargs)¶

Bases: Predictor

class used to predict multiple spectra per record, which are averaged into a single spectrum with a standard deviation

add_item(item_idx, item)¶

add newly predicted item at index idx

Parameters:

item_idx – index into items
item – item to add

create_dataloaders(model)¶

Create dataloaders.

Parameters:: model – the model to use to predict spectrum
Returns:: list of dataloader objects

create_items(dataloader_idx, start)¶

for a given loader, return back a batch of accumulators

Parameters:

dataloader_idx – the index of the dataloader in self.dataloaders
start – the start row of the batch

finalize_items(dataloader_idx, start)¶

do final processing on a batch of predicted spectra

Parameters:

dataloader_idx – the index of the dataloader in self.dataloaders
start – position of the start of the batch

single_prediction(model, item_idx, dataloader_idx)¶

predict a single spectrum

Parameters:

model – the prediction model
item_idx – the index of item in the current dataset
dataloader_idx – the index of the dataloader in self.dataloaders

Returns:

the predicted spectrum

write_items(dataloader_idx, start)¶

write the spectra to files

Parameters:

dataloader_idx – the index of the dataloader in self.dataloaders
start – position of the start of the batch

masskit_ai.mol package¶

Subpackages¶

Submodules¶

masskit_ai.mol.mol_datasets module¶

masskit_ai.mol.mol_embed module¶

masskit_ai.mol.mol_prediction module¶

Module contents¶

Table of Contents

This Page