masskit_ai.mol package

Subpackages

Submodules

masskit_ai.mol.mol_datasets module

class masskit_ai.mol.mol_datasets.MolPropDataset(*args: Any, **kwargs: Any)

Bases: BaseDataset

class for accessing a dataframe of small molecule and properties

How workers are set up requires some explanation:

  • if there is more than one gpu, each gpu has a corresponding main process.

  • the numbering of this gpu within a node is given by the environment variable LOCAL_RANK

  • if there is more than one node, the numbering of the node is given by the NODE_RANK environment variable

  • the number of nodes times the number of gpus is given by the WORLD_SIZE environment variable

  • the number of gpus on the current node can be found by parsing the PL_TRAINER_GPUS environment variable

  • these environment variables are only available when doing ddp. Otherwise sharding should be done using id and num_workers in torch.utils.data.get_worker_info()

  • each main process creates an instance of Dataset. This instance is NOT initialized by worker_init_fn, only the constructor.

  • each worker is created by forking the main process, giving each worker a copy of Dataset, already constructed. - each of these forked Datasets is initialized by worker_init_fn - the global torch.utils.data.get_worker_info() contains a reference to the forked Dataset and other info - these workers then take turns feeding minibatches into the training process - *important* since each worker is a copy, __init__() is only called once, only in the main process

  • the dataset in the main processes is used by other steps, such as the validation callback - this means that if there is any important initialization done in worker_init_fn, it must explicitly be done to the main process Dataset

  • alternative sources of parameters:
    • global_rank = trainer.node_rank * trainer.nudatasetm_processes + process_idx

    • world_size = trainer.num_nodes * trainer.num_processes

property data
get_column(column)

retrieve a column from the parquet file

Parameters:

column – the column to retrieve

get_data_row(index)

given the index, return corresponding data for the index

get_y(data_row)

given the data row, return the target of the network

to_pandas()

return data as pandas dataframe

Raises:

NotImplementedError – not implemented

masskit_ai.mol.mol_datasets.graphormer_collator(config)

collation function factory for graphormer

masskit_ai.mol.mol_datasets.pad_1d_unsqueeze(x, padlen)
masskit_ai.mol.mol_datasets.pad_2d_unsqueeze(x, padlen)
masskit_ai.mol.mol_datasets.pad_3d_unsqueeze(x, padlen1, padlen2, padlen3)
masskit_ai.mol.mol_datasets.pad_attn_bias_unsqueeze(x, padlen)
masskit_ai.mol.mol_datasets.pad_edge_type_unsqueeze(x, padlen)
masskit_ai.mol.mol_datasets.pad_spatial_pos_unsqueeze(x, padlen)
masskit_ai.mol.mol_datasets.patn_collator(config)

collation function factory for PATN

masskit_ai.mol.mol_embed module

class masskit_ai.mol.mol_embed.EmbedGraphormerMol(config)

Bases: Embed

Embedding of mol for Graphormer

embed(row)

call the requested embedding functions as listed in config.ml.embedding.embeddings

Parameters:

row – the data row

Returns:

the concatenated one hot tensor of the embeddings

static mol_channels(self)

the number of mol channels

Returns:

the number of mol channels

mol_embed(row)

embed a processed mol

Parameters:

row – data record

Returns:

one hot tensor

class masskit_ai.mol.mol_embed.EmbedPATNMol(config)

Bases: Embed

Embedding for PATN PropPredictor

embed(row)

call the requested embedding functions as listed in config.ml.embedding.embeddings

Parameters:

row – the data row

Returns:

the concatenated one hot tensor of the embeddings

static mol_channels(self)

the number of mol channels

Returns:

the number of mol channels

mol_path_embed(row)

embed the nce as a one hot tensor

Parameters:

row – data record

Returns:

one hot tensor

masskit_ai.mol.mol_embed.convert_to_single_emb(x, offset: int = 512)
masskit_ai.mol.mol_embed.from_mol(mol)

Converts a mol to a torch_geometric.data.Data instance.

Parameters:

mol – an rdkit molecule

Returns:

Data

masskit_ai.mol.mol_embed.preprocess_item(item)

masskit_ai.mol.mol_prediction module

class masskit_ai.mol.mol_prediction.MolPropPredictor(config=None, *args, **kwargs)

Bases: Predictor

class used to predict multiple spectra per record, which are averaged into a single spectrum with a standard deviation

add_item(item_idx, item)

add newly predicted item at index idx

Parameters:
  • item_idx – index into items

  • item – item to add

create_dataloaders(model)

Create dataloaders.

Parameters:

model – the model to use to predict spectrum

Returns:

list of dataloader objects

create_items(dataloader_idx, start)

for a given loader, return back a batch of accumulators

Parameters:
  • dataloader_idx – the index of the dataloader in self.dataloaders

  • start – the start row of the batch

finalize_items(dataloader_idx, start)

do final processing on a batch of predicted spectra

Parameters:
  • dataloader_idx – the index of the dataloader in self.dataloaders

  • start – position of the start of the batch

single_prediction(model, item_idx, dataloader_idx)

predict a single spectrum

Parameters:
  • model – the prediction model

  • item_idx – the index of item in the current dataset

  • dataloader_idx – the index of the dataloader in self.dataloaders

Returns:

the predicted spectrum

write_items(dataloader_idx, start)

write the spectra to files

Parameters:
  • dataloader_idx – the index of the dataloader in self.dataloaders

  • start – position of the start of the batch

Module contents