masskit_ai.mol package¶
Subpackages¶
- masskit_ai.mol.small package
- Subpackages
- masskit_ai.mol.small.models package
- Submodules
- masskit_ai.mol.small.models.algos module
- masskit_ai.mol.small.models.gf module
- masskit_ai.mol.small.models.model_utils module
- masskit_ai.mol.small.models.mol_features module
- masskit_ai.mol.small.models.mol_graph module
- masskit_ai.mol.small.models.path_utils module
- masskit_ai.mol.small.models.patn module
- Module contents
- masskit_ai.mol.small.models package
- Module contents
- Subpackages
Submodules¶
masskit_ai.mol.mol_datasets module¶
- class masskit_ai.mol.mol_datasets.MolPropDataset(*args: Any, **kwargs: Any)¶
Bases:
BaseDataset
class for accessing a dataframe of small molecule and properties
How workers are set up requires some explanation:
if there is more than one gpu, each gpu has a corresponding main process.
the numbering of this gpu within a node is given by the environment variable LOCAL_RANK
if there is more than one node, the numbering of the node is given by the NODE_RANK environment variable
the number of nodes times the number of gpus is given by the WORLD_SIZE environment variable
the number of gpus on the current node can be found by parsing the PL_TRAINER_GPUS environment variable
these environment variables are only available when doing ddp. Otherwise sharding should be done using id and num_workers in torch.utils.data.get_worker_info()
each main process creates an instance of Dataset. This instance is NOT initialized by worker_init_fn, only the constructor.
each worker is created by forking the main process, giving each worker a copy of Dataset, already constructed. - each of these forked Datasets is initialized by worker_init_fn - the global torch.utils.data.get_worker_info() contains a reference to the forked Dataset and other info - these workers then take turns feeding minibatches into the training process - *important* since each worker is a copy, __init__() is only called once, only in the main process
the dataset in the main processes is used by other steps, such as the validation callback - this means that if there is any important initialization done in worker_init_fn, it must explicitly be done to the main process Dataset
- alternative sources of parameters:
global_rank = trainer.node_rank * trainer.nudatasetm_processes + process_idx
world_size = trainer.num_nodes * trainer.num_processes
- property data¶
- get_column(column)¶
retrieve a column from the parquet file
- Parameters:
column – the column to retrieve
- get_data_row(index)¶
given the index, return corresponding data for the index
- get_y(data_row)¶
given the data row, return the target of the network
- to_pandas()¶
return data as pandas dataframe
- Raises:
NotImplementedError – not implemented
- masskit_ai.mol.mol_datasets.graphormer_collator(config)¶
collation function factory for graphormer
- masskit_ai.mol.mol_datasets.pad_1d_unsqueeze(x, padlen)¶
- masskit_ai.mol.mol_datasets.pad_2d_unsqueeze(x, padlen)¶
- masskit_ai.mol.mol_datasets.pad_3d_unsqueeze(x, padlen1, padlen2, padlen3)¶
- masskit_ai.mol.mol_datasets.pad_attn_bias_unsqueeze(x, padlen)¶
- masskit_ai.mol.mol_datasets.pad_edge_type_unsqueeze(x, padlen)¶
- masskit_ai.mol.mol_datasets.pad_spatial_pos_unsqueeze(x, padlen)¶
- masskit_ai.mol.mol_datasets.patn_collator(config)¶
collation function factory for PATN
masskit_ai.mol.mol_embed module¶
- class masskit_ai.mol.mol_embed.EmbedGraphormerMol(config)¶
Bases:
Embed
Embedding of mol for Graphormer
- embed(row)¶
call the requested embedding functions as listed in config.ml.embedding.embeddings
- Parameters:
row – the data row
- Returns:
the concatenated one hot tensor of the embeddings
- static mol_channels(self)¶
the number of mol channels
- Returns:
the number of mol channels
- mol_embed(row)¶
embed a processed mol
- Parameters:
row – data record
- Returns:
one hot tensor
- class masskit_ai.mol.mol_embed.EmbedPATNMol(config)¶
Bases:
Embed
Embedding for PATN PropPredictor
- embed(row)¶
call the requested embedding functions as listed in config.ml.embedding.embeddings
- Parameters:
row – the data row
- Returns:
the concatenated one hot tensor of the embeddings
- static mol_channels(self)¶
the number of mol channels
- Returns:
the number of mol channels
- mol_path_embed(row)¶
embed the nce as a one hot tensor
- Parameters:
row – data record
- Returns:
one hot tensor
- masskit_ai.mol.mol_embed.convert_to_single_emb(x, offset: int = 512)¶
- masskit_ai.mol.mol_embed.from_mol(mol)¶
Converts a mol to a
torch_geometric.data.Data
instance.- Parameters:
mol – an rdkit molecule
- Returns:
Data
- masskit_ai.mol.mol_embed.preprocess_item(item)¶
masskit_ai.mol.mol_prediction module¶
- class masskit_ai.mol.mol_prediction.MolPropPredictor(config=None, *args, **kwargs)¶
Bases:
Predictor
class used to predict multiple spectra per record, which are averaged into a single spectrum with a standard deviation
- add_item(item_idx, item)¶
add newly predicted item at index idx
- Parameters:
item_idx – index into items
item – item to add
- create_dataloaders(model)¶
Create dataloaders.
- Parameters:
model – the model to use to predict spectrum
- Returns:
list of dataloader objects
- create_items(dataloader_idx, start)¶
for a given loader, return back a batch of accumulators
- Parameters:
dataloader_idx – the index of the dataloader in self.dataloaders
start – the start row of the batch
- finalize_items(dataloader_idx, start)¶
do final processing on a batch of predicted spectra
- Parameters:
dataloader_idx – the index of the dataloader in self.dataloaders
start – position of the start of the batch
- single_prediction(model, item_idx, dataloader_idx)¶
predict a single spectrum
- Parameters:
model – the prediction model
item_idx – the index of item in the current dataset
dataloader_idx – the index of the dataloader in self.dataloaders
- Returns:
the predicted spectrum
- write_items(dataloader_idx, start)¶
write the spectra to files
- Parameters:
dataloader_idx – the index of the dataloader in self.dataloaders
start – position of the start of the batch