masskit.utils package¶
Subpackages¶
Submodules¶
masskit.utils.accumulator module¶
- class masskit.utils.accumulator.Accumulator(*args, **kwargs)¶
Bases:
ABC
accumulator class used to take the mean and standard deviation of a set of data
- abstract add(new_item)¶
accumulated a new item
- Parameters:
new_item – new item to be accumulated
- abstract finalize()¶
finalize the accumulation
- class masskit.utils.accumulator.AccumulatorProperty(*args, **kwargs)¶
Bases:
Accumulator
used to calculate the mean and standard deviation of a property
- add(new_item)¶
add an item to the average. Keeps running total of average and std deviation using Welford’s algorithm.
- Parameters:
new_spectrum – new item to be added
- finalize()¶
finalize the std deviation after all the the spectra have been added
masskit.utils.arrow module¶
- masskit.utils.arrow.create_object_id(filename, filters)¶
create an object id based on filename and filters
- Parameters:
filename – filename
filters – filter string
- Returns:
object id
- masskit.utils.arrow.save_to_arrow(filename, columns=None, filters=None, tempdir=None)¶
Load a parquet file and save it as a temp arrow file after applying filters and column lists. Load it as a memmap if the temp arrow file already exists
- Parameters:
filename – parquet file
columns – columns to load
filters – parquet file filters
tempdir – tempdir to use for memmap, otherwise use python default
- Returns:
ArrowLibraryMap
masskit.utils.files module¶
- class masskit.utils.files.BatchFileReader(filename: Union[str, PathLike], format: Union[Dict, omegaconf.DictConfig] = None, row_batch_size: int = 5000)¶
Bases:
object
- iter_tables() Table ¶
read batch generator, returns a Table
- loaders = {'csv': <class 'masskit.utils.files.CSVLoader'>, 'mgf': <class 'masskit.utils.files.MGFLoader'>, 'msp': <class 'masskit.utils.files.MSPLoader'>, 'sdf': <class 'masskit.utils.files.SDFLoader'>}¶
- class masskit.utils.files.BatchFileWriter(filename: Union[str, PathLike], format: str = None, annotate: bool = False, row_batch_size: int = 5000, num_workers: int = 7, column_name: str = None)¶
Bases:
object
- close()¶
Close the writer. Essential to avoid race conditions with threaded writers.
- write_table(table: Table) None ¶
write a table out
- Parameters:
table – table to write
- class masskit.utils.files.BatchLoader(file_type, row_batch_size=5000, format=None, num=None)¶
Bases:
object
- ecfp4 = <masskit.utils.fingerprints.ECFPFingerprint object>¶
- finalize()¶
finalized the chunk
- load(fp)¶
read in a chunk from a stream
- Parameters:
fp – the stream
- loop_end()¶
end of loop to read in one record
- classmethod mol2row(mol, max_size: int = 0, skip_computed_props=True, skip_expensive: bool = True) Dict ¶
Convert an rdkit Mol into a row
- Parameters:
mol – the molecule
max_size – the maximum bounding box size (used to filter out large molecules. 0=no bound)
skip_computed_props – skip computing properties
skip_expensive – skip computing computationally expensive properties
- Returns:
row as dict, mol
- setup()¶
set up loading
- tms = '[#14]([CH3])([CH3])[CH3]'¶
- class masskit.utils.files.CSVLoader(row_batch_size=5000, format=None, num=None)¶
Bases:
BatchLoader
- load(fp)¶
read in a chunk from a stream
- Parameters:
fp – the stream
- class masskit.utils.files.MGFLoader(row_batch_size=5000, format=None, num=None, set_probabilities=(0.0, 0.93, 0.02, 0.05), min_intensity=0.0)¶
Bases:
BatchLoader
- load(fp)¶
read in a chunk from a stream
- Parameters:
fp – the stream
- class masskit.utils.files.MSPLoader(row_batch_size=5000, format=None, min_intensity=0, num=None)¶
Bases:
BatchLoader
- load(fp)¶
read in a chunk from a stream
- Parameters:
fp – the stream
- class masskit.utils.files.MzTab_Reader(fp, dedup=False, decoy_func=None)¶
Bases:
object
Class for reading an mzTab file
- Parameters:
fp – stream or filename
dedup – Only take the first row with a given hit_id
- get_hitlist()¶
- parse_metadata()¶
- parse_psm()¶
- parse_psm_row(row)¶
- read_sections(fp)¶
- class masskit.utils.files.SDFLoader(row_batch_size=5000, format=None, num=None, set_probabilities=(0.0, 0.93, 0.02, 0.05), min_intensity=0.0)¶
Bases:
BatchLoader
- load(fp)¶
read in a chunk from a stream
- Parameters:
fp – the stream
- masskit.utils.files.add_row_to_records(records, row)¶
- masskit.utils.files.create_table_with_mod_dict(records, schema)¶
- masskit.utils.files.empty_records(schema)¶
- masskit.utils.files.get_progress(fp)¶
Given a fp pointer, return best possible progress meter
- Parameters:
fp – name of the file to read or a file object
- masskit.utils.files.load_default_config_schema(file_type, format=None)¶
for a given file type and format, return configuration
- Parameters:
file_type – the type of file, e.g. mgf, msp, arrow, parquet, sdf
format – the specific format of the file_type, e.g. msp_mol, msp_peptide, sdf_mol, sdf_nist_mol, sdf_pubchem_mol. If none, use default
- Returns:
config
- masskit.utils.files.load_mzTab(fp, dedup=True, decoy_func=None)¶
Read file in SDF format and return an array.
- Parameters:
fp – stream or filename
dedup – Only take the first row with a given hit_id
- masskit.utils.files.parse_energy(row, value)¶
parse energy field that has nce or collision_energy in it
- masskit.utils.files.parse_glycopeptide_annot(annots, peak_index)¶
take a glycopeptide annotation string like “Y0-H2O+i/-18.2ppm,Y0-NH3/7.6ppm 22 23” and parse it. this function is not complete.
- Parameters:
annots – the annotation string
peak_index – the index of the peak with the annotation in the mz array
- Returns:
the parsed values in an array
- masskit.utils.files.read_parquet(fp, columns=None, num=None, filters=None)¶
reads a PyArrow table from a parquet file.
- Parameters:
fp – stream or filename
columns – list of columns to remy_sdf.parquetad, None=all
filters – parquet predicate as a list of tuples
- Returns:
PyArrow table
- masskit.utils.files.records2table(records, schema_group)¶
convert a flat table with spectral records into a table with a nested spectrum
- Parameters:
records – records to be added to a table
schema_group – the schema group
- masskit.utils.files.seek_size(fp)¶
- masskit.utils.files.spectra_to_array(spectra, min_intensity=0, write_tolerance=False, schema_group=None)¶
convert an array-like of spectra to an arrow_table
- Parameters:
spectra – iteratable containing the spectrums
min_intensity – the minimum intensity to set the fingerprint bit
write_tolerance – put the tolerance into the arrow Table
schema_group – the schema group of spectrum file
- Returns:
arrow table of spectra
- masskit.utils.files.spectrum2mgf(spectrum)¶
- masskit.utils.files.spectrum2msp(spectrum, annotate=False)¶
- masskit.utils.files.write_parquet(fp, table)¶
save a PyArrow table to a parquet file.
- Parameters:
table – the dataframe
fp – stream or filename
masskit.utils.fingerprints module¶
- class masskit.utils.fingerprints.ECFPFingerprint(dimension=4096, radius=2, *args, **kwargs)¶
Bases:
Fingerprint
rdkit version of ECFP fingerprint for a small molecule structure
- from_bitvec(array)¶
create fingerprint from rdkit ExplicitBitVect
- Parameters:
array (DataStructs.ExplicitBitVect) – input ExplicitBitVect
- from_numpy(array)¶
create fingerprint from numpy array
- Parameters:
array (np.ndarray) – input numpy array
- object2fingerprint(obj, dtype=<class 'numpy.float32'>)¶
convert an object into a fingerprint
- Parameters:
obj (object) – the object to convert to a fingerprint
dtype (np.dtype) – data type of output array
- to_bitvec()¶
convert fingerprint to rdkit ExplicitBitVect
- Returns:
the fingerprint
- Return type:
DataStructs.ExplicitBitVect
- to_numpy()¶
convert fingerprint to numpy array
- Returns:
the fingerprint
- Return type:
np.ndarray
- class masskit.utils.fingerprints.Fingerprint(dimension=2000, stride=256, column_name=None, *args, **kwargs)¶
Bases:
ABC
class for encapsulating and calculating a fingerprint
- Parameters:
dimension (int) – size of the fingerprint (includes other features like counts
- bitvec2numpy()¶
version of to_numpy used when fingerprint is a bitvec
- abstract from_bitvec(array)¶
create fingerprint from rdkit ExplicitBitVect
- Parameters:
array (DataStructs.ExplicitBitVect) – input ExplicitBitVect
- abstract from_numpy(array)¶
create fingerprint from numpy array
- Parameters:
array (np.ndarray) – input numpy array
- get_num_on_bits()¶
retrieve the number of nonzero features
- numpy2bitvec(array)¶
version of from_numpy used when fingerprint is a bitvec
- Parameters:
array – the fingerprint as a numpy array
- abstract object2fingerprint(obj, dtype=<class 'numpy.float32'>)¶
convert an object into a fingerprint
- Parameters:
obj (object) – the object to convert to a fingerprint
dtype (np.dtype) – data type of output array
- size() int ¶
size of the fingerprint in bytes, where the size is a multiple of the stride
- Returns:
size of fingerprint
- abstract to_bitvec()¶
convert fingerprint to rdkit ExplicitBitVect
- Returns:
the fingerprint
- Return type:
DataStructs.ExplicitBitVect
- abstract to_numpy()¶
convert fingerprint to numpy array
- Returns:
the fingerprint
- Return type:
np.ndarray
- class masskit.utils.fingerprints.MolFingerprint(dimension=4096, *args, **kwargs)¶
Bases:
ABC
class for encapsulating and calculating a molecule fingerprint
- Parameters:
dimension (int) – size of the fingerprint (includes other features like counts
- abstract from_bitvec(array)¶
create fingerprint from rdkit ExplicitBitVect
- Parameters:
array (DataStructs.ExplicitBitVect) – input ExplicitBitVect
- abstract from_numpy(array)¶
create fingerprint from numpy array
- Parameters:
array (np.ndarray) – input numpy array
- abstract object2fingerprint(obj, dtype=<class 'numpy.float32'>)¶
convert an object into a fingerprint
- Parameters:
obj (object) – the object to convert to a fingerprint
dtype (np.dtype) – data type of output array
- abstract to_bitvec()¶
convert fingerprint to rdkit ExplicitBitVect
- Returns:
the fingerprint
- Return type:
DataStructs.ExplicitBitVect
- abstract to_numpy()¶
convert fingerprint to numpy array
- Returns:
the fingerprint
- Return type:
np.ndarray
- class masskit.utils.fingerprints.SpectrumFingerprint(dimension=2000, bin_size=1.0, first_bin_left=0.5, *args, **kwargs)¶
Bases:
Fingerprint
base class for spectrum fingerprint
- abstract from_bitvec(array)¶
create fingerprint from rdkit ExplicitBitVect
- Parameters:
array (DataStructs.ExplicitBitVect) – input ExplicitBitVect
- abstract from_numpy(array)¶
create fingerprint from numpy array
- Parameters:
array (np.ndarray) – input numpy array
- abstract object2fingerprint(obj, dtype=<class 'numpy.float32'>)¶
convert an object into a fingerprint
- Parameters:
obj (object) – the object to convert to a fingerprint
dtype (np.dtype) – data type of output array
- abstract to_bitvec()¶
convert fingerprint to rdkit ExplicitBitVect
- Returns:
the fingerprint
- Return type:
DataStructs.ExplicitBitVect
- abstract to_numpy()¶
convert fingerprint to numpy array
- Returns:
the fingerprint
- Return type:
np.ndarray
- class masskit.utils.fingerprints.SpectrumFloatFingerprint(dimension=2000, count_max=2000, mz_ratio: bool = False, column_name=None, *args, **kwargs)¶
Bases:
SpectrumFingerprint
create a spectral fingerprint that also contains a count by powers of two, skipping 1. In other words, the count bits are set if the number of peaks exceeds >=2, >=4, …
- from_bitvec(array)¶
create fingerprint from rdkit ExplicitBitVect
- Parameters:
array (DataStructs.ExplicitBitVect) – input ExplicitBitVect
- from_numpy(array)¶
create fingerprint from numpy array
- Parameters:
array (np.ndarray) – input numpy array
- object2fingerprint(obj, dtype=<class 'numpy.float32'>)¶
fill out an array of fixed size with the ions. note that this func assumes spectra sorted by mz
- to_bitvec()¶
convert fingerprint to rdkit ExplicitBitVect
- Returns:
the fingerprint
- Return type:
DataStructs.ExplicitBitVect
- to_numpy()¶
convert fingerprint to numpy array
- Returns:
the fingerprint
- Return type:
np.ndarray
- class masskit.utils.fingerprints.SpectrumTanimotoFingerPrint(dimension=2000, column_name=None, *args, **kwargs)¶
Bases:
SpectrumFingerprint
create a spectral tanimoto fingerprint
- Parameters:
dimension – overall dimension of fingerprint
bin_size – size of each bin
first_bin_left – the position of the first bin left side
- from_bitvec(array)¶
create fingerprint from rdkit ExplicitBitVect
- Parameters:
array (DataStructs.ExplicitBitVect) – input ExplicitBitVect
- from_numpy(array)¶
create fingerprint from numpy array
- Parameters:
array (np.ndarray) – input numpy array
- object2fingerprint(obj, dtype=<class 'numpy.float32'>)¶
convert an object into a fingerprint
- Parameters:
obj (object) – the object to convert to a fingerprint
dtype (np.dtype) – data type of output array
- to_bitvec()¶
convert fingerprint to rdkit ExplicitBitVect
- Returns:
the fingerprint
- Return type:
DataStructs.ExplicitBitVect
- to_numpy()¶
convert fingerprint to numpy array
- Returns:
the fingerprint
- Return type:
np.ndarray
- masskit.utils.fingerprints.calc_bit(a, b, max_mz_in, tolerance_in)¶
- masskit.utils.fingerprints.calc_dynamic_fingerprint(spectrum, max_mz=2000, tolerance=0.1, min_intensity=50, mz_window=14, min_mz=2.5, max_rank=20, hybrid_mask=None)¶
create a fingerprint created from the top max_rank peaks and the OR of the fingerprint of the top peaks, where each fingerprint is shifted by the mz of one of the top peaks
- Parameters:
spectrum – input spectrum
max_mz – maximum mz used for the fingerprint
tolerance – mass tolerance to use
min_intensity – minimum intensity to allow for creating the fingerprint
mz_window – noise filter mz window
min_mz – minimum mz to use
max_rank – the maximum rank used for the fingerprint
- Returns:
the fingerprint
- masskit.utils.fingerprints.calc_interval_fingerprint(spectrum, max_mz=2000, tolerance=0.1, min_intensity=50, mz_window=14, min_mz=1, peaks=None)¶
create a pairwise interval fingerprint from the filtered peaks of a spectrum
- Parameters:
spectrum – input spectrum
max_mz – maximum mz used for the fingerprint
tolerance – mass tolerance to use
min_intensity – minimum intensity to allow for creating the fingerprint
mz_window – noise filter mz window
min_mz – minimum mz to use
peaks – a list of additional mz values to use for creating the intervals
- Returns:
the fingerprint
- masskit.utils.fingerprints.fingerprint_search_numpy(query_fingerprint, fingerprint_array, tanimoto_cutoff, query_fingerprint_count=None, fingerprint_array_count=None)¶
return a list of fingerprint hits to a query fingerprint
- Parameters:
query_fingerprint – query fingerprint
fingerprint_array – array-like list of fingerprints
tanimoto_cutoff – Tanimoto cutoff
query_fingerprint_count – number of bits set in query
fingerprint_array_count – number of bits set in array fingerprints
- Returns:
tanimoto scores (if below threshold, set to 0.0)
Notes: numba doesn’t work with arrays of objects. pyarrow creates a numpy array of objects, where each object is a numpy array. even arrow fixed size lists do this. fixedbinary uses byte objects.
masskit.utils.general module¶
- class masskit.utils.general.MassKitSearchPathPlugin(*args: Any, **kwargs: Any)¶
Bases:
SearchPathPlugin
add the cwd to the search path for configuration yaml files
- manipulate_search_path(search_path: hydra.core.config_search_path.ConfigSearchPath) None ¶
- masskit.utils.general.class_for_name(module_name_list, class_name)¶
dynamically try to find class in list of modules :param module_name_list: list of modules to search :param class_name: class to look for :return: class
- masskit.utils.general.discounted_cumulative_gain(relevance_array)¶
- masskit.utils.general.expand_path(path_pattern) Iterable[Path] ¶
expand a path with globs (wildcards) into a generator of Paths
- Parameters:
path_pattern – the path to be globbed
- Returns:
iterable of Path
- masskit.utils.general.expand_path_list(path_list)¶
given a file path or list of file paths that may contain wildcards, expand ~ to the user directory and glob the wildcards
- Parameters:
path_list – list or str of paths, may include ~ and *
- Returns:
list of expanded paths
- masskit.utils.general.get_file(filename, cache_directory=None, search_path=None, tgz_extension=None)¶
for a given file, return the path (downloading file if necessary)
- Parameters:
filename – name of the file
cache_directory – where files are cached (config.paths.cache_directory)
search_path – list of where to search for files (config.paths.search_path)
tgz_extension – if a tgz file to be downloaded, the extension of the unarchived file
- Returns:
path
Notes: we don’t support zip files as the python api for zip files doesn’t support symlinks
- masskit.utils.general.is_list_like(obj)¶
is the object list-like?
- Parameters:
obj – the object to be tested
- Returns:
true if list like
- masskit.utils.general.open_if_compressed(filename, mode, newline=None)¶
open the file denoted by filename and uncompress it if it is compressed
- Parameters:
filename – filename
mode – file opening mode
newline – specify newline character
- Returns:
stream
- masskit.utils.general.open_if_filename(fp, mode, newline=None)¶
if the fp is a string, open the file, and uncompress if needed
- Parameters:
fp – possible filename
mode – file opening mode
newline – specify newline character
- Returns:
stream
- masskit.utils.general.parse_filename(filename: str)¶
parse filename into root, extension, and compression extension
- Parameters:
filename – filename
- Returns:
root, extension, compression
- masskit.utils.general.read_arrow(filename)¶
read a pyarrow Table from an arrow or parquet file. arrow file is memory mapped.
- Parameters:
filename – file to input. use the suffix “arrow” if an arrow file, “parquet” if a parquet file
- Returns:
pyarrow Table
- masskit.utils.general.search_for_file(filename, directories)¶
look for a file in a list of directories
- Parameters:
filename – the filename to look for
directories – the directories to search
- Returns:
the Path to the file, otherwise None if not found
- masskit.utils.general.write_arrow(table, filename, row_group_size=5000)¶
write a pyarrow Table to an arrow or parquet file.
- Parameters:
filename – file to output. use the suffix “arrow” if an arrow file, “parquet” if a parquet file
table – pyarrow Table
row_group_size – if a parquet file, use this row group size
masskit.utils.hitlist module¶
- class masskit.utils.hitlist.CompareRecallDCG(**kwargs)¶
Bases:
HitlistCompare
compare two hitlists by ranking the scores and computing the recall and discounted cumulative gain at several subsets of the hit list with different lengths. The recall values consist of the number of hits in each subset in the compare hitlist that have a rank equal to or better than the ground truth hitlist.
- compare(compare_hitlist, truth_hitlist=None, recall_values_in=None, rank_method=None)¶
compare hitlist to a ground truth hitlist. Rank according to scores, then compute recall. the ranking applied is min ranking, where a group of hits that are tied are given a rank as if there had been one member of the group.
- Parameters:
compare_hitlist – Hitlist to compare to
truth_hitlist – Hitlist that serves as ground truth
recall_values_in – lengths of the hitlist subsets used to calculate recall. [1, 3, 10] by default
rank_method – what method to use to do ranking? “min” is default
- class masskit.utils.hitlist.CosineScore(hit_table_map, query_table_map=None, score_name=None, scale=1.0)¶
Bases:
Score
- class masskit.utils.hitlist.Hitlist(hitlist=None)¶
Bases:
ABC
base class for a list of hits from a search
- Parameters:
hitlist (pandas.DataFrame) – the hitlist
- get_query_ids()¶
return list of unique query ids
- Returns:
query ids
- Return type:
np.int64
- property hitlist¶
get the hitlist as a pandas dataframe
- Returns:
the hitlist
- Return type:
pandas.DataFrame
- load(file)¶
- save(file)¶
- sort(score=None, ascending=False)¶
sort the hitlist per query
- Parameters:
score – name of the score. default is cosine_score
ascending – should the score be ascending
- to_pandas()¶
get hitlist as a pandas dataframe
- class masskit.utils.hitlist.HitlistCompare(comparison_score=None, truth_score=None, comparison_score_ascending=False, truth_score_ascending=False, comparison_score_rank=None, truth_score_rank=None)¶
Bases:
ABC
base class for comparing two hitlists.
- Parameters:
comparison_score – the column name of the score to compare to ground truth, ‘cosine_score’ default
truth_score – the column name of the ground truth score, default same as comparison score
comparison_score_ascending – is the comparison score ascending?
truth_score_ascending – is the ground truth score ascending?
comparison_score_rank – what is the column name of the rank of the comparison score, default appends _rank to comparison_score
truth_score_rank – what is the column name of the rank of the truth score, default appends _rank to truth_score
- abstract compare(compare_hitlist, truth_hitlist=None, recall_values=None, rank_method=None)¶
compare hitlist to a ground truth hitlist.
- Parameters:
compare_hitlist – Hitlist to compare to
truth_hitlist – Hitlist that serves as ground truth
recall_values – lengths of the hitlist subsets used to calculate recall. [1, 3, 10] by default
rank_method – what method to use to do ranking? “min” is default
- class masskit.utils.hitlist.IdentityRecall(**kwargs)¶
Bases:
HitlistCompare
examing a hitlist with an identity colum and compute the recall
- compare(compare_hitlist, recall_values_in=None, rank_method=None, identity_column=None)¶
compute recall of compare hitlist
- Parameters:
compare_hitlist – Hitlist that serves as ground truth
recall_values_in – lengths of the hitlist subsets used to calculate recall. [1, 3, 10] by default
rank_method – what method to use to do ranking? “min” is default
identity_column – column name of the identity column
- class masskit.utils.hitlist.Score(hit_table_map=None, query_table_map=None, score_name=None)¶
Bases:
ABC
base class for scoring a hitlist
- Parameters:
masskit.utils.index module¶
- class masskit.utils.index.BruteForceIndex(index_name=None, dimension=None, fingerprint_factory=None)¶
Bases:
Index
search a library by brute force spectrum matching
- Parameters:
dimension – the number of features in the fingerprint (ignored)
fingerprint_factory – class used to encapsulate the fingerprint
- create(table_map, column_name=None)¶
create index from a TableMap
- Parameters:
table_map – the library to index
column_name – name of the column containing the objects to index. None=’spectrum’
- load(file=None)¶
load the index
- Parameters:
file – name of the file to load the fingerprint in
- optimize()¶
optimize the index
- save(file=None)¶
save the index
- Parameters:
file – name of file to save fingerprint in
- search(objects, hitlist_size=50, epsilon=0.3, with_raw_score=False, id_list=None, id_list_query=None, predicate=None)¶
search a series of query objects
- Parameters:
objects – the query objects to be searched or their fingerprints
hitlist_size – the size of the hitlist returned
epsilon – max search accuracy error
with_raw_score – calculate and return the raw_score
id_list – array-like for converting row numbers in search results to hit ids
id_list_query – array-like for converting row numbers in search results to query ids
predicate – filter array that is the result of a predicate query
- class masskit.utils.index.DescentIndex(index_name=None, dimension=2000, fingerprint_factory=<class 'masskit.utils.fingerprints.SpectrumFloatFingerprint'>)¶
Bases:
Index
pynndescent index for a fingerprint
- Parameters:
dimension – the number of features in the fingerprint
fingerprint_factory – class used to encapsulate the fingerprint
- create(table_map, metric=None, column_name=None)¶
create index from a TableMap
- Parameters:
table_map – the library to index
column_name – name of the column containing the objects to index. None=’spectrum’
- create_from_fingerprint(table_map, fingerprint_column='ecfp4', fingerprint_count_column='ecfp4_count', metric=None)¶
create the index from a table column containing a binary fingerprint or feature vector
- Parameters:
table_map – TableMap that contains the arrow table
fingerprint_column – name of the fingerprint column, defaults to ‘ecfp4’
fingerprint_count_column – name of the fingerprint count column, defaults to ‘ecfp4_count’
- load(file=None)¶
load the index
- Parameters:
file – name of the file to load the fingerprint in
- optimize()¶
optimize the index
- save(file=None)¶
save the index
- Parameters:
file – name of file to save fingerprint in
- search(objects, hitlist_size=50, epsilon=0.3, with_raw_score=False, id_list=None, id_list_query=None, predicate=None)¶
search a series of query objects
- Parameters:
objects – the query objects to be searched or their fingerprints
hitlist_size – the size of the hitlist returned
epsilon – max search accuracy error
with_raw_score – calculate and return the raw_score
id_list – array-like for converting row numbers in search results to hit ids
id_list_query – array-like for converting row numbers in search results to query ids
predicate – filter array that is the result of a predicate query
- class masskit.utils.index.DotProductIndex(index_name=None, dimension=None, fingerprint_factory=None)¶
Bases:
Index
brute force search of feature vectors using cosine score
- create(table_map)¶
create index from a TableMap
- Parameters:
table_map – the library to index
column_name – name of the column containing the objects to index. None=’spectrum’
- load(file=None)¶
load the index
- Parameters:
file – name of the file to load the fingerprint in
- optimize()¶
optimize the index
- save(file=None)¶
save the index
- Parameters:
file – name of file to save fingerprint in
- search(objects, hitlist_size=30, epsilon=0.1, id_list=None, id_list_query=None, predicate=None)¶
search feature vectors
- Parameters:
objects – feature vector or list of feature vectors to be queried
hitlist_size – _description_, defaults to 50
epsilon – _description_, defaults to 0.1
- Returns:
_description_
- class masskit.utils.index.Index(index_name=None, dimension=None, fingerprint_factory=None)¶
Bases:
ABC
Index used for searching a library
- Parameters:
dimension – the number of features in the fingerprint
fingerprint_factory – class used to encapsulate the
index_name – name of index
- abstract create(table_map, column_name=None)¶
create index from a TableMap
- Parameters:
table_map – the library to index
column_name – name of the column containing the objects to index. None=’spectrum’
- abstract load(file=None)¶
load the index
- Parameters:
file – name of the file to load the fingerprint in
- abstract optimize()¶
optimize the index
- abstract save(file=None)¶
save the index
- Parameters:
file – name of file to save fingerprint in
- abstract search(objects, hitlist_size=50, epsilon=0.2, with_raw_score=False, id_list=None, id_list_query=None, predicate=None)¶
search a series of query objects
- Parameters:
objects – the query objects to be searched or their fingerprints
hitlist_size – the size of the hitlist returned
epsilon – max search accuracy error
with_raw_score – calculate and return the raw_score
id_list – array-like for converting row numbers in search results to hit ids
id_list_query – array-like for converting row numbers in search results to query ids
predicate – filter array that is the result of a predicate query
- spectrum2array(spectrum, spectral_array_in, channel_in=0, dtype=<class 'numpy.float32'>, cutoff=0.0)¶
- class masskit.utils.index.TanimotoIndex(index_name=None, dimension=4096, fingerprint_factory=<class 'masskit.utils.fingerprints.SpectrumTanimotoFingerPrint'>)¶
Bases:
Index
brute force search index of binary fingerprints
- Parameters:
dimension – the number of features in the fingerprint
fingerprint_factory – class used to encapsulate the fingerprint
- create(table_map, column_name=None)¶
create index from a TableMap
- Parameters:
table_map – the library to index
column_name – name of the column containing the objects to index. None=’spectrum’
- create_from_fingerprint(table_map, fingerprint_column='ecfp4', fingerprint_count_column='ecfp4_count')¶
create the index from columns in a table_map encoded as binary fingerprints
- Parameters:
table_map – TableMap that contains the arrow table
fingerprint_column – name of the fingerprint column, defaults to ‘ecfp4’
fingerprint_count_column – name of the fingerprint count column, defaults to ‘ecfp4_count’
- load(file=None)¶
load the index
- Parameters:
file – name of the file to load the fingerprint in
- optimize()¶
optimize the index
- save(file=None)¶
save the index
- Parameters:
file – name of file to save fingerprint in
- search(objects, hitlist_size=50, epsilon=0.1, with_raw_score=False, id_list=None, id_list_query=None, predicate=None)¶
search a series of query objects
- Parameters:
objects – the query objects to be searched or their fingerprints
hitlist_size – the size of the hitlist returned
epsilon – max search accuracy error
with_raw_score – calculate and return the raw_score
id_list – array-like for converting row numbers in search results to hit ids
id_list_query – array-like for converting row numbers in search results to query ids
predicate – filter array that is the result of a predicate query
- masskit.utils.index.dot_product(objects, column, hit_ids, cosine_scores, hitlist_size)¶
masskit.utils.spectrum_writers module¶
- masskit.utils.spectrum_writers.spectra_to_mgf(fp, spectra, charge_list=None)¶
write out an array-like of spectra in mgf format
- Parameters:
fp – stream or filename to write out to. will append
spectra – name of the column containing the spectrum
charge_list – list of charges for Mascot to search, otherwise use the CHARGE field
- masskit.utils.spectrum_writers.spectra_to_msp(fp, spectra, annotate_peptide=False, ion_types=None)¶
write out an array-like of spectra in msp format
- Parameters:
fp – stream or filename to write out to. will append
spectra – map containing the spectrum
annotate_peptide – annotate the spectra as peptide
ion_types – ion types for annotation
- masskit.utils.spectrum_writers.spectra_to_mzxml(fp, spectra, mzxml_attributes=None, min_intensity=1e-07, compress=True, use_id_as_scan=True)¶
write out an array-like of spectra in mzxml format
- Parameters:
fp – stream or filename to write out to. will not append
min_intensity – the minimum intensity value
spectra – name of the column containing the spectrum
mzxml_attributes – dict containing mzXML attributes
use_id_as_scan – use spectrum.id instead of spectrum.scan
compress – should the data be compressed?
masskit.utils.tablemap module¶
- class masskit.utils.tablemap.ArrowLibraryMap(table_in, num=0, *args, **kwargs)¶
Bases:
TableMap
wrapper for an arrow library
- create_dict(idx)¶
create dict for row
- Parameters:
idx – row number
- static from_mgf(file, num=None, title_fields=None, min_intensity=0.0, spectrum_type=None)¶
read in an mgf file and create an ArrowLibraryMap
- Parameters:
file – filename or stream
num – number of rows. None means all
title_fields – dict containing column names with corresponding regex to extract field values from the TITLE
min_intensity – the minimum intensity to set the fingerprint bit
spectrum_type – the type of spectrum file
- Returns:
ArrowLibraryMap
- static from_msp(file, num=None, id_field=0, comment_fields=None, min_intensity=0.0, spectrum_type=None)¶
read in an msp file and create an ArrowLibraryMap
- Parameters:
file – filename or stream
num – number of rows. None means all
id_field – start value of the id field
comment_fields – a Dict of regexes used to extract fields from the Comment field. Form of the Dict is { comment_field_name: (regex, type, field_name)}. For example {‘Filter’:(r’@hcd(d+.?d* )’, float, ‘nce’)}
min_intensity – the minimum intensity to set the fingerprint bit
spectrum_type – the type of spectrum file
- Returns:
ArrowLibraryMap
- static from_parquet(file, columns=None, num=None, combine_chunks=False, filters=None)¶
create an ArrowLibraryMap from a parquet file
- Parameters:
file – filename or stream
columns – list of columns to read. None=all, []=minimum set
num – number of rows
combine_chunks – dechunkify the arrow table to allow zero copy
filters – parquet predicate as a list of tuples
- static from_sdf(file, num=None, skip_expensive=True, max_size=0, id_field=None, min_intensity=0.0, set_probabilities=(0.01, 0.97, 0.01, 0.01), spectrum_type=None)¶
read in an sdf file and create an ArrowLibraryMap
- Parameters:
file – filename or stream
num – number of rows. None means all
skip_expensive – don’t compute fields that are computationally expensive
max_size – the maximum bounding box size (used to filter out large molecules. 0=no bound)
id_field – field to use for the mol id, such as NISTNO, ID or _NAME (the sdf title field). if an integer, use the integer as the starting value for an assigned id
min_intensity – the minimum intensity to set the fingerprint bit
set_probabilities – how to divide into dev, train, valid, test
spectrum_type – the type of spectrum file
- Returns:
ArrowLibraryMap
- get_ids()¶
get the ids of all records, in row order
- Returns:
array of ids
- getitem_by_id(key)¶
get an item from the library by id
- Parameters:
key – id
- Returns:
the row dict
- getitem_by_row(key)¶
get an item from the library by id :param key: row number :return: the row dict
- getrow_by_id(key)¶
given an id, return the corresponding row number in the table
- Parameters:
key – the id
- Returns:
the row number of the id in the table
- to_arrow()¶
- to_csv(file, columns=None)¶
Write to csv file, skipping any spectrum column and writing mol columns as canonical SMILES
- Parameters:
file – filename or file pointer. newline should be set to ‘’
columns – list of columns to write out to csv file. If none, all columns
- to_mgf(file, spectrum_column=None)¶
save spectra to mgf file
- Parameters:
file – filename or file pointer
spectrum_column – column name for spectrum
- to_mzxml(file, use_id_as_scan=True, spectrum_column=None)¶
save spectra to mzxml format file
- Parameters:
file – filename or stream
use_id_as_scan – use spectrum.id instead of spectrum.scan
spectrum_column – column name for spectrum
- to_pandas()¶
- to_parquet(file)¶
save spectra to parquet file
- Parameters:
file – filename or stream
- class masskit.utils.tablemap.ListLibraryMap(list_in, spectrum_column=None, *args, **kwargs)¶
Bases:
TableMap
wrapper for a spectral library using python lists
- create_dict(idx)¶
create dict for row
- Parameters:
idx – row number
- get_ids()¶
get the ids of all records, in row order
- Returns:
array of ids
- getitem_by_id(key)¶
get an item from the library by id
- Parameters:
key – id
- Returns:
the row dict
- getitem_by_row(key)¶
get an item from the library by id :param key: row number :return: the row dict
- getrow_by_id(key)¶
given an id, return the corresponding row number in the table
- Parameters:
key – the id
- Returns:
the row number of the id in the table
- class masskit.utils.tablemap.PandasLibraryMap(df, *args, **kwargs)¶
Bases:
TableMap
wrapper for a pandas spectral library
- create_dict(idx)¶
create dict for row
- Parameters:
idx – row number
- get_ids()¶
get the ids of all records, in row order
- Returns:
array of ids
- getitem_by_id(key)¶
get an item from the library by id
- Parameters:
key – id
- Returns:
the row dict
- getitem_by_row(key)¶
get an item from the library by id :param key: row number :return: the row dict
- getrow_by_id(key)¶
given an id, return the corresponding row number in the table
- Parameters:
key – the id
- Returns:
the row number of the id in the table
- class masskit.utils.tablemap.TableMap(*args, **kwargs)¶
Bases:
ABC
collections.abc.Sequence wrapper for a library. Allows use of different stores, e.g. arrow or pandas
- abstract create_dict(idx)¶
create dict for row
- Parameters:
idx – row number
- abstract get_ids()¶
get the ids of all records, in row order
- Returns:
array of ids
- abstract getitem_by_id(key)¶
get an item from the library by id
- Parameters:
key – id
- Returns:
the row dict
- abstract getitem_by_row(key)¶
get an item from the library by id :param key: row number :return: the row dict
- abstract getrow_by_id(key)¶
given an id, return the corresponding row number in the table
- Parameters:
key – the id
- Returns:
the row number of the id in the table
- to_msp(file, annotate_peptide=False, ion_types=None, spectrum_column=None)¶
write out spectra in msp format
- Parameters:
file – file or filename to write to
annotate_peptide – annotate the spectra as a peptide
ion_types – ion types for annotation
spectrum_column – column name for spectrum
masskit.utils.tables module¶
- masskit.utils.tables.create_dataset(rows=5, cols=[<class 'int'>, <class 'float'>, <class 'str'>, <class 'list'>], names=array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'], dtype='<U1'))¶
- masskit.utils.tables.is_struct_or_extstruct(chunked_array: ChunkedArray) bool ¶
- masskit.utils.tables.optimize_structarray(struct: ChunkedArray) ChunkedArray ¶
- masskit.utils.tables.optimize_table(table: Table) Table ¶
- masskit.utils.tables.random_string(length)¶
- masskit.utils.tables.row_view(table, idx=0)¶
- masskit.utils.tables.row_view_raw(table, idx=0)¶
- masskit.utils.tables.struct_view(struct_name, parent)¶
- masskit.utils.tables.structarray_to_table(struct: StructArray) Table ¶
- masskit.utils.tables.table_add_structarray(table: Table, structarray: StructArray, column_name: str = None) Table ¶
add a struct array to a table
- Parameters:
table – table to be added to
structarray – structarray to add to table
column_name – name of column to add
- Returns:
new table with appended column
- masskit.utils.tables.table_to_structarray(table: Table, structarray_type: ExtensionType = None) StructArray ¶
convert a spectrum table into a struct array. if an ExtensionType is passed in, will create a struct array of that type
- Parameters:
table – spectrum table
structarray_type – the type of the array returned, e.g. SpectrumArrowType()
- Returns:
StructArray