masskit.utils package¶

Subpackages¶

masskit.utils.textalloc package

Submodules¶

masskit.utils.accumulator module¶

class masskit.utils.accumulator.Accumulator(*args, **kwargs)¶

Bases: ABC

accumulator class used to take the mean and standard deviation of a set of data

abstract add(new_item)¶

accumulated a new item

Parameters:: new_item – new item to be accumulated

abstract finalize()¶: finalize the accumulation

class masskit.utils.accumulator.AccumulatorProperty(*args, **kwargs)¶

Bases: Accumulator

used to calculate the mean and standard deviation of a property

add(new_item)¶

add an item to the average. Keeps running total of average and std deviation using Welford’s algorithm.

Parameters:: new_spectrum – new item to be added

finalize()¶: finalize the std deviation after all the the spectra have been added

masskit.utils.arrow module¶

masskit.utils.arrow.create_object_id(filename, filters)¶

create an object id based on filename and filters

Parameters:

filename – filename
filters – filter string

Returns:

object id

masskit.utils.arrow.save_to_arrow(filename, columns=None, filters=None, tempdir=None)¶

Load a parquet file and save it as a temp arrow file after applying filters and column lists. Load it as a memmap if the temp arrow file already exists

Parameters:

filename – parquet file
columns – columns to load
filters – parquet file filters
tempdir – tempdir to use for memmap, otherwise use python default

Returns:

ArrowLibraryMap

masskit.utils.files module¶

class masskit.utils.files.BatchFileReader(filename: Union[str, PathLike], format: Union[Dict, omegaconf.DictConfig] = None, row_batch_size: int = 5000)¶

Bases: object

iter_tables() → Table¶: read batch generator, returns a Table

loaders = {'csv': <class 'masskit.utils.files.CSVLoader'>, 'mgf': <class 'masskit.utils.files.MGFLoader'>, 'msp': <class 'masskit.utils.files.MSPLoader'>, 'sdf': <class 'masskit.utils.files.SDFLoader'>}¶

class masskit.utils.files.BatchFileWriter(filename: Union[str, PathLike], format: str = None, annotate: bool = False, row_batch_size: int = 5000, num_workers: int = 7, column_name: str = None)¶

Bases: object

close()¶: Close the writer. Essential to avoid race conditions with threaded writers.

write_table(table: Table) → None¶

write a table out

Parameters:: table – table to write

class masskit.utils.files.BatchLoader(file_type, row_batch_size=5000, format=None, num=None)¶

Bases: object

ecfp4 = <masskit.utils.fingerprints.ECFPFingerprint object>¶

finalize()¶: finalized the chunk

load(fp)¶

read in a chunk from a stream

Parameters:: fp – the stream

loop_end()¶: end of loop to read in one record

classmethod mol2row(mol, max_size: int = 0, skip_computed_props=True, skip_expensive: bool = True) → Dict¶

Convert an rdkit Mol into a row

Parameters:

mol – the molecule
max_size – the maximum bounding box size (used to filter out large molecules. 0=no bound)
skip_computed_props – skip computing properties
skip_expensive – skip computing computationally expensive properties

Returns:

row as dict, mol

setup()¶: set up loading

tms = '[#14]([CH3])([CH3])[CH3]'¶

class masskit.utils.files.CSVLoader(row_batch_size=5000, format=None, num=None)¶

Bases: BatchLoader

load(fp)¶

read in a chunk from a stream

Parameters:: fp – the stream

class masskit.utils.files.MGFLoader(row_batch_size=5000, format=None, num=None, set_probabilities=(0.0, 0.93, 0.02, 0.05), min_intensity=0.0)¶

Bases: BatchLoader

load(fp)¶

read in a chunk from a stream

Parameters:: fp – the stream

class masskit.utils.files.MSPLoader(row_batch_size=5000, format=None, min_intensity=0, num=None)¶

Bases: BatchLoader

load(fp)¶

read in a chunk from a stream

Parameters:: fp – the stream

class masskit.utils.files.MzTab_Reader(fp, dedup=False, decoy_func=None)¶

Bases: object

Class for reading an mzTab file

Parameters:

fp – stream or filename
dedup – Only take the first row with a given hit_id

get_hitlist()¶

parse_metadata()¶

parse_psm()¶

parse_psm_row(row)¶

read_sections(fp)¶

class masskit.utils.files.SDFLoader(row_batch_size=5000, format=None, num=None, set_probabilities=(0.0, 0.93, 0.02, 0.05), min_intensity=0.0)¶

Bases: BatchLoader

load(fp)¶

read in a chunk from a stream

Parameters:: fp – the stream

masskit.utils.files.add_row_to_records(records, row)¶

masskit.utils.files.create_table_with_mod_dict(records, schema)¶

masskit.utils.files.empty_records(schema)¶

masskit.utils.files.get_progress(fp)¶

Given a fp pointer, return best possible progress meter

Parameters:: fp – name of the file to read or a file object

masskit.utils.files.load_default_config_schema(file_type, format=None)¶

for a given file type and format, return configuration

Parameters:

file_type – the type of file, e.g. mgf, msp, arrow, parquet, sdf
format – the specific format of the file_type, e.g. msp_mol, msp_peptide, sdf_mol, sdf_nist_mol, sdf_pubchem_mol. If none, use default

Returns:

config

masskit.utils.files.load_mzTab(fp, dedup=True, decoy_func=None)¶

Read file in SDF format and return an array.

Parameters:

fp – stream or filename
dedup – Only take the first row with a given hit_id

masskit.utils.files.parse_energy(row, value)¶: parse energy field that has nce or collision_energy in it

masskit.utils.files.parse_glycopeptide_annot(annots, peak_index)¶

take a glycopeptide annotation string like “Y0-H2O+i/-18.2ppm,Y0-NH3/7.6ppm 22 23” and parse it. this function is not complete.

Parameters:

annots – the annotation string
peak_index – the index of the peak with the annotation in the mz array

Returns:

the parsed values in an array

masskit.utils.files.read_parquet(fp, columns=None, num=None, filters=None)¶

reads a PyArrow table from a parquet file.

Parameters:

fp – stream or filename
columns – list of columns to remy_sdf.parquetad, None=all
filters – parquet predicate as a list of tuples

Returns:

PyArrow table

masskit.utils.files.records2table(records, schema_group)¶

convert a flat table with spectral records into a table with a nested spectrum

Parameters:

records – records to be added to a table
schema_group – the schema group

masskit.utils.files.seek_size(fp)¶

masskit.utils.files.spectra_to_array(spectra, min_intensity=0, write_tolerance=False, schema_group=None)¶

convert an array-like of spectra to an arrow_table

Parameters:

spectra – iteratable containing the spectrums
min_intensity – the minimum intensity to set the fingerprint bit
write_tolerance – put the tolerance into the arrow Table
schema_group – the schema group of spectrum file

Returns:

arrow table of spectra

masskit.utils.files.spectrum2mgf(spectrum)¶

masskit.utils.files.spectrum2msp(spectrum, annotate=False)¶

masskit.utils.files.write_parquet(fp, table)¶

save a PyArrow table to a parquet file.

Parameters:

table – the dataframe
fp – stream or filename

masskit.utils.fingerprints module¶

class masskit.utils.fingerprints.ECFPFingerprint(dimension=4096, radius=2, *args, **kwargs)¶

Bases: Fingerprint

rdkit version of ECFP fingerprint for a small molecule structure

from_bitvec(array)¶

create fingerprint from rdkit ExplicitBitVect

Parameters:: array (DataStructs.ExplicitBitVect) – input ExplicitBitVect

from_numpy(array)¶

create fingerprint from numpy array

Parameters:: array (np.ndarray) – input numpy array

object2fingerprint(obj, dtype=<class 'numpy.float32'>)¶

convert an object into a fingerprint

Parameters:

obj (object) – the object to convert to a fingerprint
dtype (np.dtype) – data type of output array

to_bitvec()¶

convert fingerprint to rdkit ExplicitBitVect

Returns:: the fingerprint
Return type:: DataStructs.ExplicitBitVect

to_numpy()¶

convert fingerprint to numpy array

Returns:: the fingerprint
Return type:: np.ndarray

class masskit.utils.fingerprints.Fingerprint(dimension=2000, stride=256, column_name=None, *args, **kwargs)¶

Bases: ABC

class for encapsulating and calculating a fingerprint

Parameters:: dimension (int) – size of the fingerprint (includes other features like counts

bitvec2numpy()¶: version of to_numpy used when fingerprint is a bitvec

abstract from_bitvec(array)¶

create fingerprint from rdkit ExplicitBitVect

Parameters:: array (DataStructs.ExplicitBitVect) – input ExplicitBitVect

abstract from_numpy(array)¶

create fingerprint from numpy array

Parameters:: array (np.ndarray) – input numpy array

get_num_on_bits()¶: retrieve the number of nonzero features

numpy2bitvec(array)¶

version of from_numpy used when fingerprint is a bitvec

Parameters:: array – the fingerprint as a numpy array

abstract object2fingerprint(obj, dtype=<class 'numpy.float32'>)¶

convert an object into a fingerprint

Parameters:

obj (object) – the object to convert to a fingerprint
dtype (np.dtype) – data type of output array

size() → int¶

size of the fingerprint in bytes, where the size is a multiple of the stride

Returns:: size of fingerprint

abstract to_bitvec()¶

convert fingerprint to rdkit ExplicitBitVect

Returns:: the fingerprint
Return type:: DataStructs.ExplicitBitVect

abstract to_numpy()¶

convert fingerprint to numpy array

Returns:: the fingerprint
Return type:: np.ndarray

class masskit.utils.fingerprints.MolFingerprint(dimension=4096, *args, **kwargs)¶

Bases: ABC

class for encapsulating and calculating a molecule fingerprint

Parameters:: dimension (int) – size of the fingerprint (includes other features like counts

abstract from_bitvec(array)¶

create fingerprint from rdkit ExplicitBitVect

Parameters:: array (DataStructs.ExplicitBitVect) – input ExplicitBitVect

abstract from_numpy(array)¶

create fingerprint from numpy array

Parameters:: array (np.ndarray) – input numpy array

abstract object2fingerprint(obj, dtype=<class 'numpy.float32'>)¶

convert an object into a fingerprint

Parameters:

obj (object) – the object to convert to a fingerprint
dtype (np.dtype) – data type of output array

abstract to_bitvec()¶

convert fingerprint to rdkit ExplicitBitVect

Returns:: the fingerprint
Return type:: DataStructs.ExplicitBitVect

abstract to_numpy()¶

convert fingerprint to numpy array

Returns:: the fingerprint
Return type:: np.ndarray

class masskit.utils.fingerprints.SpectrumFingerprint(dimension=2000, bin_size=1.0, first_bin_left=0.5, *args, **kwargs)¶

Bases: Fingerprint

base class for spectrum fingerprint

abstract from_bitvec(array)¶

create fingerprint from rdkit ExplicitBitVect

Parameters:: array (DataStructs.ExplicitBitVect) – input ExplicitBitVect

abstract from_numpy(array)¶

create fingerprint from numpy array

Parameters:: array (np.ndarray) – input numpy array

abstract object2fingerprint(obj, dtype=<class 'numpy.float32'>)¶

convert an object into a fingerprint

Parameters:

obj (object) – the object to convert to a fingerprint
dtype (np.dtype) – data type of output array

abstract to_bitvec()¶

convert fingerprint to rdkit ExplicitBitVect

Returns:: the fingerprint
Return type:: DataStructs.ExplicitBitVect

abstract to_numpy()¶

convert fingerprint to numpy array

Returns:: the fingerprint
Return type:: np.ndarray

class masskit.utils.fingerprints.SpectrumFloatFingerprint(dimension=2000, count_max=2000, mz_ratio: bool = False, column_name=None, *args, **kwargs)¶

Bases: SpectrumFingerprint

create a spectral fingerprint that also contains a count by powers of two, skipping 1. In other words, the count bits are set if the number of peaks exceeds >=2, >=4, …

from_bitvec(array)¶

create fingerprint from rdkit ExplicitBitVect

Parameters:: array (DataStructs.ExplicitBitVect) – input ExplicitBitVect

from_numpy(array)¶

create fingerprint from numpy array

Parameters:: array (np.ndarray) – input numpy array

object2fingerprint(obj, dtype=<class 'numpy.float32'>)¶: fill out an array of fixed size with the ions. note that this func assumes spectra sorted by mz

to_bitvec()¶

convert fingerprint to rdkit ExplicitBitVect

Returns:: the fingerprint
Return type:: DataStructs.ExplicitBitVect

to_numpy()¶

convert fingerprint to numpy array

Returns:: the fingerprint
Return type:: np.ndarray

class masskit.utils.fingerprints.SpectrumTanimotoFingerPrint(dimension=2000, column_name=None, *args, **kwargs)¶

Bases: SpectrumFingerprint

create a spectral tanimoto fingerprint

Parameters:

dimension – overall dimension of fingerprint
bin_size – size of each bin
first_bin_left – the position of the first bin left side

from_bitvec(array)¶

create fingerprint from rdkit ExplicitBitVect

Parameters:: array (DataStructs.ExplicitBitVect) – input ExplicitBitVect

from_numpy(array)¶

create fingerprint from numpy array

Parameters:: array (np.ndarray) – input numpy array

object2fingerprint(obj, dtype=<class 'numpy.float32'>)¶

convert an object into a fingerprint

Parameters:

obj (object) – the object to convert to a fingerprint
dtype (np.dtype) – data type of output array

to_bitvec()¶

convert fingerprint to rdkit ExplicitBitVect

Returns:: the fingerprint
Return type:: DataStructs.ExplicitBitVect

to_numpy()¶

convert fingerprint to numpy array

Returns:: the fingerprint
Return type:: np.ndarray

masskit.utils.fingerprints.calc_bit(a, b, max_mz_in, tolerance_in)¶

masskit.utils.fingerprints.calc_dynamic_fingerprint(spectrum, max_mz=2000, tolerance=0.1, min_intensity=50, mz_window=14, min_mz=2.5, max_rank=20, hybrid_mask=None)¶

create a fingerprint created from the top max_rank peaks and the OR of the fingerprint of the top peaks, where each fingerprint is shifted by the mz of one of the top peaks

Parameters:

spectrum – input spectrum
max_mz – maximum mz used for the fingerprint
tolerance – mass tolerance to use
min_intensity – minimum intensity to allow for creating the fingerprint
mz_window – noise filter mz window
min_mz – minimum mz to use
max_rank – the maximum rank used for the fingerprint

Returns:

the fingerprint

masskit.utils.fingerprints.calc_interval_fingerprint(spectrum, max_mz=2000, tolerance=0.1, min_intensity=50, mz_window=14, min_mz=1, peaks=None)¶

create a pairwise interval fingerprint from the filtered peaks of a spectrum

Parameters:

spectrum – input spectrum
max_mz – maximum mz used for the fingerprint
tolerance – mass tolerance to use
min_intensity – minimum intensity to allow for creating the fingerprint
mz_window – noise filter mz window
min_mz – minimum mz to use
peaks – a list of additional mz values to use for creating the intervals

Returns:

the fingerprint

masskit.utils.fingerprints.fingerprint_search_numpy(query_fingerprint, fingerprint_array, tanimoto_cutoff, query_fingerprint_count=None, fingerprint_array_count=None)¶

return a list of fingerprint hits to a query fingerprint

Parameters:

query_fingerprint – query fingerprint
fingerprint_array – array-like list of fingerprints
tanimoto_cutoff – Tanimoto cutoff
query_fingerprint_count – number of bits set in query
fingerprint_array_count – number of bits set in array fingerprints

Returns:

tanimoto scores (if below threshold, set to 0.0)

Notes: numba doesn’t work with arrays of objects. pyarrow creates a numpy array of objects, where each object is a numpy array. even arrow fixed size lists do this. fixedbinary uses byte objects.

masskit.utils.general module¶

class masskit.utils.general.MassKitSearchPathPlugin(*args: Any, **kwargs: Any)¶

Bases: SearchPathPlugin

add the cwd to the search path for configuration yaml files

manipulate_search_path(search_path: hydra.core.config_search_path.ConfigSearchPath) → None¶

masskit.utils.general.class_for_name(module_name_list, class_name)¶: dynamically try to find class in list of modules :param module_name_list: list of modules to search :param class_name: class to look for :return: class

masskit.utils.general.discounted_cumulative_gain(relevance_array)¶

masskit.utils.general.expand_path(path_pattern) → Iterable[Path]¶

expand a path with globs (wildcards) into a generator of Paths

Parameters:: path_pattern – the path to be globbed
Returns:: iterable of Path

masskit.utils.general.expand_path_list(path_list)¶

given a file path or list of file paths that may contain wildcards, expand ~ to the user directory and glob the wildcards

Parameters:: path_list – list or str of paths, may include ~ and *
Returns:: list of expanded paths

masskit.utils.general.get_file(filename, cache_directory=None, search_path=None, tgz_extension=None)¶

for a given file, return the path (downloading file if necessary)

Parameters:

filename – name of the file
cache_directory – where files are cached (config.paths.cache_directory)
search_path – list of where to search for files (config.paths.search_path)
tgz_extension – if a tgz file to be downloaded, the extension of the unarchived file

Returns:

path

Notes: we don’t support zip files as the python api for zip files doesn’t support symlinks

masskit.utils.general.is_list_like(obj)¶

is the object list-like?

Parameters:: obj – the object to be tested
Returns:: true if list like

masskit.utils.general.open_if_compressed(filename, mode, newline=None)¶

open the file denoted by filename and uncompress it if it is compressed

Parameters:

filename – filename
mode – file opening mode
newline – specify newline character

Returns:

stream

masskit.utils.general.open_if_filename(fp, mode, newline=None)¶

if the fp is a string, open the file, and uncompress if needed

Parameters:

fp – possible filename
mode – file opening mode
newline – specify newline character

Returns:

stream

masskit.utils.general.parse_filename(filename: str)¶

parse filename into root, extension, and compression extension

Parameters:: filename – filename
Returns:: root, extension, compression

masskit.utils.general.read_arrow(filename)¶

read a pyarrow Table from an arrow or parquet file. arrow file is memory mapped.

Parameters:: filename – file to input. use the suffix “arrow” if an arrow file, “parquet” if a parquet file
Returns:: pyarrow Table

masskit.utils.general.search_for_file(filename, directories)¶

look for a file in a list of directories

Parameters:

filename – the filename to look for
directories – the directories to search

Returns:

the Path to the file, otherwise None if not found

masskit.utils.general.write_arrow(table, filename, row_group_size=5000)¶

write a pyarrow Table to an arrow or parquet file.

Parameters:

filename – file to output. use the suffix “arrow” if an arrow file, “parquet” if a parquet file
table – pyarrow Table
row_group_size – if a parquet file, use this row group size

masskit.utils.hitlist module¶

class masskit.utils.hitlist.CompareRecallDCG(**kwargs)¶

Bases: HitlistCompare

compare two hitlists by ranking the scores and computing the recall and discounted cumulative gain at several subsets of the hit list with different lengths. The recall values consist of the number of hits in each subset in the compare hitlist that have a rank equal to or better than the ground truth hitlist.

compare(compare_hitlist, truth_hitlist=None, recall_values_in=None, rank_method=None)¶

compare hitlist to a ground truth hitlist. Rank according to scores, then compute recall. the ranking applied is min ranking, where a group of hits that are tied are given a rank as if there had been one member of the group.

Parameters:

compare_hitlist – Hitlist to compare to
truth_hitlist – Hitlist that serves as ground truth
recall_values_in – lengths of the hitlist subsets used to calculate recall. [1, 3, 10] by default
rank_method – what method to use to do ranking? “min” is default

class masskit.utils.hitlist.CosineScore(hit_table_map, query_table_map=None, score_name=None, scale=1.0)¶

Bases: Score

score(hitlist)¶

do the scoring on the hitlist

Parameters:: hitlist (Hitlist) – the hitlist to score

class masskit.utils.hitlist.Hitlist(hitlist=None)¶

Bases: ABC

base class for a list of hits from a search

Parameters:: hitlist (pandas.DataFrame) – the hitlist

get_query_ids()¶

return list of unique query ids

Returns:: query ids
Return type:: np.int64

property hitlist¶

get the hitlist as a pandas dataframe

Returns:: the hitlist
Return type:: pandas.DataFrame

load(file)¶

save(file)¶

sort(score=None, ascending=False)¶

sort the hitlist per query

Parameters:

score – name of the score. default is cosine_score
ascending – should the score be ascending

to_pandas()¶: get hitlist as a pandas dataframe

class masskit.utils.hitlist.HitlistCompare(comparison_score=None, truth_score=None, comparison_score_ascending=False, truth_score_ascending=False, comparison_score_rank=None, truth_score_rank=None)¶

Bases: ABC

base class for comparing two hitlists.

Parameters:

comparison_score – the column name of the score to compare to ground truth, ‘cosine_score’ default
truth_score – the column name of the ground truth score, default same as comparison score
comparison_score_ascending – is the comparison score ascending?
truth_score_ascending – is the ground truth score ascending?
comparison_score_rank – what is the column name of the rank of the comparison score, default appends _rank to comparison_score
truth_score_rank – what is the column name of the rank of the truth score, default appends _rank to truth_score

abstract compare(compare_hitlist, truth_hitlist=None, recall_values=None, rank_method=None)¶

compare hitlist to a ground truth hitlist.

Parameters:

compare_hitlist – Hitlist to compare to
truth_hitlist – Hitlist that serves as ground truth
recall_values – lengths of the hitlist subsets used to calculate recall. [1, 3, 10] by default
rank_method – what method to use to do ranking? “min” is default

class masskit.utils.hitlist.IdentityRecall(**kwargs)¶

Bases: HitlistCompare

examing a hitlist with an identity colum and compute the recall

compare(compare_hitlist, recall_values_in=None, rank_method=None, identity_column=None)¶

compute recall of compare hitlist

Parameters:

compare_hitlist – Hitlist that serves as ground truth
recall_values_in – lengths of the hitlist subsets used to calculate recall. [1, 3, 10] by default
rank_method – what method to use to do ranking? “min” is default
identity_column – column name of the identity column

class masskit.utils.hitlist.PeptideIdentityScore(score_name=None)¶

Bases: Score

score(hitlist)¶

do the scoring on the hitlist

Parameters:: hitlist (Hitlist) – the hitlist to score

class masskit.utils.hitlist.Score(hit_table_map=None, query_table_map=None, score_name=None)¶

Bases: ABC

base class for scoring a hitlist

Parameters:

hit_table_map (TableMap) – a TableMap for the hits
query_table_map (TableMap) – a TableMap for the queries. uses hit_table_map if set to None
score_name (str) – the name of the score

abstract score(hitlist)¶

do the scoring on the hitlist

Parameters:: hitlist (Hitlist) – the hitlist to score

class masskit.utils.hitlist.TanimotoScore(hit_table_map, query_table_map=None, score_name=None, scale=1.0, hit_fingerprint_column=None, query_fingerprint_column=None)¶

Bases: Score

score(hitlist)¶

do the scoring on the hitlist

Parameters:: hitlist (Hitlist) – the hitlist to score

masskit.utils.index module¶

class masskit.utils.index.BruteForceIndex(index_name=None, dimension=None, fingerprint_factory=None)¶

Bases: Index

search a library by brute force spectrum matching

Parameters:

dimension – the number of features in the fingerprint (ignored)
fingerprint_factory – class used to encapsulate the fingerprint

create(table_map, column_name=None)¶

create index from a TableMap

Parameters:

table_map – the library to index
column_name – name of the column containing the objects to index. None=’spectrum’

load(file=None)¶

load the index

Parameters:: file – name of the file to load the fingerprint in

optimize()¶: optimize the index

save(file=None)¶

save the index

Parameters:: file – name of file to save fingerprint in

search(objects, hitlist_size=50, epsilon=0.3, with_raw_score=False, id_list=None, id_list_query=None, predicate=None)¶

search a series of query objects

Parameters:

objects – the query objects to be searched or their fingerprints
hitlist_size – the size of the hitlist returned
epsilon – max search accuracy error
with_raw_score – calculate and return the raw_score
id_list – array-like for converting row numbers in search results to hit ids
id_list_query – array-like for converting row numbers in search results to query ids
predicate – filter array that is the result of a predicate query

class masskit.utils.index.DescentIndex(index_name=None, dimension=2000, fingerprint_factory=<class 'masskit.utils.fingerprints.SpectrumFloatFingerprint'>)¶

Bases: Index

pynndescent index for a fingerprint

Parameters:

dimension – the number of features in the fingerprint
fingerprint_factory – class used to encapsulate the fingerprint

create(table_map, metric=None, column_name=None)¶

create index from a TableMap

Parameters:

table_map – the library to index
column_name – name of the column containing the objects to index. None=’spectrum’

create_from_fingerprint(table_map, fingerprint_column='ecfp4', fingerprint_count_column='ecfp4_count', metric=None)¶

create the index from a table column containing a binary fingerprint or feature vector

Parameters:

table_map – TableMap that contains the arrow table
fingerprint_column – name of the fingerprint column, defaults to ‘ecfp4’
fingerprint_count_column – name of the fingerprint count column, defaults to ‘ecfp4_count’

load(file=None)¶

load the index

Parameters:: file – name of the file to load the fingerprint in

optimize()¶: optimize the index

save(file=None)¶

save the index

Parameters:: file – name of file to save fingerprint in

search(objects, hitlist_size=50, epsilon=0.3, with_raw_score=False, id_list=None, id_list_query=None, predicate=None)¶

search a series of query objects

Parameters:

objects – the query objects to be searched or their fingerprints
hitlist_size – the size of the hitlist returned
epsilon – max search accuracy error
with_raw_score – calculate and return the raw_score
id_list – array-like for converting row numbers in search results to hit ids
id_list_query – array-like for converting row numbers in search results to query ids
predicate – filter array that is the result of a predicate query

class masskit.utils.index.DotProductIndex(index_name=None, dimension=None, fingerprint_factory=None)¶

Bases: Index

brute force search of feature vectors using cosine score

create(table_map)¶

create index from a TableMap

Parameters:

table_map – the library to index
column_name – name of the column containing the objects to index. None=’spectrum’

load(file=None)¶

load the index

Parameters:: file – name of the file to load the fingerprint in

optimize()¶: optimize the index

save(file=None)¶

save the index

Parameters:: file – name of file to save fingerprint in

search(objects, hitlist_size=30, epsilon=0.1, id_list=None, id_list_query=None, predicate=None)¶

search feature vectors

Parameters:

objects – feature vector or list of feature vectors to be queried
hitlist_size – _description_, defaults to 50
epsilon – _description_, defaults to 0.1

Returns:

_description_

class masskit.utils.index.Index(index_name=None, dimension=None, fingerprint_factory=None)¶

Bases: ABC

Index used for searching a library

Parameters:

dimension – the number of features in the fingerprint
fingerprint_factory – class used to encapsulate the
index_name – name of index

abstract create(table_map, column_name=None)¶

create index from a TableMap

Parameters:

table_map – the library to index
column_name – name of the column containing the objects to index. None=’spectrum’

abstract load(file=None)¶

load the index

Parameters:: file – name of the file to load the fingerprint in

abstract optimize()¶: optimize the index

abstract save(file=None)¶

save the index

Parameters:: file – name of file to save fingerprint in

abstract search(objects, hitlist_size=50, epsilon=0.2, with_raw_score=False, id_list=None, id_list_query=None, predicate=None)¶

search a series of query objects

Parameters:

objects – the query objects to be searched or their fingerprints
hitlist_size – the size of the hitlist returned
epsilon – max search accuracy error
with_raw_score – calculate and return the raw_score
id_list – array-like for converting row numbers in search results to hit ids
id_list_query – array-like for converting row numbers in search results to query ids
predicate – filter array that is the result of a predicate query

spectrum2array(spectrum, spectral_array_in, channel_in=0, dtype=<class 'numpy.float32'>, cutoff=0.0)¶

class masskit.utils.index.TanimotoIndex(index_name=None, dimension=4096, fingerprint_factory=<class 'masskit.utils.fingerprints.SpectrumTanimotoFingerPrint'>)¶

Bases: Index

brute force search index of binary fingerprints

Parameters:

dimension – the number of features in the fingerprint
fingerprint_factory – class used to encapsulate the fingerprint

create(table_map, column_name=None)¶

create index from a TableMap

Parameters:

table_map – the library to index
column_name – name of the column containing the objects to index. None=’spectrum’

create_from_fingerprint(table_map, fingerprint_column='ecfp4', fingerprint_count_column='ecfp4_count')¶

create the index from columns in a table_map encoded as binary fingerprints

Parameters:

table_map – TableMap that contains the arrow table
fingerprint_column – name of the fingerprint column, defaults to ‘ecfp4’
fingerprint_count_column – name of the fingerprint count column, defaults to ‘ecfp4_count’

load(file=None)¶

load the index

Parameters:: file – name of the file to load the fingerprint in

optimize()¶: optimize the index

save(file=None)¶

save the index

Parameters:: file – name of file to save fingerprint in

search(objects, hitlist_size=50, epsilon=0.1, with_raw_score=False, id_list=None, id_list_query=None, predicate=None)¶

search a series of query objects

Parameters:

objects – the query objects to be searched or their fingerprints
hitlist_size – the size of the hitlist returned
epsilon – max search accuracy error
with_raw_score – calculate and return the raw_score
id_list – array-like for converting row numbers in search results to hit ids
id_list_query – array-like for converting row numbers in search results to query ids
predicate – filter array that is the result of a predicate query

masskit.utils.index.dot_product(objects, column, hit_ids, cosine_scores, hitlist_size)¶

masskit.utils.spectrum_writers module¶

masskit.utils.spectrum_writers.spectra_to_mgf(fp, spectra, charge_list=None)¶

write out an array-like of spectra in mgf format

Parameters:

fp – stream or filename to write out to. will append
spectra – name of the column containing the spectrum
charge_list – list of charges for Mascot to search, otherwise use the CHARGE field

masskit.utils.spectrum_writers.spectra_to_msp(fp, spectra, annotate_peptide=False, ion_types=None)¶

write out an array-like of spectra in msp format

Parameters:

fp – stream or filename to write out to. will append
spectra – map containing the spectrum
annotate_peptide – annotate the spectra as peptide
ion_types – ion types for annotation

masskit.utils.spectrum_writers.spectra_to_mzxml(fp, spectra, mzxml_attributes=None, min_intensity=1e-07, compress=True, use_id_as_scan=True)¶

write out an array-like of spectra in mzxml format

Parameters:

fp – stream or filename to write out to. will not append
min_intensity – the minimum intensity value
spectra – name of the column containing the spectrum
mzxml_attributes – dict containing mzXML attributes
use_id_as_scan – use spectrum.id instead of spectrum.scan
compress – should the data be compressed?

masskit.utils.tablemap module¶

class masskit.utils.tablemap.ArrowLibraryMap(table_in, num=0, *args, **kwargs)¶

Bases: TableMap

wrapper for an arrow library

create_dict(idx)¶

create dict for row

Parameters:: idx – row number

static from_mgf(file, num=None, title_fields=None, min_intensity=0.0, spectrum_type=None)¶

read in an mgf file and create an ArrowLibraryMap

Parameters:

file – filename or stream
num – number of rows. None means all
title_fields – dict containing column names with corresponding regex to extract field values from the TITLE
min_intensity – the minimum intensity to set the fingerprint bit
spectrum_type – the type of spectrum file

Returns:

ArrowLibraryMap

static from_msp(file, num=None, id_field=0, comment_fields=None, min_intensity=0.0, spectrum_type=None)¶

read in an msp file and create an ArrowLibraryMap

Parameters:

file – filename or stream
num – number of rows. None means all
id_field – start value of the id field
comment_fields – a Dict of regexes used to extract fields from the Comment field. Form of the Dict is { comment_field_name: (regex, type, field_name)}. For example {‘Filter’:(r’@hcd(d+.?d* )’, float, ‘nce’)}
min_intensity – the minimum intensity to set the fingerprint bit
spectrum_type – the type of spectrum file

Returns:

ArrowLibraryMap

static from_parquet(file, columns=None, num=None, combine_chunks=False, filters=None)¶

create an ArrowLibraryMap from a parquet file

Parameters:

file – filename or stream
columns – list of columns to read. None=all, []=minimum set
num – number of rows
combine_chunks – dechunkify the arrow table to allow zero copy
filters – parquet predicate as a list of tuples

static from_sdf(file, num=None, skip_expensive=True, max_size=0, id_field=None, min_intensity=0.0, set_probabilities=(0.01, 0.97, 0.01, 0.01), spectrum_type=None)¶

read in an sdf file and create an ArrowLibraryMap

Parameters:

file – filename or stream
num – number of rows. None means all
skip_expensive – don’t compute fields that are computationally expensive
max_size – the maximum bounding box size (used to filter out large molecules. 0=no bound)
id_field – field to use for the mol id, such as NISTNO, ID or _NAME (the sdf title field). if an integer, use the integer as the starting value for an assigned id
min_intensity – the minimum intensity to set the fingerprint bit
set_probabilities – how to divide into dev, train, valid, test
spectrum_type – the type of spectrum file

Returns:

ArrowLibraryMap

get_ids()¶

get the ids of all records, in row order

Returns:: array of ids

getitem_by_id(key)¶

get an item from the library by id

Parameters:: key – id
Returns:: the row dict

getitem_by_row(key)¶: get an item from the library by id :param key: row number :return: the row dict

getrow_by_id(key)¶

given an id, return the corresponding row number in the table

Parameters:: key – the id
Returns:: the row number of the id in the table

to_arrow()¶

to_csv(file, columns=None)¶

Write to csv file, skipping any spectrum column and writing mol columns as canonical SMILES

Parameters:

file – filename or file pointer. newline should be set to ‘’
columns – list of columns to write out to csv file. If none, all columns

to_mgf(file, spectrum_column=None)¶

save spectra to mgf file

Parameters:

file – filename or file pointer
spectrum_column – column name for spectrum

to_mzxml(file, use_id_as_scan=True, spectrum_column=None)¶

save spectra to mzxml format file

Parameters:

file – filename or stream
use_id_as_scan – use spectrum.id instead of spectrum.scan
spectrum_column – column name for spectrum

to_pandas()¶

to_parquet(file)¶

save spectra to parquet file

Parameters:: file – filename or stream

class masskit.utils.tablemap.ListLibraryMap(list_in, spectrum_column=None, *args, **kwargs)¶

Bases: TableMap

wrapper for a spectral library using python lists

create_dict(idx)¶

create dict for row

Parameters:: idx – row number

get_ids()¶

get the ids of all records, in row order

Returns:: array of ids

getitem_by_id(key)¶

get an item from the library by id

Parameters:: key – id
Returns:: the row dict

getitem_by_row(key)¶: get an item from the library by id :param key: row number :return: the row dict

getrow_by_id(key)¶

given an id, return the corresponding row number in the table

Parameters:: key – the id
Returns:: the row number of the id in the table

class masskit.utils.tablemap.PandasLibraryMap(df, *args, **kwargs)¶

Bases: TableMap

wrapper for a pandas spectral library

create_dict(idx)¶

create dict for row

Parameters:: idx – row number

get_ids()¶

get the ids of all records, in row order

Returns:: array of ids

getitem_by_id(key)¶

get an item from the library by id

Parameters:: key – id
Returns:: the row dict

getitem_by_row(key)¶: get an item from the library by id :param key: row number :return: the row dict

getrow_by_id(key)¶

given an id, return the corresponding row number in the table

Parameters:: key – the id
Returns:: the row number of the id in the table

class masskit.utils.tablemap.TableMap(*args, **kwargs)¶

Bases: ABC

collections.abc.Sequence wrapper for a library. Allows use of different stores, e.g. arrow or pandas

abstract create_dict(idx)¶

create dict for row

Parameters:: idx – row number

abstract get_ids()¶

get the ids of all records, in row order

Returns:: array of ids

abstract getitem_by_id(key)¶

get an item from the library by id

Parameters:: key – id
Returns:: the row dict

abstract getitem_by_row(key)¶: get an item from the library by id :param key: row number :return: the row dict

abstract getrow_by_id(key)¶

given an id, return the corresponding row number in the table

Parameters:: key – the id
Returns:: the row number of the id in the table

to_msp(file, annotate_peptide=False, ion_types=None, spectrum_column=None)¶

write out spectra in msp format

Parameters:

file – file or filename to write to
annotate_peptide – annotate the spectra as a peptide
ion_types – ion types for annotation
spectrum_column – column name for spectrum

masskit.utils.tables module¶

masskit.utils.tables.create_dataset(rows=5, cols=[<class 'int'>, <class 'float'>, <class 'str'>, <class 'list'>], names=array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'], dtype='<U1'))¶

masskit.utils.tables.is_struct_or_extstruct(chunked_array: ChunkedArray) → bool¶

masskit.utils.tables.optimize_structarray(struct: ChunkedArray) → ChunkedArray¶

masskit.utils.tables.optimize_table(table: Table) → Table¶

masskit.utils.tables.random_string(length)¶

masskit.utils.tables.row_view(table, idx=0)¶

masskit.utils.tables.row_view_raw(table, idx=0)¶

masskit.utils.tables.struct_view(struct_name, parent)¶

masskit.utils.tables.structarray_to_table(struct: StructArray) → Table¶

masskit.utils.tables.table_add_structarray(table: Table, structarray: StructArray, column_name: str = None) → Table¶

add a struct array to a table

Parameters:

table – table to be added to
structarray – structarray to add to table
column_name – name of column to add

Returns:

new table with appended column

masskit.utils.tables.table_to_structarray(table: Table, structarray_type: ExtensionType = None) → StructArray¶

convert a spectrum table into a struct array. if an ExtensionType is passed in, will create a struct array of that type

Parameters:

table – spectrum table
structarray_type – the type of the array returned, e.g. SpectrumArrowType()

Returns:

StructArray

masskit.utils package¶

Subpackages¶

Submodules¶

masskit.utils.accumulator module¶

masskit.utils.arrow module¶

masskit.utils.files module¶

masskit.utils.fingerprints module¶

masskit.utils.general module¶

masskit.utils.hitlist module¶

masskit.utils.index module¶

masskit.utils.spectrum_writers module¶

masskit.utils.tablemap module¶

masskit.utils.tables module¶

Module contents¶

Table of Contents

This Page