masskit.utils package

Subpackages

Submodules

masskit.utils.accumulator module

class masskit.utils.accumulator.Accumulator(*args, **kwargs)

Bases: ABC

accumulator class used to take the mean and standard deviation of a set of data

abstract add(new_item)

accumulated a new item

Parameters:

new_item – new item to be accumulated

abstract finalize()

finalize the accumulation

class masskit.utils.accumulator.AccumulatorProperty(*args, **kwargs)

Bases: Accumulator

used to calculate the mean and standard deviation of a property

add(new_item)

add an item to the average. Keeps running total of average and std deviation using Welford’s algorithm.

Parameters:

new_spectrum – new item to be added

finalize()

finalize the std deviation after all the the spectra have been added

masskit.utils.arrow module

masskit.utils.arrow.create_object_id(filename, filters)

create an object id based on filename and filters

Parameters:
  • filename – filename

  • filters – filter string

Returns:

object id

masskit.utils.arrow.save_to_arrow(filename, columns=None, filters=None, tempdir=None)

Load a parquet file and save it as a temp arrow file after applying filters and column lists. Load it as a memmap if the temp arrow file already exists

Parameters:
  • filename – parquet file

  • columns – columns to load

  • filters – parquet file filters

  • tempdir – tempdir to use for memmap, otherwise use python default

Returns:

ArrowLibraryMap

masskit.utils.files module

class masskit.utils.files.BatchFileReader(filename: Union[str, PathLike], format: Union[Dict, omegaconf.DictConfig] = None, row_batch_size: int = 5000)

Bases: object

iter_tables() Table

read batch generator, returns a Table

loaders = {'csv': <class 'masskit.utils.files.CSVLoader'>, 'mgf': <class 'masskit.utils.files.MGFLoader'>, 'msp': <class 'masskit.utils.files.MSPLoader'>, 'sdf': <class 'masskit.utils.files.SDFLoader'>}
class masskit.utils.files.BatchFileWriter(filename: Union[str, PathLike], format: str = None, annotate: bool = False, row_batch_size: int = 5000, num_workers: int = 7, column_name: str = None)

Bases: object

close()

Close the writer. Essential to avoid race conditions with threaded writers.

write_table(table: Table) None

write a table out

Parameters:

table – table to write

class masskit.utils.files.BatchLoader(file_type, row_batch_size=5000, format=None, num=None)

Bases: object

ecfp4 = <masskit.utils.fingerprints.ECFPFingerprint object>
finalize()

finalized the chunk

load(fp)

read in a chunk from a stream

Parameters:

fp – the stream

loop_end()

end of loop to read in one record

classmethod mol2row(mol, max_size: int = 0, skip_computed_props=True, skip_expensive: bool = True) Dict

Convert an rdkit Mol into a row

Parameters:
  • mol – the molecule

  • max_size – the maximum bounding box size (used to filter out large molecules. 0=no bound)

  • skip_computed_props – skip computing properties

  • skip_expensive – skip computing computationally expensive properties

Returns:

row as dict, mol

setup()

set up loading

tms = '[#14]([CH3])([CH3])[CH3]'
class masskit.utils.files.CSVLoader(row_batch_size=5000, format=None, num=None)

Bases: BatchLoader

load(fp)

read in a chunk from a stream

Parameters:

fp – the stream

class masskit.utils.files.MGFLoader(row_batch_size=5000, format=None, num=None, set_probabilities=(0.0, 0.93, 0.02, 0.05), min_intensity=0.0)

Bases: BatchLoader

load(fp)

read in a chunk from a stream

Parameters:

fp – the stream

class masskit.utils.files.MSPLoader(row_batch_size=5000, format=None, min_intensity=0, num=None)

Bases: BatchLoader

load(fp)

read in a chunk from a stream

Parameters:

fp – the stream

class masskit.utils.files.MzTab_Reader(fp, dedup=False, decoy_func=None)

Bases: object

Class for reading an mzTab file

Parameters:
  • fp – stream or filename

  • dedup – Only take the first row with a given hit_id

get_hitlist()
parse_metadata()
parse_psm()
parse_psm_row(row)
read_sections(fp)
class masskit.utils.files.SDFLoader(row_batch_size=5000, format=None, num=None, set_probabilities=(0.0, 0.93, 0.02, 0.05), min_intensity=0.0)

Bases: BatchLoader

load(fp)

read in a chunk from a stream

Parameters:

fp – the stream

masskit.utils.files.add_row_to_records(records, row)
masskit.utils.files.create_table_with_mod_dict(records, schema)
masskit.utils.files.empty_records(schema)
masskit.utils.files.get_progress(fp)

Given a fp pointer, return best possible progress meter

Parameters:

fp – name of the file to read or a file object

masskit.utils.files.load_default_config_schema(file_type, format=None)

for a given file type and format, return configuration

Parameters:
  • file_type – the type of file, e.g. mgf, msp, arrow, parquet, sdf

  • format – the specific format of the file_type, e.g. msp_mol, msp_peptide, sdf_mol, sdf_nist_mol, sdf_pubchem_mol. If none, use default

Returns:

config

masskit.utils.files.load_mzTab(fp, dedup=True, decoy_func=None)

Read file in SDF format and return an array.

Parameters:
  • fp – stream or filename

  • dedup – Only take the first row with a given hit_id

masskit.utils.files.parse_energy(row, value)

parse energy field that has nce or collision_energy in it

masskit.utils.files.parse_glycopeptide_annot(annots, peak_index)

take a glycopeptide annotation string like “Y0-H2O+i/-18.2ppm,Y0-NH3/7.6ppm 22 23” and parse it. this function is not complete.

Parameters:
  • annots – the annotation string

  • peak_index – the index of the peak with the annotation in the mz array

Returns:

the parsed values in an array

masskit.utils.files.read_parquet(fp, columns=None, num=None, filters=None)

reads a PyArrow table from a parquet file.

Parameters:
  • fp – stream or filename

  • columns – list of columns to remy_sdf.parquetad, None=all

  • filters – parquet predicate as a list of tuples

Returns:

PyArrow table

masskit.utils.files.records2table(records, schema_group)

convert a flat table with spectral records into a table with a nested spectrum

Parameters:
  • records – records to be added to a table

  • schema_group – the schema group

masskit.utils.files.seek_size(fp)
masskit.utils.files.spectra_to_array(spectra, min_intensity=0, write_tolerance=False, schema_group=None)

convert an array-like of spectra to an arrow_table

Parameters:
  • spectra – iteratable containing the spectrums

  • min_intensity – the minimum intensity to set the fingerprint bit

  • write_tolerance – put the tolerance into the arrow Table

  • schema_group – the schema group of spectrum file

Returns:

arrow table of spectra

masskit.utils.files.spectrum2mgf(spectrum)
masskit.utils.files.spectrum2msp(spectrum, annotate=False)
masskit.utils.files.write_parquet(fp, table)

save a PyArrow table to a parquet file.

Parameters:
  • table – the dataframe

  • fp – stream or filename

masskit.utils.fingerprints module

class masskit.utils.fingerprints.ECFPFingerprint(dimension=4096, radius=2, *args, **kwargs)

Bases: Fingerprint

rdkit version of ECFP fingerprint for a small molecule structure

from_bitvec(array)

create fingerprint from rdkit ExplicitBitVect

Parameters:

array (DataStructs.ExplicitBitVect) – input ExplicitBitVect

from_numpy(array)

create fingerprint from numpy array

Parameters:

array (np.ndarray) – input numpy array

object2fingerprint(obj, dtype=<class 'numpy.float32'>)

convert an object into a fingerprint

Parameters:
  • obj (object) – the object to convert to a fingerprint

  • dtype (np.dtype) – data type of output array

to_bitvec()

convert fingerprint to rdkit ExplicitBitVect

Returns:

the fingerprint

Return type:

DataStructs.ExplicitBitVect

to_numpy()

convert fingerprint to numpy array

Returns:

the fingerprint

Return type:

np.ndarray

class masskit.utils.fingerprints.Fingerprint(dimension=2000, stride=256, column_name=None, *args, **kwargs)

Bases: ABC

class for encapsulating and calculating a fingerprint

Parameters:

dimension (int) – size of the fingerprint (includes other features like counts

bitvec2numpy()

version of to_numpy used when fingerprint is a bitvec

abstract from_bitvec(array)

create fingerprint from rdkit ExplicitBitVect

Parameters:

array (DataStructs.ExplicitBitVect) – input ExplicitBitVect

abstract from_numpy(array)

create fingerprint from numpy array

Parameters:

array (np.ndarray) – input numpy array

get_num_on_bits()

retrieve the number of nonzero features

numpy2bitvec(array)

version of from_numpy used when fingerprint is a bitvec

Parameters:

array – the fingerprint as a numpy array

abstract object2fingerprint(obj, dtype=<class 'numpy.float32'>)

convert an object into a fingerprint

Parameters:
  • obj (object) – the object to convert to a fingerprint

  • dtype (np.dtype) – data type of output array

size() int

size of the fingerprint in bytes, where the size is a multiple of the stride

Returns:

size of fingerprint

abstract to_bitvec()

convert fingerprint to rdkit ExplicitBitVect

Returns:

the fingerprint

Return type:

DataStructs.ExplicitBitVect

abstract to_numpy()

convert fingerprint to numpy array

Returns:

the fingerprint

Return type:

np.ndarray

class masskit.utils.fingerprints.MolFingerprint(dimension=4096, *args, **kwargs)

Bases: ABC

class for encapsulating and calculating a molecule fingerprint

Parameters:

dimension (int) – size of the fingerprint (includes other features like counts

abstract from_bitvec(array)

create fingerprint from rdkit ExplicitBitVect

Parameters:

array (DataStructs.ExplicitBitVect) – input ExplicitBitVect

abstract from_numpy(array)

create fingerprint from numpy array

Parameters:

array (np.ndarray) – input numpy array

abstract object2fingerprint(obj, dtype=<class 'numpy.float32'>)

convert an object into a fingerprint

Parameters:
  • obj (object) – the object to convert to a fingerprint

  • dtype (np.dtype) – data type of output array

abstract to_bitvec()

convert fingerprint to rdkit ExplicitBitVect

Returns:

the fingerprint

Return type:

DataStructs.ExplicitBitVect

abstract to_numpy()

convert fingerprint to numpy array

Returns:

the fingerprint

Return type:

np.ndarray

class masskit.utils.fingerprints.SpectrumFingerprint(dimension=2000, bin_size=1.0, first_bin_left=0.5, *args, **kwargs)

Bases: Fingerprint

base class for spectrum fingerprint

abstract from_bitvec(array)

create fingerprint from rdkit ExplicitBitVect

Parameters:

array (DataStructs.ExplicitBitVect) – input ExplicitBitVect

abstract from_numpy(array)

create fingerprint from numpy array

Parameters:

array (np.ndarray) – input numpy array

abstract object2fingerprint(obj, dtype=<class 'numpy.float32'>)

convert an object into a fingerprint

Parameters:
  • obj (object) – the object to convert to a fingerprint

  • dtype (np.dtype) – data type of output array

abstract to_bitvec()

convert fingerprint to rdkit ExplicitBitVect

Returns:

the fingerprint

Return type:

DataStructs.ExplicitBitVect

abstract to_numpy()

convert fingerprint to numpy array

Returns:

the fingerprint

Return type:

np.ndarray

class masskit.utils.fingerprints.SpectrumFloatFingerprint(dimension=2000, count_max=2000, mz_ratio: bool = False, column_name=None, *args, **kwargs)

Bases: SpectrumFingerprint

create a spectral fingerprint that also contains a count by powers of two, skipping 1. In other words, the count bits are set if the number of peaks exceeds >=2, >=4, …

from_bitvec(array)

create fingerprint from rdkit ExplicitBitVect

Parameters:

array (DataStructs.ExplicitBitVect) – input ExplicitBitVect

from_numpy(array)

create fingerprint from numpy array

Parameters:

array (np.ndarray) – input numpy array

object2fingerprint(obj, dtype=<class 'numpy.float32'>)

fill out an array of fixed size with the ions. note that this func assumes spectra sorted by mz

to_bitvec()

convert fingerprint to rdkit ExplicitBitVect

Returns:

the fingerprint

Return type:

DataStructs.ExplicitBitVect

to_numpy()

convert fingerprint to numpy array

Returns:

the fingerprint

Return type:

np.ndarray

class masskit.utils.fingerprints.SpectrumTanimotoFingerPrint(dimension=2000, column_name=None, *args, **kwargs)

Bases: SpectrumFingerprint

create a spectral tanimoto fingerprint

Parameters:
  • dimension – overall dimension of fingerprint

  • bin_size – size of each bin

  • first_bin_left – the position of the first bin left side

from_bitvec(array)

create fingerprint from rdkit ExplicitBitVect

Parameters:

array (DataStructs.ExplicitBitVect) – input ExplicitBitVect

from_numpy(array)

create fingerprint from numpy array

Parameters:

array (np.ndarray) – input numpy array

object2fingerprint(obj, dtype=<class 'numpy.float32'>)

convert an object into a fingerprint

Parameters:
  • obj (object) – the object to convert to a fingerprint

  • dtype (np.dtype) – data type of output array

to_bitvec()

convert fingerprint to rdkit ExplicitBitVect

Returns:

the fingerprint

Return type:

DataStructs.ExplicitBitVect

to_numpy()

convert fingerprint to numpy array

Returns:

the fingerprint

Return type:

np.ndarray

masskit.utils.fingerprints.calc_bit(a, b, max_mz_in, tolerance_in)
masskit.utils.fingerprints.calc_dynamic_fingerprint(spectrum, max_mz=2000, tolerance=0.1, min_intensity=50, mz_window=14, min_mz=2.5, max_rank=20, hybrid_mask=None)

create a fingerprint created from the top max_rank peaks and the OR of the fingerprint of the top peaks, where each fingerprint is shifted by the mz of one of the top peaks

Parameters:
  • spectrum – input spectrum

  • max_mz – maximum mz used for the fingerprint

  • tolerance – mass tolerance to use

  • min_intensity – minimum intensity to allow for creating the fingerprint

  • mz_window – noise filter mz window

  • min_mz – minimum mz to use

  • max_rank – the maximum rank used for the fingerprint

Returns:

the fingerprint

masskit.utils.fingerprints.calc_interval_fingerprint(spectrum, max_mz=2000, tolerance=0.1, min_intensity=50, mz_window=14, min_mz=1, peaks=None)

create a pairwise interval fingerprint from the filtered peaks of a spectrum

Parameters:
  • spectrum – input spectrum

  • max_mz – maximum mz used for the fingerprint

  • tolerance – mass tolerance to use

  • min_intensity – minimum intensity to allow for creating the fingerprint

  • mz_window – noise filter mz window

  • min_mz – minimum mz to use

  • peaks – a list of additional mz values to use for creating the intervals

Returns:

the fingerprint

masskit.utils.fingerprints.fingerprint_search_numpy(query_fingerprint, fingerprint_array, tanimoto_cutoff, query_fingerprint_count=None, fingerprint_array_count=None)

return a list of fingerprint hits to a query fingerprint

Parameters:
  • query_fingerprint – query fingerprint

  • fingerprint_array – array-like list of fingerprints

  • tanimoto_cutoff – Tanimoto cutoff

  • query_fingerprint_count – number of bits set in query

  • fingerprint_array_count – number of bits set in array fingerprints

Returns:

tanimoto scores (if below threshold, set to 0.0)

Notes: numba doesn’t work with arrays of objects. pyarrow creates a numpy array of objects, where each object is a numpy array. even arrow fixed size lists do this. fixedbinary uses byte objects.

masskit.utils.general module

class masskit.utils.general.MassKitSearchPathPlugin(*args: Any, **kwargs: Any)

Bases: SearchPathPlugin

add the cwd to the search path for configuration yaml files

manipulate_search_path(search_path: hydra.core.config_search_path.ConfigSearchPath) None
masskit.utils.general.class_for_name(module_name_list, class_name)

dynamically try to find class in list of modules :param module_name_list: list of modules to search :param class_name: class to look for :return: class

masskit.utils.general.discounted_cumulative_gain(relevance_array)
masskit.utils.general.expand_path(path_pattern) Iterable[Path]

expand a path with globs (wildcards) into a generator of Paths

Parameters:

path_pattern – the path to be globbed

Returns:

iterable of Path

masskit.utils.general.expand_path_list(path_list)

given a file path or list of file paths that may contain wildcards, expand ~ to the user directory and glob the wildcards

Parameters:

path_list – list or str of paths, may include ~ and *

Returns:

list of expanded paths

masskit.utils.general.get_file(filename, cache_directory=None, search_path=None, tgz_extension=None)

for a given file, return the path (downloading file if necessary)

Parameters:
  • filename – name of the file

  • cache_directory – where files are cached (config.paths.cache_directory)

  • search_path – list of where to search for files (config.paths.search_path)

  • tgz_extension – if a tgz file to be downloaded, the extension of the unarchived file

Returns:

path

Notes: we don’t support zip files as the python api for zip files doesn’t support symlinks

masskit.utils.general.is_list_like(obj)

is the object list-like?

Parameters:

obj – the object to be tested

Returns:

true if list like

masskit.utils.general.open_if_compressed(filename, mode, newline=None)

open the file denoted by filename and uncompress it if it is compressed

Parameters:
  • filename – filename

  • mode – file opening mode

  • newline – specify newline character

Returns:

stream

masskit.utils.general.open_if_filename(fp, mode, newline=None)

if the fp is a string, open the file, and uncompress if needed

Parameters:
  • fp – possible filename

  • mode – file opening mode

  • newline – specify newline character

Returns:

stream

masskit.utils.general.parse_filename(filename: str)

parse filename into root, extension, and compression extension

Parameters:

filename – filename

Returns:

root, extension, compression

masskit.utils.general.read_arrow(filename)

read a pyarrow Table from an arrow or parquet file. arrow file is memory mapped.

Parameters:

filename – file to input. use the suffix “arrow” if an arrow file, “parquet” if a parquet file

Returns:

pyarrow Table

masskit.utils.general.search_for_file(filename, directories)

look for a file in a list of directories

Parameters:
  • filename – the filename to look for

  • directories – the directories to search

Returns:

the Path to the file, otherwise None if not found

masskit.utils.general.write_arrow(table, filename, row_group_size=5000)

write a pyarrow Table to an arrow or parquet file.

Parameters:
  • filename – file to output. use the suffix “arrow” if an arrow file, “parquet” if a parquet file

  • table – pyarrow Table

  • row_group_size – if a parquet file, use this row group size

masskit.utils.hitlist module

class masskit.utils.hitlist.CompareRecallDCG(**kwargs)

Bases: HitlistCompare

compare two hitlists by ranking the scores and computing the recall and discounted cumulative gain at several subsets of the hit list with different lengths. The recall values consist of the number of hits in each subset in the compare hitlist that have a rank equal to or better than the ground truth hitlist.

compare(compare_hitlist, truth_hitlist=None, recall_values_in=None, rank_method=None)

compare hitlist to a ground truth hitlist. Rank according to scores, then compute recall. the ranking applied is min ranking, where a group of hits that are tied are given a rank as if there had been one member of the group.

Parameters:
  • compare_hitlist – Hitlist to compare to

  • truth_hitlist – Hitlist that serves as ground truth

  • recall_values_in – lengths of the hitlist subsets used to calculate recall. [1, 3, 10] by default

  • rank_method – what method to use to do ranking? “min” is default

class masskit.utils.hitlist.CosineScore(hit_table_map, query_table_map=None, score_name=None, scale=1.0)

Bases: Score

score(hitlist)

do the scoring on the hitlist

Parameters:

hitlist (Hitlist) – the hitlist to score

class masskit.utils.hitlist.Hitlist(hitlist=None)

Bases: ABC

base class for a list of hits from a search

Parameters:

hitlist (pandas.DataFrame) – the hitlist

get_query_ids()

return list of unique query ids

Returns:

query ids

Return type:

np.int64

property hitlist

get the hitlist as a pandas dataframe

Returns:

the hitlist

Return type:

pandas.DataFrame

load(file)
save(file)
sort(score=None, ascending=False)

sort the hitlist per query

Parameters:
  • score – name of the score. default is cosine_score

  • ascending – should the score be ascending

to_pandas()

get hitlist as a pandas dataframe

class masskit.utils.hitlist.HitlistCompare(comparison_score=None, truth_score=None, comparison_score_ascending=False, truth_score_ascending=False, comparison_score_rank=None, truth_score_rank=None)

Bases: ABC

base class for comparing two hitlists.

Parameters:
  • comparison_score – the column name of the score to compare to ground truth, ‘cosine_score’ default

  • truth_score – the column name of the ground truth score, default same as comparison score

  • comparison_score_ascending – is the comparison score ascending?

  • truth_score_ascending – is the ground truth score ascending?

  • comparison_score_rank – what is the column name of the rank of the comparison score, default appends _rank to comparison_score

  • truth_score_rank – what is the column name of the rank of the truth score, default appends _rank to truth_score

abstract compare(compare_hitlist, truth_hitlist=None, recall_values=None, rank_method=None)

compare hitlist to a ground truth hitlist.

Parameters:
  • compare_hitlist – Hitlist to compare to

  • truth_hitlist – Hitlist that serves as ground truth

  • recall_values – lengths of the hitlist subsets used to calculate recall. [1, 3, 10] by default

  • rank_method – what method to use to do ranking? “min” is default

class masskit.utils.hitlist.IdentityRecall(**kwargs)

Bases: HitlistCompare

examing a hitlist with an identity colum and compute the recall

compare(compare_hitlist, recall_values_in=None, rank_method=None, identity_column=None)

compute recall of compare hitlist

Parameters:
  • compare_hitlist – Hitlist that serves as ground truth

  • recall_values_in – lengths of the hitlist subsets used to calculate recall. [1, 3, 10] by default

  • rank_method – what method to use to do ranking? “min” is default

  • identity_column – column name of the identity column

class masskit.utils.hitlist.PeptideIdentityScore(score_name=None)

Bases: Score

score(hitlist)

do the scoring on the hitlist

Parameters:

hitlist (Hitlist) – the hitlist to score

class masskit.utils.hitlist.Score(hit_table_map=None, query_table_map=None, score_name=None)

Bases: ABC

base class for scoring a hitlist

Parameters:
  • hit_table_map (TableMap) – a TableMap for the hits

  • query_table_map (TableMap) – a TableMap for the queries. uses hit_table_map if set to None

  • score_name (str) – the name of the score

abstract score(hitlist)

do the scoring on the hitlist

Parameters:

hitlist (Hitlist) – the hitlist to score

class masskit.utils.hitlist.TanimotoScore(hit_table_map, query_table_map=None, score_name=None, scale=1.0, hit_fingerprint_column=None, query_fingerprint_column=None)

Bases: Score

score(hitlist)

do the scoring on the hitlist

Parameters:

hitlist (Hitlist) – the hitlist to score

masskit.utils.index module

class masskit.utils.index.BruteForceIndex(index_name=None, dimension=None, fingerprint_factory=None)

Bases: Index

search a library by brute force spectrum matching

Parameters:
  • dimension – the number of features in the fingerprint (ignored)

  • fingerprint_factory – class used to encapsulate the fingerprint

create(table_map, column_name=None)

create index from a TableMap

Parameters:
  • table_map – the library to index

  • column_name – name of the column containing the objects to index. None=’spectrum’

load(file=None)

load the index

Parameters:

file – name of the file to load the fingerprint in

optimize()

optimize the index

save(file=None)

save the index

Parameters:

file – name of file to save fingerprint in

search(objects, hitlist_size=50, epsilon=0.3, with_raw_score=False, id_list=None, id_list_query=None, predicate=None)

search a series of query objects

Parameters:
  • objects – the query objects to be searched or their fingerprints

  • hitlist_size – the size of the hitlist returned

  • epsilon – max search accuracy error

  • with_raw_score – calculate and return the raw_score

  • id_list – array-like for converting row numbers in search results to hit ids

  • id_list_query – array-like for converting row numbers in search results to query ids

  • predicate – filter array that is the result of a predicate query

class masskit.utils.index.DescentIndex(index_name=None, dimension=2000, fingerprint_factory=<class 'masskit.utils.fingerprints.SpectrumFloatFingerprint'>)

Bases: Index

pynndescent index for a fingerprint

Parameters:
  • dimension – the number of features in the fingerprint

  • fingerprint_factory – class used to encapsulate the fingerprint

create(table_map, metric=None, column_name=None)

create index from a TableMap

Parameters:
  • table_map – the library to index

  • column_name – name of the column containing the objects to index. None=’spectrum’

create_from_fingerprint(table_map, fingerprint_column='ecfp4', fingerprint_count_column='ecfp4_count', metric=None)

create the index from a table column containing a binary fingerprint or feature vector

Parameters:
  • table_map – TableMap that contains the arrow table

  • fingerprint_column – name of the fingerprint column, defaults to ‘ecfp4’

  • fingerprint_count_column – name of the fingerprint count column, defaults to ‘ecfp4_count’

load(file=None)

load the index

Parameters:

file – name of the file to load the fingerprint in

optimize()

optimize the index

save(file=None)

save the index

Parameters:

file – name of file to save fingerprint in

search(objects, hitlist_size=50, epsilon=0.3, with_raw_score=False, id_list=None, id_list_query=None, predicate=None)

search a series of query objects

Parameters:
  • objects – the query objects to be searched or their fingerprints

  • hitlist_size – the size of the hitlist returned

  • epsilon – max search accuracy error

  • with_raw_score – calculate and return the raw_score

  • id_list – array-like for converting row numbers in search results to hit ids

  • id_list_query – array-like for converting row numbers in search results to query ids

  • predicate – filter array that is the result of a predicate query

class masskit.utils.index.DotProductIndex(index_name=None, dimension=None, fingerprint_factory=None)

Bases: Index

brute force search of feature vectors using cosine score

create(table_map)

create index from a TableMap

Parameters:
  • table_map – the library to index

  • column_name – name of the column containing the objects to index. None=’spectrum’

load(file=None)

load the index

Parameters:

file – name of the file to load the fingerprint in

optimize()

optimize the index

save(file=None)

save the index

Parameters:

file – name of file to save fingerprint in

search(objects, hitlist_size=30, epsilon=0.1, id_list=None, id_list_query=None, predicate=None)

search feature vectors

Parameters:
  • objects – feature vector or list of feature vectors to be queried

  • hitlist_size – _description_, defaults to 50

  • epsilon – _description_, defaults to 0.1

Returns:

_description_

class masskit.utils.index.Index(index_name=None, dimension=None, fingerprint_factory=None)

Bases: ABC

Index used for searching a library

Parameters:
  • dimension – the number of features in the fingerprint

  • fingerprint_factory – class used to encapsulate the

  • index_name – name of index

abstract create(table_map, column_name=None)

create index from a TableMap

Parameters:
  • table_map – the library to index

  • column_name – name of the column containing the objects to index. None=’spectrum’

abstract load(file=None)

load the index

Parameters:

file – name of the file to load the fingerprint in

abstract optimize()

optimize the index

abstract save(file=None)

save the index

Parameters:

file – name of file to save fingerprint in

abstract search(objects, hitlist_size=50, epsilon=0.2, with_raw_score=False, id_list=None, id_list_query=None, predicate=None)

search a series of query objects

Parameters:
  • objects – the query objects to be searched or their fingerprints

  • hitlist_size – the size of the hitlist returned

  • epsilon – max search accuracy error

  • with_raw_score – calculate and return the raw_score

  • id_list – array-like for converting row numbers in search results to hit ids

  • id_list_query – array-like for converting row numbers in search results to query ids

  • predicate – filter array that is the result of a predicate query

spectrum2array(spectrum, spectral_array_in, channel_in=0, dtype=<class 'numpy.float32'>, cutoff=0.0)
class masskit.utils.index.TanimotoIndex(index_name=None, dimension=4096, fingerprint_factory=<class 'masskit.utils.fingerprints.SpectrumTanimotoFingerPrint'>)

Bases: Index

brute force search index of binary fingerprints

Parameters:
  • dimension – the number of features in the fingerprint

  • fingerprint_factory – class used to encapsulate the fingerprint

create(table_map, column_name=None)

create index from a TableMap

Parameters:
  • table_map – the library to index

  • column_name – name of the column containing the objects to index. None=’spectrum’

create_from_fingerprint(table_map, fingerprint_column='ecfp4', fingerprint_count_column='ecfp4_count')

create the index from columns in a table_map encoded as binary fingerprints

Parameters:
  • table_map – TableMap that contains the arrow table

  • fingerprint_column – name of the fingerprint column, defaults to ‘ecfp4’

  • fingerprint_count_column – name of the fingerprint count column, defaults to ‘ecfp4_count’

load(file=None)

load the index

Parameters:

file – name of the file to load the fingerprint in

optimize()

optimize the index

save(file=None)

save the index

Parameters:

file – name of file to save fingerprint in

search(objects, hitlist_size=50, epsilon=0.1, with_raw_score=False, id_list=None, id_list_query=None, predicate=None)

search a series of query objects

Parameters:
  • objects – the query objects to be searched or their fingerprints

  • hitlist_size – the size of the hitlist returned

  • epsilon – max search accuracy error

  • with_raw_score – calculate and return the raw_score

  • id_list – array-like for converting row numbers in search results to hit ids

  • id_list_query – array-like for converting row numbers in search results to query ids

  • predicate – filter array that is the result of a predicate query

masskit.utils.index.dot_product(objects, column, hit_ids, cosine_scores, hitlist_size)

masskit.utils.spectrum_writers module

masskit.utils.spectrum_writers.spectra_to_mgf(fp, spectra, charge_list=None)

write out an array-like of spectra in mgf format

Parameters:
  • fp – stream or filename to write out to. will append

  • spectra – name of the column containing the spectrum

  • charge_list – list of charges for Mascot to search, otherwise use the CHARGE field

masskit.utils.spectrum_writers.spectra_to_msp(fp, spectra, annotate_peptide=False, ion_types=None)

write out an array-like of spectra in msp format

Parameters:
  • fp – stream or filename to write out to. will append

  • spectra – map containing the spectrum

  • annotate_peptide – annotate the spectra as peptide

  • ion_types – ion types for annotation

masskit.utils.spectrum_writers.spectra_to_mzxml(fp, spectra, mzxml_attributes=None, min_intensity=1e-07, compress=True, use_id_as_scan=True)

write out an array-like of spectra in mzxml format

Parameters:
  • fp – stream or filename to write out to. will not append

  • min_intensity – the minimum intensity value

  • spectra – name of the column containing the spectrum

  • mzxml_attributes – dict containing mzXML attributes

  • use_id_as_scan – use spectrum.id instead of spectrum.scan

  • compress – should the data be compressed?

masskit.utils.tablemap module

class masskit.utils.tablemap.ArrowLibraryMap(table_in, num=0, *args, **kwargs)

Bases: TableMap

wrapper for an arrow library

create_dict(idx)

create dict for row

Parameters:

idx – row number

static from_mgf(file, num=None, title_fields=None, min_intensity=0.0, spectrum_type=None)

read in an mgf file and create an ArrowLibraryMap

Parameters:
  • file – filename or stream

  • num – number of rows. None means all

  • title_fields – dict containing column names with corresponding regex to extract field values from the TITLE

  • min_intensity – the minimum intensity to set the fingerprint bit

  • spectrum_type – the type of spectrum file

Returns:

ArrowLibraryMap

static from_msp(file, num=None, id_field=0, comment_fields=None, min_intensity=0.0, spectrum_type=None)

read in an msp file and create an ArrowLibraryMap

Parameters:
  • file – filename or stream

  • num – number of rows. None means all

  • id_field – start value of the id field

  • comment_fields – a Dict of regexes used to extract fields from the Comment field. Form of the Dict is { comment_field_name: (regex, type, field_name)}. For example {‘Filter’:(r’@hcd(d+.?d* )’, float, ‘nce’)}

  • min_intensity – the minimum intensity to set the fingerprint bit

  • spectrum_type – the type of spectrum file

Returns:

ArrowLibraryMap

static from_parquet(file, columns=None, num=None, combine_chunks=False, filters=None)

create an ArrowLibraryMap from a parquet file

Parameters:
  • file – filename or stream

  • columns – list of columns to read. None=all, []=minimum set

  • num – number of rows

  • combine_chunks – dechunkify the arrow table to allow zero copy

  • filters – parquet predicate as a list of tuples

static from_sdf(file, num=None, skip_expensive=True, max_size=0, id_field=None, min_intensity=0.0, set_probabilities=(0.01, 0.97, 0.01, 0.01), spectrum_type=None)

read in an sdf file and create an ArrowLibraryMap

Parameters:
  • file – filename or stream

  • num – number of rows. None means all

  • skip_expensive – don’t compute fields that are computationally expensive

  • max_size – the maximum bounding box size (used to filter out large molecules. 0=no bound)

  • id_field – field to use for the mol id, such as NISTNO, ID or _NAME (the sdf title field). if an integer, use the integer as the starting value for an assigned id

  • min_intensity – the minimum intensity to set the fingerprint bit

  • set_probabilities – how to divide into dev, train, valid, test

  • spectrum_type – the type of spectrum file

Returns:

ArrowLibraryMap

get_ids()

get the ids of all records, in row order

Returns:

array of ids

getitem_by_id(key)

get an item from the library by id

Parameters:

key – id

Returns:

the row dict

getitem_by_row(key)

get an item from the library by id :param key: row number :return: the row dict

getrow_by_id(key)

given an id, return the corresponding row number in the table

Parameters:

key – the id

Returns:

the row number of the id in the table

to_arrow()
to_csv(file, columns=None)

Write to csv file, skipping any spectrum column and writing mol columns as canonical SMILES

Parameters:
  • file – filename or file pointer. newline should be set to ‘’

  • columns – list of columns to write out to csv file. If none, all columns

to_mgf(file, spectrum_column=None)

save spectra to mgf file

Parameters:
  • file – filename or file pointer

  • spectrum_column – column name for spectrum

to_mzxml(file, use_id_as_scan=True, spectrum_column=None)

save spectra to mzxml format file

Parameters:
  • file – filename or stream

  • use_id_as_scan – use spectrum.id instead of spectrum.scan

  • spectrum_column – column name for spectrum

to_pandas()
to_parquet(file)

save spectra to parquet file

Parameters:

file – filename or stream

class masskit.utils.tablemap.ListLibraryMap(list_in, spectrum_column=None, *args, **kwargs)

Bases: TableMap

wrapper for a spectral library using python lists

create_dict(idx)

create dict for row

Parameters:

idx – row number

get_ids()

get the ids of all records, in row order

Returns:

array of ids

getitem_by_id(key)

get an item from the library by id

Parameters:

key – id

Returns:

the row dict

getitem_by_row(key)

get an item from the library by id :param key: row number :return: the row dict

getrow_by_id(key)

given an id, return the corresponding row number in the table

Parameters:

key – the id

Returns:

the row number of the id in the table

class masskit.utils.tablemap.PandasLibraryMap(df, *args, **kwargs)

Bases: TableMap

wrapper for a pandas spectral library

create_dict(idx)

create dict for row

Parameters:

idx – row number

get_ids()

get the ids of all records, in row order

Returns:

array of ids

getitem_by_id(key)

get an item from the library by id

Parameters:

key – id

Returns:

the row dict

getitem_by_row(key)

get an item from the library by id :param key: row number :return: the row dict

getrow_by_id(key)

given an id, return the corresponding row number in the table

Parameters:

key – the id

Returns:

the row number of the id in the table

class masskit.utils.tablemap.TableMap(*args, **kwargs)

Bases: ABC

collections.abc.Sequence wrapper for a library. Allows use of different stores, e.g. arrow or pandas

abstract create_dict(idx)

create dict for row

Parameters:

idx – row number

abstract get_ids()

get the ids of all records, in row order

Returns:

array of ids

abstract getitem_by_id(key)

get an item from the library by id

Parameters:

key – id

Returns:

the row dict

abstract getitem_by_row(key)

get an item from the library by id :param key: row number :return: the row dict

abstract getrow_by_id(key)

given an id, return the corresponding row number in the table

Parameters:

key – the id

Returns:

the row number of the id in the table

to_msp(file, annotate_peptide=False, ion_types=None, spectrum_column=None)

write out spectra in msp format

Parameters:
  • file – file or filename to write to

  • annotate_peptide – annotate the spectra as a peptide

  • ion_types – ion types for annotation

  • spectrum_column – column name for spectrum

masskit.utils.tables module

masskit.utils.tables.create_dataset(rows=5, cols=[<class 'int'>, <class 'float'>, <class 'str'>, <class 'list'>], names=array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',        'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',        'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],       dtype='<U1'))
masskit.utils.tables.is_struct_or_extstruct(chunked_array: ChunkedArray) bool
masskit.utils.tables.optimize_structarray(struct: ChunkedArray) ChunkedArray
masskit.utils.tables.optimize_table(table: Table) Table
masskit.utils.tables.random_string(length)
masskit.utils.tables.row_view(table, idx=0)
masskit.utils.tables.row_view_raw(table, idx=0)
masskit.utils.tables.struct_view(struct_name, parent)
masskit.utils.tables.structarray_to_table(struct: StructArray) Table
masskit.utils.tables.table_add_structarray(table: Table, structarray: StructArray, column_name: str = None) Table

add a struct array to a table

Parameters:
  • table – table to be added to

  • structarray – structarray to add to table

  • column_name – name of column to add

Returns:

new table with appended column

masskit.utils.tables.table_to_structarray(table: Table, structarray_type: ExtensionType = None) StructArray

convert a spectrum table into a struct array. if an ExtensionType is passed in, will create a struct array of that type

Parameters:
  • table – spectrum table

  • structarray_type – the type of the array returned, e.g. SpectrumArrowType()

Returns:

StructArray

Module contents