The interlab Project class¶
An interlab analysis is logically divided into interlaboratory comparison projects. A project is represented by a Project
object which contains the following items:
- Sample labels that represent the physical objects that have been distributed for measurement
- Dataset labels that identify the origin of each set of measurement results
- Experimental spectral data containing the measurement results on the objects that have been analyzed
- Interspectral distance functions that will be used to calculate the spread of the data and identify outliers
- A distribution function that will be used to estimate outliers. By default, this is a lognormal distribution, but can be any distribution function supported by
scipy.stats.rv_continuous
.
The intent is that the user will interact primarily with the Project
object, loading it with the necessary information to conduct its analysis and then using the built in methods. Documentation for the ExperimentGroup
and InterlabArray
objects is included separately.
Creating Projects¶
A blank Project
can be instantiated with no arguments:
my_project = interlab.Project()
This creates a Project
with no data or metadata of any kind. This information can be loaded later, or it can be provided when the code is initialized:
my_project = interlab.Project(x_data_list=xdata,
Sample_names=Sample_names,
Data_set_names=Data_set_names_dict,
distance_metrics=distance_metric_list,
rdata=data_dict,rawdata=rawdata_dict
)
When the Project
object is created, it will automatically create ExperimentGroup
, DistanceMetric
, and Population
objects for the experiment groups and distance metrics that have been assigned.
Defining distance metrics¶
The distance metrics are defined by a list of dictionaries. Each dictionary must have the name of the metric as a text string and the function used to call the metric. The function must either be a callable that accepts two inputs or a string that is recognized by scipy.spatial.pdist()
. The following are two examples:
jeffries = r'Symmetric Kullback-Liebler'
mahalanobis = r'Mahalanobis'
nmr_distance_metrics = [dict(metric=mahalanobis,function='mahalanobis'),
#'mahalanobis' is recognized by pdist()
dict(metric=jeffries,function=interlab.jeffries),
#interlab.jeffries is a distance included in this package
]
Project workflow¶
Once the project has been created with the basic data and metadata needed for the analysis, the basic workflow is as follows:
my_project.process_mahalanobis() #This calculated the mean and covariance of the samples and is only needed if the Mahalanobis distance is included in the project
my_project.set_distances()
my_project.fit_zscores()
my_project.find_outliers()
my_project.extract_matrices()
my_project.find_lab_outliers()
This will, in order:
- Calculate the interspectral distances
- Fit the project’s distribution function to the distance data and calculate the corresponding scores.
- Identify outliers within each spectral population
- Conduct a principal components analysis on the scores and compute the projected statistical distance
- Use the projected statistical distance to determine the data set outliers.
Documentation¶
Method summary¶
Analysis functions¶
Project ([data, rawdata, distance_metrics, …]) |
The top-level project class for the interlaboratory comparison module |
Project.set_distances () |
Calculates the interspectral distances for each experiment group and metric |
Project.fit_zscores () |
Fits the sample-level zscores for each experiment group and metric. |
Project.find_outliers (**kwargs) |
Finds the sample outliers for each experiment group and metric. |
Project.extract_matrices (**kwargs) |
Runs extract_experimental_matrix() for each experimental matrix |
Project.extract_experimental_matrix ([sets, …]) |
Extracts the zscore data from the dict-of-vectors format and casts it as a 2D array. |
Project.fit_lab_zscores () |
Fits the lab-level zscores for each metric. |
Project.find_lab_outliers (**kwargs) |
Finds the lab outliers for each metric. |
Plotting functions¶
Project.plot_distance_fig ([plot_range, …]) |
For each sample, generates the following plots: |
Project.plot_zscore_distances (metric[, …]) |
Plots a bar chart of the average interspectral distance for each sample, annotated with the generalized Z score for each sample |
Project.plot_histograms (metric[, …]) |
Plots a histogram of the average interspectral distance for each sample, along with the corresponding fit |
Project.plot_zscore_outliers (metric[, …]) |
Plots the principal component scores for each lab along with the final distribution used to calculate the outliers |
Project.plot_projected_zscores ([…]) |
Plots the projected statistical distances annotated with the corresponding laboratory-level Z scores. |
Project.plot_zscore_loadings ([…]) |
Plots the principal component loadings for the statistical distances |
Full documentation¶
-
class
project.
Project
(data=None, rawdata=None, distance_metrics=None, Sample_names=None, Data_set_names=None, x_data_list=None, range_to_use=None, distribution_function=<scipy.stats._continuous_distns.lognorm_gen object>, outlier_dist=None)[source]¶ The top-level project class for the interlaboratory comparison module
Key Sample_names: List of sample names, used as keys for the dictionaries of data set names and processed and raw data. Each key in this list will correspond to a ExperimentGroup
objectKey Data_set_names: Dictionary of data sets (labs) with data for each sample Key data: Dictionary of data to be used for the interlab analysis Key rawdata: Dictionary of unprocessed data, if different from data Key distance_metrics: List of distance metrics. Each metric in this list will be used to create a DistanceMetric
object within eachExperimentGroup
objectKey x_data_list: The list of x data in the data array. For 2D data, this is not used Key range_to_use: Used to screen certain parts of the spectral data from consideration in the experimental comparison Key distribution_function: Which distribution will be assumed when assigning Z scores to each measurement of a sample. The default is sp.stats.lognorm Key outlier_dist: Which distribution will be assumed when detecting outliers. The default is the same as distribution_function -
set_distances
()[source]¶ Calculates the interspectral distances for each experiment group and metric
-
extract_experimental_matrix
()[source]¶ Extracts the zscore data from the dict-of-vectors format and casts it as a 2D array.
The dictionary of sample-level z-scores is recast as an array, with one dimension corresponding to sample names and the other corresponding to laboratory.
Key sets: Sets to extract for the interlab comparison Key metric: Distance metric that will be used Key screen_outliers: Whether to remove outlier measurements before imputing missing values Key imputation_axis: Axis along which to impute missing values
-
plot_distance_fig
()[source]¶ - For each sample, generates the following plots:
- A plot of the spectra generated for that sample by each laboratory
- For each metric, a heat map plot of the interspectral distance matrix
Key plot_range: An iterable of integers specifying which sample labels to plot Key cmap: The color map that will be used for the distance heat maps Key linecolor: The line color that will be used for the spectral data Key distance_metrics: A list of the distance metrics for which heat maps will be plotted. If None, plot heat maps for all metrics in this project Key plot_data: Boolean that tells whether the raw spectral data will be plotted Key wspace: Horizontal spacing between the heat maps Key ylabel_buffer: Space allocated for the y axis label (in inches) Key rightlabel_buffer: Space allocated for the colorbar label (in inches) Key xlabel_buffer: Returns: distance_fig, the distance measure figure matplotlib object.
-
plot_zscore_distances
()[source]¶ Plots a bar chart of the average interspectral distance for each sample, annotated with the generalized Z score for each sample
Parameters: metric – The metric for which the distances will be plotted Key plot_range: An iterable of integers specifying which sample labels to plot Key numcols: The number of columns in the distance plot Key xlabel_buffer: Space allocated for x-axis labels (in inches) Key ylabel_buffer: Space allocated for y-axis labels (in inches) Key rotation: Specifies the orientation of the z-score labels for individual labs Returns zscorefig: The distances and scores plot as a matplotlib figure object
-
plot_histograms
()[source]¶ Plots a histogram of the average interspectral distance for each sample, along with the corresponding fit
Parameters: metrics – The metric for which the distances will be plotted Key plot_range: An iterable of integers specifying which sample labels to plot Key numcols: The number of columns in the distance plot Key xlabel_buffer: Space allocated for x-axis labels (in inches) Key rotation: Specifies the orientation of the z-score labels for individual labs Returns pdffig: The distances and scores plot as a matplotlib figure object
-
plot_zscore_outliers
()[source]¶ Plots the principal component scores for each lab along with the final distribution used to calculate the outliers
Parameters: metric – The metric used to calculate the interspectral distances Key y_component: Which principal component to use on the Y axis, if not the first Key text: Whether to label the plot with the name of the Returns: zscore_outliers_fig, the Z score outlier plot as a matplotlib figure object
-
plot_projected_zscores
()[source]¶ Plots the projected statistical distances annotated with the corresponding laboratory-level Z scores.
Key distance_metrics: A list of the distance metrics for which statistical distances will be plotted. If None, plot statistical distances for all metrics in this project Key xlabel_buffer: Space allocated for x-axis labels (in inches) Key rotation: Specifies the orientation of the z-score labels for individual labs Returns: zscorefig, the projected statistical distances plot as a matplotlib figure object
-
plot_zscore_loadings
()[source]¶ Plots the principal component loadings for the statistical distances
Key distance_metrics: A list of the distance metrics for which loadings will be plotted. If None, plot loadings for all metrics in this project Key xlabel_buffer: Space allocated for x-axis labels (in inches) Returns: loadfig, the projected statistical loadings plot as a matplotlib figure object
-