The interlab Project class

An interlab analysis is logically divided into interlaboratory comparison projects. A project is represented by a Project object which contains the following items:

  • Sample labels that represent the physical objects that have been distributed for measurement
  • Dataset labels that identify the origin of each set of measurement results
  • Experimental spectral data containing the measurement results on the objects that have been analyzed
  • Interspectral distance functions that will be used to calculate the spread of the data and identify outliers
  • A distribution function that will be used to estimate outliers. By default, this is a lognormal distribution, but can be any distribution function supported by scipy.stats.rv_continuous.

The intent is that the user will interact primarily with the Project object, loading it with the necessary information to conduct its analysis and then using the built in methods. Documentation for the ExperimentGroup and InterlabArray objects is included separately.

Creating Projects

A blank Project can be instantiated with no arguments:

my_project = interlab.Project()

This creates a Project with no data or metadata of any kind. This information can be loaded later, or it can be provided when the code is initialized:

my_project = interlab.Project(x_data_list=xdata,
                              Sample_names=Sample_names,
                              Data_set_names=Data_set_names_dict,
                              distance_metrics=distance_metric_list,
                              rdata=data_dict,rawdata=rawdata_dict
                              )

When the Project object is created, it will automatically create ExperimentGroup, DistanceMetric, and Population objects for the experiment groups and distance metrics that have been assigned.

Defining distance metrics

The distance metrics are defined by a list of dictionaries. Each dictionary must have the name of the metric as a text string and the function used to call the metric. The function must either be a callable that accepts two inputs or a string that is recognized by scipy.spatial.pdist(). The following are two examples:

jeffries = r'Symmetric Kullback-Liebler'
mahalanobis = r'Mahalanobis'
nmr_distance_metrics = [dict(metric=mahalanobis,function='mahalanobis'),
                        #'mahalanobis' is recognized by pdist()
                        dict(metric=jeffries,function=interlab.jeffries),
                        #interlab.jeffries is a distance included in this package
                       ]

Project workflow

Once the project has been created with the basic data and metadata needed for the analysis, the basic workflow is as follows:

my_project.process_mahalanobis() #This calculated the mean and covariance of the samples and is only needed if the Mahalanobis distance is included in the project
my_project.set_distances()
my_project.fit_zscores()
my_project.find_outliers()
my_project.extract_matrices()
my_project.find_lab_outliers()

This will, in order:

  • Calculate the interspectral distances
  • Fit the project’s distribution function to the distance data and calculate the corresponding scores.
  • Identify outliers within each spectral population
  • Conduct a principal components analysis on the scores and compute the projected statistical distance
  • Use the projected statistical distance to determine the data set outliers.

Documentation

Method summary

Analysis functions

Project([data, rawdata, distance_metrics, …]) The top-level project class for the interlaboratory comparison module
Project.set_distances() Calculates the interspectral distances for each experiment group and metric
Project.fit_zscores() Fits the sample-level zscores for each experiment group and metric.
Project.find_outliers(**kwargs) Finds the sample outliers for each experiment group and metric.
Project.extract_matrices(**kwargs) Runs extract_experimental_matrix() for each experimental matrix
Project.extract_experimental_matrix([sets, …]) Extracts the zscore data from the dict-of-vectors format and casts it as a 2D array.
Project.fit_lab_zscores() Fits the lab-level zscores for each metric.
Project.find_lab_outliers(**kwargs) Finds the lab outliers for each metric.

Plotting functions

Project.plot_distance_fig([plot_range, …]) For each sample, generates the following plots:
Project.plot_zscore_distances(metric[, …]) Plots a bar chart of the average interspectral distance for each sample, annotated with the generalized Z score for each sample
Project.plot_histograms(metric[, …]) Plots a histogram of the average interspectral distance for each sample, along with the corresponding fit
Project.plot_zscore_outliers(metric[, …]) Plots the principal component scores for each lab along with the final distribution used to calculate the outliers
Project.plot_projected_zscores([…]) Plots the projected statistical distances annotated with the corresponding laboratory-level Z scores.
Project.plot_zscore_loadings([…]) Plots the principal component loadings for the statistical distances

Full documentation

class project.Project(data=None, rawdata=None, distance_metrics=None, Sample_names=None, Data_set_names=None, x_data_list=None, range_to_use=None, distribution_function=<scipy.stats._continuous_distns.lognorm_gen object>, outlier_dist=None)[source]

The top-level project class for the interlaboratory comparison module

Key Sample_names:
 List of sample names, used as keys for the dictionaries of data set names and processed and raw data. Each key in this list will correspond to a ExperimentGroup object
Key Data_set_names:
 Dictionary of data sets (labs) with data for each sample
Key data:Dictionary of data to be used for the interlab analysis
Key rawdata:Dictionary of unprocessed data, if different from data
Key distance_metrics:
 List of distance metrics. Each metric in this list will be used to create a DistanceMetric object within each ExperimentGroup object
Key x_data_list:
 The list of x data in the data array. For 2D data, this is not used
Key range_to_use:
 Used to screen certain parts of the spectral data from consideration in the experimental comparison
Key distribution_function:
 Which distribution will be assumed when assigning Z scores to each measurement of a sample. The default is sp.stats.lognorm
Key outlier_dist:
 Which distribution will be assumed when detecting outliers. The default is the same as distribution_function
set_distances()[source]

Calculates the interspectral distances for each experiment group and metric

fit_zscores()[source]

Fits the sample-level zscores for each experiment group and metric.

find_outliers()[source]

Finds the sample outliers for each experiment group and metric.

extract_matrices()[source]

Runs extract_experimental_matrix() for each experimental matrix

extract_experimental_matrix()[source]

Extracts the zscore data from the dict-of-vectors format and casts it as a 2D array.

The dictionary of sample-level z-scores is recast as an array, with one dimension corresponding to sample names and the other corresponding to laboratory.

Key sets:Sets to extract for the interlab comparison
Key metric:Distance metric that will be used
Key screen_outliers:
 Whether to remove outlier measurements before imputing missing values
Key imputation_axis:
 Axis along which to impute missing values
fit_lab_zscores()[source]

Fits the lab-level zscores for each metric.

find_lab_outliers()[source]

Finds the lab outliers for each metric.

plot_distance_fig()[source]
For each sample, generates the following plots:
  • A plot of the spectra generated for that sample by each laboratory
  • For each metric, a heat map plot of the interspectral distance matrix
Key plot_range:An iterable of integers specifying which sample labels to plot
Key cmap:The color map that will be used for the distance heat maps
Key linecolor:The line color that will be used for the spectral data
Key distance_metrics:
 A list of the distance metrics for which heat maps will be plotted. If None, plot heat maps for all metrics in this project
Key plot_data:Boolean that tells whether the raw spectral data will be plotted
Key wspace:Horizontal spacing between the heat maps
Key ylabel_buffer:
 Space allocated for the y axis label (in inches)
Key rightlabel_buffer:
 Space allocated for the colorbar label (in inches)
Key xlabel_buffer:
 
Returns:distance_fig, the distance measure figure matplotlib object.
plot_zscore_distances()[source]

Plots a bar chart of the average interspectral distance for each sample, annotated with the generalized Z score for each sample

Parameters:metric – The metric for which the distances will be plotted
Key plot_range:An iterable of integers specifying which sample labels to plot
Key numcols:The number of columns in the distance plot
Key xlabel_buffer:
 Space allocated for x-axis labels (in inches)
Key ylabel_buffer:
 Space allocated for y-axis labels (in inches)
Key rotation:Specifies the orientation of the z-score labels for individual labs
Returns zscorefig:
 The distances and scores plot as a matplotlib figure object
plot_histograms()[source]

Plots a histogram of the average interspectral distance for each sample, along with the corresponding fit

Parameters:metrics – The metric for which the distances will be plotted
Key plot_range:An iterable of integers specifying which sample labels to plot
Key numcols:The number of columns in the distance plot
Key xlabel_buffer:
 Space allocated for x-axis labels (in inches)
Key rotation:Specifies the orientation of the z-score labels for individual labs
Returns pdffig:The distances and scores plot as a matplotlib figure object
plot_zscore_outliers()[source]

Plots the principal component scores for each lab along with the final distribution used to calculate the outliers

Parameters:metric – The metric used to calculate the interspectral distances
Key y_component:
 Which principal component to use on the Y axis, if not the first
Key text:Whether to label the plot with the name of the
Returns:zscore_outliers_fig, the Z score outlier plot as a matplotlib figure object
plot_projected_zscores()[source]

Plots the projected statistical distances annotated with the corresponding laboratory-level Z scores.

Key distance_metrics:
 A list of the distance metrics for which statistical distances will be plotted. If None, plot statistical distances for all metrics in this project
Key xlabel_buffer:
 Space allocated for x-axis labels (in inches)
Key rotation:Specifies the orientation of the z-score labels for individual labs
Returns:zscorefig, the projected statistical distances plot as a matplotlib figure object
plot_zscore_loadings()[source]

Plots the principal component loadings for the statistical distances

Key distance_metrics:
 A list of the distance metrics for which loadings will be plotted. If None, plot loadings for all metrics in this project
Key xlabel_buffer:
 Space allocated for x-axis labels (in inches)
Returns:loadfig, the projected statistical loadings plot as a matplotlib figure object