GPR utilities (active_utils)#

Functions:

get_logweights(bias)

Given values of the biasing potential for each configuration, calculates the weights for averaging over those configurations for the biased ensemble so that the average represents the unbiased ensemble.

input_GP_from_state(state[, n_rep, log_scale])

Builds input for GP model up to specified order from ExtrapModel object of thermoextrap.

make_matern_expr(p)

Creates a sympy expression for the Matern kernel of order p.

make_rbf_expr([n_dims])

Creates a sympy expression for an RBF kernel.

make_rbf_expr_old()

Creates a sympy expression for an RBF kernel.

make_poly_expr(p)

Creates a sympy expression for a polynomial kernel.

create_base_GP_model(gpr_data[, ...])

Creates just the base GP model without any training,just sets up sympy and GPflow.

train_GPR(gpr[, record_loss, start_params])

Trains a given gpr model for n_opt steps.

create_GPR(state_list[, log_scale, ...])

Generates and trains a GPR model based on a list of ExtrapModel objects or a StateCollection object from thermoextrap.

active_learning(init_states, sim_wrapper, ...)

Continues adding new points with active learning by running simulations until the specified tolerance is reached or the maximum number of iterations is achieved.

Classes:

DataWrapper(sim_info_files, cv_bias_files, beta)

Class to keep track of metadata around the data.

SimWrapper(sim_func, struc_name, sys_name, ...)

Wrapper around simulations to spawn similar simulations easily and keep track of all parameter values.

RBFDerivKernel(**kwargs)

For convenience, create a derivative kernel specific to RBF function.

ChangeInnerOuterRBFDerivKernel([c1, c2])

Implements a change-points kernel via logistic switching functions (as in GPflow's ChangePoints kernel), but only for two points, where two instead of three kernels are utilized: one for the outer region and one for the inner.

UpdateStopABC([d_order_pred, ...])

Class that forms basis for both update and stopping criteria classes, which both need to define transformation functions and create grids of alpha values.

UpdateFuncBase([show_plot, save_plot, ...])

Base update function class defining structure and implementing basic methods.

UpdateALMbrute(**kwargs)

Performs active learning with a GPR to select new location for performing simulation.

UpdateRandom(**kwargs)

Select point randomly along a grid based on previously sampled points.

UpdateSpaceFill(**kwargs)

Select point as far as possible from previously sampled points.

UpdateAdaptiveIntegrate([tol])

Select point as far as possible from previously sampled points, but within specified error tolerance based on model relative uncertainty predictions.

UpdateALCbrute(**kwargs)

EXPERIMENTAL! MAY BE USEFUL IN FUTURE WORK, BUT NOT NOW!

MetricBase(name, tol)

Base class for structure of metrics used for stopping criteria.

MaxVar(tol[, name])

Metric based on maximum variance of GP output.

AvgVar(tol[, name])

Metric based on average variance of GP output.

MaxRelVar(tol[, threshold, name])

Metric based on maximum relative variance of GP output (actually std).

MaxRelGlobalVar(tol[, name])

Metric based on maximum ratio of GP output variance to variance of data input to the GP (actually ratio of std devs).

AvgRelVar(tol[, threshold, name])

Metric based on average relative variance of GP output.

MSD(tol[, name])

Metric based on mean squared deviation between GP model outputs.

MaxAbsRelDeviation(tol[, threshold, name])

Metric based on maximum absolute relative deviation between GP model outputs.

MaxAbsRelGlobalDeviation(tol[, name])

Metric based on maximum absolute deviation between GP model outputs divided by the std of the data.

AvgAbsRelDeviation(tol[, threshold, name])

Metric based on average absolute relative deviation between GP model outputs.

ErrorStability(tol[, name])

Implements the stopping metric introduced by Ishibashi and Hino (2021).

MaxIter([name])

Metric that always returns False so that will reach maximum number of iterations.

StopCriteria(metric_funcs, **kwargs)

Class that calculates metrics used to determine stopping criteria for active learning.

thermoextrap.gpr_active.active_utils.get_logweights(bias)[source]#

Given values of the biasing potential for each configuration, calculates the weights for averaging over those configurations for the biased ensemble so that the average represents the unbiased ensemble.

thermoextrap.gpr_active.active_utils.input_GP_from_state(state, n_rep=100, log_scale=False)[source]#

Builds input for GP model up to specified order from ExtrapModel object of thermoextrap. If log_scale, adjust x inputs and derivatives to reflect taking the logarithm of x.

Parameters:
  • state (ExtrapModel) – object containing derivative information

  • n_rep (int, default 100) – Number of bootstrap draws of data to perform to compute variances

  • log_scale (bool, default False) – Whether or not to apply a log scale in the input locations (i.e., compute derivatives of dy/dlog(x) instead of dy/dx)

Returns:

  • x_data (object) – input locations, likely state points (e.g., temperature, pressure, etc.), augmented with derivative order of each observation, as required for GP models

  • y_data (object) – Output data, which includes both function values and derivative information

  • cov_data (object) – covariance matrix between function observations, including derivative observations; note that this will be block-diagonal since it is expected that the state objects at different conditions are based on information from independent simulations, while the observations at different derivative orders from a single simulation are correlated

class thermoextrap.gpr_active.active_utils.DataWrapper(sim_info_files, cv_bias_files, beta, x_files=None, n_frames=10000, u_col=2, cv_cols=None, x_col=None)[source]#

Bases: object

Class to keep track of metadata around the data. Data will not be stored here, but this class will define how data is loaded and processed. If want to change the column indices, simply have to call get_data, then build_state in two separate steps rather than just calling build_state. Handles multiple files, but only under the assumption that all biases are fixed after the initial simulation. If the biases change, then need to use MBAR to reweight everything. This is feasible, but difficult if trying to provide all samples to GPR so it can bootstrap. When to point where can provide all derivatives and uncertainties directly, can switch over to using MBAR and computeExpectations.

Parameters:
  • sim_info_files (list of str) – list of files containing simulation information, such as potential energy timeseries

  • cv_bias_files (list of str) – list of files containing the collective variable, or quantity of interest for the active learning procedure; a bias along this quantity or another CV may also be included for a simulation with enhanced sampling

  • beta (float) – the reciprocal temperature (1/(kB*T)) of the simulations in this data set

  • x_files (list of str, optional) – the files containing the quantity of interest for the active learning procedure; default is to assume this is the CV in cv_bias_files and that these are not necessary

  • n_frames (int, default 10000) – number of frames from the END of simulation(s), or the files read in, to use for computations; allows exclusion of equilibration periods

  • u_col (int, default 2) – column of sim_info_files in which potential energy is found

  • cv_cols (list, default [1,2]) – columns of cv_bias_files in which the CV and bias are found

  • x_col (list, default [1]) – list of columns from x_files to pull out; can be multiple if the are multiple outputs of interest

Methods:

load_U_info()

Loads potential energies from a list of files.

load_CV_info()

Loads data from a file specifying CV coordinate and added bias at each frame.

load_x_info()

Loads observable data.

get_data()

Loads data from files needed to generate data classes for thermoextrap.

build_state([all_data, max_order])

Builds a thermoextrap data object for the data described by this wrapper class.

load_U_info()[source]#

Loads potential energies from a list of files.

load_CV_info()[source]#

Loads data from a file specifying CV coordinate and added bias at each frame. Assumes that the first value in col_ind is the index of the CV coordinate column and the second is the index for the bias.

load_x_info()[source]#

Loads observable data.

get_data()[source]#

Loads data from files needed to generate data classes for thermoextrap. Will change significantly if using MBAR on trajectories with different biases.

build_state(all_data=None, max_order=6)[source]#

Builds a thermoextrap data object for the data described by this wrapper class. If all_data is provided, should be list or tuple of (potential energies, X) to be used, where X should be appropriately weighted if the simulation is biased.

class thermoextrap.gpr_active.active_utils.SimWrapper(sim_func, struc_name, sys_name, info_name, bias_name, kw_inputs=None, data_kw_inputs=None, data_class=<class 'thermoextrap.gpr_active.active_utils.DataWrapper'>, post_process_func=None, post_process_out_name=None, post_process_kw_inputs=None, pre_process_func=None)[source]#

Bases: object

Wrapper around simulations to spawn similar simulations easily and keep track of all parameter values.

Parameters:
  • sim_func (callable()) – function that runs a new simulation

  • struc_name (str) – name of structure file inputs to simulation

  • sys_name (str) – name of system or topological file inputs to simulation

  • info_name (str) – name of information file for simulation to produce

  • bias_name (str) – name of file with CV values and bias for simulation to produce

  • kw_inputs (dict, optional) – additional keyword inputs to the simulation

  • data_class (object) – class (e.g., DataWrapper) to use for wrapping simulation output data; data will be wrapped before returned to the active learning algorithm

  • post_process_func (callable(), optional) – Function for post-processing simulation outputs but before wrapping in data_class

  • post_process_out_name (str, optional) – name of output files produced by post_process_func

  • post_process_kw_inputs (dict, optional) – additional dictionary of arguments for the post_process_func

  • pre_process_func (callable(), optional) – function to apply before a simulation run in order to produce extra keyword arguments for the simulation run; useful if have an extra model that predicts info needed by or helpful for the simulation

Methods:

run_sim(sim_dir, alpha[, n_repeats])

Runs simulation(s) and returns an object of type self.data_class pointing to right files.

run_sim(sim_dir, alpha, n_repeats=1, **extra_kwargs)[source]#

Runs simulation(s) and returns an object of type self.data_class pointing to right files. By default only one, but will run n_repeats in parallel if specified.

thermoextrap.gpr_active.active_utils.make_matern_expr(p)[source]#

Creates a sympy expression for the Matern kernel of order p.

Parameters:

p (int) – order of Matern kernel

Returns:

  • expr (Expr)

  • kern_params (dict) – parameters matching naming in sympy expression

thermoextrap.gpr_active.active_utils.make_rbf_expr(n_dims=1)[source]#

Creates a sympy expression for an RBF kernel.

Parameters:

n_dims (int, default 1) – the number of input dimensions for this kernel

Returns:

  • expr (Expr) – sympy expression representing the kernel

  • kern_params (dict) – parameters matching naming in sympy expression

thermoextrap.gpr_active.active_utils.make_rbf_expr_old()[source]#

Creates a sympy expression for an RBF kernel.

Returns:

  • expr (Expr)

  • kern_params (dict) – parameters matching naming in sympy expression

thermoextrap.gpr_active.active_utils.make_poly_expr(p)[source]#

Creates a sympy expression for a polynomial kernel.

Parameters:

p (int) – order of polynomial

Returns:

  • expr (Expr)

  • kern_params (dict) – parameters matching naming in sympy expression

class thermoextrap.gpr_active.active_utils.RBFDerivKernel(**kwargs)[source]#

Bases: DerivativeKernel

For convenience, create a derivative kernel specific to RBF function. Use it most often, so convenient to have.

class thermoextrap.gpr_active.active_utils.ChangeInnerOuterRBFDerivKernel(c1=-7.0, c2=-2.0, **kwargs)[source]#

Bases: DerivativeKernel

Implements a change-points kernel via logistic switching functions (as in GPflow’s ChangePoints kernel), but only for two points, where two instead of three kernels are utilized: one for the outer region and one for the inner. Both kernels are RBF kernels with a shared variance parameter. The resulting kernel is differentiable, inheriting DerivativeKernel. Two points where the kernel changes may be specified, c1 and c2, meaning that the outer kernel is used for x <= c1 and x >= c2, while the inner kernel is used for c1 < x < c2.

Parameters:

See also

DerivativeKernel

thermoextrap.gpr_active.active_utils.create_base_GP_model(gpr_data, d_order_ref=0, shared_kernel=True, kernel=<class 'thermoextrap.gpr_active.active_utils.RBFDerivKernel'>, mean_func=None, likelihood_kwargs=None)[source]#

Creates just the base GP model without any training,just sets up sympy and GPflow. kernel can either be a kernel object, in which case it is assumed you know what you’re doing and shared_kernel will be ignored (will not wrap in SharedIndependent or SeparateIndependent and thus if the kernel is not a subclass of MultioutputKernel, HeteroscedasticGPR will wrap it in a SharedIndependent kernel). If shared_kernel is False and a kernel object is provided, only a warning will be printed, so beware. If only a class is provided for kernel, then shared_kernel is respected - a class is necessary so that if the kernel is not to be shared, separate instances can be created. Note that if a class is provided for kernel, it must be initiated without any passed parameters - this is easy to set up with a wrapper class as for RBFDerivKernel, the default.

Parameters:
  • gpr_data (tuple) – a tuple of input locations, output data, and the noise covariance matrix

  • d_order_ref (int, default 0) – derivative order to treat as the reference for constructing mean functions; PROBABLY BEST TO REMOVE THIS OPTION UNTIL HAVE MORE SOPHISTICATED MEAN FUNCTIONS - DEFAULT BEHAVIOR DEFENDS AGAINST SITUATION WITH NO ZEROTH ORDER DERIVATIVES, BUT DOES NOT CREATE MEANINGFUL MEAN FUNCTION IN THAT CASE (JUST ZEROS)

  • shared_kernel (bool, default True) – whether or not the kernel will be shared across output dimensions

  • kernel (object) – Defaults to RBFDerivKernel. Kernel to use in GP model.

  • mean_func (callable(), optional) – mean function to use for GP model

  • likelihood_kwargs (dict, optional) – keyword arguments to pass to the likelihood model

Returns:

gpr (thermoextrap.gpr_active.gp_models.HeteroscedasticGPR) – Note, that this is an untrained model.

thermoextrap.gpr_active.active_utils.train_GPR(gpr, record_loss=False, start_params=None)[source]#

Trains a given gpr model for n_opt steps. Actually uses scipy wrapper in gpflow, which seems faster. If starting parameter values are provided in start_params, should be iterable with numpy array or float values (e.g., in tuple or list).

Parameters:
  • gpr (object) – The GPR model to train

  • record_loss (bool, default False) – Whether or not to record the output of the optimizer and return it

  • start_params (dict, optional) – Parameters to also try as starting points for the optimization; if these are provided as a list or numpy array, two optimizations are performed, one from the current GPR model parameters, and one with the GPR model parameters set to start_params; the optimization result with the lowest loss function value is selected and the GPR parameters are set to those values

thermoextrap.gpr_active.active_utils.create_GPR(state_list, log_scale=False, start_params=None, base_kwargs=None)[source]#

Generates and trains a GPR model based on a list of ExtrapModel objects or a StateCollection object from thermoextrap. If a list of another type of object, such as a custom state function, will simply call it and expect to return GPR input data.

Parameters:
  • state_list (list of ExtrapModel) – Each at different conditions.

  • log_scale (bool, default False) – whether or not to compute derivatives with respect to x or the logarithm of x, where x is the input location

  • start_params (dict, optional) – Starting parameter values to consider during optimization

  • base_kwargs (dict, optional) – Additional dictionary of keyword arguments to pass to create_base_GP_model

Returns:

gpr (thermoextrap.gpr_active.gp_models.HeteroscedasticGPR) – Trained model.

class thermoextrap.gpr_active.active_utils.UpdateStopABC(d_order_pred=0, transform_func=<function identityTransform>, log_scale=False, avoid_repeats=False, rng=None)[source]#

Bases: object

Class that forms basis for both update and stopping criteria classes, which both need to define transformation functions and create grids of alpha values.

Methods:

create_alpha_grid(alpha_list)

Given a list of alpha values used in the GP model, creates a grid of values to evaluate the GP model at.

get_transformed_GP_output(gpr, x_vals)

Returns output of GP and transforms it, evaluating GP using predict_f at alpha values.

create_alpha_grid(alpha_list)[source]#

Given a list of alpha values used in the GP model, creates a grid of values to evaluate the GP model at. This grid, alpha_grid is returned along with values of possible points to select to add to the GP model, alpha_select. Depending on the update strategy, these may be different points (e.g., if using integrated uncertainty, point to add may be different than grid used to evaluate integrated variance).

get_transformed_GP_output(gpr, x_vals)[source]#

Returns output of GP and transforms it, evaluating GP using predict_f at alpha values.

class thermoextrap.gpr_active.active_utils.UpdateFuncBase(show_plot=False, save_plot=False, save_dir='./', compare_func=None, **kwargs)[source]#

Bases: UpdateStopABC

Base update function class defining structure and implementing basic methods. This class will be callable and will use the do_update() function to perform updates, which means that for new classes inheriting from this do_update() must be implemented. The update function should typically take two arguments: the GP model and the list of alpha (or x) input values that the model is based on.

Parameters:
  • show_plot (bool, default False) – Whether or not to show a plot after each update

  • save_plot (bool, default False) – Whether or not to save a plot after each update

  • save_dir (str or path-like, default './') – Directory to save figures.

  • compare_func (callable(), optional) – Function to compare to for plotting, like ground truth if it is known.

Methods:

do_plotting(x, y, err, alpha_list)

Plots output used to select new update point.

__call__(gpr, alpha_list)

Call self as a function.

do_plotting(x, y, err, alpha_list)[source]#

Plots output used to select new update point. err is expected to be length 2 list with upper and lower confidence intervals.

__call__(gpr, alpha_list)[source]#

Call self as a function.

class thermoextrap.gpr_active.active_utils.UpdateALMbrute(**kwargs)[source]#

Bases: UpdateFuncBase

Performs active learning with a GPR to select new location for performing simulation. This is called “Active Learning Mackay” in the book by Grammacy (Surrogates, 2022). Selection is based on maximizing uncertainty over the interval, which is done with brute force evaluation on a grid of points (this is cheap compared to running simulations or training the GP model). It is possible to select a point that has already been run and add more data there, but will not move outside the range of data already collected. A function for transforming the GPR prediction may also be provided, taking the inputs and prediction as arguments, in that order. This will not affect the GPR model, but will change the active learning outcomes, as will changing the derivative order of the outcome. Note that the transform should only involve addition or scaling (linear operations) such that the uncertainty can also be adjusted in the same way.

class thermoextrap.gpr_active.active_utils.UpdateRandom(**kwargs)[source]#

Bases: UpdateFuncBase

Select point randomly along a grid based on previously sampled points. This does not require training a GP model, but one is trained anyway for plotting, etc.

class thermoextrap.gpr_active.active_utils.UpdateSpaceFill(**kwargs)[source]#

Bases: UpdateFuncBase

Select point as far as possible from previously sampled points. This will just be halfway between for two points. For situations where multiple locations work equally well, locations are chosen randomly. This does not require training a GP model, but one is trained anyway for plotting, etc.

class thermoextrap.gpr_active.active_utils.UpdateAdaptiveIntegrate(tol=0.005, **kwargs)[source]#

Bases: UpdateFuncBase

Select point as far as possible from previously sampled points, but within specified error tolerance based on model relative uncertainty predictions. If all values in the interval satisfy the tolerance, the furthest point from all others will be chosen, as in a space-filling update.

Parameters:

tol (float, default 0.005) – tolerance threshold to stay under when finding next point; this is defined as the relative uncertainty, or the GPR-predicted standard deviation divided by the absolute value of the GPR-predicted mean

class thermoextrap.gpr_active.active_utils.UpdateALCbrute(**kwargs)[source]#

Bases: UpdateFuncBase

EXPERIMENTAL! MAY BE USEFUL IN FUTURE WORK, BUT NOT NOW!

Performs active learning with a GPR to select new location for performing simulation. This is called “Active Learning Cohn” in the book by Grammacy (Surrogates, 2022). Selection is based on maximizing INTEGRATED uncertainty, which is done with brute force evaluation on a grid of points (this is cheap compared to running simulations or training the GP model). It is possible to select a point that has already been run and add more data there, but will not move outside the range of data already collected. The provided data should be a list of DataWrapper objects. A function for transforming the GPR prediction may also be provided, taking the inputs and prediction as arguments, in that order. This will not affect the GPR model, but will change the active learning outcomes.

ONLY EXPECTED TO WORK WITH A FULLY HETEROSCEDASTIC GP MODEL WHERE THERE IS A MODEL, PERHAPS A SEPARATE GP PROCESS, FOR THE BEHAVIOR OF THE NOISE ACROSS INPUT LOCATIONS.

(the trivial case is that the noise does not vary with input location, in which case this will also work, but if providing heteroscedastic noise but no model to predict new noise, this will not work)

class thermoextrap.gpr_active.active_utils.MetricBase(name, tol)[source]#

Bases: object

Base class for structure of metrics used for stopping criteria. To create a metric, write the calc_metric method. Inputs can be history, x_vals, and gp. See below for definition of history. x_vals are the values at which the means and variances were evaluated, and gp is the actual Gaussian process model. If possible, should avoid using the GP in metric functions since will make them slow, but in some cases it is necessary, so pass it in for flexibility.

Parameters:
  • name (str) – name of this metric

  • tol (float) – tolerance threshold for defining stopping

Methods:

__call__(history, x_vals, gp)

Call self as a function.

__call__(history, x_vals, gp)[source]#

Call self as a function.

class thermoextrap.gpr_active.active_utils.MaxVar(tol, name='MaxVar', **kwargs)[source]#

Bases: MetricBase

Metric based on maximum variance of GP output.

class thermoextrap.gpr_active.active_utils.AvgVar(tol, name='AvgVar', **kwargs)[source]#

Bases: MetricBase

Metric based on average variance of GP output.

Parameters:
  • name (str, default 'AvgVar') – name of this metric

  • tol (float) – tolerance threshold for defining stopping

  • **kwargs – Extrap arguments to MetricBase

class thermoextrap.gpr_active.active_utils.MaxRelVar(tol, threshold=1e-12, name='MaxRelVar', **kwargs)[source]#

Bases: MetricBase

Metric based on maximum relative variance of GP output (actually std).

Parameters:
  • name (str, default 'MaxRelVar') – Name of this metric

  • tol (float) – tolerance threshold for defining stopping

  • threshold (float, default 1e-12) – checks to make sure GPR-predicted means have absolute value larger than this value so that do not divide by zero; if below this value, those points are ignored for purposes of calculating metric (set to zero)

class thermoextrap.gpr_active.active_utils.MaxRelGlobalVar(tol, name='MaxRelGlobalVar', **kwargs)[source]#

Bases: MetricBase, UpdateStopABC

Metric based on maximum ratio of GP output variance to variance of data input to the GP (actually ratio of std devs).

Parameters:
  • name (str, default 'MaxRelGlobalVar') – name of this metric

  • tol (float) – tolerance threshold for defining stopping

class thermoextrap.gpr_active.active_utils.AvgRelVar(tol, threshold=1e-12, name='AvgRelVar', **kwargs)[source]#

Bases: MetricBase

Metric based on average relative variance of GP output.

Parameters:
  • name (str, default 'AvgRelVar') – name of this metric

  • tol (float) – tolerance threshold for defining stopping

  • threshold (float, default 1e-12) – checks to make sure GPR-predicted means have absolute value larger than this value so that do not divide by zero; if below this value, those points are ignored for purposes of calculating metric (set to zero)

class thermoextrap.gpr_active.active_utils.MSD(tol, name='MSD', **kwargs)[source]#

Bases: MetricBase

Metric based on mean squared deviation between GP model outputs.

Parameters:
  • name (str, default 'MSD') – Name of this metric

  • tol (float) – tolerance threshold for defining stopping

class thermoextrap.gpr_active.active_utils.MaxAbsRelDeviation(tol, threshold=1e-12, name='MaxAbsRelDev', **kwargs)[source]#

Bases: MetricBase

Metric based on maximum absolute relative deviation between GP model outputs.

Parameters:
  • name (str, default 'MaxAbsRelDev') – name of this metric

  • tol (float) – tolerance threshold for defining stopping

  • threshold (float, default 1e-12) – checks to make sure GPR-predicted means have absolute value larger than this value so that do not divide by zero; if below this value, those points are ignored for purposes of calculating metric (set to zero)

class thermoextrap.gpr_active.active_utils.MaxAbsRelGlobalDeviation(tol, name='MaxAbsRelGlobalDeviation', **kwargs)[source]#

Bases: MetricBase, UpdateStopABC

Metric based on maximum absolute deviation between GP model outputs divided by the std of the data.

Parameters:
  • name (str, default 'MaxAbsRelGlobalDeviation') – name of this metric

  • tol (float) – tolerance threshold for defining stopping

class thermoextrap.gpr_active.active_utils.AvgAbsRelDeviation(tol, threshold=1e-12, name='AvgAbsRelDev', **kwargs)[source]#

Bases: MetricBase

Metric based on average absolute relative deviation between GP model outputs.

Parameters:
  • name (str, default 'AvgAbsRelDev') – name of this metric

  • tol (float) – tolerance threshold for defining stopping

class thermoextrap.gpr_active.active_utils.ErrorStability(tol, name='ErrorStability', **kwargs)[source]#

Bases: MetricBase, UpdateStopABC

Implements the stopping metric introduced by Ishibashi and Hino (2021). Note that for this metric, also inherits UpdateStopABC, so has its own parameters for log_scale, d_order_pred, and transform_func, that are separate from the StopCriteria it’s used in. Note that tol should be between 0 and 1, likely at 0.1 or below. And not that implementation here, transform_func can only be a linear transformation (scale and/or shift) of the GPR output (even though generally the transform_func can be more complicated).

Parameters:
  • name (str) – name of this metric

  • tol (float) – tolerance threshold for defining stopping

class thermoextrap.gpr_active.active_utils.MaxIter(name='MaxIter', **kwargs)[source]#

Bases: MetricBase

Metric that always returns False so that will reach maximum number of iterations. This can be used with or without other metrics to reach maximum iterations since all metrics must be True to reach stopping criteria. Note that do not need to (and should not) set the tolerance here.

Parameters:

name (str) – name of this metric

class thermoextrap.gpr_active.active_utils.StopCriteria(metric_funcs, **kwargs)[source]#

Bases: UpdateStopABC

Class that calculates metrics used to determine stopping criteria for active learning. The key component of this class is a list of metric functions which have names and define tolerances. All of the metrics must be less than the tolerance to trigger stopping.

To perform calculations of metrics, this class keeps track of the history of the GPR predictions (necessary for metrics based on deviations since past iterations). This history object is stored as a list of array objects, specifically the GPR mean (list index 0) and GPR standard deviations (list index 1) with rows in each array being different iterations. So GPR prediction of a mean at iteration 3 would be history[0][3, …].

Parameters:

metric_funcs (dict) – A dictionary {name: function} of metric names and associated functions; this will be looped over with metrics calculated to determine stopping

Methods:

compute_metrics(alpha_grid[, history, gpr])

Uses current history (default) or one provided to compute all metrics.

__call__(gpr, alpha_list)

Call self as a function.

compute_metrics(alpha_grid, history=None, gpr=None)[source]#

Uses current history (default) or one provided to compute all metrics. Must provide grid of alpha values as well to input to metrics.

__call__(gpr, alpha_list)[source]#

Call self as a function.

thermoextrap.gpr_active.active_utils.active_learning(init_states, sim_wrapper, update_func, base_dir='', stop_criteria=None, max_iter=10, alpha_name='alpha', log_scale=False, max_order=4, gp_base_kwargs=None, num_state_repeats=1, save_history=False, use_predictions=False)[source]#

Continues adding new points with active learning by running simulations until the specified tolerance is reached or the maximum number of iterations is achieved.

Parameters:
  • init_states (list of DataWrapper)

  • sim_wrapper (SimWrapper) – Object for running simulations.

  • update_func (callable()) – For selecting the next state point.

  • base_dir (string) – File path. based directory in which active learning run performed and outputs generated

  • stop_criteria (callable(), optional) – callable taking GP to determine if should stop

  • max_iter (int, default 10) – maximum number of iterations to run (new points to add)

  • alpha_name (str, default 'alpha') – the changed parameter; MUST match input name for sim_inputs

  • log_scale (bool, default False) – whether or not to use a log scale for alpha

  • max_order (int, default 4) – Maximum order to use for derivative observations

  • gp_base_kwargs (dict, optional) – dictionary of keyword arguments for create_base_GP_model (allows for more advanced specification of GP model)

  • num_state_repeats (int, default 1) – Number of simulations to run for each state (can help to estimate uncertainty as long as independent)

  • save_history (bool, default False) – If stop_criteria is not None, saves it’s history (all predictions at each step of active learning protocol).

  • use_predictions (bool, default False) – Whether or not sim_wrapper needs predictions from the GP model or not; if True, passes keyword arguments of model_pred and model_std (model predicted mu and std) to sim_wrapper

Returns:

  • data_list (list of DataWrapper) – List of DataWrapper objects describing how to load data (can be used to build states and create_GPR to generate GP model)

  • train_history (dict) – Dictionary of information about results at each training iteration, like GP predictions, losses, parameters, etc.