GP Models Utilizing Derivative Information and Active Learning

GP Models Utilizing Derivative Information and Active Learning#

The notebooks contained here provide a set of tutorials for using the Gaussian Process Regression (GPR) modeling capabilities found in the thermoextrap.gpr_active module. For all of the code an analysis necessary to reproduce the paper associated with the development of this module, please see the example_projects directory.

Gaussian Process Models#

The components of all Gaussian Process (GP) models are housed in gp_models. A custom kernel function DerivativeKernel forms the basis of using derivative information in the GP models. Behind the scenes, this function uses sympy to compute necessary derivatives of a provided sympy expression representing the kernel. Unique combinations of derivative orders are identified, the derivative function determined, and the results stored and stitched back together at the end to produce the covariance matrix. This is possible because a derivative is a linear operator on the covariance kernel, meaning that derivatives of the kernel provide various covariances between observations at different derivative orders.

Other key functions are the custom likelihood HetGaussianDeriv and the GP model itself HeteroscedasticGPR. The former builds a likelihood model that takes covariances between derivatives into account and also allows for heteroscedasticity (different uncertainties for different data points, including different derivative orders). The latter, HeteroscedasticGPR is a heteroscedastic GP model making use of the likelihood just described and the DerivativeKernel and behaves much like other GP models in GPflow. Note, though, that predict_y is not implemented as that would require estimates of the uncertainty at new points. For heteroscedastic uncertainties, that would require a model of how the uncertainty varied with input location, which has not yet been implemented.

Active Learning#

Tools that assist in performing active learning protocols are found in active_utils. Though this includes functions for building and training GPR models, the focus is on classes and methods that enable active learning. Pre-eminent among these are classes for housing data and keeping track of simulations performed during active learning. Since every simulation environment and project will be unique, users are encouraged to use these as guidelines and templates for creating their own classes for managing data collection. More generally useful are classes describing active learning update strategies, metrics, and stopping criteria. These can easily be inherited to construct new active learning protocols without fundamentally changing the primary active learning function. While this function is fairly general in its structure, it does make specific use of some of the other classes described, which, as highlighted, vary in their generalizability and transferability to new situations.