GP Models Utilizing Derivative Information and Active Learning#
The notebooks contained here provide a set of tutorials for using the Gaussian
Process Regression (GPR) modeling capabilities found in the
thermoextrap.gpr_active
module. For all of the code an analysis necessary
to reproduce the paper associated with the development of this module, please
see the
example_projects
directory.
Gaussian Process Models#
The components of all Gaussian Process (GP) models are housed in
gp_models
. A custom kernel function
DerivativeKernel
forms the basis of
using derivative information in the GP models. Behind the scenes, this function
uses sympy to compute necessary derivatives of a provided sympy expression
representing the kernel. Unique combinations of derivative orders are
identified, the derivative function determined, and the results stored and
stitched back together at the end to produce the covariance matrix. This is
possible because a derivative is a linear operator on the covariance kernel,
meaning that derivatives of the kernel provide various covariances between
observations at different derivative orders.
Other key functions are the custom likelihood
HetGaussianDeriv
and the GP model
itself HeteroscedasticGPR
. The
former builds a likelihood model that takes covariances between derivatives into
account and also allows for heteroscedasticity (different uncertainties for
different data points, including different derivative orders). The latter,
HeteroscedasticGPR
is a
heteroscedastic GP model making use of the likelihood just described and the
DerivativeKernel
and behaves much
like other GP models in GPflow. Note, though, that predict_y
is not
implemented as that would require estimates of the uncertainty at new points.
For heteroscedastic uncertainties, that would require a model of how the
uncertainty varied with input location, which has not yet been implemented.
Active Learning#
Tools that assist in performing active learning protocols are found in
active_utils
. Though this includes functions for
building and training GPR models, the focus is on classes and methods that
enable active learning. Pre-eminent among these are classes for housing data and
keeping track of simulations performed during active learning. Since every
simulation environment and project will be unique, users are encouraged to use
these as guidelines and templates for creating their own classes for managing
data collection. More generally useful are classes describing active learning
update strategies, metrics, and stopping criteria. These can easily be inherited
to construct new active learning protocols without fundamentally changing the
primary active learning function. While this function is fairly general in its
structure, it does make specific use of some of the other classes described,
which, as highlighted, vary in their generalizability and transferability to new
situations.