Manual

R. D. McMichael rmcmichael@nist.gov
National Institute of Standards and Technology
Gaithersburg, MD USA
April 7, 2020

Introduction

This manual describes an implementation of optimal Bayesian experimental design methods to control measurement settings in order to efficiently determine model parameters. In situations where parametric models would conventionally be fit to measurement data in order to obtain model parameters, these methods offer an adaptive measurement strategy capable of reduced uncertainty with fewer required measurements. These methods are therefore most beneficial in situations where measurements are expensive in terms of money, time, risk, labor or other cost. The price for these benefits lies in the complexity of automating such measurements and in the computational load required. It is the goal of this package to assist potential users in overcoming at least the programming hurdles.

Optimal Bayesian experimental design is not new, at least not in the statistics community. A review paper from 1995 by Kathryn Chaloner and Isabella Verinelli reveals that the basic methods had been worked out in preceding decades. The methods implemented here closely follow Xun Huan and Youssef M. Marzouk which emphasizes simulation-based experimental design. Optimal Bayesian experimental design is also an active area of research.

There are at least three important factors that encourage application of these methods today. First, the availability of flexible, modular computer languages such as Python. Second, availability of cheap computational power. Most of all though, an increased awareness of the benefits of code sharing and reuse is growing in scientific communities.

Philosophy and goals

If it sounds good, it is good.

—Duke Ellington

Jazz legend Duke Ellington was talking about music, where it’s all about the sound. For this package, it’s all about being useful, so we have adopted a “runs good” philosophy:

  • If it’s a struggle to use, it can’t run good.

  • If technical jargon is a barrier, it can’t run good

  • If the user finds it useful, it runs good.

  • If it runs good, it is good.

The goals are modest: to adapt some of the developments in optimal Bayeseian experimental design research for practical use in laboratory settings, giving users tools to make better measurements.

Requirements for users

It takes a little effort to get this software up and running. Here’s what a user will need to supply to get started.

  1. Python 3.x with the numpy python package

  2. An experiment that yields measurement results with uncertainty estimates.

  3. A reliable parametric model for the experiment, typically a function with parameters to be determined.

  4. A working knowledge of Python programming, enough to follow examples and program a model function.

Theory of operation

The optimal Bayes experimental design method incorporates two main jobs, which we can describe as “learning early” and “making good decisions”

Learning early

By interpreting measurement data as soon as it becomes available, sequential Bayesian experimental design gains a critical advantage over the traditional measure-then-fit design. With a measure-then-fit strategy, we get information about parameters only at the very end of the process, after all the measurements and fitting are done. In contrast, the optimal Bayesian experimental design method updates our parameter knowledge with each measurement result, so that information-based decisions can be made as data is collected.

The process of digesting new data is Bayesian inference, which frames parameter knowledge in terms of a probability distribution \(p(\theta)\) for an array of parameters \(\theta = [ \theta_0, \theta_1, ...]\). The familiar notation \(a\pm \sigma\) is often a shorthand description of a Gaussian probability distribution. A broad distribution \(p(\theta)\) corresponds to large uncertainty, and if \(p(\theta)\) is a narrow distribution, the uncertainty is small.

When new measurement results \(m\) are taken in to account, there will be a new, revised probability distribution \(p(\theta|m)\). The vertical bar in the notation \(p(\theta|m)\) indicates a conditional probability, so \(p(\theta|m)\) is the distribution of \(\theta\) values given \(m\).

Bayes’ rule provides

\[p(\theta|m) = \frac{p(m|\theta) p(\theta)}{p(m)}.\]

All of the terms here have technical names. The left side is the posterior distribution, i.e. the distribution of parameters \(\theta\) after we include \(m\). On the right, distribution \(p(\theta)\) is the prior, representing what we knew about the parameters \(\theta\) before the measurement. In the denominator, \(p(m)\) is called the evidence, but because it has no \(\theta\) dependence, it functions just a normalizing constant in this situation. As wrong as this sounds, we will ignore the evidence.

The term that requires attention is in the numerator; \(p(m|\theta)\) is called the likelihood. It’s the probability of getting measurement \(m\) given variable parameter values \(\theta\). Less formally, the likelihood answers the question: “How well does the model explain the measured value \(m\), when the model uses different parameter values \(\theta\)?”

In practice, \(m_i\) will be a fixed measurement result to “plug in” for \(m\). It’s important to keep sight of the fact that \(p(m_i|\theta)\) is still a function of theta. Conceptually, we can try out different parameter values in our model to produce a variety of measurement predictions. Some parameter values (the more likely ones) will produce model values closer to \(m_i\) and for other parameters, (the less likely ones), model model value will be further away.

The optbayesexpt.OptBayesExpt class requires the user to report a measurement record \(m_i\) that includes the measurement settings \(x_i\), the “value” \(y_i\), and uncertainty \(\sigma_i\). Together, \(y_i\) and \(\sigma_i\) are more than fixed numbers; they are shorthand for a probability that a noise-free measurement would yield a value \(y\). \(y\) given a mean value \(y_i\). If this distribution is symmetric, like a Gaussian, for example, then \(p(y|y_i, \sigma_i) = p(y_i|y, \sigma_i)\), the probability of measuring \(y_i\) given a mean value \(y\) that’s provided by the experimental model \(y=y (x_i,\theta)\).

\[p(m_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma_i} \exp\left[-\frac{[y_i - y(x_i, \theta)]^2 }{ 2\sigma_i^2 } \right].\]

With this likelihood, Bayes theorem provides the updated \(p(\theta|m_i)\). Then, another measurement \(m_j\) can update \(p(\theta|m_i)\) to \(p(\theta|m_j, m_i, \ldots)\) and so on. In order to keep the notation readable, we’ll adopt a convention that \(p(\theta)\) always represents the most up-to-date parameter distribution that we have.

This approach assumes that our model function \(y(x, \theta)\) is a good description of our system, and that the measurement noise is Gaussian with standard deviation \(\sigma_i\). On one hand we have to admit that these assumptions don’t allow us to address all important cases. On the other hand, these are the same assumptions we often make in doing least-squares curve fitting.

The method described above puts the responsibility for determining measurement uncertainty on the measurement reporter, but as an experimental result, uncertainty is generally a measurement output, not an input. If uncertainty is a parameter to be determined, it enters the process through the likelihood function given above, but it is not part of the model function \(y(x_i, \theta)\). See demos/line_plus_noise.py for an example.

Making good decisions

The next important job in the process is figuring out good measurement settings. The goal is to make the parameter probability distribution \(p(\theta)\) narrow while minimizing cost. More formally, the challenge is to develop a utility function \(U(x)\) that helps us to predict and compare the relative benefits of measurements made with different possible experimental settings \(x\).

Qualitatively, the mechanism for choosing measurement values hinges on the model’s connection between parameter values \(\theta\) and measurement results \(y\). Consider a sampling of several sets of parameter values \(\theta_i\). With these parameter sets, the model can produce a collection of output curves \(y_i(x)\), and generally these curves will be closer together for some settings \(x\) and further apart for others. Intuitively, little would be learned by measuring at \(x\) values where the curves agree. Instead, it would do the most good to “pin down” the results with a measurement at an \(x\) where the predicted \(y_i(x)\) curves disagree.

By drawing samples from the updated parameter distribution \(p(\theta)\) the mechanism above focuses attention on the relevant parts of parameter space, and through the model to relevant settings. Or, stated slightly differently, using an updated parameter distribution helps to avoid wasting measurement resources on low-impact measurements.

Estimate benefits

To translate such a qualitative argument into code, “doing the most good” must be defined more precisely in terms of desired changes in the parameter distribution \(p(\theta)\). Usually, the goal in determining model parameters is to get unambiguous results with small uncertainty. The information entropy provides a measure of something like a probability distribution. The information entropy of a probability distribution \(p(a)\) is defined as

\[E = -\int da\; p(a)\; \ln[p(a)].\]

Note that the integrand above is zero for both \(p(a) = 1\) and \(p(a)=0\). It’s the intermediate values encountered in a spread-out distribution where the information entropy accumulates. For common distributions, like rectangular or Gaussian, that have characteristic widths \(w\) the entropy goes like \(\ln(w) + C\) with \(C\) values depending on the shape of the distribution.

Now we can define “doing the most good” in terms of how much entropy change \(E\)(posterior) - \(E\)(prior) we might get for predicted measurement values \(y\) at different settings \(x\). Actually, we use something slightly different called the Kulback-Liebler divergence:

\[D^{KL}(y,x) = \int d\theta\; p(\theta |y,x) \ln \left[ \frac{p(\theta | y,x)}{p(\theta)}\right].\]

In this expression \(p(\theta | y,x)\) is a speculative parameter distribution we would get if we happened to measure a value \(y\) using settings \(x\). By itself, \(D^{KL}(y,x)\) doesn’t work as a utility function \(U(x)\) because it depends on this arbitrary possible measurement value \(y\). So we need to average \(D^{KL}\), weighted by the probability of measuring \(y\).

\[U(x) \propto \int dy \int d\theta\; p(y|x) p(\theta |y,x) \ln \left[ \frac{p(\theta | y,x)}{p(\theta)}\right].\]

With two applications of Bayes rule and some rearrangement this expression for \(U(x)\) can be rewritten as the difference between two information entropy-like terms:

Term 1

The information entropy of \(p(y|x)\), the distribution of measurement values expected at setting \(x\). Importantly this distribution includes likely variations of \(\theta.\) Explicitly,

\[p(y|x) = \int d\theta'\; p(\theta') p(y|\theta',x)\]

Qualitatively, this term is the information entropy of the predicted measurement values including both measurement noise and the effects of parameter uncertainty.

Term 2

The other term is the information entropy of \(p(y|\theta,x)\) the measurement value distribution when \(\theta\) and \(x\) are fixed, i.e. the entropy of just the measurement noise distribution. The entropy of this distribution is averaged over \(\theta\) values.

\[\int d\theta\; p(\theta) \int dy\; p(y|\theta,x) \ln [ p(y|\theta, x) ]\]

Term 1 is the entropy of the \(\theta\)-averaged \(y\) distribution and Term 2 is the \(\theta\) average of the entropy of the \(y\) distribution. Loosely, Term 1 is a measure of the spread in \(y\) values due to both measurement noise and likely parameter variations, while term 2 is (mostly) just the measurement noise.

An accurate calculation of \(U(x)\) is a big job, requiring integrals over all parameter space and also all possible measurement outcomes, once for every possible setting. Fortunately, in keeping with the “runs good” project philosophy, accuracy is not required. All we require of an approximate utility function is that provides a guide for non-stupid decisions. It is not critical that the absolute best measurement choice is made every single time. It is only necessary to know if there are values of \(x\) where \(U (x)\) is large compared to other \(x\). Even if we don’t choose the absolute best setting, a “pretty good” choice will do more good than an uninformed choice.

Consider an approximate calculation of \(U^*(x)\) where all of the distributions are assumed to be normal (Gaussian). The information entropy of the normal distribution has a term that goes like \(\ln\)(width). Term 1 from above is a convolution of the measurement noise distribution (width = \(\sigma_y\) and the distribution of model \(y\) values (width = \(\sigma_{y,\theta}\)) that reflects the connection to the parameter distribution. A property of normal distributions is that a convolution of normal distributions is another normal distribution with width = \(\sqrt{\sigma_{y,\theta}^2 + \sigma_y^2}\). Under the assumption of normal distributions, we now have an approximate utility function

\[U^*(x) \propto \approx \ln(\sqrt{\sigma_\theta^2 + \sigma_y^2}) - \ln(\sigma_y) = \frac{1}{2}\ln\left[\frac{\sigma_{y,\theta}(x)^2}{\sigma_y(x)^2}+1\right]\]

This approximation has some reasonable properties. The dependence on \(\sigma_{y,\theta}\) matches our initial intuition that high-utility parameters are those where measurements vary a lot due to parameter variations. The dependence on measurement noise \(\sigma_y\) also has an intuitive interpretation: that it’s less useful to make measurements at settings \(x\) where the instrumental noise is larger. This approximate utility function is also positive, i.e. more data helps narrow a distribution.

Estimate the costs

The utility as described above focuses on the information entropy change, but if the question is about making efficient measurements, the cost of the measurements is fully half of the problem. In the simplest situations, the time spent measuring is the only cost. In more “interesting” situations, there may be a cost associated with changing settings, the cost of a measurement may depend on the likelihood of damage. So in contrast to the general approach to predicting information entropy change, the cost closely associated with local conditions.

Guided Tour

Where the section above treats the theory of Bayesian experimental design, this section provides an introduction to the algorithms and methods in the OptBayesExpt class that perform the “learn fast” and “make good decisions” tasks. An additional subsection describes how the parameter probability distribution is implemented.

Learn fast routine

To input measurement data, the user script invokes pdf_update(measurement_result). The events that follow are illustrated in the sidebar. OptBaesExpt requires the user to provide a measurement_result that represents a complete measurement record with the settings, the value and the uncertainty. The settings are required since actual settings may differ from suggested settings. The settings are used to evaluate the model function for every parameter sample in the parameter probability distribution to produce y_model_data. For each parameter sample the difference between the model value and the reported measurement value is used to calculate the likelihood of that parameter sample, assuming a Gaussian noise distribution.

The uncertainty deserves some attention. By requiring the measurement uncertainty, OptBayesExpt places the burden of determining the uncertainty on the experimenter. Bluntly, the reason for this requirement is that it makes the programming easier. But often, the noise is one of the things one would like to learn from an experiment.

To include the raw measurement uncertainty as an unknown, it is convenient to create a child class that inherits from OptBayesExpt. ChildClass may include the uncertainty alongside the actual model parameters in the parameter_samples tuple, as suggested by the dashed arrow in the sidebar. In this arrangement, the uncertainty parameter is not used in the model function, but it does enter the likelihood calculation. After all, some values of uncertainty explain the data better than others. For examples of this approach, see

  • demos/obe_line_plus_noise.py

  • demos/obe_lockin.py

  • demos/obe_sweeper.py

ParticlePdf.bayesian_update() adjusts the parameter distribution (including the uncertainty part) based on the likelihood.

Make good decisions routine

Measurement settings are selected by opt_setting() and good_setting, and both of these rely on utility(). The sidebar illustrates the inner workings of utility(), where the most time-consuming part is yvar_from_parameter_draws(). For each of N_DRAWS samples from the parameter distribution the model function is evaluated for every possible setting combination, yielding a set of outputs for each setting. The variance of these outputs is compared to the model experimental noise which is a constant by default. The examples listed above for unknown uncertainties include the mean of the variance as a model of experimental noise.

In the deoniminator, the cost of a potential experiment may depend on settings, or predicted value or distance to the next setting. There are many possibilities. In OptBayesExpt the cost is constant by default but customized cost models can be programmed into child classes. In demos/obe_lockin.py for example, the cost model includes the additional cost of a settling time if the setting is changed from its current value. In demos/obe_sweeper.py, the cost model includes a cost of setting up a sweep plus a cost proportional to the length of the sweep.

Once the utility is computed, opt_setting() and good_setting() offer different strategies for selecting settings for the next measurement. The opt_setting() method implements a “greedy” strategy, always selecting the setting that generates the maximum utility. The resulting measurements tend to cluster strongly around a few settings. The good_setting() method uses the utility as a weighting function for a random setting selection. To tilt the odds toward selecting settings with higher utility, the contrast between high and low utility is stretched by raising utility to a power, pickiness. The contrasting behavior of opt_settting() and good_setting() with different pickiness values are illustrated by demos/line_plus_noise.py

Probability distribution class

The probability distribution over the model parameters is at the core of OptBayesExpt. As of version 1.0, the optbayesexpt package uses the particlePDF class to implement the distribution function for N parameters as a cloud or swarm of particles in N-dimensional parameter space. This approach has been given several names, but “sequential Monte Carlo” seems to be gaining traction. Each point in the cloud has coordinates corresponding to parameter values, and also a weight, normalized so that the sum of the weights is 1. The density of particles in a small region and the weights of those particles together provide a flexible representation of the probability density. The figure in the sidebar illustrates how this works for a 1-D normal (Gaussian) distribution. Panel (a) plots 100 draws from the normal distribution, each with a weight of 0.01. The probability distribution is represented only by the density of points in this case. Panel (b) represents the same normal distribution using only the weights. The points are 100 draws from a uniform distribution, but the weights represent the normal distribution.

In the learning fast stage, the distribution is modified based on new data. The likelihood is calculated for every point in the distribution cloud, and then ParticlePdf.bayesian_update() multiplies the particle’s weight by its likelihood. So the distribution is easily modified by adjusting weight values.

In the decision making stage, random draws of parameters from the probability distribution are used to calculate utility. Draws are made by random selection of points with probability determined by weights. In panel (a) values at the center are more likely to be chosen because the density is higher there. In panel (b), values at the center are more likely to be chosen because the weights are higher there.

ParticlePdf also performs a self-maintenance function, resample(). As the incoming data is used to modify the distribution, some regions of parameter space may develop very low probabilities. The points near the ends of the plot in panel (b) illustrate the issue. These low-weight points will almost never be chosen in random draws, but they consume computational resources. In resampling, N draws are taken from an N-particle distribution, so some high-weight particles will be chosen more than once and low-weight particles may not be chosen at all. Each particle is then given a small random nudge to separate copies of the same original particle, the cloud of points is shrunk slightly to compensate for the one-step diffusion, and the weight is divided evenly between particles as shown in panel (c). Resampling relocates low-weight particles to higher-weight neighborhoods so that as measurements narrow the distributions of parameter values, the representation adapts.