# Manual¶

## Introduction¶

This manual describes an implementation of optimal Bayesian experimental design methods to control measurement settings in order to efficiently determine model parameters. In situations where parametric models would conventionally be fit to measurement data in order to obtain model parameters, these methods offer an adaptive measurement strategy capable of reduced uncertainty with fewer required measurements. These methods are therefore most beneficial in situations where measurements are expensive in terms of money, time, risk, labor or other cost. The price for these benefits lies in the complexity of automating such measurements and in the computational load required. It is the goal of this package to assist potential users in overcoming at least the programming hurdles.

Optimal Bayesian experimental design is not new, at least not in the statistics community. A review paper from 1995 by Kathryn Chaloner and Isabella Verinelli reveals that the basic methods had been worked out in preceding decades. The methods implemented here closely follow Xun Huan and Youssef M. Marzouk which emphasizes simulation-based experimental design. Optimal Bayesian experimental design is also an active area of research.

There are at least three important factors that encourage application of these methods today. First, the availability of flexible, modular computer languages such as Python. Second, availability of cheap computational power. Most of all though, an increased awareness of the benefits of code sharing and reuse is growing in scientific communities.

### Philosophy and goals¶

If it sounds good, it is good.

—Duke Ellington

Jazz legend Duke Ellington was talking about music, where it’s all about the sound. For this package, it’s all about being useful, so we have adopted a “runs good” philosophy:

If it’s a struggle to use, it can’t run good.

If technical jargon is a barrier, it can’t run good

If the user finds it useful, it runs good.

If it runs good, it is good.

The goals are modest: to adapt some of the developments in optimal Bayeseian experimental design research for practical use in laboratory settings, giving users tools to make better measurements.

### Requirements for users¶

It takes a little effort to get this software up and running. Here’s what a user will need to supply to get started.

Python 3.x with the

`numpy`

python packageAn experiment that yields measurement results with uncertainty estimates.

A reliable parametric model for the experiment, typically a function with parameters to be determined.

A working knowledge of Python programming, enough to follow examples and program a model function.

## Theory of operation¶

The optimal Bayes experimental design method incorporates two main jobs, which we can describe as “learning early” and “making good decisions”

### Learning early¶

By interpreting measurement data as soon as it becomes available, sequential Bayesian experimental design gains a critical advantage over the traditional measure-then-fit design. With a measure-then-fit strategy, we get information about parameters only at the very end of the process, after all the measurements and fitting are done. In contrast, the optimal Bayesian experimental design method updates our parameter knowledge with each measurement result, so that information-based decisions can be made as data is collected.

The process of digesting new data is Bayesian inference, which frames parameter knowledge in terms of a probability distribution \(p(\theta)\) for an array of parameters \(\theta = [ \theta_0, \theta_1, ...]\). The familiar notation \(a\pm \sigma\) is often a shorthand description of a Gaussian probability distribution. A broad distribution \(p(\theta)\) corresponds to large uncertainty, and if \(p(\theta)\) is a narrow distribution, the uncertainty is small.

When new measurement results \(m\) are taken in to account, there will be a new,
revised probability distribution \(p(\theta|m)\). The vertical bar in the notation
\(p(\theta|m)\) indicates a conditional probability, so \(p(\theta|m)\) is the
distribution of \(\theta\) values *given* \(m\).

Bayes’ rule provides

All of the terms here have technical names. The left side is the
*posterior* distribution, i.e. the distribution of parameters
\(\theta\) after we include \(m\). On the right, distribution
\(p(\theta)\) is the *prior*, representing what we knew about the
parameters \(\theta\) before the measurement. In the denominator,
\(p(m)\) is called the *evidence*, but because it has no \(\theta\)
dependence, it functions just a normalizing constant in this situation.
As wrong as this sounds, we will ignore the *evidence*.

The term that requires attention is in the numerator; \(p(m|\theta)\) is called the
*likelihood*. It’s the probability of getting measurement \(m\)
given variable parameter values \(\theta\). Less formally, the *likelihood* answers the
question: “How well does the model explain the measured value
\(m\), when the model uses different parameter values \(\theta\)?”

In practice, \(m_i\) will be a fixed measurement result to “plug in” for \(m\). It’s important to keep sight of the fact that \(p(m_i|\theta)\) is still a function of theta. Conceptually, we can try out different parameter values in our model to produce a variety of measurement predictions. Some parameter values (the more likely ones) will produce model values closer to \(m_i\) and for other parameters, (the less likely ones), model model value will be further away.

The `optbayesexpt.OptBayesExpt`

class requires the user to report a measurement record
\(m_i\) that includes the measurement settings \(x_i\), the “value” \(y_i\), and
uncertainty \(\sigma_i\). Together, \(y_i\) and \(\sigma_i\) are more than fixed
numbers; they
are shorthand for a probability that a noise-free measurement would yield a value \(y\).
\(y\) given a mean value \(y_i\). If this distribution is symmetric, like a Gaussian, for
example, then \(p(y|y_i, \sigma_i) = p(y_i|y, \sigma_i)\), the probability of measuring
\(y_i\) given a mean value \(y\) that’s provided by the experimental model \(y=y
(x_i,\theta)\).

With this *likelihood*, Bayes theorem provides the updated \(p(\theta|m_i)\).
Then, another measurement \(m_j\) can update \(p(\theta|m_i)\) to
\(p(\theta|m_j, m_i, \ldots)\) and so on. In order to keep the
notation readable, we’ll adopt a convention that \(p(\theta)\)
always represents the most up-to-date parameter distribution that we
have.

This approach assumes that our model function \(y(x, \theta)\) is a good description of our system, and that the measurement noise is Gaussian with standard deviation \(\sigma_i\). On one hand we have to admit that these assumptions don’t allow us to address all important cases. On the other hand, these are the same assumptions we often make in doing least-squares curve fitting.

The method described above puts the responsibility for determining
measurement uncertainty on the measurement reporter, but as an experimental result, uncertainty is
generally a measurement output, not an input. If uncertainty is a parameter to be determined, it
enters the process through the likelihood function given above, but it is not part of the model
function \(y(x_i, \theta)\). See `demos/line_plus_noise.py`

for an example.

### Making good decisions¶

The next important job in the process is figuring out good measurement
settings. The goal is to make the parameter probability distribution
\(p(\theta)\) narrow while minimizing cost. More formally, the
challenge is to develop a *utility function* \(U(x)\) that helps us
to predict and compare the relative benefits of measurements made with
different possible experimental settings \(x\).

Qualitatively, the mechanism for choosing measurement values hinges on the model’s connection between parameter values \(\theta\) and measurement results \(y\). Consider a sampling of several sets of parameter values \(\theta_i\). With these parameter sets, the model can produce a collection of output curves \(y_i(x)\), and generally these curves will be closer together for some settings \(x\) and further apart for others. Intuitively, little would be learned by measuring at \(x\) values where the curves agree. Instead, it would do the most good to “pin down” the results with a measurement at an \(x\) where the predicted \(y_i(x)\) curves disagree.

By drawing samples from the updated parameter distribution \(p(\theta)\) the mechanism above focuses attention on the relevant parts of parameter space, and through the model to relevant settings. Or, stated slightly differently, using an updated parameter distribution helps to avoid wasting measurement resources on low-impact measurements.

#### Estimate benefits¶

To translate such a qualitative argument into code, “doing the most good”
must be defined more precisely in terms of desired changes in
the parameter distribution \(p(\theta)\). Usually, the goal in
determining model parameters is to get unambiguous results with small uncertainty. The *information
entropy* provides a measure of something like a probability distribution. The information entropy
of a probability distribution \(p(a)\) is defined as

Note that the integrand above is zero for both \(p(a) = 1\) and \(p(a)=0\). It’s the intermediate values encountered in a spread-out distribution where the information entropy accumulates. For common distributions, like rectangular or Gaussian, that have characteristic widths \(w\) the entropy goes like \(\ln(w) + C\) with \(C\) values depending on the shape of the distribution.

Now we can define “doing the most good” in terms of how much
entropy change \(E\)(*posterior*) - \(E\)(*prior*) we
might get for predicted measurement values \(y\) at different
settings \(x\). Actually, we use something slightly
different called the Kulback-Liebler divergence:

In this expression \(p(\theta | y,x)\) is a speculative parameter distribution we would get if we happened to measure a value \(y\) using settings \(x\). By itself, \(D^{KL}(y,x)\) doesn’t work as a utility function \(U(x)\) because it depends on this arbitrary possible measurement value \(y\). So we need to average \(D^{KL}\), weighted by the probability of measuring \(y\).

With two applications of Bayes rule and some rearrangement this expression for \(U(x)\) can be rewritten as the difference between two information entropy-like terms:

- Term 1:
The information entropy of \(p(y|x)\), the distribution of measurement values expected at setting \(x\). Importantly this distribution includes likely variations of \(\theta.\) Explicitly,

\[p(y|x) = \int d\theta'\; p(\theta') p(y|\theta',x)\]Qualitatively, this term is the information entropy of the predicted measurement values including both measurement noise and the effects of parameter uncertainty.

- Term 2:
The other term is the information entropy of \(p(y|\theta,x)\) the measurement value distribution when \(\theta\) and \(x\) are fixed, i.e. the entropy of just the measurement noise distribution. The entropy of this distribution is averaged over \(\theta\) values.

\[\int d\theta\; p(\theta) \int dy\; p(y|\theta,x) \ln [ p(y|\theta, x) ]\]

Term 1 is the entropy of the \(\theta\)-averaged \(y\) distribution and Term 2 is the \(\theta\) average of the entropy of the \(y\) distribution. Loosely, Term 1 is a measure of the spread in \(y\) values due to both measurement noise and likely parameter variations, while term 2 is (mostly) just the measurement noise.

An accurate calculation of \(U(x)\) is a big job, requiring integrals over all parameter space and also all possible measurement outcomes, once for every possible setting. Fortunately, in keeping with the “runs good” project philosophy, accuracy is not required. All we require of an approximate utility function is that provides a guide for non-stupid decisions. It is not critical that the absolute best measurement choice is made every single time. It is only necessary to know if there are values of \(x\) where \(U (x)\) is large compared to other \(x\). Even if we don’t choose the absolute best setting, a “pretty good” choice will do more good than an uninformed choice.

Consider an approximate calculation of \(U^*(x)\) where all of the distributions are assumed to be normal (Gaussian). The information entropy of the normal distribution has a term that goes like \(\ln\)(width). Term 1 from above is a convolution of the measurement noise distribution (width = \(\sigma_y\) and the distribution of model \(y\) values (width = \(\sigma_{y,\theta}\)) that reflects the connection to the parameter distribution. A property of normal distributions is that a convolution of normal distributions is another normal distribution with width = \(\sqrt{\sigma_{y,\theta}^2 + \sigma_y^2}\). Under the assumption of normal distributions, we now have an approximate utility function

This approximation has some reasonable properties. The dependence on \(\sigma_{y,\theta}\) matches our initial intuition that high-utility parameters are those where measurements vary a lot due to parameter variations. The dependence on measurement noise \(\sigma_y\) also has an intuitive interpretation: that it’s less useful to make measurements at settings \(x\) where the instrumental noise is larger. This approximate utility function is also positive, i.e. more data helps narrow a distribution.

#### Estimate the costs¶

The utility as described above focuses on the information entropy change, but if the question is about making efficient measurements, the cost of the measurements is fully half of the problem. In the simplest situations, the time spent measuring is the only cost. In more “interesting” situations, there may be a cost associated with changing settings, the cost of a measurement may depend on the likelihood of damage. So in contrast to the general approach to predicting information entropy change, the cost closely associated with local conditions.

## Guided Tour¶

Where the section above treats the theory of Bayesian experimental design,
this section provides an introduction to the algorithms and methods in the
`OptBayesExpt`

class that perform the “learn fast” and “make good decisions” tasks.
An additional subsection describes how the parameter probability
distribution is implemented.

### Learn fast routine¶

To input measurement data, the user script invokes
`pdf_update(measurement_result)`

. The events that follow are illustrated
in the sidebar. `OptBaesExpt`

requires the user to provide a
`measurement_result`

that represents a complete measurement record with
the settings, the value and the uncertainty. The settings are required since
actual settings may differ from suggested settings. The settings are used to
evaluate the model function for every parameter sample in the
parameter probability distribution to produce `y_model_data`

. For each
parameter sample the difference between the model value and the reported
measurement value is used to calculate the likelihood of that parameter
sample, assuming a Gaussian noise distribution.

The uncertainty deserves some attention. By requiring the measurement uncertainty, OptBayesExpt places the burden of determining the uncertainty on the experimenter. Bluntly, the reason for this requirement is that it makes the programming easier. But often, the noise is one of the things one would like to learn from an experiment.

To include the raw measurement uncertainty as an unknown, it is
convenient to create a child class that inherits from OptBayesExpt.
`ChildClass`

may include the uncertainty alongside the
actual model parameters in the `parameter_samples`

tuple, as suggested by
the dashed arrow in the sidebar. In this arrangement, the
uncertainty parameter is not used in the
model function, but it does enter the likelihood calculation. After all,
some values of uncertainty explain the data better than others. For examples of
this approach, see

`demos/obe_line_plus_noise.py`

`demos/obe_lockin.py`

`demos/obe_sweeper.py`

`ParticlePdf.bayesian_update()`

adjusts the parameter distribution
(including the uncertainty part) based on the likelihood.

### Make good decisions routine¶

Measurement settings are selected by `opt_setting()`

and `good_setting`

,
and both of these rely on `utility()`

. The sidebar illustrates the inner
workings of `utility()`

, where the most time-consuming part is
`yvar_from_parameter_draws()`

. For each of `N_DRAWS`

samples from
the parameter distribution the model function is evaluated for every
possible setting combination, yielding a set of outputs for each setting. The
variance of these outputs is compared to the model experimental noise which
is a constant by default. The examples listed above for unknown
uncertainties include the mean of the variance as a model of experimental
noise.

In the denominator, the cost of a potential experiment may depend on
settings, or predicted value or distance to the next setting. There are
many possibilities. In `OptBayesExpt`

the cost is constant by default but
customized cost models can be programmed into child classes.
In `demos/obe_lockin.py`

for example, the cost model includes the additional
cost of a settling time if the setting is changed from its current value. In
`demos/obe_sweeper.py`

, the cost model includes a cost of setting up a
sweep plus a cost proportional to the length of the sweep.

Once the utility is computed, `opt_setting()`

and `good_setting()`

offer different strategies for selecting settings for the next measurement.
The `opt_setting()`

method implements a “greedy” strategy, always
selecting the setting that generates the maximum utility. The
resulting measurements tend to cluster strongly around a few settings. The
`good_setting()`

method uses the utility as a weighting function for a
random setting selection. To tilt the odds toward selecting settings
with higher utility, the contrast between high and low utility is stretched
by raising utility to a power, `pickiness`

. The contrasting behavior of
`opt_settting()`

and `good_setting()`

with different `pickiness`

values are illustrated by `demos/line_plus_noise.py`

### Probability distribution class¶

The probability distribution over the model parameters is at the core of
`OptBayesExpt`

. As of version 1.0, the optbayesexpt package uses the
`particlePDF`

class to implement the distribution function for *N*
parameters as a cloud or swarm of particles in *N*-dimensional parameter
space. This approach has been given several names, but “sequential Monte
Carlo” seems to be gaining traction. Each point in the cloud has
coordinates corresponding to parameter values, and also a weight, normalized
so that the sum of the weights is 1. The
density of particles in a small region and the weights of those particles
together provide a flexible representation of the probability density. The
figure in the sidebar illustrates how this works for a 1-D normal (Gaussian)
distribution. Panel (a) plots 100 draws from the normal distribution, each
with a weight of 0.01. The probability distribution is represented only
by the density of points in this case. Panel (b) represents the same normal
distribution using only the weights. The points are 100 draws from a
*uniform* distribution, but the weights represent the normal distribution.

In the learning fast stage, the distribution is modified based on new data. The likelihood is calculated for every point in the distribution cloud, and then ParticlePdf.bayesian_update() multiplies the particle’s weight by its likelihood. So the distribution is easily modified by adjusting weight values.

In the decision making stage, random draws of parameters from the probability distribution are used to calculate utility. Draws are made by random selection of points with probability determined by weights. In panel (a) values at the center are more likely to be chosen because the density is higher there. In panel (b), values at the center are more likely to be chosen because the weights are higher there.

`ParticlePdf`

also performs a self-maintenance function, `resample()`

.
As the incoming data is used to modify the
distribution, some regions of parameter space may develop very low
probabilities. The points near the ends of the plot in panel (b) illustrate
the issue. These low-weight points will almost never be chosen in
random draws, but they consume computational resources. In resampling, *N*
draws are taken from an *N*-particle distribution, so some high-weight
particles will be chosen more than once and low-weight particles may not be
chosen at all. Each particle is then given a small random nudge to separate
copies of the same original particle, the cloud of points is shrunk
slightly to compensate for the one-step diffusion, and the weight is divided
evenly between particles as shown in panel (c). Resampling relocates
low-weight particles to higher-weight neighborhoods so that as measurements
narrow the distributions of parameter values, the representation adapts.