Data Organization#

All of the extrapolation and interpolation models in the thermoextrap package expect input data to be organized in a certain fashion. To help manage the data, there are data objects that help organize it. Even the inputs to these data object, however, must be organized appropriately. Here we will use data from our ideal gas test system (from thermoextrap.idealgas) to demonstrate this organization, as well as the various options for what types of data that may be provided as input.

# need imports
%matplotlib inline
import numpy as np
import xarray as xr

# import thermoextrap
import thermoextrap as xtrap

# Import idealgas module
from thermoextrap import idealgas
# Define reference beta
beta_ref = 5.6

# And maximum order
order = 2

npart = 1000  # Number of particles (in single configuration)
nconfig = 100_000  # Number of configurations

# Generate all the data we could want
xdata, udata = idealgas.generate_data((nconfig, npart), beta_ref)

Refer to thermoextrap.data for more information on the data classes.

Basics#

Rather than passing data directly to __init__ methods for creating data class objects and simultaneously telling it which dimensions mean what (or expecting that specific dimensions mean a certain thing), thermoextrap uses xarray to label the dimensions of inputs. While this is also useful in the background, it helps to clarify what is expected of user inputs.

Currently, xdata is of the shape (nconfig), or the number of configurations generated with each entry being the average \(x\) location for the associated configuration.

print(nconfig, xdata.shape)
100000 (100000,)

The dimension over which independent samples vary is the “record” dimension, with its default name in thermoextrap.data being ‘rec’. So when we create an xarray.DataArray object to house the input \(x\) data, we must label that dimension ‘rec’. Same goes for the input potential energy data. Note that the list provided to the argument dims is a list of strings naming the dimensions in the array passed to xarray.DataArray.

xdata = xr.DataArray(xdata, dims=["rec"])
udata = xr.DataArray(udata, dims=["rec"])

Now when we create a data object in thermoextrap to hold the data, we tell it that the “record” dimension, rec_dim is named ‘rec’, which is the default, but it could be named something different as long as you provided that name to rec_dim.

Note that the xv is the argument for the observable \(x\) and uv is the potential energy or appropriate Hamiltonian or thermodynamic conjugate variable.

data = xtrap.DataCentralMomentsVals.from_vals(
    order=order, rec_dim="rec", xv=xdata, uv=udata, central=True
)

A couple more notes are in order about the inputs to any of the thermoextrap.data object variants. First, you only need to provide the order you expect to extrapolate to up front if you’re using the from_vals() constructor. This is because you need to specify the order of moments that will be calculated from the raw data.

The next argument to be aware of is central. This is True by default and tells the data object to work with central moments for calculating derivatives in the background, which it turns out is much more numerically stable than non-central moments. You probably want central to be True, but know that you can change it if you wish.

Data Structure#

A lot of data is already computed as soon as we create our data object. The original raw data is still stored in xv and uv, and order is order, but we can already see the central moments appearing if we look at…

data.xv
<xarray.DataArray (rec: 100000)> Size: 800kB
array([0.181 , 0.1687, 0.1713, ..., 0.1763, 0.1761, 0.1829])
Dimensions without coordinates: rec
data.xave
<xarray.DataArray ()> Size: 8B
array(0.1748)

xave is the average observable value.

data.u
<xarray.DataArray (umom: 3)> Size: 24B
array([1.0000e+00, 1.7485e+02, 3.0601e+04])
Dimensions without coordinates: umom

u are the moments of the potential energy, i.e., \(\langle U^i \rangle\) for the \(i^\mathrm{th}\) index in the array.

For the central moments of the potential energy, \(\langle (U - \langle U \rangle )^i \rangle\), you can look at du

data.du
<xarray.DataArray (umom: 3)> Size: 24B
array([ 1.    ,  0.    , 28.2438])
Dimensions without coordinates: umom

The other necessary component for calculating derivatives is \(\langle x U^i \rangle\), which is in DataCentralMomentsVals.xu

data.xu
<xarray.DataArray (umom: 3)> Size: 24B
array([1.7485e-01, 3.0601e+01, 5.3604e+03])
Dimensions without coordinates: umom

Or if working with central moments, \(\langle (x - \langle x \rangle) (U - \langle U \rangle)^i \rangle\), is in DataCentralMomentsVals.dxdu

data.dxdu
<xarray.DataArray (umom: 3)> Size: 24B
array([0.    , 0.0282, 0.0078])
Dimensions without coordinates: umom

All of this information is condensed in values, which takes only exactly what we need for computing derivatives.

data.values
<xarray.DataArray (xmom: 2, umom: 3)> Size: 48B
array([[1.0000e+05, 1.7485e+02, 2.8244e+01],
       [1.7485e-01, 2.8244e-02, 7.8139e-03]])
Dimensions without coordinates: xmom, umom

Understanding this internal structure will help to understand possible inputs as well. In values, the data object has stored the central moments of \(\langle x \rangle\) and \(\langle U^i \rangle\) and cross moments \(\langle x U^i \rangle\). Note that the second dimension, “umom”, short for “U moments”, is just order plus 1. That makes sense if we remember that the zeroth order derivative is the observable itself and we specify that we want to use up to order derivatives. So the second dimension involves setting the exponent \(i\) on \(U\) equal to the index of that dimension, or order of that moment. The first dimension does the same thing, but with \(x\). Regardless of the order, however, we only ever need \(x\) raised to the zeroth or first power in the average.

The first row in values contains all moments of just \(U\), i.e., \(\langle U^0 \rangle\), \(\langle U^1 \rangle\), \(\langle U^2 \rangle\), etc. The second row in values contains all moments of \(x\) multiplied by \(U\), i.e., \(\langle x U^0 \rangle\), \(\langle x U^1 \rangle\), etc. But note that beyond the powers of 0 and 1 for the first row, and just 0 for the second row, all values shown are central moments, e.g., \(\langle (x - \langle x \rangle)(U - \langle U \rangle)^i \rangle\) or \(\langle (U - \langle U \rangle)^i \rangle\).

In other words, values is a special array with structure…

for i + j <= 1:
data.values[0, 0] = {sum of weights or count}
data.values[1, 0] = {ave of x} = \(\langle x \rangle\)
data.values[0, 1] = {ave of u} = \(\langle U \rangle\)

for i + j > 1:
data.values[i, j] = \(\langle (x - \langle x \rangle)^i (U - \langle U \rangle)^j \rangle\)

To summarize, values contains the bare bones of what is required for calculating derivatives and will be shared in some form or another across all data classes, with this information passed to the functions that compute derivative.

Input formats and resampling#

Since values reflects the internal structure, you can just provide it directly (or something similar to it in terms of moments) if you prefer to do that. You’ll just need to use a different data class DataCentralMoments and a different constructor method DataCentralMoments.from_vals().

While DataCentralMomentsVals is designed to work with ‘values’ (i.e., individual observations), DataCentralMoments is designed to work with moments. Both classes can be constructed from ‘values’, but DataCentralMomentsVals retains the underlying values (for resampling, etc), DataCentralMoments converts the values to moments, and goes from there. Basically, if you have pre-computed moments (e.g., from a simulation), DataCentralMoments is probably what you want to use. Note that resampling for DataCentralMoments is based on resampling over multiple samples the moments.

For example, if we construct an DataCentralMoments object using the DataCentralMoments.from_data() constructor, we have:

data_noboot = xtrap.DataCentralMoments.from_data(data.values)
xr.testing.assert_allclose(data_noboot.values, data.values)
data_noboot.values
<xarray.DataArray (xmom: 2, umom: 3)> Size: 48B
array([[1.0000e+05, 1.7485e+02, 2.8244e+01],
       [1.7485e-01, 2.8244e-02, 7.8139e-03]])
Dimensions without coordinates: xmom, umom

Which is identical to data above. Note that the order here is inferred from the passed moments array. Likewise, we could have just created this from values using DataCentralMoments.from_vals()

data_noboot = xtrap.DataCentralMoments.from_vals(
    xv=xdata, uv=udata, rec_dim="rec", central=True, order=order
)
xr.testing.assert_allclose(data_noboot.values, data.values)
data_noboot.values
<xarray.DataArray (xmom: 2, umom: 3)> Size: 48B
array([[1.0000e+05, 1.7485e+02, 2.8244e+01],
       [1.7485e-01, 2.8244e-02, 7.8139e-03]])
Dimensions without coordinates: xmom, umom

However, since data_noboot is based on just a single average, bootstrapping makes little sense. For example:

data_noboot = xtrap.DataCentralMoments.from_raw(data.values)

try:
    data_noboot.resample(nrep=3).values
except ValueError as e:
    print("caught error!")
    print(e)
caught error!
not implemented for scalar

versus…

data.resample(nrep=3).values
<xarray.DataArray (rep: 3, xmom: 2, umom: 3)> Size: 144B
array([[[1.0000e+05, 1.7485e+02, 2.8276e+01],
        [1.7485e-01, 2.8276e-02, 7.8536e-03]],

       [[1.0000e+05, 1.7483e+02, 2.8195e+01],
        [1.7483e-01, 2.8195e-02, 6.9730e-03]],

       [[1.0000e+05, 1.7487e+02, 2.8451e+01],
        [1.7487e-01, 2.8451e-02, 9.0861e-03]]])
Dimensions without coordinates: rep, xmom, umom

Note that whenever you call .resample() a new dimension is created for the output called ‘rep’, short for “repetitions.” This is similar to the ‘rec’ dimension, but helps you keep track of whether you’re working with original data, or a bootstrapped sample.

So clearly if you like to provide raw moments and the uncertainty quantification in thermoextrap to work, you will need to do this from blocks of your data or from repeated simulations. But the code will all still work if you prefer to calculate your own moments (saving them periodically from simulations rather than saving all of the configurations, energies, or observables frequently).

As an example, we can work with blocks of data, adding an axis called ‘block’ that we will ask the constructor to average over by specifying the dim argument.

# Make 100 averaged observations
xx = xr.DataArray(xdata.values.reshape(100, -1), dims=["rec", "block"])
uu = xr.DataArray(udata.values.reshape(100, -1), dims=["rec", "block"])
# Create directly from values of moments - notice that this is DataCentralMoments, not DataCentralMomentsVals
# Effectively just means that the 'rec' dim will not be collapsed when using data for extrapolation, etc.
# Behaves similarly to the 'rep' dim when resampling
data_fv = xtrap.DataCentralMoments.from_vals(
    xv=xx, uv=uu, dim="block", order=order, central=True
)
# So 'rec' is for each separate block average
data_fv.values
<xarray.DataArray (rec: 100, xmom: 2, umom: 3)> Size: 5kB
array([[[ 1.0000e+03,  1.7504e+02,  2.8681e+01],
        [ 1.7504e-01,  2.8681e-02,  1.0778e-02]],

       [[ 1.0000e+03,  1.7487e+02,  2.9507e+01],
        [ 1.7487e-01,  2.9507e-02,  3.6761e-03]],

       [[ 1.0000e+03,  1.7471e+02,  2.8956e+01],
        [ 1.7471e-01,  2.8956e-02,  3.4057e-02]],

       [[ 1.0000e+03,  1.7470e+02,  2.9232e+01],
        [ 1.7470e-01,  2.9232e-02, -6.4238e-03]],

       [[ 1.0000e+03,  1.7469e+02,  2.8072e+01],
        [ 1.7469e-01,  2.8072e-02,  5.0937e-03]],

       [[ 1.0000e+03,  1.7457e+02,  2.6426e+01],
        [ 1.7457e-01,  2.6426e-02,  3.2884e-03]],

       [[ 1.0000e+03,  1.7480e+02,  2.8580e+01],
        [ 1.7480e-01,  2.8580e-02,  2.9648e-03]],
...
       [[ 1.0000e+03,  1.7491e+02,  2.8840e+01],
        [ 1.7491e-01,  2.8840e-02,  1.4932e-02]],

       [[ 1.0000e+03,  1.7478e+02,  2.7380e+01],
        [ 1.7478e-01,  2.7380e-02,  2.4111e-02]],

       [[ 1.0000e+03,  1.7492e+02,  2.6600e+01],
        [ 1.7492e-01,  2.6600e-02, -3.0268e-03]],

       [[ 1.0000e+03,  1.7489e+02,  2.8386e+01],
        [ 1.7489e-01,  2.8386e-02, -8.0114e-03]],

       [[ 1.0000e+03,  1.7477e+02,  2.6906e+01],
        [ 1.7477e-01,  2.6906e-02, -2.0863e-04]],

       [[ 1.0000e+03,  1.7486e+02,  2.6945e+01],
        [ 1.7486e-01,  2.6945e-02,  2.7821e-02]],

       [[ 1.0000e+03,  1.7496e+02,  2.7179e+01],
        [ 1.7496e-01,  2.7179e-02, -1.3813e-02]]])
Dimensions without coordinates: rec, xmom, umom

Again, note that we did not use the DataCentralMomentsVals class above, but instead the DataCentralMoments class. The former is for processing and storing simulation “timeseries” of observable and potential energy values, while the latter takes pre-computed moments, including multiple replicates of precomputed moments as above. Behind the scenes, this will influence how bootstrapped confidence intervals are computed.

What’s functionally different, though, is that the ‘rec’ dim also appears in .values. That means that when this data is used in models for extrapolation or interpolation, that dimension will also be preserved. So prediction to a new \(\beta\) value will result in an output matching the size of the ‘rec’ dimension in the same way that it would match the ‘rep’ dimension created through resampling.

If we resample over this data set, we see that we just take nrep random samples from it, putting those samples into a new dimension called ‘rep’.

data_fv.resample(nrep=3).values
<xarray.DataArray (rep: 3, xmom: 2, umom: 3)> Size: 144B
array([[[1.0000e+05, 1.7485e+02, 2.8257e+01],
        [1.7485e-01, 2.8257e-02, 8.0428e-03]],

       [[1.0000e+05, 1.7485e+02, 2.8462e+01],
        [1.7485e-01, 2.8462e-02, 7.3831e-03]],

       [[1.0000e+05, 1.7485e+02, 2.8340e+01],
        [1.7485e-01, 2.8340e-02, 9.6959e-03]]])
Dimensions without coordinates: rep, xmom, umom

If we had computed the moments from the blocked data ourselves, we could also create a data object with the DataCentralMoments.from_ave_raw() constructor (below). Many other constructors exist, including from central moments if you like. If you use those, please take a look at the documentation to make sure you are specifying or using the correct dimension naming conventions, such as ‘rec’, ‘xmom’, ‘umom’, etc. Remember, if you are extrapolating an observable that has an explicit dependence on the extrapolation observable, you also need to specify the ‘deriv’ dimension that describes the observed derivatives with respect to the extrapolation variable (see the Temperature extrapolation case 2 notebook).

# Compute moments of U, i.e., averages to integer powers up to maximum order desired
mom_u = xr.DataArray(np.arange(order + 1), dims=["umom"])
uave = (uu**mom_u).mean("block")
xuave = (xx * uu**mom_u).mean("block")
data_fa = xtrap.DataCentralMoments.from_ave_raw(
    u=uave, xu=xuave, central=True, w=xx.sizes["block"]
)

xr.testing.assert_allclose(data_fv.values, data_fa.values)

The above .values should be identical to those from the from_vals constructor.

At this point, we have seen how the same data objects that interface with extrapolation or interpolation models can be created from different inputs. Other features also exist, such as specifying weights with the argument w to a constructor to change the weights used during averaging.

Vector observables#

Finally, we can also have vector observables, like RDFs, for example. This is easy to accomplish with any of the above constructors or types of data input. All that is required is to add another dimension to our xarray.DataArray input. Typically, we will call this dimension ‘vals’, short for “values”, which is the default name for this dimension when using the DataCentralMomentsVals.from_vals() constructor.

# Extrapolate both average x and average x**2
x_xsq_data = xr.DataArray(
    np.vstack([xdata.values, xdata.values**2]).T,
    dims=["rec", "vals"],
    coords={"vals": ["x", "xsq"]},
)
data_vec = xtrap.DataCentralMomentsVals.from_vals(
    order=order, rec_dim="rec", xv=x_xsq_data, uv=udata, central=True
)
data_vec.values
<xarray.DataArray (vals: 2, xmom: 2, umom: 3)> Size: 96B
array([[[1.0000e+05, 1.7485e+02, 2.8244e+01],
        [1.7485e-01, 2.8244e-02, 7.8139e-03]],

       [[1.0000e+05, 1.7485e+02, 2.8244e+01],
        [3.0601e-02, 9.8847e-03, 4.3329e-03]]])
Coordinates:
  * vals     (vals) <U3 24B 'x' 'xsq'
Dimensions without coordinates: xmom, umom
data_vec.resample(nrep=3).values
<xarray.DataArray (rep: 3, vals: 2, xmom: 2, umom: 3)> Size: 288B
array([[[[1.0000e+05, 1.7484e+02, 2.8261e+01],
         [1.7484e-01, 2.8261e-02, 7.9176e-03]],

        [[1.0000e+05, 1.7484e+02, 2.8261e+01],
         [3.0598e-02, 9.8903e-03, 4.3704e-03]]],


       [[[1.0000e+05, 1.7482e+02, 2.8255e+01],
         [1.7482e-01, 2.8255e-02, 7.9522e-03]],

        [[1.0000e+05, 1.7482e+02, 2.8255e+01],
         [3.0590e-02, 9.8870e-03, 4.3943e-03]]],


       [[[1.0000e+05, 1.7485e+02, 2.8214e+01],
         [1.7485e-01, 2.8214e-02, 6.9934e-03]],

        [[1.0000e+05, 1.7485e+02, 2.8214e+01],
         [3.0601e-02, 9.8735e-03, 4.0474e-03]]]])
Coordinates:
  * vals     (vals) <U3 24B 'x' 'xsq'
Dimensions without coordinates: rep, xmom, umom

Note that we have simply added a dimension along which all the same operations happen, but independently for different data. The behavior is identical if we work instead with data from other constructors.

xx_xsqxsq = xr.DataArray(
    x_xsq_data.values.reshape(100, -1, 2),
    dims=["rec", "block", "vals"],
    coords={"vals": ["x", "xsq"]},
)
x_xsq_uave = (xx_xsqxsq * uu**mom_u).mean("block")
data_fa_vec = xtrap.DataCentralMoments.from_ave_raw(
    u=uave, xu=x_xsq_uave, central=True, w=xx_xsqxsq.sizes["block"]
)
data_fa_vec.reduce("rec").values
<xarray.DataArray (vals: 2, xmom: 2, umom: 3)> Size: 96B
array([[[1.0000e+05, 1.7485e+02, 2.8244e+01],
        [1.7485e-01, 2.8244e-02, 7.8139e-03]],

       [[1.0000e+05, 1.7485e+02, 2.8244e+01],
        [3.0601e-02, 9.8847e-03, 4.3329e-03]]])
Coordinates:
  * vals     (vals) <U3 24B 'x' 'xsq'
Dimensions without coordinates: xmom, umom
data_fa_vec.resample(nrep=3).values
<xarray.DataArray (rep: 3, vals: 2, xmom: 2, umom: 3)> Size: 288B
array([[[[1.0000e+05, 1.7485e+02, 2.8203e+01],
         [1.7485e-01, 2.8203e-02, 6.0452e-03]],

        [[1.0000e+05, 1.7485e+02, 2.8203e+01],
         [3.0600e-02, 9.8685e-03, 3.7024e-03]]],


       [[[1.0000e+05, 1.7486e+02, 2.8246e+01],
         [1.7486e-01, 2.8246e-02, 8.5305e-03]],

        [[1.0000e+05, 1.7486e+02, 2.8246e+01],
         [3.0604e-02, 9.8867e-03, 4.5961e-03]]],


       [[[1.0000e+05, 1.7486e+02, 2.8203e+01],
         [1.7486e-01, 2.8203e-02, 1.1006e-02]],

        [[1.0000e+05, 1.7486e+02, 2.8203e+01],
         [3.0603e-02, 9.8740e-03, 5.4435e-03]]]])
Coordinates:
  * vals     (vals) <U3 24B 'x' 'xsq'
Dimensions without coordinates: rep, xmom, umom