Data Organization#
All of the extrapolation and interpolation models in the thermoextrap package expect input data to be organized in a certain fashion. To help manage the data, there are data objects that help organize it. Even the inputs to these data object, however, must be organized appropriately. Here we will use data from our ideal gas test system (from thermoextrap.idealgas
) to demonstrate this organization, as well as the various options for what types of data that may be provided as input.
# need imports
%matplotlib inline
import numpy as np
import xarray as xr
# import thermoextrap
import thermoextrap as xtrap
# Import idealgas module
from thermoextrap import idealgas
# Define reference beta
beta_ref = 5.6
# And maximum order
order = 2
npart = 1000 # Number of particles (in single configuration)
nconfig = 100_000 # Number of configurations
# Generate all the data we could want
xdata, udata = idealgas.generate_data((nconfig, npart), beta_ref)
Refer to thermoextrap.data
for more information on the data classes.
Basics#
Rather than passing data directly to __init__
methods for creating data class objects and simultaneously telling it which dimensions mean what (or expecting that specific dimensions mean a certain thing), thermoextrap
uses xarray
to label the dimensions of inputs. While this is also useful in the background, it helps to clarify what is expected of user inputs.
Currently, xdata
is of the shape (nconfig), or the number of configurations generated with each entry being the average \(x\) location for the associated configuration.
print(nconfig, xdata.shape)
100000 (100000,)
The dimension over which independent samples vary is the “record” dimension, with its default name in thermoextrap.data
being ‘rec’. So when we create an xarray.DataArray
object to house the input \(x\) data, we must label that dimension ‘rec’. Same goes for the input potential energy data. Note that the list provided to the argument dims
is a list of strings naming the dimensions in the array passed to xarray.DataArray
.
xdata = xr.DataArray(xdata, dims=["rec"])
udata = xr.DataArray(udata, dims=["rec"])
Now when we create a data object in thermoextrap
to hold the data, we tell it that the “record” dimension, rec_dim
is named ‘rec’, which is the default, but it could be named something different as long as you provided that name to rec_dim
.
Note that the xv
is the argument for the observable \(x\) and uv
is the potential energy or appropriate Hamiltonian or thermodynamic conjugate variable.
data = xtrap.DataCentralMomentsVals.from_vals(
order=order, rec_dim="rec", xv=xdata, uv=udata, central=True
)
A couple more notes are in order about the inputs to any of the thermoextrap.data
object variants. First, you only need to provide the order you expect to extrapolate to up front if you’re using the from_vals()
constructor. This is because you need to specify the order of moments that will be calculated from the raw data.
The next argument to be aware of is central
. This is True by default and tells the data object to work with central moments for calculating derivatives in the background, which it turns out is much more numerically stable than non-central moments. You probably want central
to be True, but know that you can change it if you wish.
Data Structure#
A lot of data is already computed as soon as we create our data object. The original raw data is still stored in xv
and uv
, and order is order
, but we can already see the central moments appearing if we look at…
data.xv
<xarray.DataArray (rec: 100000)> Size: 800kB array([0.181 , 0.1687, 0.1713, ..., 0.1763, 0.1761, 0.1829]) Dimensions without coordinates: rec
data.xave
<xarray.DataArray ()> Size: 8B array(0.1748)
xave
is the average observable value.
data.u
<xarray.DataArray (umom: 3)> Size: 24B array([1.0000e+00, 1.7485e+02, 3.0601e+04]) Dimensions without coordinates: umom
u
are the moments of the potential energy, i.e., \(\langle U^i \rangle\) for the \(i^\mathrm{th}\) index in the array.
For the central moments of the potential energy, \(\langle (U - \langle U \rangle )^i \rangle\), you can look at du
data.du
<xarray.DataArray (umom: 3)> Size: 24B array([ 1. , 0. , 28.2438]) Dimensions without coordinates: umom
The other necessary component for calculating derivatives is \(\langle x U^i \rangle\), which is in DataCentralMomentsVals.xu
data.xu
<xarray.DataArray (umom: 3)> Size: 24B array([1.7485e-01, 3.0601e+01, 5.3604e+03]) Dimensions without coordinates: umom
Or if working with central moments, \(\langle (x - \langle x \rangle) (U - \langle U \rangle)^i \rangle\), is in DataCentralMomentsVals.dxdu
data.dxdu
<xarray.DataArray (umom: 3)> Size: 24B array([0. , 0.0282, 0.0078]) Dimensions without coordinates: umom
All of this information is condensed in values
, which takes only exactly what we need for computing derivatives.
data.values
<xarray.DataArray (xmom: 2, umom: 3)> Size: 48B array([[1.0000e+05, 1.7485e+02, 2.8244e+01], [1.7485e-01, 2.8244e-02, 7.8139e-03]]) Dimensions without coordinates: xmom, umom
Understanding this internal structure will help to understand possible inputs as well. In values
, the data object has stored the central moments of \(\langle x \rangle\) and \(\langle U^i \rangle\) and cross moments \(\langle x U^i \rangle\). Note that the second dimension, “umom”, short for “U moments”, is just order
plus 1. That makes sense if we remember that the zeroth order derivative is the observable itself and we specify that we want to use up to order
derivatives. So the second dimension involves setting the exponent \(i\) on \(U\) equal to the index of that dimension, or order of that moment. The first dimension does the same thing, but with \(x\). Regardless of the order, however, we only ever need \(x\) raised to the zeroth or first power in the average.
The first row in values
contains all moments of just \(U\), i.e., \(\langle U^0 \rangle\), \(\langle U^1 \rangle\), \(\langle U^2 \rangle\), etc. The second row in values
contains all moments of \(x\) multiplied by \(U\), i.e., \(\langle x U^0 \rangle\), \(\langle x U^1 \rangle\), etc. But note that beyond the powers of 0 and 1 for the first row, and just 0 for the second row, all values shown are central moments, e.g., \(\langle (x - \langle x \rangle)(U - \langle U \rangle)^i \rangle\) or \(\langle (U - \langle U \rangle)^i \rangle\).
In other words, values
is a special array with structure…
for i + j <= 1:
data.values[0, 0]
= {sum of weights or count}
data.values[1, 0]
= {ave of x} = \(\langle x \rangle\)
data.values[0, 1]
= {ave of u} = \(\langle U \rangle\)
for i + j > 1:
data.values[i, j]
= \(\langle (x - \langle x \rangle)^i (U - \langle U \rangle)^j \rangle\)
To summarize, values
contains the bare bones of what is required for calculating derivatives and will be shared in some form or another across all data classes, with this information passed to the functions that compute derivative.
Input formats and resampling#
Since values
reflects the internal structure, you can just provide it directly (or something similar to it in terms of moments) if you prefer to do that. You’ll just need to use a different data class DataCentralMoments
and a different constructor method DataCentralMoments.from_vals()
.
While DataCentralMomentsVals
is designed to work with ‘values’ (i.e., individual observations), DataCentralMoments
is designed to work with moments. Both classes can be constructed from ‘values’, but DataCentralMomentsVals
retains the underlying values (for resampling, etc), DataCentralMoments
converts the values to moments, and goes from there. Basically, if you have pre-computed moments (e.g., from a simulation), DataCentralMoments
is probably what you want to use. Note that resampling for DataCentralMoments
is based on resampling over multiple samples the moments.
For example, if we construct an DataCentralMoments
object using the DataCentralMoments.from_data()
constructor, we have:
data_noboot = xtrap.DataCentralMoments.from_data(data.values)
xr.testing.assert_allclose(data_noboot.values, data.values)
data_noboot.values
<xarray.DataArray (xmom: 2, umom: 3)> Size: 48B array([[1.0000e+05, 1.7485e+02, 2.8244e+01], [1.7485e-01, 2.8244e-02, 7.8139e-03]]) Dimensions without coordinates: xmom, umom
Which is identical to data
above. Note that the order here is inferred from the passed moments array. Likewise, we could have just created this from values using DataCentralMoments.from_vals()
data_noboot = xtrap.DataCentralMoments.from_vals(
xv=xdata, uv=udata, rec_dim="rec", central=True, order=order
)
xr.testing.assert_allclose(data_noboot.values, data.values)
data_noboot.values
<xarray.DataArray (xmom: 2, umom: 3)> Size: 48B array([[1.0000e+05, 1.7485e+02, 2.8244e+01], [1.7485e-01, 2.8244e-02, 7.8139e-03]]) Dimensions without coordinates: xmom, umom
However, since data_noboot
is based on just a single average, bootstrapping makes little sense. For example:
data_noboot = xtrap.DataCentralMoments.from_raw(data.values)
try:
data_noboot.resample(sampler={"nrep": 3}).values
except ValueError as e:
print("caught error!")
print(e)
caught error!
'rec' not found in array dimensions ('xmom', 'umom')
versus…
data.resample(sampler={"nrep": 3}).values
<xarray.DataArray (rep: 3, xmom: 2, umom: 3)> Size: 144B array([[[1.0000e+05, 1.7485e+02, 2.8276e+01], [1.7485e-01, 2.8276e-02, 7.8536e-03]], [[1.0000e+05, 1.7483e+02, 2.8195e+01], [1.7483e-01, 2.8195e-02, 6.9730e-03]], [[1.0000e+05, 1.7487e+02, 2.8451e+01], [1.7487e-01, 2.8451e-02, 9.0861e-03]]]) Dimensions without coordinates: rep, xmom, umom
Note that whenever you call .resample()
a new dimension is created for the output called ‘rep’, short for “repetitions.” This is similar to the ‘rec’ dimension, but helps you keep track of whether you’re working with original data, or a bootstrapped sample.
So clearly if you like to provide raw moments and the uncertainty quantification in thermoextrap
to work, you will need to do this from blocks of your data or from repeated simulations. But the code will all still work if you prefer to calculate your own moments (saving them periodically from simulations rather than saving all of the configurations, energies, or observables frequently).
As an example, we can work with blocks of data, adding an axis called ‘block’ that we will ask the constructor to average over by specifying the dim
argument.
# Make 100 averaged observations
xx = xr.DataArray(xdata.values.reshape(100, -1), dims=["rec", "block"])
uu = xr.DataArray(udata.values.reshape(100, -1), dims=["rec", "block"])
# Create directly from values of moments - notice that this is DataCentralMoments, not DataCentralMomentsVals
# Effectively just means that the 'rec' dim will not be collapsed when using data for extrapolation, etc.
# Behaves similarly to the 'rep' dim when resampling
data_fv = xtrap.DataCentralMoments.from_vals(
xv=xx, uv=uu, dim="block", order=order, central=True
)
# So 'rec' is for each separate block average
data_fv.values
<xarray.DataArray (rec: 100, xmom: 2, umom: 3)> Size: 5kB array([[[ 1.0000e+03, 1.7504e+02, 2.8681e+01], [ 1.7504e-01, 2.8681e-02, 1.0778e-02]], [[ 1.0000e+03, 1.7487e+02, 2.9507e+01], [ 1.7487e-01, 2.9507e-02, 3.6761e-03]], [[ 1.0000e+03, 1.7471e+02, 2.8956e+01], [ 1.7471e-01, 2.8956e-02, 3.4057e-02]], [[ 1.0000e+03, 1.7470e+02, 2.9232e+01], [ 1.7470e-01, 2.9232e-02, -6.4238e-03]], [[ 1.0000e+03, 1.7469e+02, 2.8072e+01], [ 1.7469e-01, 2.8072e-02, 5.0937e-03]], [[ 1.0000e+03, 1.7457e+02, 2.6426e+01], [ 1.7457e-01, 2.6426e-02, 3.2884e-03]], [[ 1.0000e+03, 1.7480e+02, 2.8580e+01], [ 1.7480e-01, 2.8580e-02, 2.9648e-03]], ... [[ 1.0000e+03, 1.7491e+02, 2.8840e+01], [ 1.7491e-01, 2.8840e-02, 1.4932e-02]], [[ 1.0000e+03, 1.7478e+02, 2.7380e+01], [ 1.7478e-01, 2.7380e-02, 2.4111e-02]], [[ 1.0000e+03, 1.7492e+02, 2.6600e+01], [ 1.7492e-01, 2.6600e-02, -3.0268e-03]], [[ 1.0000e+03, 1.7489e+02, 2.8386e+01], [ 1.7489e-01, 2.8386e-02, -8.0114e-03]], [[ 1.0000e+03, 1.7477e+02, 2.6906e+01], [ 1.7477e-01, 2.6906e-02, -2.0863e-04]], [[ 1.0000e+03, 1.7486e+02, 2.6945e+01], [ 1.7486e-01, 2.6945e-02, 2.7821e-02]], [[ 1.0000e+03, 1.7496e+02, 2.7179e+01], [ 1.7496e-01, 2.7179e-02, -1.3813e-02]]]) Dimensions without coordinates: rec, xmom, umom
Again, note that we did not use the DataCentralMomentsVals
class above, but instead the DataCentralMoments
class. The former is for processing and storing simulation “timeseries” of observable and potential energy values, while the latter takes pre-computed moments, including multiple replicates of precomputed moments as above. Behind the scenes, this will influence how bootstrapped confidence intervals are computed.
What’s functionally different, though, is that the ‘rec’ dim also appears in .values
. That means that when this data is used in models for extrapolation or interpolation, that dimension will also be preserved. So prediction to a new \(\beta\) value will result in an output matching the size of the ‘rec’ dimension in the same way that it would match the ‘rep’ dimension created through resampling.
If we resample over this data set, we see that we just take nrep
random samples from it, putting those samples into a new dimension called ‘rep’.
data_fv.resample(sampler={"nrep": 3}).values
<xarray.DataArray (rep: 3, xmom: 2, umom: 3)> Size: 144B array([[[1.0000e+05, 1.7485e+02, 2.8257e+01], [1.7485e-01, 2.8257e-02, 8.0428e-03]], [[1.0000e+05, 1.7485e+02, 2.8462e+01], [1.7485e-01, 2.8462e-02, 7.3831e-03]], [[1.0000e+05, 1.7485e+02, 2.8340e+01], [1.7485e-01, 2.8340e-02, 9.6959e-03]]]) Dimensions without coordinates: rep, xmom, umom
If we had computed the moments from the blocked data ourselves, we could also create a data object with the DataCentralMoments.from_ave_raw()
constructor (below). Many other constructors exist, including from central moments if you like. If you use those, please take a look at the documentation to make sure you are specifying or using the correct dimension naming conventions, such as ‘rec’, ‘xmom’, ‘umom’, etc. Remember, if you are extrapolating an observable that has an explicit dependence on the extrapolation observable, you also need to specify the ‘deriv’ dimension that describes the observed derivatives with respect to the extrapolation variable (see the Temperature extrapolation case 2 notebook).
# Compute moments of U, i.e., averages to integer powers up to maximum order desired
mom_u = xr.DataArray(np.arange(order + 1), dims=["umom"])
uave = (uu**mom_u).mean("block")
xuave = (xx * uu**mom_u).mean("block")
data_fa = xtrap.DataCentralMoments.from_ave_raw(
u=uave, xu=xuave, central=True, weight=xx.sizes["block"]
)
xr.testing.assert_allclose(data_fv.values, data_fa.values)
The above .values
should be identical to those from the from_vals
constructor.
At this point, we have seen how the same data objects that interface with extrapolation or interpolation models can be created from different inputs. Other features also exist, such as specifying weights with the argument w
to a constructor to change the weights used during averaging.
Vector observables#
Finally, we can also have vector observables, like RDFs, for example. This is easy to accomplish with any of the above constructors or types of data input. All that is required is to add another dimension to our xarray.DataArray
input. Typically, we will call this dimension ‘vals’, short for “values”, which is the default name for this dimension when using the DataCentralMomentsVals.from_vals()
constructor.
# Extrapolate both average x and average x**2
x_xsq_data = xr.DataArray(
np.vstack([xdata.values, xdata.values**2]).T,
dims=["rec", "vals"],
coords={"vals": ["x", "xsq"]},
)
data_vec = xtrap.DataCentralMomentsVals.from_vals(
order=order, rec_dim="rec", xv=x_xsq_data, uv=udata, central=True
)
data_vec.values
<xarray.DataArray (vals: 2, xmom: 2, umom: 3)> Size: 96B array([[[1.0000e+05, 1.7485e+02, 2.8244e+01], [1.7485e-01, 2.8244e-02, 7.8139e-03]], [[1.0000e+05, 1.7485e+02, 2.8244e+01], [3.0601e-02, 9.8847e-03, 4.3329e-03]]]) Coordinates: * vals (vals) <U3 24B 'x' 'xsq' Dimensions without coordinates: xmom, umom
data_vec.resample(sampler={"nrep": 3}).values
<xarray.DataArray (rep: 3, vals: 2, xmom: 2, umom: 3)> Size: 288B array([[[[1.0000e+05, 1.7484e+02, 2.8261e+01], [1.7484e-01, 2.8261e-02, 7.9176e-03]], [[1.0000e+05, 1.7484e+02, 2.8261e+01], [3.0598e-02, 9.8903e-03, 4.3704e-03]]], [[[1.0000e+05, 1.7482e+02, 2.8255e+01], [1.7482e-01, 2.8255e-02, 7.9522e-03]], [[1.0000e+05, 1.7482e+02, 2.8255e+01], [3.0590e-02, 9.8870e-03, 4.3943e-03]]], [[[1.0000e+05, 1.7485e+02, 2.8214e+01], [1.7485e-01, 2.8214e-02, 6.9934e-03]], [[1.0000e+05, 1.7485e+02, 2.8214e+01], [3.0601e-02, 9.8735e-03, 4.0474e-03]]]]) Coordinates: * vals (vals) <U3 24B 'x' 'xsq' Dimensions without coordinates: rep, xmom, umom
Note that we have simply added a dimension along which all the same operations happen, but independently for different data. The behavior is identical if we work instead with data from other constructors.
xx_xsqxsq = xr.DataArray(
x_xsq_data.values.reshape(100, -1, 2),
dims=["rec", "block", "vals"],
coords={"vals": ["x", "xsq"]},
)
x_xsq_uave = (xx_xsqxsq * uu**mom_u).mean("block")
data_fa_vec = xtrap.DataCentralMoments.from_ave_raw(
u=uave, xu=x_xsq_uave, central=True, weight=xx_xsqxsq.sizes["block"]
)
data_fa_vec.reduce("rec").values
<xarray.DataArray (vals: 2, xmom: 2, umom: 3)> Size: 96B array([[[1.0000e+05, 1.7485e+02, 2.8244e+01], [1.7485e-01, 2.8244e-02, 7.8139e-03]], [[1.0000e+05, 1.7485e+02, 2.8244e+01], [3.0601e-02, 9.8847e-03, 4.3329e-03]]]) Coordinates: * vals (vals) <U3 24B 'x' 'xsq' Dimensions without coordinates: xmom, umom
data_fa_vec.resample(sampler={"nrep": 3}).values
<xarray.DataArray (rep: 3, vals: 2, xmom: 2, umom: 3)> Size: 288B array([[[[1.0000e+05, 1.7485e+02, 2.8203e+01], [1.7485e-01, 2.8203e-02, 6.0452e-03]], [[1.0000e+05, 1.7485e+02, 2.8203e+01], [3.0600e-02, 9.8685e-03, 3.7024e-03]]], [[[1.0000e+05, 1.7486e+02, 2.8246e+01], [1.7486e-01, 2.8246e-02, 8.5305e-03]], [[1.0000e+05, 1.7486e+02, 2.8246e+01], [3.0604e-02, 9.8867e-03, 4.5961e-03]]], [[[1.0000e+05, 1.7486e+02, 2.8203e+01], [1.7486e-01, 2.8203e-02, 1.1006e-02]], [[1.0000e+05, 1.7486e+02, 2.8203e+01], [3.0603e-02, 9.8740e-03, 5.4435e-03]]]]) Coordinates: * vals (vals) <U3 24B 'x' 'xsq' Dimensions without coordinates: rep, xmom, umom