Open In Colab

Build an xarray.Dataset from Scratch#

In this How-To we’ll go through the process of building up an xarray.Dataset that could be used as an input to Pipeline.calculate. We’ll generate random compositions and fake data to go along with these compositions.

The dataset generated in this notebook is the basis for the Building Pipelines tutorial.

Google Colab Setup#

Only uncomment and run the next cell if you are running this notebook in Google Colab or if don’t already have the AFL-agent package installed.

[ ]:
# !pip install git+https://github.com/usnistgov/AFL-agent.git

First Steps#

To begin, let’s import the necessary libraries for this document and then make an empty :py:class:xarray.Dataset

[1]:
import numpy as np
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt

ds = xr.Dataset()
ds
[1]:
<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*

Compositions#

Next, we generate random ‘compositions’ that we’ll do simulated/virtual measurements at. We’ll generate the compositions for a 2-dimensional space with components “A” and “B” as placeholders. You could imagine that A and B are the concentrations of two different preservatives in a liquid mixtures.

[2]:
num_measurements = 100
A = np.random.uniform(0,10,size=num_measurements)
B = np.random.uniform(0,25,size=num_measurements)
compositions = np.array([A,B]).T
compositions
[2]:
array([[ 1.93506959,  4.33877746],
       [ 3.99993228, 15.11981127],
       [ 5.14403166,  1.65850983],
       [ 4.57883235, 12.34183192],
       [ 8.05567528, 10.47358865],
       [ 1.04161007, 22.83361697],
       [ 5.85757901, 18.81270953],
       [ 4.29558185, 16.91442648],
       [ 3.14950211,  2.08947439],
       [ 7.65251749, 13.05015789],
       [ 5.58833051,  7.55301393],
       [ 8.75864958, 13.90004698],
       [ 9.03579843,  1.04110731],
       [ 6.94709288, 22.03909555],
       [ 6.30872735, 23.08178649],
       [ 8.38214203, 24.28281802],
       [ 6.09861924,  5.67560421],
       [ 0.12177663,  1.50263558],
       [ 5.4271795 , 24.05339109],
       [ 3.32773495, 12.91733508],
       [ 8.22656778, 18.12750637],
       [ 7.77255352,  1.61982803],
       [ 1.58024907,  8.89957219],
       [ 0.62964449, 13.64241493],
       [ 4.87628903, 11.61657773],
       [ 5.13234999, 15.35880089],
       [ 3.02453785, 10.78092744],
       [ 6.29667967,  9.70296135],
       [ 2.87889006, 21.23149213],
       [ 7.33476093, 23.21218938],
       [ 6.96178175, 14.54078124],
       [ 9.94068744, 13.1306839 ],
       [ 5.81347609, 18.31807371],
       [ 3.65653358,  1.86819243],
       [ 8.97030945, 15.30596251],
       [ 2.93130382, 20.47763125],
       [ 9.37753719, 17.91516725],
       [ 0.41716654,  1.43289172],
       [ 1.1435741 ,  4.91629814],
       [ 1.08112256, 11.13119176],
       [ 5.25354877,  9.45231719],
       [ 7.20589227, 18.57879028],
       [ 7.89243271, 11.31509607],
       [ 6.94687163, 14.82218341],
       [ 7.23403931,  6.79257162],
       [ 8.68249381, 12.53839805],
       [ 1.06877839, 10.32097668],
       [ 7.89831494,  0.4644321 ],
       [ 1.55458517, 24.16727467],
       [ 0.45829217,  3.24497194],
       [ 8.9361479 , 15.47144486],
       [ 6.41770086, 10.88253153],
       [ 5.88298706, 18.25514364],
       [ 1.345648  , 11.08596411],
       [ 1.03642353, 21.48281264],
       [ 8.17349873, 17.80265345],
       [ 8.24719952, 23.52786849],
       [ 7.19131724,  9.11486603],
       [ 0.78123421,  3.94805308],
       [ 6.24072184, 17.0863877 ],
       [ 0.50409961, 16.27277289],
       [ 8.38188047,  1.72138619],
       [ 2.60295083,  4.67442285],
       [ 2.58444436, 16.57143555],
       [ 9.35239534, 16.00199993],
       [ 6.50574801, 17.33916381],
       [ 3.27476499,  7.83797279],
       [ 7.75461602, 23.60987228],
       [ 0.55448762, 24.08902153],
       [ 3.82063523, 20.43360895],
       [ 1.96279418,  8.66791611],
       [ 8.31501072,  0.0249756 ],
       [ 2.4606    ,  5.38843633],
       [ 9.1328654 , 19.8844552 ],
       [ 4.08471932,  1.10690158],
       [ 8.48469695, 11.10944636],
       [ 5.99946032, 10.58338866],
       [ 7.99427379, 10.09969   ],
       [ 8.71171219,  7.02035713],
       [ 7.30678607, 24.33877297],
       [ 7.76378269, 21.99343589],
       [ 7.45991402,  8.11174006],
       [ 7.63854343, 13.54025074],
       [ 0.81733253,  9.43775453],
       [ 7.20059135,  5.85811728],
       [ 2.20804772,  6.7721729 ],
       [ 3.65472461,  4.43961665],
       [ 8.21345636, 10.74735116],
       [ 9.92147334, 15.55468087],
       [ 9.52663079,  6.77762994],
       [ 7.01379838, 22.20232276],
       [ 0.23914684,  3.4963898 ],
       [ 3.46456904, 12.84968187],
       [ 0.26618359,  8.12557401],
       [ 1.1156883 ,  0.45431966],
       [ 6.91416382, 21.4816626 ],
       [ 0.59623227,  7.69198605],
       [ 2.02031934,  4.93580074],
       [ 5.21402634,  4.84272796],
       [ 7.87824238, 14.33035062]])

Now let’s add this information to the :py:class:xarray.Dataset.

Note how, for the composition variable, we need to not only specify the name of the variable in the dataset but also the names of the dimensions of the data (‘sample’ and ‘components’).

[3]:
ds['composition'] = (['sample','component'],compositions)
ds['component'] = ('component',['A','B'])
ds
[3]:
<xarray.Dataset> Size: 2kB
Dimensions:      (sample: 100, component: 2)
Coordinates:
  * component    (component) <U1 8B 'A' 'B'
Dimensions without coordinates: sample
Data variables:
    composition  (sample, component) float64 2kB 1.935 4.339 4.0 ... 7.878 14.33

Okay, in order to simulate a ‘phase boundary’ we’ll create labels for the data. We’ll draw an arbitrary line through the composition space and label the data that is above and below that line.

Let’s generate this data and add it to the dataset

[4]:
labels = (A>(0.25*B-1)).astype(int)
ds['ground_truth_labels'] = ('sample',labels)
ds
[4]:
<xarray.Dataset> Size: 2kB
Dimensions:              (sample: 100, component: 2)
Coordinates:
  * component            (component) <U1 8B 'A' 'B'
Dimensions without coordinates: sample
Data variables:
    composition          (sample, component) float64 2kB 1.935 4.339 ... 14.33
    ground_truth_labels  (sample) int64 800B 1 1 1 1 1 0 1 1 ... 1 0 1 1 0 1 1 1

Now we can plot the data. We do this using xarray by first extracting the compositions data variable into a new standalone xarray.Dataset and then calling plot.scatter on it.

[5]:
ds.composition.to_dataset('component').plot.scatter(x='A',y='B',c=ds.ground_truth_labels)
plt.show()
../_images/how-to_building_xarray_datasets_15_0.png

Simulated Measurement Data#

Now let’s generate the ‘measurement’ data. We’ll generate one measurement for each composition generated above. We’ll generate two kinds of data that depend on the data label:

  1. A flat background signal with random Gaussian noise

  2. A power-law with a power of -4 that decays to a flat background

Both kinds of data will have random Gaussian noise.

Now we can define a method (Python’s name for a function) that randomly generates one of two measurement signals.

[6]:
import numpy as np

def measure(x,label):
    """Generate one of two signals with noise"""

    if label==0:
        m = np.ones_like(x) #flat background
    else:
        m = 1e-6*np.power(x,-4) + 1.0  #power law

    # add noise
    m += np.random.normal(loc=m, scale=0.25*m, size=x.shape[0])

    return m

Let’s define a domain for the measurement (x), generate the data, and the create an xarray.Dataset with it.

[7]:
import xarray as xr

#domain of the measurements (e.g., for scattering this would be q, for
#spectroscopy this would be wavelength or wavenumber)
x = np.geomspace(0.001,1.0,150)

# conduct 50 measurements and gather into an array
measurements = np.array([measure(x,label) for label in labels])

# add the measurement data to the dataset
ds['measurement'] = (['sample','x'],measurements)
ds['ground_truth_labels'] = (['sample'],labels)
ds['x'] = ('x',x)
ds
[7]:
<xarray.Dataset> Size: 124kB
Dimensions:              (sample: 100, component: 2, x: 150)
Coordinates:
  * component            (component) <U1 8B 'A' 'B'
  * x                    (x) float64 1kB 0.001 0.001047 0.001097 ... 0.9547 1.0
Dimensions without coordinates: sample
Data variables:
    composition          (sample, component) float64 2kB 1.935 4.339 ... 14.33
    ground_truth_labels  (sample) int64 800B 1 1 1 1 1 0 1 1 ... 1 0 1 1 0 1 1 1
    measurement          (sample, x) float64 120kB 2.047e+06 1.318e+06 ... 2.065

Now let’s plot the two groups of data

[8]:
for label, sub_ds in ds.groupby('ground_truth_labels'):
    plt.figure()
    sub_ds.measurement.plot.line(x='x',marker='.',ls='None',xscale='log',yscale='log',add_legend=False)
    plt.title(f'Group {label}')

plt.show()
../_images/how-to_building_xarray_datasets_22_0.png
../_images/how-to_building_xarray_datasets_22_1.png

Composition Grid#

Okay, the final piece of data that you need to start is the composition grid. This grid defines the space that the agent will evaluate when choosing the next composition

[9]:
num_grid_points = 50
A_grid = np.linspace(0,10,num_grid_points)
B_grid = np.linspace(0,25,num_grid_points)
composition_grid = np.meshgrid(A_grid,B_grid)
composition_grid = np.array([composition_grid[0].ravel(),composition_grid[1].ravel()]).T

ds['composition_grid'] = (['grid','component'],composition_grid)
ds
[9]:
<xarray.Dataset> Size: 164kB
Dimensions:              (sample: 100, component: 2, x: 150, grid: 2500)
Coordinates:
  * component            (component) <U1 8B 'A' 'B'
  * x                    (x) float64 1kB 0.001 0.001047 0.001097 ... 0.9547 1.0
Dimensions without coordinates: sample, grid
Data variables:
    composition          (sample, component) float64 2kB 1.935 4.339 ... 14.33
    ground_truth_labels  (sample) int64 800B 1 1 1 1 1 0 1 1 ... 1 0 1 1 0 1 1 1
    measurement          (sample, x) float64 120kB 2.047e+06 1.318e+06 ... 2.065
    composition_grid     (grid, component) float64 40kB 0.0 0.0 ... 10.0 25.0

Let’s inspect the grid in a plot

[10]:
ds.composition_grid.to_dataset('component').plot.scatter(x='A',y='B',edgecolor='None')
plt.show()
../_images/how-to_building_xarray_datasets_27_0.png

Saving the Dataset to disk#

We can save this dataset to disk for use in other notebooks or to memorialize the input data used in a calculation. We’ll use the netcdf format for this:

[11]:

ds.to_netcdf('../data/example_dataset.nc')

Conclusion#

In this notebook, we demonstrated how to build an xarray.Dataset from scratch.

We:

  1. Created an empty dataset

  2. Added composition data for samples

  3. Added ground truth labels for the samples

  4. Added simulated measurement data

  5. Added a composition grid for the agent to explore

  6. Saved the dataset to disk in netCDF format

The resulting dataset contains all the necessary components for training and evaluating an active learning agent:

  • Sample compositions and their corresponding measurements

  • Ground truth labels for validation

  • A grid defining the composition space for exploration

This dataset structure represents a typical format expected by many agent pipelines in AFL.double_agent. The exact variables and variable names will change with the pipeline, but the concept of having measurement data and composition information that shares dimensions is a foundational feature of analyzing formulations and materials problems where the composition is varying.