{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/usnistgov/AFL-agent/blob/main/docs/source/how-to/building_xarray_datasets.ipynb)\n", "\n", "# Build an xarray.Dataset from Scratch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this `How-To` we'll go through the process of building up an xarray.Dataset that could be used as an input to `Pipeline.calculate`. We'll generate random compositions and fake data to go along with these compositions. \n", "\n", "\n", "The dataset generated in this notebook is the basis for the `Building Pipelines` tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Google Colab Setup\n", "\n", "Only uncomment and run the next cell if you are running this notebook in Google Colab or if don't already have the AFL-agent package installed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# !pip install git+https://github.com/usnistgov/AFL-agent.git" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First Steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To begin, let's import the necessary libraries for this document and then make an empty :py:class:`xarray.Dataset`" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 0B\n",
       "Dimensions:  ()\n",
       "Data variables:\n",
       "    *empty*
" ], "text/plain": [ " Size: 0B\n", "Dimensions: ()\n", "Data variables:\n", " *empty*" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import numpy as np\n", "import xarray as xr\n", "import matplotlib.pyplot as plt\n", "\n", "ds = xr.Dataset()\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compositions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we generate random 'compositions' that we'll do simulated/virtual measurements at. We'll generate the compositions for a 2-dimensional space with components \"A\" and \"B\" as placeholders. You could imagine that A and B are the concentrations of two different preservatives in a liquid mixtures. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 1.93506959, 4.33877746],\n", " [ 3.99993228, 15.11981127],\n", " [ 5.14403166, 1.65850983],\n", " [ 4.57883235, 12.34183192],\n", " [ 8.05567528, 10.47358865],\n", " [ 1.04161007, 22.83361697],\n", " [ 5.85757901, 18.81270953],\n", " [ 4.29558185, 16.91442648],\n", " [ 3.14950211, 2.08947439],\n", " [ 7.65251749, 13.05015789],\n", " [ 5.58833051, 7.55301393],\n", " [ 8.75864958, 13.90004698],\n", " [ 9.03579843, 1.04110731],\n", " [ 6.94709288, 22.03909555],\n", " [ 6.30872735, 23.08178649],\n", " [ 8.38214203, 24.28281802],\n", " [ 6.09861924, 5.67560421],\n", " [ 0.12177663, 1.50263558],\n", " [ 5.4271795 , 24.05339109],\n", " [ 3.32773495, 12.91733508],\n", " [ 8.22656778, 18.12750637],\n", " [ 7.77255352, 1.61982803],\n", " [ 1.58024907, 8.89957219],\n", " [ 0.62964449, 13.64241493],\n", " [ 4.87628903, 11.61657773],\n", " [ 5.13234999, 15.35880089],\n", " [ 3.02453785, 10.78092744],\n", " [ 6.29667967, 9.70296135],\n", " [ 2.87889006, 21.23149213],\n", " [ 7.33476093, 23.21218938],\n", " [ 6.96178175, 14.54078124],\n", " [ 9.94068744, 13.1306839 ],\n", " [ 5.81347609, 18.31807371],\n", " [ 3.65653358, 1.86819243],\n", " [ 8.97030945, 15.30596251],\n", " [ 2.93130382, 20.47763125],\n", " [ 9.37753719, 17.91516725],\n", " [ 0.41716654, 1.43289172],\n", " [ 1.1435741 , 4.91629814],\n", " [ 1.08112256, 11.13119176],\n", " [ 5.25354877, 9.45231719],\n", " [ 7.20589227, 18.57879028],\n", " [ 7.89243271, 11.31509607],\n", " [ 6.94687163, 14.82218341],\n", " [ 7.23403931, 6.79257162],\n", " [ 8.68249381, 12.53839805],\n", " [ 1.06877839, 10.32097668],\n", " [ 7.89831494, 0.4644321 ],\n", " [ 1.55458517, 24.16727467],\n", " [ 0.45829217, 3.24497194],\n", " [ 8.9361479 , 15.47144486],\n", " [ 6.41770086, 10.88253153],\n", " [ 5.88298706, 18.25514364],\n", " [ 1.345648 , 11.08596411],\n", " [ 1.03642353, 21.48281264],\n", " [ 8.17349873, 17.80265345],\n", " [ 8.24719952, 23.52786849],\n", " [ 7.19131724, 9.11486603],\n", " [ 0.78123421, 3.94805308],\n", " [ 6.24072184, 17.0863877 ],\n", " [ 0.50409961, 16.27277289],\n", " [ 8.38188047, 1.72138619],\n", " [ 2.60295083, 4.67442285],\n", " [ 2.58444436, 16.57143555],\n", " [ 9.35239534, 16.00199993],\n", " [ 6.50574801, 17.33916381],\n", " [ 3.27476499, 7.83797279],\n", " [ 7.75461602, 23.60987228],\n", " [ 0.55448762, 24.08902153],\n", " [ 3.82063523, 20.43360895],\n", " [ 1.96279418, 8.66791611],\n", " [ 8.31501072, 0.0249756 ],\n", " [ 2.4606 , 5.38843633],\n", " [ 9.1328654 , 19.8844552 ],\n", " [ 4.08471932, 1.10690158],\n", " [ 8.48469695, 11.10944636],\n", " [ 5.99946032, 10.58338866],\n", " [ 7.99427379, 10.09969 ],\n", " [ 8.71171219, 7.02035713],\n", " [ 7.30678607, 24.33877297],\n", " [ 7.76378269, 21.99343589],\n", " [ 7.45991402, 8.11174006],\n", " [ 7.63854343, 13.54025074],\n", " [ 0.81733253, 9.43775453],\n", " [ 7.20059135, 5.85811728],\n", " [ 2.20804772, 6.7721729 ],\n", " [ 3.65472461, 4.43961665],\n", " [ 8.21345636, 10.74735116],\n", " [ 9.92147334, 15.55468087],\n", " [ 9.52663079, 6.77762994],\n", " [ 7.01379838, 22.20232276],\n", " [ 0.23914684, 3.4963898 ],\n", " [ 3.46456904, 12.84968187],\n", " [ 0.26618359, 8.12557401],\n", " [ 1.1156883 , 0.45431966],\n", " [ 6.91416382, 21.4816626 ],\n", " [ 0.59623227, 7.69198605],\n", " [ 2.02031934, 4.93580074],\n", " [ 5.21402634, 4.84272796],\n", " [ 7.87824238, 14.33035062]])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_measurements = 100\n", "A = np.random.uniform(0,10,size=num_measurements)\n", "B = np.random.uniform(0,25,size=num_measurements)\n", "compositions = np.array([A,B]).T\n", "compositions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's add this information to the :py:class:`xarray.Dataset`. \n", "\n", "Note how, for the `composition` variable, we need to not only specify the name of the variable in the dataset but also the names of the dimensions of the data ('sample' and 'components')." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 2kB\n",
       "Dimensions:      (sample: 100, component: 2)\n",
       "Coordinates:\n",
       "  * component    (component) <U1 8B 'A' 'B'\n",
       "Dimensions without coordinates: sample\n",
       "Data variables:\n",
       "    composition  (sample, component) float64 2kB 1.935 4.339 4.0 ... 7.878 14.33
" ], "text/plain": [ " Size: 2kB\n", "Dimensions: (sample: 100, component: 2)\n", "Coordinates:\n", " * component (component) \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 2kB\n",
       "Dimensions:              (sample: 100, component: 2)\n",
       "Coordinates:\n",
       "  * component            (component) <U1 8B 'A' 'B'\n",
       "Dimensions without coordinates: sample\n",
       "Data variables:\n",
       "    composition          (sample, component) float64 2kB 1.935 4.339 ... 14.33\n",
       "    ground_truth_labels  (sample) int64 800B 1 1 1 1 1 0 1 1 ... 1 0 1 1 0 1 1 1
" ], "text/plain": [ " Size: 2kB\n", "Dimensions: (sample: 100, component: 2)\n", "Coordinates:\n", " * component (component) (0.25*B-1)).astype(int)\n", "ds['ground_truth_labels'] = ('sample',labels)\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can plot the data. We do this using xarray by first extracting the compositions data variable into a new standalone xarray.Dataset and then calling plot.scatter on it. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ds.composition.to_dataset('component').plot.scatter(x='A',y='B',c=ds.ground_truth_labels)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simulated Measurement Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's generate the 'measurement' data. We'll generate one measurement for each composition generated above. We'll generate two kinds of data that depend on the data label:\n", "\n", "1. A flat background signal with random Gaussian noise\n", "2. A power-law with a power of -4 that decays to a flat background\n", "\n", "Both kinds of data will have random Gaussian noise.\n", "\n", "Now we can define a method (Python's name for a function) that randomly generates one of two measurement signals. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def measure(x,label):\n", " \"\"\"Generate one of two signals with noise\"\"\"\n", "\n", " if label==0:\n", " m = np.ones_like(x) #flat background\n", " else:\n", " m = 1e-6*np.power(x,-4) + 1.0 #power law\n", "\n", " # add noise\n", " m += np.random.normal(loc=m, scale=0.25*m, size=x.shape[0])\n", "\n", " return m\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's define a domain for the measurement (x), generate the data, and the create an `xarray.Dataset` with it." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 124kB\n",
       "Dimensions:              (sample: 100, component: 2, x: 150)\n",
       "Coordinates:\n",
       "  * component            (component) <U1 8B 'A' 'B'\n",
       "  * x                    (x) float64 1kB 0.001 0.001047 0.001097 ... 0.9547 1.0\n",
       "Dimensions without coordinates: sample\n",
       "Data variables:\n",
       "    composition          (sample, component) float64 2kB 1.935 4.339 ... 14.33\n",
       "    ground_truth_labels  (sample) int64 800B 1 1 1 1 1 0 1 1 ... 1 0 1 1 0 1 1 1\n",
       "    measurement          (sample, x) float64 120kB 2.047e+06 1.318e+06 ... 2.065
" ], "text/plain": [ " Size: 124kB\n", "Dimensions: (sample: 100, component: 2, x: 150)\n", "Coordinates:\n", " * component (component) " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for label, sub_ds in ds.groupby('ground_truth_labels'):\n", " plt.figure()\n", " sub_ds.measurement.plot.line(x='x',marker='.',ls='None',xscale='log',yscale='log',add_legend=False)\n", " plt.title(f'Group {label}')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Composition Grid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, the final piece of data that you need to start is the composition grid. This grid defines the space that the agent will evaluate when choosing the next composition" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 164kB\n",
       "Dimensions:              (sample: 100, component: 2, x: 150, grid: 2500)\n",
       "Coordinates:\n",
       "  * component            (component) <U1 8B 'A' 'B'\n",
       "  * x                    (x) float64 1kB 0.001 0.001047 0.001097 ... 0.9547 1.0\n",
       "Dimensions without coordinates: sample, grid\n",
       "Data variables:\n",
       "    composition          (sample, component) float64 2kB 1.935 4.339 ... 14.33\n",
       "    ground_truth_labels  (sample) int64 800B 1 1 1 1 1 0 1 1 ... 1 0 1 1 0 1 1 1\n",
       "    measurement          (sample, x) float64 120kB 2.047e+06 1.318e+06 ... 2.065\n",
       "    composition_grid     (grid, component) float64 40kB 0.0 0.0 ... 10.0 25.0
" ], "text/plain": [ " Size: 164kB\n", "Dimensions: (sample: 100, component: 2, x: 150, grid: 2500)\n", "Coordinates:\n", " * component (component) " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ds.composition_grid.to_dataset('component').plot.scatter(x='A',y='B',edgecolor='None')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving the Dataset to disk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can save this dataset to disk for use in other notebooks or to memorialize the input data used in a calculation. We'll use the `netcdf` format for this:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "\n", "ds.to_netcdf('../data/example_dataset.nc')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we demonstrated how to build an `xarray.Dataset` from scratch. \n", "\n", "We:\n", "\n", "1. Created an empty dataset\n", "2. Added composition data for samples\n", "3. Added ground truth labels for the samples\n", "4. Added simulated measurement data\n", "5. Added a composition grid for the agent to explore\n", "6. Saved the dataset to disk in netCDF format\n", "\n", "The resulting dataset contains all the necessary components for training and evaluating an active learning agent:\n", "- Sample compositions and their corresponding measurements\n", "- Ground truth labels for validation\n", "- A grid defining the composition space for exploration\n", "\n", "This dataset structure represents a typical format expected by many agent pipelines in `AFL.double_agent`. The exact variables and variable names will change with the pipeline, but the concept of having measurement data and composition information that shares dimensions is a foundational feature of analyzing formulations and materials problems where the composition is varying. " ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 2 }