{ "cells": [ { "cell_type": "markdown", "id": "c68003ae", "metadata": {}, "source": [ "Appending Data to xarray.Datasets\n", "================================" ] }, { "cell_type": "markdown", "id": "8bda8da5", "metadata": {}, "source": [ "When working with scientific data, you'll often need to combine or append datasets from different experiments or measurements. This tutorial demonstrates various methods to append data to existing xarray.Dataset objects, with a focus on scattering and composition data.\n" ] }, { "cell_type": "markdown", "id": "9a8ef101", "metadata": {}, "source": [ "\n", "Setup\n", "-----\n" ] }, { "cell_type": "markdown", "id": "21e0776d", "metadata": {}, "source": [ "## Google Colab Setup\n", "\n", "Only uncomment and run the next cell if you are running this notebook in Google Colab or if don't already have the AFL-agent package installed." ] }, { "cell_type": "code", "execution_count": null, "id": "aedb2ae3", "metadata": {}, "outputs": [], "source": [ "# !pip install git+https://github.com/usnistgov/AFL-agent.git" ] }, { "cell_type": "markdown", "id": "05f3daf4", "metadata": {}, "source": [ "Next, let's import the necessary support modules and load data from AFL.double_agent" ] }, { "cell_type": "code", "execution_count": 2, "id": "d71bef10", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "xarray version: 2025.1.2\n", "Dataset dimensions: {'sample': 100, 'component': 2, 'x': 150, 'grid': 2500}\n", "Dataset variables: ['composition', 'ground_truth_labels', 'measurement', 'composition_grid']\n", "Dataset coordinates: ['component', 'x']\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import xarray as xr\n", "import matplotlib.pyplot as plt\n", "\n", "print(f\"xarray version: {xr.__version__}\")\n", "\n", "# Import the example dataset from AFL.double_agent.data\n", "from AFL.double_agent.data import example_dataset1\n", "\n", "# Load the example dataset\n", "ds = example_dataset1()\n", "\n", "# Print basic information about the dataset\n", "print(f\"Dataset dimensions: {dict(ds.sizes)}\")\n", "print(f\"Dataset variables: {list(ds.data_vars)}\")\n", "print(f\"Dataset coordinates: {list(ds.coords)}\")" ] }, { "cell_type": "markdown", "id": "8418da99", "metadata": {}, "source": [ "Understanding the Dataset\n", "------------------------\n", "\n", "Let's first understand the structure of our example dataset:" ] }, { "cell_type": "code", "execution_count": 3, "id": "49a69bea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Composition data shape: (100, 2)\n", "Sample of composition data:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'composition' (sample: 3, component: 2)> Size: 48B\n",
       "[6 values with dtype=float64]\n",
       "Coordinates:\n",
       "  * component  (component) <U1 8B 'A' 'B'\n",
       "Dimensions without coordinates: sample
" ], "text/plain": [ " Size: 48B\n", "[6 values with dtype=float64]\n", "Coordinates:\n", " * component (component) \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'measurement' (sample: 2, x: 5)> Size: 80B\n",
       "[10 values with dtype=float64]\n",
       "Coordinates:\n",
       "  * x        (x) float64 40B 0.001 0.001047 0.001097 0.001149 0.001204\n",
       "Dimensions without coordinates: sample
" ], "text/plain": [ " Size: 80B\n", "[10 values with dtype=float64]\n", "Coordinates:\n", " * x (x) float64 40B 0.001 0.001047 0.001097 0.001149 0.001204\n", "Dimensions without coordinates: sample" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Look at the measurement data\n", "print(\"Measurement data shape:\", ds.measurement.shape)\n", "print(\"Sample of measurement data:\")\n", "ds.measurement.isel(sample=slice(0, 2), x=slice(0, 5))" ] }, { "cell_type": "markdown", "id": "a9b0276e", "metadata": {}, "source": [ "The measurement data has dimensions ('sample', 'x') with 100 samples and 150 x-values." ] }, { "cell_type": "code", "execution_count": 5, "id": "33fa9c81", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Measurement Data for First 3 Samples')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot measurements for the first 3 samples using xarray's built-in plotting\n", "ds.measurement.isel(sample=slice(0, 3)).plot.line(x='x', hue='sample', xscale='log',yscale='log')\n", "\n", "plt.title('Measurement Data for First 3 Samples')" ] }, { "cell_type": "markdown", "id": "2f2ebc84", "metadata": {}, "source": [ "Creating Subsets of the Dataset\n", "------------------------------\n", "\n", "To demonstrate appending data, let's first create subsets of our dataset:" ] }, { "cell_type": "code", "execution_count": 6, "id": "81f8210c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Batch 1: Frozen({'sample': 50, 'component': 2, 'x': 150, 'grid': 2500})\n", "Batch 2: Frozen({'sample': 50, 'component': 2, 'x': 150, 'grid': 2500})\n" ] } ], "source": [ "# Create two subsets of the data\n", "ds_batch1 = ds.isel(sample=slice(0, 50)) # First 50 samples\n", "ds_batch2 = ds.isel(sample=slice(50, 100)) # Last 50 samples\n", "\n", "print(f\"Batch 1: {ds_batch1.sizes}\")\n", "print(f\"Batch 2: {ds_batch2.sizes}\")" ] }, { "cell_type": "markdown", "id": "655ba7e8", "metadata": {}, "source": [ "Method 1: Concatenating Along the Sample Dimension\n", "-------------------------------------------------\n", "\n", "The most common way to append datasets is using ``xr.concat()`` to combine along a dimension. Here, we'll combine our two batches along the sample dimension:" ] }, { "cell_type": "code", "execution_count": 7, "id": "9dc896a4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Combined dataset dimensions: Frozen({'sample': 100, 'component': 2, 'x': 150, 'grid': 2500})\n", "Original samples: 100\n", "Combined samples: 100\n", "Data is identical: True\n" ] } ], "source": [ "# Concatenate along the sample dimension\n", "combined_ds = xr.concat([ds_batch1, ds_batch2], dim='sample')\n", "\n", "print(\"Combined dataset dimensions:\", combined_ds.sizes)\n", "\n", "# Verify that the combined dataset has the same number of samples as the original\n", "print(f\"Original samples: {ds.sizes['sample']}\")\n", "print(f\"Combined samples: {combined_ds.sizes['sample']}\")\n", "\n", "# Check if the data is the same\n", "print(\"Data is identical:\", np.allclose(ds.measurement.values, combined_ds.measurement.values))" ] }, { "cell_type": "markdown", "id": "147abec2", "metadata": {}, "source": [ "The combined dataset has the same dimensions as our original dataset, and the data is identical.\n", "\n", "Let's visualize the combined data:" ] }, { "cell_type": "code", "execution_count": 8, "id": "8742fd87", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Samples from Combined Dataset')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot using xarray's built-in plotting functionality\n", "combined_ds.measurement.isel(sample=0).plot( label=\"First sample (Batch 1)\",xscale='log',yscale='log')\n", "combined_ds.measurement.isel(sample=50).plot( label=\"First sample (Batch 2)\",xscale='log',yscale='log')\n", "\n", "plt.title('Samples from Combined Dataset')" ] }, { "cell_type": "markdown", "id": "77f6efaf", "metadata": {}, "source": [ "Method 2: Adding New Variables to Existing Datasets\n", "--------------------------------------------------\n", "\n", "Sometimes you might want to add new variables to an existing dataset, such as adding derived data or analysis results.\n", "\n", "Let's create a new variable by calculating the mean of each measurement:" ] }, { "cell_type": "code", "execution_count": 9, "id": "eee3522e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 400B\n",
       "Dimensions:           (sample: 50)\n",
       "Dimensions without coordinates: sample\n",
       "Data variables:\n",
       "    measurement_mean  (sample) float64 400B 8.046e+04 7.622e+04 ... 7.786e+04
" ], "text/plain": [ " Size: 400B\n", "Dimensions: (sample: 50)\n", "Dimensions without coordinates: sample\n", "Data variables:\n", " measurement_mean (sample) float64 400B 8.046e+04 7.622e+04 ... 7.786e+04" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate the mean of each measurement\n", "measurement_mean = ds_batch1.measurement.mean(dim='x')\n", "\n", "\n", "# Create a new dataset with this information\n", "mean_ds = xr.Dataset()\n", "mean_ds['measurement_mean'] = ('sample', measurement_mean.values)\n", "mean_ds\n" ] }, { "cell_type": "markdown", "id": "e0f72d0d", "metadata": {}, "source": [ "Now we can merge this new dataset with our original batch:" ] }, { "cell_type": "code", "execution_count": 14, "id": "f2046d8f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Merged dataset variables: ['composition', 'ground_truth_labels', 'measurement', 'composition_grid', 'measurement_mean']\n" ] }, { "data": { "text/plain": [ "Text(0.5, 1.0, 'Measurements and Their Means')" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Merge the mean dataset with batch1\n", "merged_ds = xr.merge([ds_batch1, mean_ds])\n", "\n", "print(\"Merged dataset variables:\", list(merged_ds.data_vars))\n", "\n", "# Plot the original measurements and their means\n", "merged_ds.measurement.isel(sample=[0,1,2]).plot.line(x='x', hue='sample', xscale='log',yscale='log')\n", "for i in range(3):\n", " plt.axhline(y=merged_ds.measurement_mean.isel(sample=i), linestyle='--', color=f'C{i}', label=f\"Mean of Sample {i}\")\n", "\n", "plt.title('Measurements and Their Means')" ] }, { "cell_type": "markdown", "id": "d03da3a9", "metadata": {}, "source": [ "Method 3: Combining Datasets with Different X Ranges\n", "---------------------------------------------------\n", "\n", "Sometimes you need to combine datasets with different x ranges. Let's create a subset with a different x range:" ] }, { "cell_type": "code", "execution_count": null, "id": "d5108936", "metadata": {}, "outputs": [], "source": [ "# Create a subset with a different x range\n", "x_subset = ds.x.values[::2] # Take every other x value\n", "\n", "# Create a new dataset with this subset\n", "ds_subset_x = ds.isel(sample=slice(0, 10)).copy() # First 10 samples\n", "\n", "# Interpolate the data to the new x values\n", "new_measurement = np.zeros((10, len(x_subset)))\n", "\n", "for i in range(10):\n", " new_measurement[i] = np.interp(\n", " x_subset, \n", " ds.x.values, \n", " ds.measurement.isel(sample=i).values\n", " )\n", "\n", "# Create the new dataset\n", "ds_different_x = xr.Dataset(\n", " data_vars={\n", " 'measurement': (('sample', 'x'), new_measurement),\n", " 'composition': ds_subset_x.composition.values,\n", " },\n", " coords={\n", " 'sample': ds_subset_x.sample,\n", " 'x': x_subset,\n", " 'component': ds.component,\n", " }\n", ")\n", "\n", "print(\"Original x length:\", len(ds.x))\n", "print(\"New x length:\", len(ds_different_x.x))" ] }, { "cell_type": "markdown", "id": "9d85070b", "metadata": {}, "source": [ "To combine datasets with different x coordinates, we need to interpolate onto a common grid:" ] }, { "cell_type": "code", "execution_count": null, "id": "9da7816f", "metadata": {}, "outputs": [], "source": [ "# Get a sample from each dataset\n", "sample_original = ds.isel(sample=0)\n", "sample_different_x = ds_different_x.isel(sample=0)\n", "\n", "# Plot to show the different x grids\n", "plt.figure(figsize=(10, 6))\n", "\n", "plt.plot(sample_original.x, sample_original.measurement, \n", " 'o-', label=\"Original x grid\")\n", "plt.plot(sample_different_x.x, sample_different_x.measurement, \n", " 'x-', label=\"Different x grid\")\n", "\n", "plt.xlabel('x')\n", "plt.ylabel('Measurement')\n", "plt.title('Comparison of Different X Grids')\n", "plt.legend()\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "5c6d6a3b", "metadata": {}, "source": [ "To combine these datasets, we need to interpolate one onto the grid of the other:" ] }, { "cell_type": "code", "execution_count": null, "id": "f7ea37de", "metadata": {}, "outputs": [], "source": [ "# Create a combined x grid (union of both)\n", "combined_x = np.sort(np.unique(np.concatenate([\n", " ds.x.values, \n", " ds_different_x.x.values\n", "])))\n", "\n", "# Interpolate both datasets to this new grid\n", "# For demonstration, we'll just use one sample from each\n", "\n", "# Interpolate original data\n", "original_interp = np.interp(\n", " combined_x, \n", " ds.x.isel(sample=0), \n", " ds.measurement.isel(sample=0)\n", ")\n", "\n", "# Interpolate different_x data\n", "different_x_interp = np.interp(\n", " combined_x, \n", " ds_different_x.x.isel(sample=0), \n", " ds_different_x.measurement.isel(sample=0)\n", ")\n", "\n", "# Create a new dataset with the combined x grid\n", "combined_x_ds = xr.Dataset(\n", " data_vars={\n", " 'measurement_original': ('x', original_interp),\n", " 'measurement_different_x': ('x', different_x_interp),\n", " },\n", " coords={\n", " 'x': combined_x,\n", " }\n", ")\n", "\n", "print(\"Combined x grid length:\", len(combined_x_ds.x))\n", "\n", "# Plot the interpolated data\n", "plt.figure(figsize=(10, 6))\n", "\n", "plt.plot(combined_x_ds.x, combined_x_ds.measurement_original, \n", " label=\"Original data (interpolated)\")\n", "plt.plot(combined_x_ds.x, combined_x_ds.measurement_different_x, \n", " label=\"Different x data (interpolated)\")\n", "\n", "plt.xlabel('x')\n", "plt.ylabel('Measurement')\n", "plt.title('Data Interpolated to Common X Grid')\n", "plt.legend()\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "937acb22", "metadata": {}, "source": [ "Method 4: Filling Missing Data\n", "-------------------------------\n", "\n", "Sometimes you might have incomplete data that needs to be filled from another dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "e4a705a0", "metadata": {}, "outputs": [], "source": [ "# Create a dataset with some missing values\n", "ds_with_nans = ds.isel(sample=slice(0, 10)).copy()\n", "\n", "# Set some measurement values to NaN\n", "measurement_with_nans = ds_with_nans.measurement.values.copy()\n", "measurement_with_nans[2:5, 30:60] = np.nan # Set a block to NaN\n", "\n", "ds_with_nans['measurement'] = (('sample', 'x'), measurement_with_nans)\n", "\n", "# Visualize the dataset with missing values\n", "plt.figure(figsize=(10, 6))\n", "\n", "for i in range(3):\n", " plt.plot(ds_with_nans.x, ds_with_nans.measurement.isel(sample=i), \n", " label=f\"Sample {i}\")\n", "\n", "plt.xlabel('x')\n", "plt.ylabel('Measurement')\n", "plt.title('Dataset with Missing Values')\n", "plt.legend()\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "509dad49", "metadata": {}, "source": [ "We can use the ``combine_first()`` method to fill missing values from another dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "0308644c", "metadata": {}, "outputs": [], "source": [ "# Create a dataset to fill the missing values\n", "# We'll use the original dataset for this\n", "ds_fill = ds.isel(sample=slice(0, 10))\n", "\n", "# Fill the missing values\n", "ds_filled = ds_with_nans.combine_first(ds_fill)\n", "\n", "# Check if all NaNs are filled\n", "print(\"NaNs in original:\", np.isnan(ds_with_nans.measurement.values).sum())\n", "print(\"NaNs after filling:\", np.isnan(ds_filled.measurement.values).sum())\n", "\n", "# Visualize the filled dataset\n", "plt.figure(figsize=(10, 6))\n", "\n", "for i in range(3):\n", " plt.plot(ds_filled.x, ds_filled.measurement.isel(sample=i), \n", " label=f\"Sample {i} (filled)\")\n", " \n", " # Also plot the original data with NaNs for comparison\n", " if i == 2: # Sample 2 had NaNs\n", " plt.plot(ds_with_nans.x, ds_with_nans.measurement.isel(sample=i), \n", " 'r--', label=f\"Sample {i} (with NaNs)\")\n", "\n", "plt.xlabel('x')\n", "plt.ylabel('Measurement')\n", "plt.title('Dataset After Filling Missing Values')\n", "plt.legend()\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "ceab9f6f", "metadata": {}, "source": [ "Method 5: Updating Metadata When Combining Datasets\n", "----------------------------------------------------\n", "\n", "When combining datasets, you might want to update the metadata (attributes):" ] }, { "cell_type": "code", "execution_count": null, "id": "24ca9e37", "metadata": {}, "outputs": [], "source": [ "# Combine datasets and update attributes\n", "combined_ds = xr.concat([ds_batch1, ds_batch2], dim='sample')\n", "\n", "# Update attributes\n", "combined_ds.attrs = {\n", " 'description': 'Combined dataset from two batches',\n", " 'samples': f\"{combined_ds.sizes['sample']} samples\",\n", " 'x_range': f\"{combined_ds.x.values[0]:.3f} to {combined_ds.x.values[-1]:.3f}\",\n", " 'components': ', '.join([str(c.values) for c in combined_ds.component]),\n", " 'created_date': pd.Timestamp.now().strftime('%Y-%m-%d'),\n", "}\n", "\n", "print(\"Combined Dataset Attributes:\")\n", "for key, value in combined_ds.attrs.items():\n", " print(f\"{key}: {value}\")" ] }, { "cell_type": "markdown", "id": "13167c96", "metadata": {}, "source": [ "Best Practices and Considerations\n", "-------------------------------\n", "\n", "When appending data to xarray Datasets, keep these tips in mind:\n", "\n", "1. **Dimension Alignment**: Ensure that dimensions you're not concatenating along have the same values.\n", "2. **Data Types**: Check that variables have compatible data types before combining.\n", "3. **Metadata**: Decide how to handle metadata (attributes) when combining datasets.\n", "4. **Interpolation**: When combining data with different coordinate values, consider interpolation to a common grid.\n", "5. **Units**: Ensure that data being combined has consistent units.\n", "6. **Performance**: For very large datasets, consider using dask for parallel processing.\n", "\n", "Conclusion\n", "---------\n", "\n", "In this tutorial, we've explored various methods to append data to xarray Datasets:\n", "\n", "1. Using ``xr.concat()`` to combine datasets along a dimension\n", "2. Using ``xr.merge()`` to add new variables to existing datasets\n", "3. Combining datasets with different coordinate values through interpolation\n", "4. Using ``combine_first()`` to fill missing data\n", "5. Handling metadata when combining datasets\n", "\n", "These techniques are essential for working with multiple batches of data, combining data from different sources, or extending your dataset with new samples or derived properties.\n", "\n", "Further Reading\n", "-------------\n", "\n", "- `xarray Documentation on Combining Data `_\n", "- `Dask Integration with xarray `_\n", "- `xarray API Reference `_" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 5 }