Appending Data to xarray.Datasets#

When working with scientific data, you’ll often need to combine or append datasets from different experiments or measurements. This tutorial demonstrates various methods to append data to existing xarray.Dataset objects, with a focus on scattering and composition data.

Setup#

Google Colab Setup#

Only uncomment and run the next cell if you are running this notebook in Google Colab or if don’t already have the AFL-agent package installed.

[ ]:

# !pip install git+https://github.com/usnistgov/AFL-agent.git

Next, let’s import the necessary support modules and load data from AFL.double_agent

[2]:

import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt

print(f"xarray version: {xr.__version__}")

# Import the example dataset from AFL.double_agent.data
from AFL.double_agent.data import example_dataset1

# Load the example dataset
ds = example_dataset1()

# Print basic information about the dataset
print(f"Dataset dimensions: {dict(ds.sizes)}")
print(f"Dataset variables: {list(ds.data_vars)}")
print(f"Dataset coordinates: {list(ds.coords)}")

xarray version: 2025.1.2
Dataset dimensions: {'sample': 100, 'component': 2, 'x': 150, 'grid': 2500}
Dataset variables: ['composition', 'ground_truth_labels', 'measurement', 'composition_grid']
Dataset coordinates: ['component', 'x']

Creating Subsets of the Dataset#

To demonstrate appending data, let’s first create subsets of our dataset:

[6]:

# Create two subsets of the data
ds_batch1 = ds.isel(sample=slice(0, 50))  # First 50 samples
ds_batch2 = ds.isel(sample=slice(50, 100))  # Last 50 samples

print(f"Batch 1: {ds_batch1.sizes}")
print(f"Batch 2: {ds_batch2.sizes}")

Batch 1: Frozen({'sample': 50, 'component': 2, 'x': 150, 'grid': 2500})
Batch 2: Frozen({'sample': 50, 'component': 2, 'x': 150, 'grid': 2500})

Method 1: Concatenating Along the Sample Dimension#

The most common way to append datasets is using xr.concat() to combine along a dimension. Here, we’ll combine our two batches along the sample dimension:

[7]:

# Concatenate along the sample dimension
combined_ds = xr.concat([ds_batch1, ds_batch2], dim='sample')

print("Combined dataset dimensions:", combined_ds.sizes)

# Verify that the combined dataset has the same number of samples as the original
print(f"Original samples: {ds.sizes['sample']}")
print(f"Combined samples: {combined_ds.sizes['sample']}")

# Check if the data is the same
print("Data is identical:", np.allclose(ds.measurement.values, combined_ds.measurement.values))

Combined dataset dimensions: Frozen({'sample': 100, 'component': 2, 'x': 150, 'grid': 2500})
Original samples: 100
Combined samples: 100
Data is identical: True

The combined dataset has the same dimensions as our original dataset, and the data is identical.

Let’s visualize the combined data:

[8]:

# Plot using xarray's built-in plotting functionality
combined_ds.measurement.isel(sample=0).plot( label="First sample (Batch 1)",xscale='log',yscale='log')
combined_ds.measurement.isel(sample=50).plot( label="First sample (Batch 2)",xscale='log',yscale='log')

plt.title('Samples from Combined Dataset')

[8]:

Text(0.5, 1.0, 'Samples from Combined Dataset')

Method 2: Adding New Variables to Existing Datasets#

Sometimes you might want to add new variables to an existing dataset, such as adding derived data or analysis results.

Let’s create a new variable by calculating the mean of each measurement:

[9]:

# Calculate the mean of each measurement
measurement_mean = ds_batch1.measurement.mean(dim='x')


# Create a new dataset with this information
mean_ds = xr.Dataset()
mean_ds['measurement_mean'] = ('sample', measurement_mean.values)
mean_ds

[9]:

<xarray.Dataset> Size: 400B
Dimensions:           (sample: 50)
Dimensions without coordinates: sample
Data variables:
    measurement_mean  (sample) float64 400B 8.046e+04 7.622e+04 ... 7.786e+04

Now we can merge this new dataset with our original batch:

[14]:

# Merge the mean dataset with batch1
merged_ds = xr.merge([ds_batch1, mean_ds])

print("Merged dataset variables:", list(merged_ds.data_vars))

# Plot the original measurements and their means
merged_ds.measurement.isel(sample=[0,1,2]).plot.line(x='x', hue='sample', xscale='log',yscale='log')
for i in range(3):
    plt.axhline(y=merged_ds.measurement_mean.isel(sample=i), linestyle='--', color=f'C{i}', label=f"Mean of Sample {i}")

plt.title('Measurements and Their Means')

Merged dataset variables: ['composition', 'ground_truth_labels', 'measurement', 'composition_grid', 'measurement_mean']

[14]:

Text(0.5, 1.0, 'Measurements and Their Means')

Method 3: Combining Datasets with Different X Ranges#

Sometimes you need to combine datasets with different x ranges. Let’s create a subset with a different x range:

[ ]:

# Create a subset with a different x range
x_subset = ds.x.values[::2]  # Take every other x value

# Create a new dataset with this subset
ds_subset_x = ds.isel(sample=slice(0, 10)).copy()  # First 10 samples

# Interpolate the data to the new x values
new_measurement = np.zeros((10, len(x_subset)))

for i in range(10):
    new_measurement[i] = np.interp(
        x_subset,
        ds.x.values,
        ds.measurement.isel(sample=i).values
    )

# Create the new dataset
ds_different_x = xr.Dataset(
    data_vars={
        'measurement': (('sample', 'x'), new_measurement),
        'composition': ds_subset_x.composition.values,
    },
    coords={
        'sample': ds_subset_x.sample,
        'x': x_subset,
        'component': ds.component,
    }
)

print("Original x length:", len(ds.x))
print("New x length:", len(ds_different_x.x))

To combine datasets with different x coordinates, we need to interpolate onto a common grid:

[ ]:

# Get a sample from each dataset
sample_original = ds.isel(sample=0)
sample_different_x = ds_different_x.isel(sample=0)

# Plot to show the different x grids
plt.figure(figsize=(10, 6))

plt.plot(sample_original.x, sample_original.measurement,
         'o-', label="Original x grid")
plt.plot(sample_different_x.x, sample_different_x.measurement,
         'x-', label="Different x grid")

plt.xlabel('x')
plt.ylabel('Measurement')
plt.title('Comparison of Different X Grids')
plt.legend()
plt.tight_layout()
plt.show()

To combine these datasets, we need to interpolate one onto the grid of the other:

[ ]:

# Create a combined x grid (union of both)
combined_x = np.sort(np.unique(np.concatenate([
    ds.x.values,
    ds_different_x.x.values
])))

# Interpolate both datasets to this new grid
# For demonstration, we'll just use one sample from each

# Interpolate original data
original_interp = np.interp(
    combined_x,
    ds.x.isel(sample=0),
    ds.measurement.isel(sample=0)
)

# Interpolate different_x data
different_x_interp = np.interp(
    combined_x,
    ds_different_x.x.isel(sample=0),
    ds_different_x.measurement.isel(sample=0)
)

# Create a new dataset with the combined x grid
combined_x_ds = xr.Dataset(
    data_vars={
        'measurement_original': ('x', original_interp),
        'measurement_different_x': ('x', different_x_interp),
    },
    coords={
        'x': combined_x,
    }
)

print("Combined x grid length:", len(combined_x_ds.x))

# Plot the interpolated data
plt.figure(figsize=(10, 6))

plt.plot(combined_x_ds.x, combined_x_ds.measurement_original,
         label="Original data (interpolated)")
plt.plot(combined_x_ds.x, combined_x_ds.measurement_different_x,
         label="Different x data (interpolated)")

plt.xlabel('x')
plt.ylabel('Measurement')
plt.title('Data Interpolated to Common X Grid')
plt.legend()
plt.tight_layout()
plt.show()

Method 4: Filling Missing Data#

Sometimes you might have incomplete data that needs to be filled from another dataset:

[ ]:

# Create a dataset with some missing values
ds_with_nans = ds.isel(sample=slice(0, 10)).copy()

# Set some measurement values to NaN
measurement_with_nans = ds_with_nans.measurement.values.copy()
measurement_with_nans[2:5, 30:60] = np.nan  # Set a block to NaN

ds_with_nans['measurement'] = (('sample', 'x'), measurement_with_nans)

# Visualize the dataset with missing values
plt.figure(figsize=(10, 6))

for i in range(3):
    plt.plot(ds_with_nans.x, ds_with_nans.measurement.isel(sample=i),
             label=f"Sample {i}")

plt.xlabel('x')
plt.ylabel('Measurement')
plt.title('Dataset with Missing Values')
plt.legend()
plt.tight_layout()
plt.show()

We can use the combine_first() method to fill missing values from another dataset:

[ ]:

# Create a dataset to fill the missing values
# We'll use the original dataset for this
ds_fill = ds.isel(sample=slice(0, 10))

# Fill the missing values
ds_filled = ds_with_nans.combine_first(ds_fill)

# Check if all NaNs are filled
print("NaNs in original:", np.isnan(ds_with_nans.measurement.values).sum())
print("NaNs after filling:", np.isnan(ds_filled.measurement.values).sum())

# Visualize the filled dataset
plt.figure(figsize=(10, 6))

for i in range(3):
    plt.plot(ds_filled.x, ds_filled.measurement.isel(sample=i),
             label=f"Sample {i} (filled)")

    # Also plot the original data with NaNs for comparison
    if i == 2:  # Sample 2 had NaNs
        plt.plot(ds_with_nans.x, ds_with_nans.measurement.isel(sample=i),
                 'r--', label=f"Sample {i} (with NaNs)")

plt.xlabel('x')
plt.ylabel('Measurement')
plt.title('Dataset After Filling Missing Values')
plt.legend()
plt.tight_layout()
plt.show()

Method 5: Updating Metadata When Combining Datasets#

When combining datasets, you might want to update the metadata (attributes):

[ ]:

# Combine datasets and update attributes
combined_ds = xr.concat([ds_batch1, ds_batch2], dim='sample')

# Update attributes
combined_ds.attrs = {
    'description': 'Combined dataset from two batches',
    'samples': f"{combined_ds.sizes['sample']} samples",
    'x_range': f"{combined_ds.x.values[0]:.3f} to {combined_ds.x.values[-1]:.3f}",
    'components': ', '.join([str(c.values) for c in combined_ds.component]),
    'created_date': pd.Timestamp.now().strftime('%Y-%m-%d'),
}

print("Combined Dataset Attributes:")
for key, value in combined_ds.attrs.items():
    print(f"{key}: {value}")

Best Practices and Considerations#

When appending data to xarray Datasets, keep these tips in mind:

Dimension Alignment: Ensure that dimensions you’re not concatenating along have the same values.
Data Types: Check that variables have compatible data types before combining.
Metadata: Decide how to handle metadata (attributes) when combining datasets.
Interpolation: When combining data with different coordinate values, consider interpolation to a common grid.
Units: Ensure that data being combined has consistent units.
Performance: For very large datasets, consider using dask for parallel processing.

Conclusion#

In this tutorial, we’ve explored various methods to append data to xarray Datasets:

Using xr.concat() to combine datasets along a dimension
Using xr.merge() to add new variables to existing datasets
Combining datasets with different coordinate values through interpolation
Using combine_first() to fill missing data
Handling metadata when combining datasets

These techniques are essential for working with multiple batches of data, combining data from different sources, or extending your dataset with new samples or derived properties.