{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/usnistgov/AFL-agent/blob/main/docs/source/tutorials/building_pipelines.ipynb)\n", "\n", "# Building Pipelines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we'll go into more detail on the Quick Start Example from [Getting Started](getting_started.rst). In this example, we'll build a pipeline that \n", "\n", "- standardized the input compositions to improve the convergence of the Gaussian Process optimization\n", "- uses a Savitzky-Golay filter to compute the first derivative of the measurement\n", "- computes the similarity between the derivatives of the measurement data\n", "- clusters (i.e., labels) the data using spectral clustering\n", "- fits a Gaussian Process classifier to the data.\n", "- chooses the next optimal measurement based on the entropy of the Gaussian Process posterior \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "Only uncomment and run the next cell if you are running this notebook in Google Colab or if don't already have the AFL-agent package installed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# !pip install git+https://github.com/usnistgov/AFL-agent.git" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below are the imported modules used in this tutorial" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import xarray as xr\n", "import matplotlib.pyplot as plt\n", "\n", "from AFL.double_agent import *\n", "from AFL.double_agent.plotting import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Input Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, to begin, we'll load in a pre-prepared `xarray.Dataset`. These are powerful and flexible data structures for working with multi-dimensional data, and `AFL.double_agent` uses them for all input, intermediate and output data.\n", "\n", "The dataset below contains simulated measurement data along with the compositions that this simulated data was generated at. It also has the ground truth phase labelse along with a grid of compositions that the agent will search through for the next optimal measurement. \n", "\n", "To see how this dataset is created, see [Building xarray.Datasets](../how-to/building_xarray_datasets>) page.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 164kB\n",
       "Dimensions:              (sample: 100, component: 2, x: 150, grid: 2500)\n",
       "Coordinates:\n",
       "  * component            (component) <U1 8B 'A' 'B'\n",
       "  * x                    (x) float64 1kB 0.001 0.001047 0.001097 ... 0.9547 1.0\n",
       "Dimensions without coordinates: sample, grid\n",
       "Data variables:\n",
       "    composition          (sample, component) float64 2kB 5.7 1.36 ... 5.104\n",
       "    ground_truth_labels  (sample) int64 800B 1 1 0 1 0 1 1 1 ... 1 1 1 1 1 0 1 1\n",
       "    measurement          (sample, x) float64 120kB 1.915e+06 1.479e+06 ... 1.885\n",
       "    composition_grid     (grid, component) float64 40kB 0.0 0.0 ... 10.0 25.0
" ], "text/plain": [ " Size: 164kB\n", "Dimensions: (sample: 100, component: 2, x: 150, grid: 2500)\n", "Coordinates:\n", " * component (component) " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from AFL.double_agent import *\n", "\n", "with Pipeline() as my_first_pipeline:\n", " Standardize(\n", " input_variable='composition',\n", " output_variable='normalized_composition',\n", " dim='sample',\n", " component_dim='component',\n", " min_val={'A':0.0,'B':0.0},\n", " max_val={'A':10.0,'B':25.0},\n", " )\n", "\n", "my_first_pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Going over the arguments to Standardize one by one:\n", "\n", "- `input_variable='composition'`: The data variable to normalize, in this case the 'composition' variable\n", "- `output_variable='normalized_composition': The name of the new variable that will store the normalized data\n", "- `dim='sample'`: The dimension along which to compute statistics for normalization\n", "- `component_dim='component'`: The dimension containing different components/features\n", "- `min_val={'A':0.0,'B':0.0}`: Dictionary specifying minimum values for each component\n", "- `max_val={'A':10.0,'B':25.0}`: Dictionary specifying maximum values for each component\n", "\n", "\n", "We can view more information about the pipeline by printing it " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PipelineOp input_variable ---> output_variable\n", "---------- -----------------------------------\n", "0 ) composition ---> normalized_composition\n", "\n", "Input Variables\n", "---------------\n", "0) composition\n", "\n", "Output Variables\n", "----------------\n", "0) normalized_composition\n" ] } ], "source": [ "my_first_pipeline.print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can add more operations to the `Pipeline` by recreating the context" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PipelineOp input_variable ---> output_variable\n", "---------- -----------------------------------\n", "0 ) composition ---> normalized_composition\n", "1 ) composition_grid ---> normalized_composition_grid\n", "\n", "Input Variables\n", "---------------\n", "0) composition\n", "1) composition_grid\n", "\n", "Output Variables\n", "----------------\n", "0) normalized_composition\n", "1) normalized_composition_grid\n" ] } ], "source": [ "with my_first_pipeline:\n", " Standardize(\n", " input_variable='composition_grid',\n", " output_variable='normalized_composition_grid',\n", " dim='grid',\n", " component_dim='component',\n", " min_val={'A':0.0,'B':0.0},\n", " max_val={'A':10.0,'B':25.0},\n", " )\n", "\n", "my_first_pipeline.print()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can run the pipeline by calling the `.calculate` method and passing in the input dataset `ds`" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5e6b3e09a44b4b2a8326f6c9df390e19", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/2 [00:00\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 205kB\n",
       "Dimensions:                      (sample: 100, component: 2, x: 150, grid: 2500)\n",
       "Coordinates:\n",
       "  * component                    (component) <U1 8B 'A' 'B'\n",
       "  * x                            (x) float64 1kB 0.001 0.001047 ... 0.9547 1.0\n",
       "Dimensions without coordinates: sample, grid\n",
       "Data variables:\n",
       "    composition                  (sample, component) float64 2kB 5.7 ... 5.104\n",
       "    ground_truth_labels          (sample) int64 800B 1 1 0 1 0 1 ... 1 1 1 0 1 1\n",
       "    measurement                  (sample, x) float64 120kB 1.915e+06 ... 1.885\n",
       "    composition_grid             (grid, component) float64 40kB 0.0 0.0 ... 25.0\n",
       "    normalized_composition       (sample, component) float64 2kB 0.57 ... 0.2042\n",
       "    normalized_composition_grid  (grid, component) float64 40kB 0.0 0.0 ... 1.0
" ], "text/plain": [ " Size: 205kB\n", "Dimensions: (sample: 100, component: 2, x: 150, grid: 2500)\n", "Coordinates:\n", " * component (component) " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig,axes = plt.subplots(1,2,figsize=(8,3.25))\n", "\n", "plot_scatter_mpl(ds_result,'composition',component_dim='component',labels='ground_truth_labels',ax=axes[0])\n", "axes[0].set(title='Raw Composition Data')\n", "\n", "plot_scatter_mpl(ds_result,'normalized_composition',component_dim='component',labels='ground_truth_labels',ax=axes[1])\n", "axes[1].set(title='Normalized Composition Data')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the relative data positions are unchanged, only the magnitude of the axes is changed. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Savitsky-Golay Filter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the composition data processed, we can move on to processing the measurement data. In many cases, smoothing and filtering data can help remove noise and emphasize features in data that you want your agent to focus on. \n", "\n", "Here we'll add a `SavgolFilter` operation in order to calculate the first derivative of the measurement data. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PipelineOp input_variable ---> output_variable\n", "---------- -----------------------------------\n", "0 ) composition ---> normalized_composition\n", "1 ) composition_grid ---> normalized_composition_grid\n", "2 ) measurement ---> derivative\n", "\n", "Input Variables\n", "---------------\n", "0) composition\n", "1) composition_grid\n", "2) measurement\n", "\n", "Output Variables\n", "----------------\n", "0) normalized_composition\n", "1) normalized_composition_grid\n", "2) derivative\n" ] } ], "source": [ "with my_first_pipeline:\n", "\n", " SavgolFilter(\n", " input_variable='measurement', \n", " output_variable='derivative', \n", " dim='x', \n", " window_length=50,\n", " apply_log_scale=True,\n", " derivative=1,\n", "\n", " )\n", "\n", "my_first_pipeline.print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Let's go through each argument passed to SavgolFilter:\n", "\n", "* `input_variable='measurement'`: Specifies the data variable to filter, in this case the raw measurement data\n", "* `output_variable='derivative'`: Names the new variable that will store the filtered/derivative data\n", "* `dim='x'`: Indicates which dimension to apply the filter along (the x-axis values)\n", "* `window_length=50`: Sets the size of the moving window used for filtering - larger values give smoother results\n", "* `apply_log_scale=True`: Takes the log of the x-axis values before filtering, useful for data spanning multiple orders of magnitude\n", "* `derivative=1`: Calculates the first derivative of the data while filtering\n", "\n", "We can run the pipeline on the dataset and plot the results." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f82802c51a6e44d3baae14cd7202b0cc", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/3 [00:00\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 407kB\n",
       "Dimensions:                      (sample: 100, component: 2, x: 150,\n",
       "                                  grid: 2500, log_x: 250)\n",
       "Coordinates:\n",
       "  * component                    (component) <U1 8B 'A' 'B'\n",
       "  * x                            (x) float64 1kB 0.001 0.001047 ... 0.9547 1.0\n",
       "  * log_x                        (log_x) float64 2kB -3.0 -2.988 ... 0.0\n",
       "Dimensions without coordinates: sample, grid\n",
       "Data variables:\n",
       "    composition                  (sample, component) float64 2kB 5.7 ... 5.104\n",
       "    ground_truth_labels          (sample) int64 800B 1 1 0 1 0 1 ... 1 1 1 0 1 1\n",
       "    measurement                  (sample, x) float64 120kB 1.915e+06 ... 1.885\n",
       "    composition_grid             (grid, component) float64 40kB 0.0 0.0 ... 25.0\n",
       "    normalized_composition       (sample, component) float64 2kB 0.57 ... 0.2042\n",
       "    normalized_composition_grid  (grid, component) float64 40kB 0.0 0.0 ... 1.0\n",
       "    derivative                   (sample, log_x) float64 200kB -3.82 ... -0.4063
" ], "text/plain": [ " Size: 407kB\n", "Dimensions: (sample: 100, component: 2, x: 150,\n", " grid: 2500, log_x: 250)\n", "Coordinates:\n", " * component (component) " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig,axes = plt.subplots(1,2,figsize=(8,3.25))\n", "\n", "ds_result.measurement.plot.line(x='x',xscale='log',yscale='log',ax=axes[0],add_legend=False)\n", "ds_result.derivative.plot.line(x='log_x',ax=axes[1],add_legend=False);\n", "\n", "axes[0].set(title=\"Raw Data\")\n", "axes[1].set(title=\"Derivative of Smoothed Log(Data)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data on the right has more flat, constant regions than the data on the left making it easier for the simlarity and clustering analyses below to separate. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Calculate Similarity between Measurement Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have preprocessed our data using the Savgol filter, we can calculate the similarity between different measurements. The `Similarity` component computes a similarity matrix between all pairs of samples based on their filtered derivative data. This similarity matrix will be used as input for clustering in the next step.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PipelineOp input_variable ---> output_variable\n", "---------- -----------------------------------\n", "0 ) composition ---> normalized_composition\n", "1 ) composition_grid ---> normalized_composition_grid\n", "2 ) measurement ---> derivative\n", "3 ) derivative ---> similarity\n", "\n", "Input Variables\n", "---------------\n", "0) composition\n", "1) composition_grid\n", "2) measurement\n", "\n", "Output Variables\n", "----------------\n", "0) normalized_composition\n", "1) normalized_composition_grid\n", "2) similarity\n" ] } ], "source": [ "with my_first_pipeline:\n", " Similarity(\n", " input_variable='derivative', \n", " output_variable='similarity', \n", " sample_dim='sample',\n", " params={'metric': 'laplacian','gamma':1e-4}\n", " )\n", "\n", "my_first_pipeline.print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Similarity component takes the following inputs:\n", "\n", "- `input_variable`: The variable to calculate similarity between ('derivative')\n", "- `output_variable`: The variable to store the similarity matrix ('similarity')\n", "- `sample_dim`: The dimension containing different samples ('sample')\n", "- `params`: Dictionary of parameters for similarity calculation\n", " - `metric`: The similarity metric to use ('laplacian')\n", " - `gamma`: The scale parameter for the similarity metric (1e-4)\n", "\n", "Let's execute the pipeline" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "200f79473aba43d68d893247f7564ce2", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/4 [00:00\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 487kB\n",
       "Dimensions:                      (sample: 100, component: 2, x: 150,\n",
       "                                  grid: 2500, log_x: 250, sample_i: 100,\n",
       "                                  sample_j: 100)\n",
       "Coordinates:\n",
       "  * component                    (component) <U1 8B 'A' 'B'\n",
       "  * x                            (x) float64 1kB 0.001 0.001047 ... 0.9547 1.0\n",
       "  * log_x                        (log_x) float64 2kB -3.0 -2.988 ... 0.0\n",
       "Dimensions without coordinates: sample, grid, sample_i, sample_j\n",
       "Data variables:\n",
       "    composition                  (sample, component) float64 2kB 5.7 ... 5.104\n",
       "    ground_truth_labels          (sample) int64 800B 1 1 0 1 0 1 ... 1 1 1 0 1 1\n",
       "    measurement                  (sample, x) float64 120kB 1.915e+06 ... 1.885\n",
       "    composition_grid             (grid, component) float64 40kB 0.0 0.0 ... 25.0\n",
       "    normalized_composition       (sample, component) float64 2kB 0.57 ... 0.2042\n",
       "    normalized_composition_grid  (grid, component) float64 40kB 0.0 0.0 ... 1.0\n",
       "    derivative                   (sample, log_x) float64 200kB -3.82 ... -0.4063\n",
       "    similarity                   (sample_i, sample_j) float64 80kB 1.0 ... 1.0
" ], "text/plain": [ " Size: 487kB\n", "Dimensions: (sample: 100, component: 2, x: 150,\n", " grid: 2500, log_x: 250, sample_i: 100,\n", " sample_j: 100)\n", "Coordinates:\n", " * component (component) " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ds_result.similarity.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each pixel indexed by (i,j) in this image corresponds to the similarity between measurement i and j. The bright pixels indicate high similarity and the darker pixels reduced similarity. A check on this calculation is that the diagonal should have a perfect similarity = 1.0 because each data is perfectly self similar to itself, i.e. `S(i,i) = S(j,j) = 1.0`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Cluster Measurement Data based on Similarity" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we can use the similarity matrix to cluster the data into groups. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PipelineOp input_variable ---> output_variable\n", "---------- -----------------------------------\n", "0 ) composition ---> normalized_composition\n", "1 ) composition_grid ---> normalized_composition_grid\n", "2 ) measurement ---> derivative\n", "3 ) derivative ---> similarity\n", "4 ) similarity ---> labels\n", "\n", "Input Variables\n", "---------------\n", "0) composition\n", "1) composition_grid\n", "2) measurement\n", "\n", "Output Variables\n", "----------------\n", "0) normalized_composition\n", "1) normalized_composition_grid\n", "2) labels\n" ] } ], "source": [ "with my_first_pipeline:\n", " SpectralClustering(\n", " input_variable='similarity',\n", " output_variable='labels',\n", " dim='sample',\n", " params={'n_phases': 2}\n", " )\n", "\n", "\n", "my_first_pipeline.print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The SpectralClustering pipeline operation takes:\n", "\n", " - `input_variable`: The similarity matrix to use for clustering ('similarity')\n", " - `output_variable`: The variable to store the cluster labels ('labels') \n", " - `dim`: The dimension containing different samples ('sample')\n", " - `params`: Dictionary of parameters for clustering\n", " - `n_phases`: The number of clusters/phases to find (2)\n", "\n", "Let's run the pipeline with this new operation" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "367c56818d734cee84398ef695190ab9", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/5 [00:00\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 488kB\n",
       "Dimensions:                      (sample: 100, component: 2, x: 150,\n",
       "                                  grid: 2500, log_x: 250, sample_i: 100,\n",
       "                                  sample_j: 100)\n",
       "Coordinates:\n",
       "  * component                    (component) <U1 8B 'A' 'B'\n",
       "  * x                            (x) float64 1kB 0.001 0.001047 ... 0.9547 1.0\n",
       "  * log_x                        (log_x) float64 2kB -3.0 -2.988 ... 0.0\n",
       "Dimensions without coordinates: sample, grid, sample_i, sample_j\n",
       "Data variables:\n",
       "    composition                  (sample, component) float64 2kB 5.7 ... 5.104\n",
       "    ground_truth_labels          (sample) int64 800B 1 1 0 1 0 1 ... 1 1 1 0 1 1\n",
       "    measurement                  (sample, x) float64 120kB 1.915e+06 ... 1.885\n",
       "    composition_grid             (grid, component) float64 40kB 0.0 0.0 ... 25.0\n",
       "    normalized_composition       (sample, component) float64 2kB 0.57 ... 0.2042\n",
       "    normalized_composition_grid  (grid, component) float64 40kB 0.0 0.0 ... 1.0\n",
       "    derivative                   (sample, log_x) float64 200kB -3.82 ... -0.4063\n",
       "    similarity                   (sample_i, sample_j) float64 80kB 1.0 ... 1.0\n",
       "    labels                       (sample) int64 800B 1 1 0 1 0 1 ... 1 1 1 0 1 1
" ], "text/plain": [ " Size: 488kB\n", "Dimensions: (sample: 100, component: 2, x: 150,\n", " grid: 2500, log_x: 250, sample_i: 100,\n", " sample_j: 100)\n", "Coordinates:\n", " * component (component) " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "fig,axes = plt.subplots(1,2,figsize=(8,3.25))\n", "\n", "plot_scatter_mpl(ds_result,'composition',component_dim='component',labels='ground_truth_labels',ax=axes[0])\n", "plot_scatter_mpl(ds_result,'composition',component_dim='component',labels='labels',ax=axes[1])\n", "\n", "axes[0].set(title=\"Ground Truth Labels\")\n", "axes[1].set(title=\"Spectral Clustering Labels\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Extrapolate Cluster Labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can extrapolate the labels from the `SpectralClustering` over the `composition_grid` that we supplied in the input dataset." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PipelineOp input_variable ---> output_variable\n", "---------- -----------------------------------\n", "0 ) composition ---> normalized_composition\n", "1 ) composition_grid ---> normalized_composition_grid\n", "2 ) measurement ---> derivative\n", "3 ) derivative ---> similarity\n", "4 ) similarity ---> labels\n", "5 ) ['normalized_composition', 'labels', 'normalized_composition_grid'] ---> ['extrap_mean', 'extrap_entropy']\n", "\n", "Input Variables\n", "---------------\n", "0) composition\n", "1) composition_grid\n", "2) measurement\n", "\n", "Output Variables\n", "----------------\n", "0) extrap_mean\n", "1) extrap_entropy\n" ] } ], "source": [ "with my_first_pipeline:\n", " GaussianProcessClassifier(\n", " feature_input_variable='normalized_composition',\n", " predictor_input_variable='labels',\n", " output_prefix='extrap',\n", " sample_dim='sample',\n", " grid_variable='normalized_composition_grid',\n", " grid_dim='grid',\n", " )\n", "my_first_pipeline.print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The GaussianProcessClassifier pipeline operation takes:\n", "\n", "- `feature_input_variable`: The composition data to use for training ('compositions')\n", "- `predictor_input_variable`: The labels to predict ('labels')\n", "- `output_prefix`: Prefix for output variables ('extrap')\n", "- `sample_dim`: The dimension containing different samples ('sample')\n", "- `grid_variable`: The grid points to extrapolate to ('composition_grid')\n", "- `grid_dim`: The dimension containing grid points ('grid')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1ef66176cd6e4558b64fb54bc8317da1", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/6 [00:00\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 548kB\n",
       "Dimensions:                      (sample: 100, component: 2, x: 150,\n",
       "                                  grid: 2500, log_x: 250, sample_i: 100,\n",
       "                                  sample_j: 100)\n",
       "Coordinates:\n",
       "  * component                    (component) <U1 8B 'A' 'B'\n",
       "  * x                            (x) float64 1kB 0.001 0.001047 ... 0.9547 1.0\n",
       "  * log_x                        (log_x) float64 2kB -3.0 -2.988 ... 0.0\n",
       "Dimensions without coordinates: sample, grid, sample_i, sample_j\n",
       "Data variables:\n",
       "    composition                  (sample, component) float64 2kB 5.7 ... 5.104\n",
       "    ground_truth_labels          (sample) int64 800B 1 1 0 1 0 1 ... 1 1 1 0 1 1\n",
       "    measurement                  (sample, x) float64 120kB 1.915e+06 ... 1.885\n",
       "    composition_grid             (grid, component) float64 40kB 0.0 0.0 ... 25.0\n",
       "    normalized_composition       (sample, component) float64 2kB 0.57 ... 0.2042\n",
       "    normalized_composition_grid  (grid, component) float64 40kB 0.0 0.0 ... 1.0\n",
       "    derivative                   (sample, log_x) float64 200kB -3.82 ... -0.4063\n",
       "    similarity                   (sample_i, sample_j) float64 80kB 1.0 ... 1.0\n",
       "    labels                       (sample) int64 800B 1 1 0 1 0 1 ... 1 1 1 0 1 1\n",
       "    extrap_mean                  (grid) int64 20kB 1 1 1 1 1 1 1 ... 1 1 1 1 1 1\n",
       "    extrap_entropy               (grid) float64 20kB 0.5813 0.5687 ... 0.4603\n",
       "    extrap_y_prob                (grid) float64 20kB 0.5813 0.5687 ... 0.4603
" ], "text/plain": [ " Size: 548kB\n", "Dimensions: (sample: 100, component: 2, x: 150,\n", " grid: 2500, log_x: 250, sample_i: 100,\n", " sample_j: 100)\n", "Coordinates:\n", " * component (component) " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig,axes = plt.subplots(1,2,figsize=(8,3.25))\n", "\n", "plot_surface_mpl(ds_result,'composition_grid',component_dim='component',labels='extrap_mean',ax=axes[0])\n", "plot_surface_mpl(ds_result,'composition_grid',component_dim='component',labels='extrap_entropy',ax=axes[1])\n", "\n", "axes[0].set(title=\"Most Likely Phase Label\")\n", "axes[1].set(title=\"Entropy of Phase Label\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The right subplot is related to our confidence in the label prediction and is a powerful tool for finding label boundaries because, by construction, it is maximized at label boundaries. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Calculate Next Sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a model that can predict phase labels and their uncertainty, we can use this information to select the next sample point. The `MaxValueAF` pipeline operation will select the composition with maximum entropy as the next point to measure, since high entropy indicates regions where the model is most uncertain about the phase label.\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PipelineOp input_variable ---> output_variable\n", "---------- -----------------------------------\n", "0 ) composition ---> normalized_composition\n", "1 ) composition_grid ---> normalized_composition_grid\n", "2 ) measurement ---> derivative\n", "3 ) derivative ---> similarity\n", "4 ) similarity ---> labels\n", "5 ) ['normalized_composition', 'labels', 'normalized_composition_grid'] ---> ['extrap_mean', 'extrap_entropy']\n", "6 ) ['extrap_entropy', 'composition_grid'] ---> next_sample\n", "\n", "Input Variables\n", "---------------\n", "0) composition\n", "1) composition_grid\n", "2) measurement\n", "\n", "Output Variables\n", "----------------\n", "0) extrap_mean\n", "1) next_sample\n" ] } ], "source": [ "with my_first_pipeline:\n", " MaxValueAF(\n", " input_variables=['extrap_entropy'],\n", " output_variable='next_sample',\n", " grid_variable='composition_grid',\n", " )\n", "\n", "my_first_pipeline.print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's run the pipeline" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "085eea8cbbbb4104a87fe6de9926459e", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/7 [00:00\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 568kB\n",
       "Dimensions:                      (sample: 100, component: 2, x: 150,\n",
       "                                  grid: 2500, log_x: 250, sample_i: 100,\n",
       "                                  sample_j: 100, AF_sample: 1)\n",
       "Coordinates:\n",
       "  * component                    (component) <U1 8B 'A' 'B'\n",
       "  * x                            (x) float64 1kB 0.001 0.001047 ... 0.9547 1.0\n",
       "  * log_x                        (log_x) float64 2kB -3.0 -2.988 ... 0.0\n",
       "Dimensions without coordinates: sample, grid, sample_i, sample_j, AF_sample\n",
       "Data variables: (12/14)\n",
       "    composition                  (sample, component) float64 2kB 5.7 ... 5.104\n",
       "    ground_truth_labels          (sample) int64 800B 1 1 0 1 0 1 ... 1 1 1 0 1 1\n",
       "    measurement                  (sample, x) float64 120kB 1.915e+06 ... 1.885\n",
       "    composition_grid             (grid, component) float64 40kB 0.0 0.0 ... 25.0\n",
       "    normalized_composition       (sample, component) float64 2kB 0.57 ... 0.2042\n",
       "    normalized_composition_grid  (grid, component) float64 40kB 0.0 0.0 ... 1.0\n",
       "    ...                           ...\n",
       "    labels                       (sample) int64 800B 1 1 0 1 0 1 ... 1 1 1 0 1 1\n",
       "    extrap_mean                  (grid) int64 20kB 1 1 1 1 1 1 1 ... 1 1 1 1 1 1\n",
       "    extrap_entropy               (grid) float64 20kB 0.5813 0.5687 ... 0.4603\n",
       "    extrap_y_prob                (grid) float64 20kB 0.5813 0.5687 ... 0.4603\n",
       "    decision_surface             (grid) float64 20kB 0.7655 0.7391 ... 0.512\n",
       "    next_sample                  (AF_sample, component) float64 16B 4.082 24.49
" ], "text/plain": [ " Size: 568kB\n", "Dimensions: (sample: 100, component: 2, x: 150,\n", " grid: 2500, log_x: 250, sample_i: 100,\n", " sample_j: 100, AF_sample: 1)\n", "Coordinates:\n", " * component (component) " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig,axes = plt.subplots(1,2,figsize=(8,3.25))\n", "\n", "plot_surface_mpl(ds_result,'composition_grid',component_dim='component',labels='extrap_mean',ax=axes[0])\n", "plot_surface_mpl(ds_result,'composition_grid',component_dim='component',labels='extrap_entropy',ax=axes[1])\n", "\n", "plot_scatter_mpl(ds_result,'next_sample',component_dim='component',labels=[-1],marker='x',color='red',s=100,ax=axes[0])\n", "plot_scatter_mpl(ds_result,'next_sample',component_dim='component',labels=[-1],marker='x',color='red',s=100,ax=axes[1])\n", "\n", "axes[0].set(title=\"Most Likely Phase Label\")\n", "axes[1].set(title=\"Entropy of Phase Label\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See that the red X is placed near the boundary of the two phases. Running the pipeline several times, you should see the X move about the bright region of entropy. This is because the `MaxValueAF` acquisition function doesn't choose the absolute maximum, but rather randomly chooses from the top `acquisition_rtol` percent of the entropy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Full Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With that, we have a full `Pipeline` which defines the behavior of a decision agent! Let's view the whole pipeline defined in a single context:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PipelineOp input_variable ---> output_variable\n", "---------- -----------------------------------\n", "0 ) composition ---> normalized_composition\n", "1 ) composition_grid ---> normalized_composition_grid\n", "2 ) measurement ---> derivative\n", "3 ) derivative ---> similarity\n", "4 ) similarity ---> labels\n", "5 ) ['normalized_composition', 'labels', 'normalized_composition_grid'] ---> ['extrap_mean', 'extrap_entropy']\n", "6 ) ['extrap_entropy', 'composition_grid'] ---> next_sample\n", "\n", "Input Variables\n", "---------------\n", "0) composition\n", "1) composition_grid\n", "2) measurement\n", "\n", "Output Variables\n", "----------------\n", "0) extrap_mean\n", "1) next_sample\n" ] } ], "source": [ "with Pipeline() as my_first_pipeline:\n", "\n", " Standardize(\n", " input_variable='composition',\n", " output_variable='normalized_composition',\n", " dim='sample',\n", " component_dim='component',\n", " min_val={'A':0.0,'B':0.0},\n", " max_val={'A':10.0,'B':25.0},\n", " )\n", "\n", " Standardize(\n", " input_variable='composition_grid',\n", " output_variable='normalized_composition_grid',\n", " dim='grid',\n", " component_dim='component',\n", " min_val={'A':0.0,'B':0.0},\n", " max_val={'A':10.0,'B':25.0},\n", " )\n", "\n", " SavgolFilter(\n", " input_variable='measurement', \n", " output_variable='derivative', \n", " dim='x', \n", " derivative=1\n", " )\n", "\n", " Similarity(\n", " input_variable='derivative', \n", " output_variable='similarity', \n", " sample_dim='sample',\n", " params={'metric': 'laplacian','gamma':1e-4}\n", " )\n", "\n", " SpectralClustering(\n", " input_variable='similarity',\n", " output_variable='labels',\n", " dim='sample',\n", " params={'n_phases': 2}\n", " )\n", "\n", " \n", " GaussianProcessClassifier(\n", " feature_input_variable='normalized_composition',\n", " predictor_input_variable='labels',\n", " output_prefix='extrap',\n", " sample_dim='sample',\n", " grid_variable='normalized_composition_grid',\n", " grid_dim='grid',\n", " )\n", "\n", " MaxValueAF(\n", " input_variables=['extrap_entropy'],\n", " output_variable='next_sample',\n", " grid_variable='composition_grid',\n", " )\n", "\n", "my_first_pipeline.print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also visualize the full pipeline using the `.draw` and `.draw_plotly` methods" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "my_first_pipeline.draw();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While this doesn't always produce the more visually appealling graphs, it is a powerful way to check the consistency and flow of complex pipelines. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we learned how to build pipelines in using `AFL.double_agent` by:\n", "\n", "- Creating a new pipeline using `Pipeline()`\n", "- Adding data processing steps like normalization and derivative calculation\n", "- Implementing spectral clustering for phase identification\n", "- Using Gaussian Process classification to extrapolate phase boundaries\n", "- Adding active learning with acquisition functions to guide further sampling\n", "- Visualizing the pipeline structure and results at each step\n", "\n", "The pipeline we built demonstrates a complete workflow - from raw data processing through machine learning and active learning. This modular approach allows us to easily modify individual components while maintaining a clear data flow between steps.\n", "\n", "For more examples of AFL pipelines and components, check out the other tutorials and examples in the documentation.\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 4 }