AFL.double_agent.Preprocessor module#

PipelineOps for Data Preprocessing

This module contains preprocessing operations that transform, normalize, and prepare data for analysis. Preprocessors handle tasks such as: - Scaling and normalizing data - Transforming between coordinate systems - Filtering and smoothing signals - Extracting features from raw measurements - Converting between different data representations

Each preprocessor is implemented as a PipelineOp that can be composed with others in a processing pipeline.

class AFL.double_agent.Preprocessor.ArrayToVars(input_variable: str, output_variables: list, split_dim: list, postfix: str = '', squeeze: bool = False, name: str = 'DatasetToVars')#

Bases: Preprocessor

Convert an array into multiple variables

Parameters:
  • input_variable (str) – The name of the array variable to split into separate variables

  • output_variables (list) – The names of the variables to create from the array

  • split_dim (str) – The dimension to split the array along

  • postfix (str, default='') – String to append to output variable names

  • squeeze (bool, default=False) – Whether to squeeze out single-dimension axes

  • name (str, default='DatasetToVars') – The name to use when added to a Pipeline

calculate(dataset)#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.BarycentricToTernaryXY(input_variable: str, output_variable: str, sample_dim: str, name: str = 'BarycentricToTernaryXY')#

Bases: Preprocessor

Transform from ternary coordinates to xy coordinates

Note —- Adapted from BaryCentric transform mpltern: yuzie007/mpltern

Parameters:
  • input_variable (str) – The name of the xarray.Dataset data variable to extract from the input dataset

  • output_variable (str) – The name of the variable to be inserted into the xarray.Dataset by this PipelineOp

  • sample_dim (str) – The dimension to use when calculating the data minimum

  • name (str) – The name to use when added to a Pipeline. This name is used when calling Pipeline.search(

calculate(dataset: Dataset) Self#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.Destandardize(input_variable: str, output_variable: str, dim: str, component_dim: str | None = 'component', scale_variable: str | None = None, min_val: Number | None = None, max_val: Number | None = None, name: str = 'Destandardize')#

Bases: Preprocessor

Transform the data from 0->1 scaling

Parameters:
  • input_variable (str) – The name of the xarray.Dataset data variable to extract from the input dataset

  • output_variable (str) – The name of the variable to be inserted into the xarray.Dataset by this PipelineOp

  • dim (str) – The dimension used for calculating the data minimum

  • component_dim (str | None, default="component") – The dimension for component-wise operations

  • scale_variable (str | None, default=None) – If specified, the min/max of this data variable in the supplied xarray.Dataset will be used to scale the data rather than min/max of the input_variable or the supplied min_val or max_val

  • min_val (Number | None, default=None) – Value used to scale the data minimum

  • max_val (Number | None, default=None) – Value used to scale the data maximum

  • name (str, default="Destandardize") – The name to use when added to a Pipeline

calculate(dataset: Dataset) Self#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.Extrema(input_variable: str, output_variable: str, dim: str, return_coords: bool = False, operator='max', slice: List | None = None, slice_dim: str | None = None, name: str = 'Extrema')#

Bases: Preprocessor

Find the extrema of a data variable

calculate(dataset: Dataset) Self#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.Preprocessor(input_variable: str = None, output_variable: str = None, name: str = 'PreprocessorBase')#

Bases: PipelineOp

Base class stub for all preprocessors

Parameters:
  • input_variable (str) – The name of the xarray.Dataset data variable to extract from the input dataset

  • output_variable (str) – The name of the variable to be inserted into the xarray.Dataset by this PipelineOp

  • name (str) – The name to use when added to a Pipeline. This name is used when calling Pipeline.search()

calculate(dataset: Dataset) Self#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.SavgolFilter(input_variable: str, output_variable: str, dim: str = 'q', xlo: Number | None = None, xhi: Number | None = None, xlo_isel: int | None = None, xhi_isel: int | None = None, pedestal: Number | None = None, npts: int = 250, derivative: int = 0, window_length: int = 31, polyorder: int = 2, apply_log_scale: bool = True, name: str = 'SavgolFilter')#

Bases: Preprocessor

Smooth and take derivatives of input data via a Savitsky-Golay filter

This PipelineOp cleans measurement data and takes smoothed derivatives using scipy.signal.savgol_filter. Below is a summary of the steps taken.

Parameters:
  • input_variable (str) – The name of the xarray.Dataset data variable to extract from the input dataset

  • output_variable (str) – The name of the variable to be inserted into the xarray.Dataset by this PipelineOp

  • dim (str) – The dimension in the xarray.Dataset to apply this filter over

  • xlo (Optional[Number]) – The values of the input dimension (dim, above) to trim the data to

  • xhi (Optional[Number]) – The values of the input dimension (dim, above) to trim the data to

  • xlo_isel (Optional[int]) – The integer indices of the input dimension (dim, above) to trim the data to

  • xhi_isel (Optional[int]) – The integer indices of the input dimension (dim, above) to trim the data to

  • pedestal (Optional[Number]) – This value is added to the input_variable to establish a fixed data ‘floor’

  • npts (int) – The size of the grid to interpolate onto

  • derivative (int) – The order of the derivative to return. If derivative=0, the data is smoothed with no derivative taken.

  • window_length (int) – The width of the window used in the savgol smoothing. See scipy.signal.savgol_filter for more information.

  • polyorder (int) – The order of polynomial used in the savgol smoothing. See scipy.signal.savgol_filter for more information.

  • apply_log_scale (bool) – If True, the input_variable and associated dim coordinated are scaled with numpy.log10

  • name (str) – The name to use when added to a Pipeline. This name is used when calling Pipeline.search()

Notes

This PipelineOp performs the following steps:

1. Data is trimmed to (xlo, xhi) and then (xlo_isel, xhi_isel) in that order. The former trims the data to a numerical while the latter trims to integer indices. It is generally not advisable to supply both varieties and a warning will be raised if this is attempted.

2. If apply_log_scale = True, both the input_variable and dim data will be scaled with numpy.log10. A new xarray dimension and coordinate will be created with the name log_{dim}.

3. All duplicate data (multiple data values at the same dim coordinates) are removed by taking the average of the duplicates.

  1. If pedestal is specified, the pedestal value is added to the data and all NaNs are filled with the pedestal

  2. The data is interpolated onto a constant grid with npts values from the trimmed minimum to the trimmed maximum. If apply_log_scale=True, the grid is geometrically rather than linearly spaced.

  3. All remaining NaN values are dropped along dim

7. Finally, scipy.signal.savgol_filter is applied with the window_length, polyorder, and derivative parameters specified in the constructor.

calculate(dataset: Dataset) Self#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.Standardize(input_variable: str, output_variable: str, dim: str, component_dim: str | None = 'component', scale_variable: str | None = None, min_val: Number | None = None, max_val: Number | None = None, name: str = 'Standardize')#

Bases: Preprocessor

Standardize the data to have min 0 and max 1

Parameters:
  • input_variable (str) – The name of the xarray.Dataset data variable to extract from the input dataset

  • output_variable (str) – The name of the variable to be inserted into the xarray.Dataset by this PipelineOp

  • dim (str) – The dimension used for calculating the data minimum

  • component_dim (str | None, default="component") – The dimension for component-wise operations

  • scale_variable (str | None, default=None) – If specified, the min/max of this data variable in the supplied xarray.Dataset will be used to scale the data rather than min/max of the input_variable or the supplied min_val or max_val

  • min_val (Number | None, default=None) – Value used to scale the data minimum

  • max_val (Number | None, default=None) – Value used to scale the data maximum

  • name (str, default="Standardize") – The name to use when added to a Pipeline

calculate(dataset: Dataset) Self#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.Subtract(input_variable: str, output_variable: str, dim: str, value: float | str, coord_value: bool = True, name: str = 'Subtract')#

Bases: Preprocessor

Baseline input variable by subtracting a value

calculate(dataset: Dataset) Self#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.SubtractMin(input_variable: str, output_variable: str, dim: str, name: str = 'SubtractMin')#

Bases: Preprocessor

Baseline input variable by subtracting minimum value

calculate(dataset: Dataset) Self#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.SympyTransform(input_variable: str, output_variable: str, sample_dim: str, transforms: Dict[str, object], transform_dim: str, component_dim: str = 'component', name: str = 'SympyTransform')#

Bases: Preprocessor

Transform data using sympy expressions

Parameters:
  • input_variable (str) – The name of the xarray.Dataset data variable to extract from the input dataset

  • output_variable (str) – The name of the variable to be inserted into the xarray.Dataset by this PipelineOp

  • sample_dim (str) – The sample dimension i.e., the dimension of compositions or grid points

  • component_dim (str, default="component") – The dimension of the component of each gridpoint

  • transforms (Dict[str,object]) – A dictionary of transforms (sympy expressions) to evaluate to generate new variables. For this method to function, the transforms must be completely specified except for the names in component_dim of the input_variable

  • transform_dim (str) – The name of the dimension that the ‘component_dim’ will be transformed to

  • name (str, default="SympyTransform") – The name to use when added to a Pipeline

Example

```python from AFL.double_agent import * import sympy with Pipeline() as p:

CartesianGrid(

output_variable=’comps’, grid_spec={

‘A’:{‘min’:1,’max’:25,’steps’:5}, ‘B’:{‘min’:1,’max’:25,’steps’:5}, ‘C’:{‘min’:1,’max’:25,’steps’:5},

}, sample_dim=’grid’

)

A,B,C = sympy.symbols(‘A B C’) vA = A/(A+B+C) vB = B/(A+B+C) vC = C/(A+B+C) SympyTransform(

input_variable=’comps’, output_variable=’trans_comps’, sample_dim=’grid’, transforms={‘vA’:vA,’vB’:vB,’vC’:vC}, transform_dim=’trans_component’

)

p.calculate(xr.Dataset())# returns dataset with grid and transformed grid ```

calculate(dataset: Dataset) Self#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.TernaryXYToBarycentric(input_variable: str, output_variable: str, sample_dim: str, name: str = 'TernaryXYToBarycentric')#

Bases: Preprocessor

Transform to ternary coordinates from xy coordinates

Note

Adapted from BaryCentric transform mpltern: yuzie007/mpltern

Parameters:
  • input_variable (str) – The name of the xarray.Dataset data variable to extract from the input dataset

  • output_variable (str) – The name of the variable to be inserted into the xarray.Dataset by this PipelineOp

  • sample_dim (str) – The dimension to use when calculating the data minimum

  • name (str, default="TernaryXYToBarycentric") – The name to use when added to a Pipeline

calculate(dataset: Dataset) Self#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.VarsToArray(input_variables: List, output_variable: str, variable_dim: str, squeeze: bool = False, variable_mapping: Dict = None, name: str = 'VarsToArray')#

Bases: Preprocessor

Convert multiple variables into a single array

Parameters:
  • input_variables (List) – List of input variables to combine into an array

  • output_variable (str) – The name of the variable to be inserted into the dataset

  • variable_dim (str) – The dimension name for the variables in the output array

  • squeeze (bool, default=False) – Whether to squeeze out single-dimension axes

  • variable_mapping (Dict, default=None) – Optional mapping to rename variables

  • name (str, default='VarsToArray') – The name to use when added to a Pipeline

calculate(dataset)#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.Zscale(input_variable: str, output_variable: str, dim: str, name: str = 'Zscale')#

Bases: Preprocessor

Z-scale the data to have mean 0 and standard deviation scaling

Parameters:
  • input_variable (str) – The name of the xarray.Dataset data variable to extract from the input dataset

  • output_variable (str) – The name of the variable to be inserted into the xarray.Dataset by this PipelineOp

  • dim (str) – The dimension to use when calculating the data minimum

  • name (str) – The name to use when added to a Pipeline. This name is used when calling Pipeline.search(

calculate(dataset: Dataset) Self#

Apply this PipelineOp to the supplied xarray.dataset

class AFL.double_agent.Preprocessor.ZscaleError(input_variables: str | None | List[str], output_variable: str, dim: str, name: str = 'Zscale_error')#

Bases: Preprocessor

Scale the y_err data, first input is y, second input is y_err

Parameters:
  • input_variables (Union[Optional[str], List[str]]) – The names of the input variables - first is y, second is y_err

  • output_variable (str) – The name of the variable to be inserted into the dataset

  • dim (str) – The dimension to use when calculating the data minimum

  • name (str, default="Zscale_error") – The name to use when added to a Pipeline

calculate(dataset: Dataset) Self#

Apply this PipelineOp to the supplied xarray.dataset