HLG-MOS_SYNTHETIC_DATA_TEST_DRIVE


Data Synthesis Methods

Fully Conditional Specification (FCS)

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED_BY

Synthpop

R Library

cart, ctree: any data

surv.ctree: duration data

GPL-3

synthpop: Bespoke Creation of Synthetic Data in R

Synthpop Docs

Smart Data Foundry;

DESTATIS;

ISTAT;

Statistics Netherlands;

SynthPop;

VOIGHT-KAMPFF;

IPUMS International;

JAFE;

Innovation Surge Team;

PHAC-DAT;

Scottish Longitudinal Survey;

SAGE;

Synth4SSB;

Parametric / Regression:

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED_BY

Synthpop

R Library

  • Linear Regression: numeric
  • Logistic Regression: binary Predictive Mean Matching: numeric
  • Ordered / Unordered Polytomous Logistic Regression: Order / Unordered Factors, > 2 levels

GPL-3

synthpop: Bespoke Creation of Synthetic Data in R

Synthpop Docs

ISTAT;

Statistics Netherlands;

Differential Privacy:

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED_BY

Private-pgm

Python Library

nist-synthetic-data-challenge-submission

SDNist

Differentially Private Probabilistic Graph Model (DP-PGM) (Link) (categorical data)

Link

Graphical model based estimation and inference for differential privacy

DESTATIS;

ISTAT;

SynthPop;

SAGE;

DataSynthesizer

Python Library

Differentially Private Probabilistic Graph Model (DP Bayesian Network))

MIT

DataSynthesizer: Privacy-Preserving Synthetic Datasets

Statistics Lithuania

DP Copula

Method

Computes differentially private copula function from which synthetic data can be sampled

None

CRA

EFPA Histogram

Method

Optimization of the Fourier Perturbation Algorithm (FPA)

None

Differentially Private Histogram Publishing through Lossy Compression

CRA

Generative Adversarial Networks (GAN):

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED_BY

SDV - CTGAN

Python Library

  • GAN-based deep Learning: discrete and continuous data (Link)
  • (TVAE) Variational Autoencoder based deep learning: discrete and continuous data (Link)
  • Gaussian Copulas (Link)
  • CopulaGAN (Link)

MIT

Smart Data Foundry (CTGAN, Gaussian Copulas);

DESTATIS (CTGAN, CopulaGAN);

ISTAT (CTGAN);

Statistics Netherlands (CTGAN, Gaussian Copulas, TVAE);

SynthPop (CTGAN);

ABSEHRD

Python

Library

Correlation-Capturing Convolutional Generative Adversarial Networks

Link

CorGAN: Correlation-Capturing Convolutional Generative Adversarial Networks

SAGE

Mostly.ai

Integrated Tool in Commercial Product

Generative Adversarial Networks for data synthesis

Unknown

Documents

SynthPop

Feature Space Sampling Methods:

These methods sample new points from the high-dimensional feature space containing the data distribution (possibly with dimension reduction or distribution modeling steps).

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED_BY

RegSDC

R Library

Regression-based Information Preserving Statistical Obfuscation (IPSO)

Apache 2.0

Information preserving regression-based tools for statistical disclosure control

DESTATIS;

JAFE;

sdcMicro

R Library

Information Preserving Statistical Obfuscation (IPSO)

GPL-2

Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro

VOIGHT-KAMPFF;

GEMINAI

Integrated Tool in Commercial Product

Based on k-nearest-neighbors and information theory.

Unknown

(Link given on Diveplane > Resources > whitepapers)

Smart Data Foundry

PCA

Method

Principal Component Analysis

None

CRA

SMOTE

Method

Synthetic minority over-sampling technique

None

SMOTE: Synthetic minority over-sampling technique

(arXiv)

CRA

Other Methods:

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED_BY

Miro

Integrated  Tool with TDDA Python Library

Include Test-Driven Data Analysis (TDDA) for generation of relational integrity constraints, detection of PII data, and generation of synthetic data.

TDDA Library: MIT

Smart Data Foundry

Analytically Advanced Simulated Data

Method

Library Used: semTools

Continuous variables were synthesized using the simulation method for generating non-normal data as detailed by Fleishman (1978) and Vale and Maurelli (1983).

None

semTools: GPL3

JAFE

Sytho

Integrated Tool in Commercial Product

AI-generated synthetic data.

Unknown

Privacy Warriors:




Synthetic-Data Utility Evaluation Methods

Simple Statistics:

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED BY

Univariate Distribution

Graphical comparison of marginal distribution between original and synthetic data

Smart Data Foundry;

DESTATIS;

CRA;

Statistics Netherlands;

SynthPop;

VOIGHT-KAMPFF;

IPUMS International;

JAFE;

Statistics Lithuania;

Scottish Longitudinal Survey;

SAGE;

Privacy Warriors:

Descriptive Statistics

Difference in Min, Max, Mean, Median and Standard deviation between original and synthetic data

Smart Data Foundry;

ISTAT;

Statistics Netherlands;

CRA;

JAFE;

Statistics Lithuania;

Pairwise Scatter Plots

Graphical comparison by plotting pairwise scatter plots between variables

JAFE;

Innovation Surge Team;

Scottish Longitudinal Survey;

SAGE;

Joint Distribution

Graphical comparison of joint distributions

SynthPop;

Distribution Divergence Tests:

These tests determine whether synthetic data and real data plausibly represent different draws from the same population.

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED BY

MMD Statistics

GEMINAI

Measures the desirability of the joint distribution.

Unknown

Smart Data Foundry

Energy Statistics

GEMINAI

Measures the desirability of joint distribution of continuous features.

Unknown

Smart Data Foundry

MannWhitney

GEMINAI

Measures the desirability of marginal distribution that measures central tendencies.

Unknown

Smart Data Foundry

GTest

GEMINAI

Measures the desirability of the marginal distribution for nominal features.

Unknown

Smart Data Foundry

Chi Square

GEMINAI

Measures the desirability of marginal distribution for nominal features.

Unknown

Smart Data Foundry

Kolmogorov-Smirnov

GEMINAI

Measures the desirability of marginal distribution for continuous features.

Unknown

Smart Data Foundry

Kolmogorov-Smirnov

Unspecified, presumably applied to numerical variables in SAT_GPA

DESTATIS

Propensity Metric:

These metrics train a classifier/model to distinguish between real and synthetic data, and hopes for poor performance.

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED BY

pSME

Synthpop Library

Propensity score mean-squared error (pSME) measures group membership of records between original and synthetic data to obtain an estimate of distinguishability.

GPL-3

Smart Data Foundry;

DESTATIS;

ISTAT;

Statistics Netherlands;

CRA;

SynthPop;

VOIGHT-KAMPFF;

IPUMS International;

JAFE;

PHAC-DAT;

Statistics Lithuania;

Scottish Longitudinal Survey;

SAGE;

S_pSME

Synthpop Library

propensity score mean-squared error standardized ratio

GPL-3

Smart Data Foundry;

DESTATIS;

ISTAT;

Statistics Netherlands;

CRA;

SynthPop;

VOIGHT-KAMPFF;

IPUMS International;

JAFE;

Statistics Lithuania;

Scottish Longitudinal Survey;

SAGE;

SPECKS

Synthpop Library

Kolmogorov-Smirnov Statistics, which is the maximum distance between the cumulative distribution functions of the propensity score for the synthetic and original distributions.

GPL-3

Assessing, Visualizing and Imporving the Utility of Synthetic Data

Smart Data Foundry;

SynthPop

PO50

Synthpop Library

Percentage above 50% of synthetic data records where the model used correctly predicts whether the data is real or synthetic data.

GPL-3

Assessing, Visualizing and Imporving the Utility of Synthetic Data

Smart Data Foundry

Classifier Performance Methods:

These methods are used to compare the performance of classifiers/models trained on real and synthetic data.

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED BY

Regression Comparison

GEMINAI

Measures the desirability of regression models using original and generated data and then compares performance

Unknown

Smart Data Foundry

Binary Decision Tree Classifier

SDV

Trains a binary decision tree classifier on the synthetic data and then evaluates the model performance on the real data.

MIT

Docs

Statistics Netherlands

Binary AdaBoost Classifier

SDV

Trains a binary adaboost classifier on the synthetic data and then evaluates the model performance on the real data

MIT

Docs

Statistics Netherlands

Binary Logistic Regression

SDV

Trains a binary logistic regression on the synthetic data and then evaluates the model performance on the real data

MIT

Docs

Statistics Netherlands

Multiclass Decision Tree Classifier

SDV

Trains a multiclass decision tree classifier on the synthetic data and then evaluates the model performance on the real data

MIT

Docs

Statistics Netherlands

Multiclass MLP Classifier

SDV

Trains a multiclass MLP classifier on the synthetic data and then evaluates the model performance on the real data

MIT

Docs

Statistics Netherlands

Gradient Boosting Decision Tree Training

LightGBM

Trains an ML model on original data one feature at a time and scores predictions on withheld original data. Is repeated similarly for synthetic data.

LightGBM: A highly Efficient
Gradient Boosting Decision Tree

SAGE

Compare variables effect sizes

ABSEHRD

Trains logistic regression model for a binary outcome between a model trained from real data and synthetic data.

Link

SAGE IPython Notebook

SAGE

Compare predictive performance

ABSEHRD

Train and test a logistic regression model with synthetic data and compare performance with training and testing on real data.

Link

SAGE IPython Notebook

SAGE

Marginal (Density Difference) Methods:

These methods compare the difference between real and synthetic data with respect to multidimensional bin counts.

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED BY

MabsDD

Synthpop Library

Mean absolute difference in densities

GPL-3

DESTATIS

SDNist metric

sdnist

k-marginal

Link

SDNist: Benchmark Data and Evaluation Tools for Data Synthesizer

SAGE

Other Methods:

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED BY

Pearson Pairwise Correlation

Coefficient

Calculates Pearson pairwise correlation coefficient to measure the strength of linear relationship between two variables.

DESTATIS;

CRA;

Innovation Surge Team;

Information Loss Measure

sdcMicro

Information loss based on measure of distribution disturbance, measure of impact on the variance of estimates, and measure of impact on the intensity of connections.

GPL-2

DESTATIS

Mahalanobis Distance

Measures of the distance between a point and a distribution

wikipedia

DESTATIS

dBhatt

Synthpop Library

Bhattacharyya distance metric

GPL-3

wikipedia

DESTATIS

Earth Mover’s Distance / Wasserstein Metric

R

Evaluates dissimilarity between two multidimensional probability distributions inspired by the problem of optimal mass transportation.

Innovation Surge Team;

PCA

Creates a graphical comparison by scattering data on pairwise plot of the components created by PCA

IPUMS International;

Distribution of Euclidean distance

Compares distribution of Euclidean distances between pairs of records sampled from the real data versus pairs of records sampled from the real and synthetic data.

SAGE;




Synthetic-Data Privacy Evaluation Methods

Unique Record Reproduction:

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED BY

Replicated uniques

Synthpop Library

Number of unique records in the synthetic dataset that match the unique records in the original dataset with regard to their quasi-identifiers

GPL-3

Smart Data Foundry;

JAFE;

DESTATIS;

ISTAT;

Statistics Netherlands;

VOIGHT-KAMPFF;

PHAC-DAT;

SAGE;

Statistics Lithuania;

SAGE;

CRA

Percentage Replicated uniques

Synthpop Library

Percentage of replicated uniques

GPL-3

Smart Data Foundry;

JAFE;

DESTATIS;

ISTAT;

Statistics Netherlands;

VOIGHT-KAMPFF;

PHAC-DAT;

SAGE;

Statistics Lithuania;

SAGE;

CRA

Apparent Match Distribution

Finds every synthetic record that exactly matches a unique real on a set of quasi-identifier features (an “apparent” unique match between the synthetic and real data), then assesses the degree to which those apparent matches have the same value on other features in the datasets.

IPUMS International

Count Disclosure

Find the number of replicated unique records in the synthetic data set that have a real disclosure risk in at least one confidential variable (i.e. there is at least one confidential variable where the record in the synthetical data set is “too close” to the matching unique record in the original data set). Any two records are defined as “too close” in a variable, if they differ in this variable by at most p%.

DESTATIS

Percent Disclosure

Determines the proportion of the number of replicated unique records in the synthetic data set that have a real disclosure risk in at least one confidential variable relating to the original data set size.

DESTATIS

Re-identification:

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED BY

Full Matches

Uses inner join of original and synthetic data to find fully replicated records.

Innovation Surge Team

Partial Matches

Iteratively performs inner join on a combination of a subset of features in original and synthetic data to find partially replicated records.

Innovation Surge Team

Pairwise Intersections

Identifies pairwise intersection that occur if any two rounded columns in the original dataset are the same as any two rounded columns in the synthetic data. All pairwise intersections in the numeric variables are counted within each category group.

CRA

Fuzzy matching distance

Matches the absolute difference between an inferred variable from synthetic data to the real value of the variable in the original data.

CRA

Identical Match Share

Mostly.ai,

Sytho

Determines the ratio of synthetic data records that match a record in the original data

SynthPop;

Privacy Warriors;

Distance Anonymity Preservation

GEMINAI

Measures privacy by computing the distances between the data points of a data set to the closest corresponding data point of another dataset, with a focus on the regions with the densest data.

Unknown

Smart Data Foundry

Density Anonymity Preservation

GEMINAI

Measures privacy by computing the distances between the data points of a data set to the closest corresponding data point of another dataset, relative to density.

Unknown

Smart Data Foundry

Distance to Closest Record

Mostly.ai,

Sytho

Measures the distance of synthetic data records to their nearest actual record within the original data

SynthPop;

Privacy Warriors:

Nearest Neighbor Distance Ratio

Mostly.ai,

Sytho

Provides the distance ratio between the nearest and second-nearest synthetic record to their closest record within the original data

SynthPop;

Privacy Warriors;

Attribute inference:

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED BY

Targeted Inference Attack

Designed to correctly guess a sensitive variable from a combination of quasi-identifiers

Scottish Longitudinal Survey;

K-anonymity:

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED BY

k-anonymity

GEMINAI

This risk measure is based on the principle that, in a safe data set, the number of individuals sharing the same combination of values (keys) of categorical quasi-identifiers should be higher than a specified threshold

Unknown

Smart Data Foundry

kl-Divergence

GEMINAI

kl-divergence values are a measure of how similar the distributions of the generated and original dataset are.

Unknown

Smart Data Foundry

l-diversity

GEMINAI

Measures the desirability of the l-diversity, which happens when a distribution of a sensitive attribute in each equivalence class has at least “well-represented” values to protect against attribute disclosure.

Unknown

Smart Data Foundry

t-loseness

GEMINAI

An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness.

Unknown

Smart Data Foundry

Other Methods:

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED BY

Nearest Neighbors

ABSEHRD

Compares distances between pairs of real and synthetic data samples

Link

SAGE;

Membership Inference: Distance-based thresholding

ABSEHRD

Determines whether a given data sample was used to train a model of interest.

Link

SAGE;