HLG-MOS
Synthetic Data
Test-Drive

LIBRARY OF RESOURCES COMPILED FROM
HLG-MOS CHALLENGE 2022

The test-drive contains an archive of collective expert knowledge gathered from the participant submissions for the High Level Group on the Modernization of Official Statistics (HLG-MOS) Synthetic Data Challenge, held in January 2022. This platform aims to offer some insights from the members of National Statistical Organizations (NSOs) and the synthetic data community at large, regarding the subject matter expertise in synthetic data generation, and their perspective towards utility and privacy of synthesized data. The High-Level Working Group for the Modernisation of Statistics (HLG-MOS) is a function of the United Nations Economic Commission for Europe (UNCE). The HLG-MOS Challenge used NIST's SDNIST: Synthetic Data Benchmarking Library as an evaluation tool. These results are shared with permission under NIST agreement DTA-22-011. These data are for informational purposes and, inclusion does not imply an endorsement.

Digital drawing of a person using a laptop.

Overview

HLG-MOS Synthetic Data Challenge

The HLG-MOS Synthetic Data Challenge was conducted in order test drive the research and recommendations carried out in the HLG-MOS Synthetic Data Project and resulting publication Synthetic Data for Official Statistics: A Starter Guide. The challenge was organized in partnership with Statistics Canada, NIST and Knexus Reserach Corporation, with participants from a broad set of data synthesizers and data evaluators over a week in late January 2022. A total of 17 teams participated, representing NSO’s, NGO’s industry and academia from a variety of countries and continents. Participants tried out different combinations of data sets, configurations, synthesizers, and evaluators based on the recommendations and scenarios outlined in the Synthetic Data for Official Statistics: A Starter Guide. Teams were provided with the same data options, quickstart guidance documents and could also reach out to a range of subject matter experts by slack or in office hours

The challenge was focused on synthesizing two benchmark data sets, one small data set (Students SAT-GPA data with 7 features) and one complex data set (American Community Survey Excerpt with 35 features, including high cardinality categoricals) provided by the US National Institute of Standards and Technology(NIST), from their own Synthetic Data Challenges.

Number of different synthesis and evaluation methods tried, collectively across all participating teams:
17 data synthesis techniques
34 data utility evaluation techniques
21 data privacy evaluation techniques

Eight participating teams provided a summary for the use-case suitability of the synthesized data instances. The following figure shows the number of teams that found at least one approach that was at least potentially suitable for each of the use cases on each of the two data sets.

	Public Release	Testing Analysis	Education	Testing Code
Student SAT-GPA Data	6	8	8	8
ACS Data	4	5	7	6

Challenge Winners

Teams Test-Drive reports can be found here

1st Place:
Smart Data Foundry
11 points (6 points Student SAT-GPA Data + 5 points ACS Data)

2nd Place:
DESTATIS
10 points (5 points Student SAT-GPA Data + 5 points ACS Data)

Honorable Mention:
Statistics Netherlands
8 points (4 points Student SAT-GPA Data + 4 points ACS Data)

Honorable Mention:
CRA
8 points (4 points Student SAT-GPA Data + 4 points ACS Data)

Honorable Mention:
ISTAT
7 points (4 points Student SAT-GPA Data + 3 points ACS Data)

Data Synthesis Methods

Fully Conditional Specification (FCS)

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED_BY

Synthpop

R Library

cart, ctree: any data

surv.ctree: duration data

GPL-3

synthpop: Bespoke Creation of Synthetic Data in R

Statistics Netherlands;

SynthPop;

VOIGHT-KAMPFF;

IPUMS International;

JAFE;

Innovation Surge Team;

PHAC-DAT;

Scottish Longitudinal Survey;

SAGE;

Synth4SSB;

Parametric / Regression:

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED_BY

Synthpop

R Library

Linear Regression: numeric
Logistic Regression: binary Predictive Mean Matching: numeric
Ordered / Unordered Polytomous Logistic Regression: Order / Unordered Factors, > 2 levels

GPL-3

synthpop: Bespoke Creation of Synthetic Data in R

Synthpop Docs

ISTAT;

Statistics Netherlands;

Differential Privacy:

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED_BY
Private-pgm	Python Library nist-synthetic-data-challenge-submission SDNist	Differentially Private Probabilistic Graph Model (DP-PGM) (Link) (categorical data)	Link	Graphical model based estimation and inference for differential privacy	DESTATIS; ISTAT; SynthPop; SAGE;
DataSynthesizer	Python Library	Differentially Private Probabilistic Graph Model (DP Bayesian Network))	MIT	DataSynthesizer: Privacy-Preserving Synthetic Datasets	Statistics Lithuania
DP Copula	Method	Computes differentially private copula function from which synthetic data can be sampled	None	Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions DPSynthesizer: Differentially Private Data Synthesizer for Privacy Preserving Data Sharing	CRA
EFPA Histogram	Method	Optimization of the Fourier Perturbation Algorithm (FPA)	None	Differentially Private Histogram Publishing through Lossy Compression	CRA

Generative Adversarial Networks (GAN):

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED_BY
SDV - CTGAN	Python Library	GAN-based deep Learning: discrete and continuous data (Link) (TVAE) Variational Autoencoder based deep learning: discrete and continuous data (Link) Gaussian Copulas (Link) CopulaGAN (Link)	MIT	Modeling Tabular data using Conditional GAN, SDV, SDV (IEEE)	Smart Data Foundry (CTGAN, Gaussian Copulas); DESTATIS (CTGAN, CopulaGAN); ISTAT (CTGAN); Statistics Netherlands (CTGAN, Gaussian Copulas, TVAE); SynthPop (CTGAN);
ABSEHRD	Python Library	Correlation-Capturing Convolutional Generative Adversarial Networks	Link	CorGAN: Correlation-Capturing Convolutional Generative Adversarial Networks	SAGE
Mostly.ai	Integrated Tool in Commercial Product	Generative Adversarial Networks for data synthesis	Unknown	Documents	SynthPop

Feature Space Sampling Methods:

These methods sample new points from the high-dimensional feature space containing the data distribution (possibly with dimension reduction or distribution modeling steps).

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED_BY
RegSDC	R Library	Regression-based Information Preserving Statistical Obfuscation (IPSO)	Apache 2.0	Information preserving regression-based tools for statistical disclosure control	DESTATIS; JAFE;
sdcMicro	R Library	Information Preserving Statistical Obfuscation (IPSO)	GPL-2	Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro	VOIGHT-KAMPFF;
GEMINAI	Integrated Tool in Commercial Product	Based on k-nearest-neighbors and information theory.	Unknown	Natively Interpretable Machine Learning and Artificial Intelligence: Preliminary Results and Future Directions (Link given on Diveplane > Resources > whitepapers) GEMINAI Info	Smart Data Foundry
PCA	Method	Principal Component Analysis	None		CRA
SMOTE	Method	Synthetic minority over-sampling technique	None	SMOTE: Synthetic minority over-sampling technique (arXiv)	CRA

Other Methods:

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED_BY
Miro	Integrated Tool with TDDA Python Library	Include Test-Driven Data Analysis (TDDA) for generation of relational integrity constraints, detection of PII data, and generation of synthetic data.	TDDA Library: MIT	Link Automatic Constraint Generation and Verification White Paper	Smart Data Foundry
Analytically Advanced Simulated Data	Method Library Used: semTools	Continuous variables were synthesized using the simulation method for generating non-normal data as detailed by Fleishman (1978) and Vale and Maurelli (1983).	None semTools: GPL3	A method for simulating non-normal distributions (Fleishman, 1978) Simulating multivariate nonnormal distributions (Vale and Maurelli, 1983)	JAFE
Sytho	Integrated Tool in Commercial Product	AI-generated synthetic data.	Unknown		Privacy Warriors:

Synthetic-Data Utility Evaluation Methods

Simple Statistics:

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED BY
Univariate Distribution		Graphical comparison of marginal distribution between original and synthetic data			Smart Data Foundry; DESTATIS; CRA; Statistics Netherlands; SynthPop; VOIGHT-KAMPFF; IPUMS International; JAFE; Statistics Lithuania; Scottish Longitudinal Survey; SAGE; Privacy Warriors:
Descriptive Statistics		Difference in Min, Max, Mean, Median and Standard deviation between original and synthetic data			Smart Data Foundry; ISTAT; Statistics Netherlands; CRA; JAFE; Statistics Lithuania;
Pairwise Scatter Plots		Graphical comparison by plotting pairwise scatter plots between variables			JAFE; Innovation Surge Team; Scottish Longitudinal Survey; SAGE;
Joint Distribution		Graphical comparison of joint distributions			SynthPop;

Distribution Divergence Tests:

These tests determine whether synthetic data and real data plausibly represent different draws from the same population.

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED BY
MMD Statistics	GEMINAI	Measures the desirability of the joint distribution.	Unknown		Smart Data Foundry
Energy Statistics	GEMINAI	Measures the desirability of joint distribution of continuous features.	Unknown		Smart Data Foundry
MannWhitney	GEMINAI	Measures the desirability of marginal distribution that measures central tendencies.	Unknown		Smart Data Foundry
GTest	GEMINAI	Measures the desirability of the marginal distribution for nominal features.	Unknown		Smart Data Foundry
Chi Square	GEMINAI	Measures the desirability of marginal distribution for nominal features.	Unknown		Smart Data Foundry
Kolmogorov-Smirnov	GEMINAI	Measures the desirability of marginal distribution for continuous features.	Unknown		Smart Data Foundry
Kolmogorov-Smirnov		Unspecified, presumably applied to numerical variables in SAT_GPA			DESTATIS

Propensity Metric:

These metrics train a classifier/model to distinguish between real and synthetic data, and hopes for poor performance.

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED BY
pSME	Synthpop Library	Propensity score mean-squared error (pSME) measures group membership of records between original and synthetic data to obtain an estimate of distinguishability.	GPL-3	Assessing, Visualizing and Improving the Utility of Synthetic Data General and specific utility measures for synthetic data	Smart Data Foundry; DESTATIS; ISTAT; Statistics Netherlands; CRA; SynthPop; VOIGHT-KAMPFF; IPUMS International; JAFE; PHAC-DAT; Statistics Lithuania; Scottish Longitudinal Survey; SAGE;
S_pSME	Synthpop Library	propensity score mean-squared error standardized ratio	GPL-3	Assessing, Visualizing and Improving the Utility of Synthetic Data General and specific utility measures for synthetic data	Smart Data Foundry; DESTATIS; ISTAT; Statistics Netherlands; CRA; SynthPop; VOIGHT-KAMPFF; IPUMS International; JAFE; Statistics Lithuania; Scottish Longitudinal Survey; SAGE;
SPECKS	Synthpop Library	Kolmogorov-Smirnov Statistics, which is the maximum distance between the cumulative distribution functions of the propensity score for the synthetic and original distributions.	GPL-3	Assessing, Visualizing and Imporving the Utility of Synthetic Data	Smart Data Foundry; SynthPop
PO50	Synthpop Library	Percentage above 50% of synthetic data records where the model used correctly predicts whether the data is real or synthetic data.	GPL-3	Assessing, Visualizing and Imporving the Utility of Synthetic Data	Smart Data Foundry

Classifier Performance Methods:

These methods are used to compare the performance of classifiers/models trained on real and synthetic data.

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED BY
Regression Comparison	GEMINAI	Measures the desirability of regression models using original and generated data and then compares performance	Unknown		Smart Data Foundry
Binary Decision Tree Classifier	SDV	Trains a binary decision tree classifier on the synthetic data and then evaluates the model performance on the real data.	MIT	Docs	Statistics Netherlands
Binary AdaBoost Classifier	SDV	Trains a binary adaboost classifier on the synthetic data and then evaluates the model performance on the real data	MIT	Docs	Statistics Netherlands
Binary Logistic Regression	SDV	Trains a binary logistic regression on the synthetic data and then evaluates the model performance on the real data	MIT	Docs	Statistics Netherlands
Multiclass Decision Tree Classifier	SDV	Trains a multiclass decision tree classifier on the synthetic data and then evaluates the model performance on the real data	MIT	Docs	Statistics Netherlands
Multiclass MLP Classifier	SDV	Trains a multiclass MLP classifier on the synthetic data and then evaluates the model performance on the real data	MIT	Docs	Statistics Netherlands
Gradient Boosting Decision Tree Training	LightGBM	Trains an ML model on original data one feature at a time and scores predictions on withheld original data. Is repeated similarly for synthetic data.		LightGBM: A highly Efficient Gradient Boosting Decision Tree	SAGE
Compare variables effect sizes	ABSEHRD	Trains logistic regression model for a binary outcome between a model trained from real data and synthetic data.	Link	SAGE IPython Notebook	SAGE
Compare predictive performance	ABSEHRD	Train and test a logistic regression model with synthetic data and compare performance with training and testing on real data.	Link	SAGE IPython Notebook	SAGE

Marginal (Density Difference) Methods:

These methods compare the difference between real and synthetic data with respect to multidimensional bin counts.

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED BY
MabsDD	Synthpop Library	Mean absolute difference in densities	GPL-3		DESTATIS
SDNist metric	sdnist	k-marginal	Link	SDNist: Benchmark Data and Evaluation Tools for Data Synthesizer	SAGE

Other Methods:

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED BY
Pearson Pairwise Correlation Coefficient		Calculates Pearson pairwise correlation coefficient to measure the strength of linear relationship between two variables.			DESTATIS; CRA; Innovation Surge Team;
Information Loss Measure	sdcMicro	Information loss based on measure of distribution disturbance, measure of impact on the variance of estimates, and measure of impact on the intensity of connections.	GPL-2	The trade-off between the risk of disclosure and data utility in SDC - a case of data from a survey of accidents at work Mlodak, A. (2020). Information loss resulting from statistical disclosure control of output data, Wiadomosci Statystyczne. The Polish Statistician, 2020, 65(9), 7-27, DOI: 10.5604/01.3001.0014.4121 Mlodak, A. (2019). Using the Complex Measure in an Assessment of the Information Loss Due to the Microdata Disclosure Control, Przegląd Statystyczny, 2019, 66(1), 7-26, DOI: 10.5604/01.3001.0013.8285	DESTATIS
Mahalanobis Distance		Measures of the distance between a point and a distribution		wikipedia	DESTATIS
dBhatt	Synthpop Library	Bhattacharyya distance metric	GPL-3	wikipedia	DESTATIS
Earth Mover’s Distance / Wasserstein Metric	R	Evaluates dissimilarity between two multidimensional probability distributions inspired by the problem of optimal mass transportation.			Innovation Surge Team;
PCA		Creates a graphical comparison by scattering data on pairwise plot of the components created by PCA			IPUMS International;
Distribution of Euclidean distance		Compares distribution of Euclidean distances between pairs of records sampled from the real data versus pairs of records sampled from the real and synthetic data.			SAGE;

Synthetic-Data Privacy Evaluation Methods

Unique Record Reproduction:

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED BY
Replicated uniques	Synthpop Library	Number of unique records in the synthetic dataset that match the unique records in the original dataset with regard to their quasi-identifiers	GPL-3		Smart Data Foundry; JAFE; DESTATIS; ISTAT; Statistics Netherlands; VOIGHT-KAMPFF; PHAC-DAT; SAGE; Statistics Lithuania; SAGE; CRA
Percentage Replicated uniques	Synthpop Library	Percentage of replicated uniques	GPL-3		Smart Data Foundry; JAFE; DESTATIS; ISTAT; Statistics Netherlands; VOIGHT-KAMPFF; PHAC-DAT; SAGE; Statistics Lithuania; SAGE; CRA
Apparent Match Distribution		Finds every synthetic record that exactly matches a unique real on a set of quasi-identifier features (an “apparent” unique match between the synthetic and real data), then assesses the degree to which those apparent matches have the same value on other features in the datasets.			IPUMS International
Count Disclosure		Find the number of replicated unique records in the synthetic data set that have a real disclosure risk in at least one confidential variable (i.e. there is at least one confidential variable where the record in the synthetical data set is “too close” to the matching unique record in the original data set). Any two records are defined as “too close” in a variable, if they differ in this variable by at most p%.			DESTATIS
Percent Disclosure		Determines the proportion of the number of replicated unique records in the synthetic data set that have a real disclosure risk in at least one confidential variable relating to the original data set size.			DESTATIS

Re-identification:

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED BY
Full Matches		Uses inner join of original and synthetic data to find fully replicated records.			Innovation Surge Team
Partial Matches		Iteratively performs inner join on a combination of a subset of features in original and synthetic data to find partially replicated records.			Innovation Surge Team
Pairwise Intersections		Identifies pairwise intersection that occur if any two rounded columns in the original dataset are the same as any two rounded columns in the synthetic data. All pairwise intersections in the numeric variables are counted within each category group.			CRA
Fuzzy matching distance		Matches the absolute difference between an inferred variable from synthetic data to the real value of the variable in the original data.			CRA
Identical Match Share	Mostly.ai, Sytho	Determines the ratio of synthetic data records that match a record in the original data			SynthPop; Privacy Warriors;
Distance Anonymity Preservation	GEMINAI	Measures privacy by computing the distances between the data points of a data set to the closest corresponding data point of another dataset, with a focus on the regions with the densest data.	Unknown		Smart Data Foundry
Density Anonymity Preservation	GEMINAI	Measures privacy by computing the distances between the data points of a data set to the closest corresponding data point of another dataset, relative to density.	Unknown		Smart Data Foundry
Distance to Closest Record	Mostly.ai, Sytho	Measures the distance of synthetic data records to their nearest actual record within the original data			SynthPop; Privacy Warriors:
Nearest Neighbor Distance Ratio	Mostly.ai, Sytho	Provides the distance ratio between the nearest and second-nearest synthetic record to their closest record within the original data			SynthPop; Privacy Warriors;

Attribute inference:

NAME

TYPE

PROCEDURE

LICENSE

DOCUMENTS

USED BY

Targeted Inference Attack

Designed to correctly guess a sensitive variable from a combination of quasi-identifiers

Scottish Longitudinal Survey;

K-anonymity:

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED BY
k-anonymity	GEMINAI	This risk measure is based on the principle that, in a safe data set, the number of individuals sharing the same combination of values (keys) of categorical quasi-identifiers should be higher than a specified threshold	Unknown		Smart Data Foundry
kl-Divergence	GEMINAI	kl-divergence values are a measure of how similar the distributions of the generated and original dataset are.	Unknown		Smart Data Foundry
l-diversity	GEMINAI	Measures the desirability of the l-diversity, which happens when a distribution of a sensitive attribute in each equivalence class has at least “well-represented” values to protect against attribute disclosure.	Unknown		Smart Data Foundry
t-loseness	GEMINAI	An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness.	Unknown		Smart Data Foundry

Other Methods:

NAME	TYPE	PROCEDURE	LICENSE	DOCUMENTS	USED BY
Nearest Neighbors	ABSEHRD	Compares distances between pairs of real and synthetic data samples	Link		SAGE;
Membership Inference: Distance-based thresholding	ABSEHRD	Determines whether a given data sample was used to train a model of interest.	Link		SAGE;

National Institute of Standards and Technology

HLG-MOS
Synthetic Data
Test-Drive

Overview

HLG-MOS Synthetic Data Challenge

Challenge Winners

Teams Test-Drive reports can be found here

Summary of Synthesis and Evaluation Methods

Data Synthesis Methods

Fully Conditional Specification (FCS)

Parametric / Regression:

Differential Privacy:

Generative Adversarial Networks (GAN):

Feature Space Sampling Methods:

Other Methods:

Synthetic-Data Utility Evaluation Methods

Simple Statistics:

Distribution Divergence Tests:

Propensity Metric:

Classifier Performance Methods:

Marginal (Density Difference) Methods:

Other Methods:

Synthetic-Data Privacy Evaluation Methods

Unique Record Reproduction:

Re-identification:

Attribute inference:

K-anonymity:

Other Methods:

Get in touch