HLG-MOS_SYNTHETIC_DATA_TEST_DRIVE

HLG-MOS Synthetic Data Challenge Information Package

This guide outlines the HLG-MOS Synthetic Data Challenge. It includes the steps and tasks of the challenge along with documentation to aid in the tasks. It also outlines the deliverables and provides instructions on how to communicate with experts and submit your deliverables.

Challenge

Evaluate a broad set of data sets, methods and synthesizers for typical use cases of disseminated microdata. Credit is given for each instance (combination of data and synthesis method) evaluated.

Getting Help

Challenge Instructions

Recommended Reading from Synthetic Data for NSOs: A Starter Guide

Steps to Complete One Synthesis Exploration Instance:

  1. Choose one of the two data sets for your “original”data.
  2. Choose a synthesis technique and tune parameters as desired.
  3. Evaluate the utility and privacy of your resulting synthetic data set.
  4. Repeat steps 1 through 3 for any additional synthesis methods or datasets.

Deliverables

  1. Evaluation of Synthetic Instances
  2. Evaluation of the HLG-MOS Synthetic Data for National Statistical Organizations: A Starter Guide
  3. Presentation to Senior Management

How to Submit Your Responses

Slack Channel

Communications from challenge organizers, video submission, and Q&As will take place on the HLG-MOS Synthetic Data Slack channel.

Getting Help

Experts will be available at the following times to answer your questions in person. Your experts for the challenge are the following.

Name

Affiliation

Expertise

Geoffrey Brent

Australian Bureau of Statistics

Microsimulation / agent-based simulation

Joerg Drechsler

Institute for Employment Research

Fully conditional specification

Christine Task

Knexus Research

Fully conditional specification

Ioannis Kaloskampis

Office of National Statistics

GANs, differential privacy

Jecy Yu

Statistics Canada

Fully conditional specification, R

Alistair Ramsden

Statistics New Zealand

Applied confidentiality / statistical disclosure control best practice (especially in the NZ context), which includes generation and testing of synthetic data

Manel Slokom

TU Delft

Fully conditional specification, deep learning

Gillian Raab

University of Edinburgh, Scotland

Fully conditional specification, synthpop

Office hours are scheduled for the following dates and times. Click on the expert’s name for the meeting link for their office hour.

January 24

January 25

January 26

January 27

January 28

Office Hour 1

Jecy Yu

2pm EST

(20:00 CET/3am AEDT)

Geoffrey Brent

13:00 AEDT (9pm January 24 EST/ 300 CET)

Joerg Dreschler

1300 CET

(7am EST/11pm AEDT)

Joerg Dreschler

1300 CET

(7am EST/11pm AEDT)

Jecy Yu

10am EST

(16:00 CET/1am AEDT)

Office Hour 2

Joerg Dreschler

1300 CET

(7am EST/11pm AEDT)

Gillian Raab

1700 CET

(11am EST/3am AEDT)

Gillian Raab

1930 CET

(1:30pm EST/5:30am AEDT)

Office Hour 3

Christine Task

2pm EST (20:00 CET/6am AEDT)

Manel Slokom

1800 CET

(12pm EST/4am AEDT)

You can also ask questions of the experts and fellow participants in the #question Slack channel.

Challenge Instructions

You and your team are from a National Statistical Organization (NSO). Your organization has a data holding that is valuable to your many users and stakeholders, so you are tasked with providing these users with microdata that meets the high confidentiality standards of your NSO. You decide that synthetic data is the best method to achieve your goal.

Your stakeholders are interested in using the data holding for the following purposes:

1.         Testing their own specific analysis and models (testing analysis)

2.         Teaching students and new learners the latest in data science methods (education)

3.         Testing complex systems that have been built to transform your data holding into another product (testing technology)

4.         The uses are unknown (releasing microdata to the public)

You need to explore the common synthetic data generation methods and communicate to your senior managers by using appropriate disclosure and utility evaluations, focusing on whether and how suitable your synthetic data is for each use case noted above.

Remember: The more methods, data, and tools you try, the more points you get.

Please refer to HLG-MOS Synthetic Data for National Statistical Organizations: A Starter Guide to help you with your task.

Recommended Reading from   Synthetic Data for NSOs: A Starter Guide

The starter guide provides a holistic view of synthetic data and provides options and recommendations for organizations producing synthetic data. Reading the entire guide will help you earn top marks.However, the challenge can be completed by reading the following:

  1. Chapter 2: Use Cases (all pages)
  2. Chapter 3: Methods (pages 17-21, 23-25, 28-32)
  3. Chapter 4: Disclosure Risk (pages 33, 40-52)
  4. Chapter 5: Utility Measures (all pages)

Steps to Complete One Synthesis Exploration Instance 

1. Choose one of the two data sets for your “original” data.

We have provided you with your “original” data. You can choose between a toy option and an option that is more realistic, depending on your level of expertise and how much time you have. We encourage you to try both data sets.

  1. satgpa data set: This data set includes SEX, SAT score (Standardized Admissions Test for universities in the United States), and GPA (university grade point average) data for 1000 students at an unnamed college.
  2. ACS data set (csv format): For a real-world example of detailed demographic survey data, try using the American Community Survey (more details and the data dictionary are available here). Note that this data set has more features (33) and will generally require longer computation time for synthesis and evaluation. Participants are welcome to explore partial synthesis (synthesizing only some variables) in addition to full synthesis (synthesizing all variables).

 2. Choose a synthesis technique and tune parameters as desired.

Chapter 3 of the HLG-MOS Synthetic Data for National Statistical Organizations: A Starter Guide will help you explore the different methods to generate synthetic data. The recommendations in this chapter will help you understand how these methods are suitable for different use cases and can help you with your communication to senior managers.  

You are welcome to use a synthesizer (or several synthesizers) of your choice or from the guide. We provide optional instructions and guidance for exploring parameter options (tuning) to improve utility/privacy for specified methods for each data set.

Select Synthesis Tools with Quickstart Guidance

Quickstart Tech Guide

CART with R-synthpop:

RSynthpop (library website)

synthpop, a package for R, provides routines for generating synthetic data and evaluating the utility and privacy of that synthesis.  Both parametric and nonparametric synthesis techniques are available in synthpop syn.

Quickstart guidance for satgpa data CART

Quickstart guidance for ACS data CART

DP-pgm (Minutemen):

Discussion and walk-through (video), adaptive granularity method by Minutemen (github repo)

A top solution to the 2021 NIST synthesizer model challenge, by Ryan McKenna and the Minuteman team. This technique is based on probabilistic graphical models and is best suited for categorical and discrete features. The linked implementation has been extended to arbitrary data sets and has not been specifically tuned for ACS data. This approach satisfies differential privacy metrics.  

Quickstart guidance for DP-pgm (Minutemen) 

DPSyn

The 4th place solution from the ACS sprint of the 2021 NIST synthesizer model challenge, by the DPSyn team, is based on a combination of marginal and sampling methods.  This synthesizer has been specifically trained using the ACS IL/OH data (and thus does not provide formal privacy for hackathon purposes), but it runs quickly and has a good privacy score on informal privacy metrics.  

Quickstart guidance for DPSyn

 

Other tools from the guide include the following.

Method Category

Method

Tools

Sequential Modeling

Fully conditional specification

synthpop  package for R: www.synthpop.org.uk

Information-preserving statistical obfuscation

Mu-Argus, Implementation of Domingo-Ferrer and Gonzalez-Nicolas (2010), https://github.com/sdcTools/muargus

R package sdcMicro; an implementation of Ting et al. (2008) is included as a noise addition method, https://cran.r-project.org/package=sdcMicro

R package RegSDC; implementation of all methods described in Langsrud (2019),  https://CRAN.R-project.org/package=RegSDC

Simulated Data

R package semTools: https://CRAN.R-project.org/package=semTools

ABS-developed simulated data generation code. For more information on the code, contact Geoffrey Brent at geoffrey.brent@abs.gov.au.

Deep Learning

Synthetic Data Vault: https://sdv.dev/ 

SDGym: https://github.com/sdv-dev/SDGym 

Got code or a synthesizer you want to share? Add it to the table!

3. Evaluate the utility and privacy of your resulting synthetic data set.

Chapters 4 and 5 of HLG-MOS Synthetic Data for National Statistical Organizations: A Starter Guide present disclosure considerations and utility measures, respectively, for you to evaluate the suitability for your synthetic data to the use cases of your stakeholders.

You are welcome to use any utility and disclosure risk measures of your choosing. The synthpop and the SDNist code suggested above provide built-in utility and disclosure measures.

  1. Utility Evaluation

  1. Privacy Evaluation

4. Repeat steps 1 through 3 for any additional synthesis methods or data sets.

Deliverables

There are three expected deliverables for this challenge.

1. Evaluation of Synthetic Instances

Think about the results, and for each of the four use cases in the guide, write whether you think  the synthesis technique applied to that data, with those utility and privacy results, would produce a suitable result for that use case.

You receive more points for the more methods, data sets, and use cases you evaluate.

In your response, please outline a justification for each of the following:

  1. the method and method category (for example, sequential modeling, deep learning) you used  
  2. whether or not you used any specific tools to generate your synthetic data
  3. how you evaluated your data (explain any specific measures and their results)

You will have more success if you complete this table for both the toy and realistic data sets. You can receive bonus marks for the following:

  1. creating both a fully synthetic and a partially synthetic file
  2. demonstrating evidence of tuning (adjusting model parameters to improve performance)

 2. Evaluation of  HLG-MOS Synthetic Data for National Statistical Organizations: A Starter Guide

  1. Did you find the guide useful for generating synthetic data? If so, what aspects were the most useful?
  2. Did you find the guide useful for evaluating the utility and disclosure of the synthetic data? If so, what aspects were most useful?
  3. Were there any parts of the guide that were unclear or misleading?
  4. Did you encounter an aspect of the synthetic data generation or evaluation process that was missing from the guide that you would like to propose a modification for?

3. Presentation to Senior Management

Record a 5-minute presentation, reporting your findings from Part 1 and informing senior managers of your organization that the synthetic data you generated is suitable (or unsuitable) to be used for each of the use cases you considered.  

For this challenge, the senior managers are Anders Holmberg from the Australian Bureau of Statistics and Joost Huurman from Statistics Netherlands, both representatives of the HLG-MOS Executive Board.

Your videos will be judged on their clarity and the appropriateness of the techniques (methods, utility and privacy measures) applied, as well as the difficulty of the chosen example.

How to Submit Your Responses

  1. For deliverables 1 and 2: Evaluation of synthetic data methods for stakeholder use cases and  your evaluation of the HLG-MOS synthetic data guide, please submit your answers to your team’s folder here:  HLG-MOS Synthetic Data Challenge Submissions Folder.
  2. For deliverable 3: Upload your senior management presentations in both the HLG-MOS Synthetic Data Challenge Submissions Folder and the #submit-deliverables Slack channel.
  • We suggest you record a virtual meeting on the platform of your choice and upload the file.

Thank You!

A big “thank you” to Christine Task and Mary Wall from Knexus Research and the NIST and Sarus teams for helping put this challenge together.  






ACS Dataset Details

The main data set includes survey data, including demographic and financial features, representing a subset of data from the Integrated Public Use Microdata Series (IPUMS) of the American Community Survey for Ohio and Illinois from 2012 through 2018. The data includes a large feature set of quantitative survey variables along with simulated individuals (with a sequence of records across years), time segments (years), and map segments (Public Use Microdata Areas, PUMAs). Solutions in this sprint must include a list of records (i.e., synthetic data) with corresponding time/map segments.

There are 36 total columns in the ground truth data, consisting of PUMA, YEAR, 33 additional survey features, and an ID denoting simulated individuals.

Here is what an example row of the ground_truth.csv data looks like:

PUMA

17-1001

YEAR

2012

HHWT

30

GQ

1

PERWT

34

SEX

1

AGE

49

MARST

1

RACE

1

HISPAN

0

CITIZEN

0

SPEAKENG

3

HCOVANY

2

HCOVPRIV

2

HINSEMP

2

HINSCAID

1

HINSCARE

1

EDUC

7

EMPSTAT

1

EMPSTATD

10

LABFORCE

2

WRKLSTWK

2

ABSENT

4

LOOKING

3

AVAILBLE

5

WRKRECAL

3

WORKEDYR

3

INCTOT

10000

INCWAGE

10000

INCWELFR

0

INCINVST

0

INCEARN

10000

POVERTY

36

DEPARTS

1205

ARRIVES

1224

Sim_individual_id

[deprecated; see note below]

1947

The columns are as follows:

  • PUMA (str) — Identifies the PUMA where the housing unit is located.
  • YEAR (uint32) — Reports the four-digit year when the household was enumerated or included in the survey.
  • HHWT (float) — Indicates how many households in the U.S. population are represented by a given household in an IPUMS sample.
  • GQ (uint8) — Classifies all housing units into one of three main categories: households, group quarters, or vacant units.
  • PERWT (float) — Indicates how many people in the U.S. population are represented by a given person in an IPUMS sample.
  • SEX (uint8) — Reports whether the person is male or female.
  • AGE (uint8) — Reports the person's age in years as of the last birthday.
  • MARST (uint8) — Gives each person's current marital status.
  • RACE (uint8) — Reports what race the person considers himself/herself to be.
  • HISPAN (uint8) — Identifies people of Hispanic/Spanish/Latino origin and classifies them according to their country of origin when possible.
  • CITIZEN (uint8) — Reports the citizenship status of respondents, distinguishing between naturalized citizens and noncitizens.
  • SPEAKENG (uint8) — Indicates whether the respondent speaks only English at home and also reports how well the respondent, who speaks a language other than English at home, speaks English.
  • HCOVANY, HCOVPRIV, HINSEMP, HINSCAID, HINSCARE (uint8) — Indicates whether respondents had any health insurance coverage at the time of interview and whether they had private, employer-provided, Medicaid or other government insurance, or Medicare coverage, respectively.
  • EDUC (uint8) — Indicates respondents' educational attainment, as measured by the highest year of school or degree completed.
  • EMPSTAT, EMPSTATD, LABFORCE, WRKLSTWK, ABSENT, LOOKING, AVAILBLE, WRKRECAL, WORKEDYR (uint8) — Indicates whether the respondent was a part of the labor force (working or seeking work), whether the person was currently unemployed, their work-related status in the previous week, whether they were informed they would be returning to work (if not working in the previous week), and whether they worked during the previous year.
  • INCTOT, INCWAGE, INCWELFR, INCINVST, INCEARN (int32) — Reports each respondent's total pre-tax personal income or losses from all sources for the previous year, as well as income from wages, welfare, investments, or a person's own business or farm, respectively.
  • POVERTY (uint32) — Expresses each family's total income for the previous year as a percentage of the Social Security Administration's inflation-adjusted poverty threshold.
  • DEPARTS, ARRIVES (uint32) — Reports the time that the respondent usually left home for work and arrived at work during the previous week, measured using a 24-hour clock.
  • sim_individual_id (int) — Unique, synthetic ID for the notional person to which this record is attributed. The largest number of records attributed to a single simulated resident is provided in the parameters.json file as max_records_per_individual. This variable was designed to support the temporal map component of the 2020 NIST Synthetic Data Challenge and is not in use for the UNECE Synthetic Data Hackathon.  It is included only to enable execution of NIST Challenge solutions.  

Information is from https://www.drivendata.org/competitions/74/competition-differential-privacy-maps-2/page/282/

This data was originally drawn from the IPUMS harmonized archive of ACS data. Additional details of the code values for all variables can be found on their website at the following links:






SAT-GPA Synthesis with CART (R-synthpop)

If you need to install R. Optionally, you may want to download R Studio..

From either RStudio or Jupyter Notebook, install the R-synthpop package:

Install.packages("synthpop")

Load R-synthpop library:

library("synthpop")

For access to documentation:

help(package = synthpop)

Preprocessing

Read in the data set:

satgpa <- read.csv(file.choose())

Check out the first few rows of the data set:  

head(satgpa)

The data frame has the following features: sex, sat_v, sat_m, sat_sum, hs_gpa, and fy_gpa

Use codebook to see the number of distinct outcomes of each variable, as well details such as missing records:

codebook.syn(satgpa)

Before generating the synthetic data set, there are some best practices to be aware of:

  • Remove any identifiers (e.g., study number).
  • Change any character (text) variables into factors and rerun codebook.syn() after doing so. The syn() function will do this conversion for you, but it is better that you do it first.
  • Note which variables have missing values, especially those that are not coded as the R missing value NA. For example, the value -9 often signifies missing data for positive items such as income. These can be identified to the syn() function via the cont.na parameter.
  • Note any variables that ought to be derivable from others (e.g., discharge date from length-of-stay and admission date). These could be omitted and recalculated after synthesis or calculated as part of the synthesis process by setting their method to passive (see ?syn.passive).
  • Also note any variables that should obey rules that depend on other variables. For example, the number of cigarettes smoked should be zero or missing for non-smokers. You can set the rules with the parameter rules and rvalues of the syn() function. The syn() function will warn you if the rule is not obeyed in the observed data.

(From https://www.synthpop.org.uk/get-started.html)

Note that sat_sum, the verbal and math sat score sums, can be omitted and recalculated following synthesis.  

To omit the sat_sum feature:

mydata <- satgpa[,c(1,2,3,5,6)]

Synthesize the Data

#generate synthetic data with default parameters
synthetic_dataset <- syn(mydata)

You will see the following warning:

Note the sex variable has two distinct integer-encoded outcomes or levels  (i.e., 1, 2). R treats the sex variable as numeric. The suggested fix here, setting the minnumlevels parameter to 2, will convert features with fewer than the specified number of levels from numeric to factor:

synthetic_dataset <-syn(satgpa, minnumlevels = 2)

Next recalculate the sat_sum variable and append to the synthetic data set.  

Further Exploration

The ACS data has many more variables than the satgpa data set, and the results (and runtime) may be more sensitive to the particular choice of predictor matrix and visit sequence.  It may be useful to experiment with different options for these parameters with the satgpa data set before attempting synthesis of ACS. Here is a link to additional resources: https://arxiv.org/pdf/2109.12717.pdf

You can specify one of the built-in synthesizing methods for variable synthesis (parametric or non-parametric). The initial variable synthesis will always use sampling, regardless of the methods selected for subsequent variables. See the methods used for the synthesis shown below:

synthetic_dataset$method

You can specify the order of the sequencing. In our synthesis, the default ordering is the variables as they appear in the data frame from left to right:

synthetic_dataset$visit.sequence

In the predictor matrix, 1 is assigned to any (i,j) entry where the ith variable synthesis is dependent on the jth column. Otherwise the entry is assigned 0. The synthesis of the nth variable in the visit sequence, where n!=1, depends on the preceding 1, … , n-1 variables. The default predictor matrix is a square triangular matrix, with 0’s along the diagonal because no variables can act as their own predictor.  See the predictor matrix below:

synthetic_dataset$predictor.matrix

Predictor variables can be adjusted, and it may be helpful to do so when working with larger data sets and/or more variables. Not all variables need to act as predictors. Not all variables must be synthesized. Thus, a variable that is not itself synthesized can still act as a predictor for another variable’s synthesis.  






ACS Synthesis with CART (R-synthpop)

This is an introduction to ACS synthesis with R-synthpop, designating the CART method for variable synthesis.

install.packages("synthpop")
library("synthpop")
install.packages(
"arrow")
library("arrow")

The ACS data can be downloaded as a csv file from the SDNist repo here:https://github.com/usnistgov/SDNist/tree/main/sdnist/census_public_data

sprint <- read_csv(file.choose())

Run codebook to see details of the data set  (list of variables and number of levels).

codebook.syn(sprint)

Getting Started

To get an initial look at the data, you can explore a subset of the data set. Here we restrict it to the initial 200 records.

sprint_subset <- sprint[1:200,]

Exactly how large is this data set?

dim(sprint_subset)

mysyn <-syn(sprint_subset)

Review the codebook, determine which variables you think should be changed to factors, and assign the corresponding minnum levels value (see the satgpa directions).  


Partial Synthesis of ACS Data

Generating a full synthesis of the ACS data set requires tuning the model parameters  (visit.sequence, predictor.matrix, etc.) to improve efficiency. These are first referenced in the satgpa with R-synthpop document, where a brief description is provided. Before  generating a complete synthesis of the ACS data set, you may choose to test a partial synthesis of the ACS data set in which you synthesize only a few selected features, leaving the remaining features with their ground-truth values.

Here is sample code for you to consider for partial synthesis of the ACS data set using R-synthpop with the CART method. If you have not already done so, read in the data set.

ACS <- read.csv(file.choose())

In this particular quickstart guidance, we do not address variable sampling weights. You may choose to remove the PERWT, HHWT, and sim_individual_id features from the data set with the following code. For this example, the columns are removed.

df = subset(ACS, select = -c(PERWT,HHWT, sim_individual_id, X) )

This example predicts GQ, SEX, AGE, MARST, RACE, and HISPAN using the remaining features in the data set (excluding the weights and serial identifier, which were removed). The following code sets the method to “cart” for the variables to be predicted. For the variables that are used to predict, but are not themselves synthesized, set the method to “ “.  

method.ini=c("","","","cart","cart","cart","cart","cart","cart","","","","","","","","","","","","","","","","","","","","","","","","")

Set the visit sequence, including the indices of the variables to be predicted; exclude the variables that are used only to predict.

Visit.sequence.ini=c(4,5,6,7,8,9)

Set the predictor matrix to 33 x 33, such that only the rows corresponding to SEX, AGE, MARST, RACE, and HISPAN are predicted. Note: the number of rows and columns in the predictor matrix should equal the number of columns in the ground-truth data set. 

tosynth <- c(4,5,6,7,8,9)
P <- matrix(
0, nrow=33, ncol=33)
for (i in tosynth){
   
for (j in 1:33)
       
if (! is.element(j, tosynth)){
           P[
i, j] <- 1
       }
}
for (i in 1:length(tosynth))
   
for (j in 1:(i-1)){
       
if (j < i){
           
a <- tosynth[i]
           
b <- tosynth[j]
           P[
a, b] <- 1
       }
   }

Take some time to review the results of codebook.syn (ACS). Notice that the PUMA feature has 181 distinct levels. In this example, we suggest setting PUMA to factor with the following.

df$PUMA = factor(df$PUMA)

 

Now pass the visit sequence, predictor matrix, and method to syn, along with the ground-truth data set for synthesis. Set “drop.not.used” to FALSE so that features which are not synthesized are included in the resulting synthetic data set.  

   

synth.ini <- syn(data = df, seed = my.seed, visit.sequence=Visit.sequence.ini,predictor.matrix = P, method=method.ini, m=1,minnumlevels=6, drop.not.used=FALSE, maxfactorlevel=182)

If you see “Error: We have stopped your synthesis because you have factor(s) with more than 60 levels: PUMA (181). This may cause computational problems that lead to failures and/or long running times. You can try continuing by increasing 'maxfaclevels.'” Pass maxfactorlevel set to 182.

This synthesis may take several hours to complete. You may consider different approaches to improve processing time, such as dropping PUMA from the predictor matrix by setting the first entry of each nonzero row to 0.  

for (i in 4:9)
   P[
i,1] <- 0

Full Synthesis

Hopefully, you have some ideas now to inform your approach to generating a full synthesis of the ACS dataset with method=CART.  Some suggestions follow:

  1. In the full synthesis, every variable is included in the visit sequence. It may be helpful to leave PUMA to the end of the visit sequence. In this way, PUMA will be synthesized and not used as a predictor for any additional variable synthesis.
  2. Combine variable categories. For example, you might combine SEX+MARST, HCOVANY+HCOVPRIV+HINSEMP+HINSCAID+HINSCARE, EMPSTATD+WORKEDYR+WRKLSTWK, ABSENT+LOOKING, and/or AVAILABLE+WRKRECAL.
  3. Exclude any determined variable from synthesis and/or from acting as a predictor for any remaining variables.
  4. Consider grouping PUMA levels. For example,  urban vs. rural, or by income or demographics.

Evaluate Your Results!!  

Once you have the synthesis, run utility evaluations in R-synthpop to understand your model’s performance. Be sure to evaluate your synthesis with SDNist.score as well. (Information to compute the score may be found here: ACS synth eval with SDNist.score). To do so, you will need to export the synthesis to a csv file.  

write.syn(ACS, filename = "mysyn", filetype = "csv")






ACS Synthesis with DP-pgm Minutemen

A top solution from the 2021 NIST synthesizer model challenge, by Ryan McKenna and the Minuteman Team, is a   technique based on probabilistic graphical models and is best suited for categorical and discrete features. The linked implementation has been extended to arbitrary data sets and has not been specifically tuned for ACS data. This approach satisfies differential privacy metric.  Please see the discussion and walk-through video for more details.

Use this synthesis technique with either ACS or satgpa data sets. To implement the technique, first install Ryan McKenna's NIST Synthetic Data 2021, private-pgm, and nist-schemagen, found in github (https://github.com/ryan112358/nist-synthetic-data-2021, https://github.com/ryan112358/private-pgm, and https://github.com/hd23408/nist-schemagen). Instructions to git-clone the libraries are below. Make sure your environment uses Python version 3.9.  In Anaconda, you might use conda create -n <NAME> python=3.9.

Additionally, these instructions assume MacOS or Linux, due to a dependency on the jaxlibPpython library in the final undiscretization step (which isn’t supported in Windows). Windows users may want to check out WSL  to run the complete synthesis process or else consider evaluating directly on the discretized synthetic data (against discretized ground-truth data; see undiscretization step).

Cloning the Necessary DP-pgm Repositories

From the terminal, git-clone the private-pgm repository.  

git clone https://github.com/ryan112358/private-pgm.git

Next, navigate to the private-pgm directory, where the requirements.txt file is located, and install it.  

pip install -r requirements.txt

pip install .

Follow the same process as above: git-clone the nist-synthetic-data-2021 repository, navigate to the nist-synthetic-data-2021 directory and requirements.txt, and install the requirements.

git clone https://github.com/ryan112358/nist-synthetic-data-2021.git
pip install -r requirements.txt

Install the nist-schemagen library, then navigate to the nist-schemagen directory and install the requirements.txt file.  This step is optional if using the provided pre-generated schema files (see below). 

git clone https://github.com/hd23408/nist-schemagen.git

pip install -r requirements.txt 

Once you have the repositories installed locally, you are ready to begin synthesis of your data set with the Minutemen solution!

The diagram below outlines the synthesis process when using Minutemen. The ground-truth data set feeds into the “black box” synthesis model, and voilà ... synthetic data! A few details to keep in mind:  

  1. This synthesis model should be fed discretized data and schema.
  2. Post synthesis, the data should be undiscretized.  

 

Generate Schema  

This step uses the schemagen library to produce the parameters.json and domain.json files for a given data set. If you would like to skip this step, feel free to instead use the previously generated sample parameters.json and domain.json files for both the satgpa and ACS data sets, which are provided in the next step.    

Ground-truth data sets must be in csv format! schemagen will return “column_datatypes.json” and “parameter.json”. It may be helpful to make a new directory to save that schemagen output because you will need the files to discretize the ground-truth data set with transform.

Run from the nist-schemagen directory:

python main.py <PATH_TO_DATASET.csv> -o <PATH_TO_DIRECTORY> -m 181 

Discretize the Ground-Truth Data Set

Pre-generated Schema Files for Each Challenge Data Set

satgpa

parameters.json   [1/24/22 16:20 CET: Updated hs_gpa binning]

ACS

parameters.json

Transform will use the schema file (parameters.json) to discretize the ground-truth data set, returning a “discretized.csv” and “domain.json” files. The “transform.py” file is located here: nist-synthetic-data-2021\extensions.  

Run from the nist-synthetic-data-2021 directory  (make sure to install private-pgm library):

python extensions\transform.py --transform discretize --df <PATH_TO_GROUNDTRUTH> --schema <PATH_TO_parameters.json> --output_dir <PATH_TO_DIRECTORY>

Check Discretized Ground Truth  

Optionally, you can verify that the discretized data set and the domain file are compatible.  “check_domain.py” is located in nist-synthetic-data-2021\extensions.

Run from the nist-synthetic-data-2021 directory:

python extensions\check_domain.py --dataset <PATH_TO_discretized.csv> --domain <PATH_TO_domain.json>

Synthesize the Discretized Ground-Truth Data Set

“Adaptive_grid.py”, located in nist-synthetic-data-2021\extensions, takes in the discretized ground-truth data set and returns synthetic discretized data with filename “out.csv”.

(Be careful! There is a file of different content and the same name,  “adaptive_grid.py”,  in the SDNist repo, under examples; that version does not run on arbitrary data sets.) 

Run from the nist-synthetic-data-2021 directory:

python extensions\adaptive_grid.py --dataset <PATH_TO_discretized.csv> --domain <PATH_TO_domain.json> --save <PATH_TO_DIRECTORY> –-epsilon 10

Transform Synthetic Discretized to Undiscretized Raw Synthetic

The discretized synthetic data uses bin indices rather than feature values to represent the data.   The transform.py function will transform the discretized synthetic data back to its original domain value range. This undiscretized synthetic output should be used for any further comparative assessment (utility/disclosure synthetic relative to ground truth).  

Run from nist-synthetic-data-2021 directory:

python extensions\transform.py --transform undo_discretize --df <PATH_TO_out.csv> --schema <PATH_TO_parameters.json>

Note: Windows OS users who are not using WSL to run the Minuteman synthesizer may not be able to complete this final step because of a dependency on jaxlib for python, which does not support Windows. If you would like to evaluate the discretized synthetic satgpa data set directly without undiscretizing back to the original data domain, we’ve provided this discretized version of the ground-truth satgpa data to be used as a basis for comparison.  
Unfortunately, because both of the hackathon data sets include numerical/continuous variables, comparing the discretized data sets directly like this won’t necessarily capture the utility or privacy of the undiscretized data. However, it is a representation of the algorithm performance for categorical data.  






ACS with DPSyn Utility Visualization

DPSyn synthesizer, the fourth-place performer in Sprint 2 of the NIST Differential Privacy Synthetic Data Challenge, runs efficiently and provides an easy way to look at the output of the SDNist visualizer on the synthetic ACS data.  

We do note, though, that the version of DPSyn included in the SDNist library was the challenge submission version of the DPSyn approach (rather than the
release version), and thus it operates in a more stringent privacy context than we intended for the HLG-MOS Hackathon (it is hardcoded to assume up to 7 years of linked data across individuals, and with epsilon = 1). This solution is not indicative of the utility capabilities for DP algorithms in general.    

To begin, if you have not already done so, clone the SDNist repo:

git clone https://github.com/usnistgov/SDNist.git
pip install .

You may need to install pyyaml and the importlib-resources libraries. To do so, in the terminal type the following:

pip install pyyaml
pip
install importlib-resources

Next, navigate to the folder in the SDNist directory containing the “main.py” file (SDNist/examples/DPSyn) and run the following:

python main.py

The synthesis should take less than an hour. Computation time will vary depending on your machine. When the synthesis is complete, you will see a new tab on your browser with an interactive heat map that displays the utility score of the evaluation across PUMAs.  


Note: Synthesis output will save as a csv file in the SDNist/sdnist/examples/DPSyn directory.

Note: Owing to some slowness of the NIST servers, some people encounter difficulties when the system goes to access the ACS benchmark data (error data/census/final/IL_OH_10Y_PUMS.json does not exist.).  Please see the update to the repo documentation here for more guidance: README.md






R-synthpop Utility Evaluation

This document provides examples of utility evaluations using tools from the R-synthpop package, with the satgpa data set.  

Evaluating the Utility of the Synthesis

We evaluate the utility of synthesized data to understand how it compares with the original data set. Ultimately, we hope to understand whether inferences made on the basis of  the synthetic data set, rather than the ground-truth data set, are comparably reliable.  

A natural comparison might be to look at a similar summary of the synthetic and ground-truth data sets.

summary(synthetic_dataset)

 

summary(satgpa)

utility.gen(synthetic_dataset, satgpa)

utility.tables(synthetic_dataset, satgpa)

Histogram

In addition, we can compare the synthetic and ground-truth data sets at the variable level to see how the distributions match up. The following visualization shows the distribution of the SAT verbal score grouped by men - 1 and women - 2.

Verbal SAT Score by Sex

multi.compare(synthetic_dataset, satgpa, var = "sat_v", by = "sex")

SAT Math Score by Sex

multi.compare(synthetic_dataset, satgpa, var = "sat_m", by = "sex")

High School GPA by Sex

multi.compare(synthetic_dataset, satgpa, var = "hs_gpa", by = "sex")

First Year GPA by Sex

multi.compare(synthetic_dataset, satgpa, var = "fy_gpa", by = "sex")

You might try a box plot to compare the central tendency and distribution of the synthetic and ground-truth variables.  

SAT Math by Sex

multi.compare(synthetic_dataset, satgpa, var = "sat_m", by="sex", cont.type = "boxplot")

In either case, it appears that the SAT math average is resistant under synthesis. However, there are subtle differences in the spread.  






Data Set Comparison Using R-Synthpop

Author: Maia Hansen

Introduction

synthpop (https://www.synthpop.org.uk/) is a package for the R programming language that is generally used to produce synthetic data for data sets of individuals. The synthpop package also provides a set of utilities and tools that allows you to compare two arbitrary data sets. These tools are generally used to compare original data sets with the results from  synthetic data created using synthpop, but we will be using the utilities and tools to compare original data sets with new synthetic data sets created using our own algorithms.

Installation

To use synthpop, you must first install the R programming language and its synthpop package. You don't need a working knowledge of R because this guide will walk you through the process step by step, but you do need to install it. (If you already have R installed, skip to step 2 below.) If you need help installing R or the synthpop library, please reach out to maia.hansen@gmail.com and schedule a time to work through the process together.
Note: Mac users, please see footnote at the end of this guide.

  1. Install R using the instructions at https://cran.r-project.org/. If you are using Mac or Windows, you will likely want to use one of the pre-compiled binary packages available in the "Download and Install R" section. If you are using Linux, you may want to use your Linux package manager instead.
  1. OPTIONAL: Once you have installed R, you can optionally install the RStudio IDE at https://www.rstudio.com/products/rstudio/download/ .This is not required to run the synthpop package, but it can be useful if you intend to do additional things with R.
  1. Install the devtools R package.
  1. Open an R console by opening a terminal or command-line window and typing the single letter r
  2. Install the R devtools package with the following command:
             install.packages("devtools")
  3. You will be prompted to select a mirror site near you. R will provide you with a list of available mirror sites. Enter the number of the mirror site that is physically located nearest to you ("81" for Oregon, "17" for Beijing, etc.)
  4. R will automatically download and install the devtools package as well as any required dependencies.1
  1. Load the devtools R package into your R terminal session with the following command:
                    
    library(devtools)

You should see the following message:
        > library(devtools)

Loading required package: usethis

  1. Next, install at least version 1.7-0 of the synthpop package. (You will need version 1.7 to use the compare function with data that was not generated directly from synthpop.) Assuming you are connected to the Internet, you can download this directly from github2 with the following command:

install_github("bnowok/synthpop")

You should see the following message:

> install_github("bnowok/synthpop")

Downloading GitHub repo bnowok/synthpop@HEAD...

and the synthpop package will be installed.

If you did not receive any error messages during this process,1 you have successfully installed the required synthpop library! Continue to the next step, "Basic Data Set Comparison."

Basic Data Set Comparison

This section assumes that you have two csv (comma-separated values) files accessible locally on your filesystem. One should contain a ground-truth dataset, and one should contain a data set synthesized from this ground-truth dataset.

This quickstart guide will focus on performing the comparison using the R command line. For convenience, you may wish to write an R script that will perform the comparison for you, or use RStudio if you are already comfortable with R. That is outside the scope of this guide, but if you'd like to try that and want help, please feel free to contact maia.hansen@gmail.com.

To compare two data sets:

  1. If you are not already in an R console, open a terminal or command-line window and type the single letter r (followed by <ENTER>).
  2. Load the synthpop library into your workspace with the following command:

library(synthpop)

  1. Load your ground-truth data set into an R data frame with the following command:
            groundTruth <- read.csv(file = '/path/to/ground_truth.csv')
  2. Load your synthetic data set into an R data frame with the following command:
            syntheticData <- read.csv(file = '/path/to/synthetic_data.csv')
  3. Run a basic comparison between the two data sets with the following command:

        compare(syntheticData, groundTruth)

This comparison will open up a series of graphs that provide a basic comparison between the two data sets. Keep reading for examples and more tools!

Example

As an example, we will use the ground_truth.csv Chicago taxi driver data set from Sprint 3. This data set contains the following columns:

taxi_id,shift,company_id,pickup_community_area,dropoff_community_area,
payment_type,trip_day_of_week,trip_hour_of_day,fare,tips,trip_total,
trip_seconds,trip_miles

For our comparison data set, we will use a synthetic data set (named submission.csv), created  per the requirements of the Sprint 3 competition, containing the following columns:

epsilon,taxi_id,shift,company_id,pickup_community_area,dropoff_community_
area,payment_type,fare,tips,trip_total,trip_seconds,trip_miles

Both the ground_truth.csv and submission.csv files in this example are located in the current working directory. First, enter an interactive R session by typing the letter r and hitting <ENTER>. Then, run the following R session to compare these data sets, based on the instructions above.3

> library(synthpop)

Find out more at https://www.synthpop.org.uk/

> groundTruth <- read.csv(file = 'data/ground_truth.csv')

> syntheticData <- read.csv(file = 'submission.csv')

> compare(syntheticData, groundTruth)

Warning: Variable(s) epsilon in synthetic object but not in observed data

  Looks like you might not have the correct data for comparison

Comparing percentages observed with synthetic

Press return for next variable(s):

Press return for next variable(s):

>

When the analysis is complete, graphs such as these will be displayed, providing a comparison of the  two data sets

:


Other Features

R-synthpop also provides methods to perform other comparisons. Some features that might be useful include the following.

Data Set Summaries

To get a summary (min, median, max, etc.) of the values contained in a data set, type the following command (using the examples loaded as described above):

summary(groundTruth)

or

summary(syntheticData)

Tabular Comparisons

R-synthpop can also be used to produce and compare tables from observed and synthesized data. This uses the utility.tab function of R-synthpop, documented at https://github.com/bnowok/synthpop/blob/master/man/utility.tab.Rd 

As an example, to produce a tabular comparison of the pickup and dropoff community areas between the synthesized and ground-truth data sets loaded in the example above, you would use the following command:

utility.tab(syntheticData, groundTruth, vars=c("pickup_community_area", "dropoff_community_area"))

This produces results similar to the following:

Distributional Comparisons

You can also use synthpop to do a distributional comparison of synthesized and ground-truth data, using the utility.gen function of R-synthpop, documented at https://github.com/bnowok/synthpop/blob/master/man/utility.gen.Rd. To use this function, make sure the synthetic data does not contain any columns that are not also present in the ground-truth data. However, in our example above, the synthetic data contains the column epsilon, which is not present in the original data.

To delete the epsilon column from the data loaded as syntheticData in the example above, run the following command (after loading the data in your R console) to create a new synthdata data frame that does not have epsilon:

synthdata = subset(syntheticData, select = -c(epsilon))

You should now be able to run the utility.gen function to compare the two data sets. However, this function requires a great deal of internal memory, and so you may need to select only the first N rows of each data set in order to get a result, using the head function. As an example, here is a comparison of the first 5000 rows of our ground-truth and synthetic data (after removing the epsilon column), as described in the example above:

Refer to the R-synthpop documentation for the utility.gen function  at https://github.com/bnowok/synthpop/blob/master/man/utility.gen.Rd#L120 for more details on how to use and configure the utility.gen function to produce more useful and interesting results.

For more details on R-synthpop, refer to its github repository at https://github.com/bnowok/synthpop.

You can also access the full documentation for R-synthpop at

https://rdrr.io/cran/synthpop/. Be aware, though, that this documentation is for version 1.6.0, which does not  support comparisons between arbitrary data sets. (It only supports comparisons between ground-truth data sets and synthetic data sets created using R-synthpop itself.) However, it may still be useful for information about basic functionality.

Python

There does exist at least one Python implementation of R-synthpop, at https://hazy.com/blog/2020/01/31/synthpop-for-python. We have not tried this implementation because  the native R implementation has the latest functionality and supports direct comparisons between arbitrary data sets. However, if you'd like to try the Python implementation, please let us know how it goes!

Footnotes

  1. If you are running this on a Mac, you may see a warning message similar to the following:

Warning message:

In doTryCatch(return(expr), name, parentenv, handler) :

  unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':

  dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib

  Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/modules/R_X11.so

  Reason: image not found
This error message is benign and can be safely ignored.

  1. If you are familiar with using R and the CRAN package manager, please note that the version of R-synthpop that is available directly from the R package manager has not yet been updated to 1.7, so you will need to follow the process of installing from github.
  2. Notice that when you run the compare function, you'll receive the following message:

Warning: Variable(s) epsilon in synthetic object but not in observed data

          Looks like you might not have the correct data for comparison

This is expected because the "epsilon" column in the synthesized data does not exist in the ground-truth data. The message can be ignored.






ACS Evaluation with SDNist

This document outlines the steps for evaluating the utility of the ACS synthetic and ground-truth data sets.

% python -m sdnist your_file.csv

import sdnist
import pandas as pd

Retrieve the ACS data set.

dataset, schema = sdnist.census()  

Read in the data synthesized from your choice of synthesizer.  

mysynth = pd.read_csv("mysynth.csv")

Score with SDNist.

sdnist.score(dataset, mysynth, schema, challenge = "census”)




Apparent Match Distribution Privacy Metric - Python

Metric Definition

Though synthetic data may be sufficient as a disclosure risk measure, the perceived disclosure risk could be a concern. A perceived disclosure risk can be found in a scenario in which  a unique record in the synthetic data could be perceived to be a unique record in the original data. One way to address this concern in a fully synthetic data file is to look at records with a unique combination of key variable values in the synthetic data that match unique records in the real data set. These key variables, or quasi-identifiers, that can distinctly identify an individual record need to be selected and then the observations are matched between the real and the synthetic data based on this set of key variables.  Examples of potential  quasi-identifiers could be income, age, sex, race, marital status, and location. The matching process results in a percentage of the synthetic individual records that are apparent unique matches to real-world people.

For these apparent matches, the next step is to determine what percentage of the remainder of the individual’s record (variables that are not quasi-identifiers) is the same between the real and synthetic data—essentially, to what extent does the apparent match between a real and synthetic record reflect a real match between a synthetic individual and a real individual? The result of this process can determine with what amount of confidence the synthetic data could be used to correctly infer anything about a real-world person.

By graphing the distribution of true similarity between apparently matched records, one can assess how the record has changed in the synthesis process. Exact matches (all record variables matching) could still exist in the synthetic data, or it might be that apparently matched records actually differed in all but one or two non-quasi-identifying  variables. For example, Figure 1 illustrates an example where the ACS was synthesized using a differentially private synthesizer. This exercise resulted in
 only 0.145% of synthetic records having a unique “apparent match” to a real-world individual, based on quasi-identifiers. There were zero true exact matches between the synthetic and real data (synthetic records that matched real records on all variables). By graphing the distribution of additional variable matches between apparently matching records, one can determine the extent to which the synthetic data actually reflects the real data for their apparent matches.  

 

 

Figure 1: Similarity distribution between pairs of apparently matching records, using differentially private synthetic ACS data. Most apparent unique matches between the real and synthetic data actually shared less than 60% of their full record information (differed on more than 40% of their variables), meaning no inference could be made with high confidence about the real-world person using the apparently matching synthetic person’s record.

This assessment allows the synthesizer or organization to determine the level of certainty (i.e., how many variables are matched, and  at what level of confidence do apparent matches with the synthetic data reflect real information about real individuals?) they are willing to allow in order to release the synthetic file. The exact threshold of both number of variables matched and the percentage of matches largely depends on the information present in the data set (for example, health data) or what information needs to be present in the data set (for example, a minority population).  

The matching records could either remain in the synthetic data (depending on other disclosure policies) or could be suppressed. There are two important considerations when using this method. First, a small threshold or a threshold of zero may be the optimal result on disclosure risk alone; however, this may not be realistic. Second, removing such a small number of records may change very little about the overall risk of the data set.

How to Run the Metric

Currently, the relevant files and sample data sets for the example below are in pmetric.  Eventually, this privacy metric will be integrated into the SDNist benchmarking repo.  

The following sample script uses the data sets contained in the Google Drive folder linked above:

python main.py --groundtruth GA_IL_OH_10Y_PUMS.csv --dataset synthetic.csv -q "GQ,SEX,AGE,MARST,RACE,HISPAN,CITIZEN,SPEAKENG,PUMA" -x "sim_individual_id,HHWT,PERWT"

Figure 2: Output generated by SDNist Apparent Match Distribution Module


This privacy tool can also be used as a module within a larger program by using the following code:

import pandas as pd

def cellchange(df1, df2, quasi, exclude_cols):
   uniques1 = df1.drop_duplicates(subset=quasi, keep=False)
   uniques2 = df2.drop_duplicates(subset=quasi, keep=False)
   matcheduniq = uniques1.merge(uniques2, how='inner', on = quasi)
   allcols = set(df1.columns).intersection(set(df2.columns))
   
cols = allcols - set(quasi) - set(exclude_cols)
   print('done
with cellchange')
   return match(matcheduniq, cols), uniques1, uniques2, matcheduniq

def match(df, cols):
   S = pd.Series(data=0, index=df.index)
   for c
in cols:
       c_x = c +
"_x"
       c_y = c +
"_y"
       S = S + (df[c_x] == df[c_y]).astype(int)
   S = (S/len(cols))*
100
   return S




R-synthpop Privacy Evaluation

replicated.uniques(synthetic_dataset, satgpa)

Interpretation  

There are 10 replicated uniques in the synthesized data set compared with 1000 unique records in the ground-truth data set, or 1% of unique records are replicated in the synthesized data set.