Contents:

Benchmark Data

Deidentified Data Archive

Deidentification Algorithm Summary Table

Meta-analysis Tutorial Notebooks

Pair-wise PCA Inspection Tool


Benchmark Data

The Collaborative Research Cycle uses the Diverse Communities Data Excerpts serve as the target data for this program. All deidentification techniques in our directory have been run on this input data, and the resulting examples of deidentified data are available in the archive below.

The Diverse Community Data Excerpts includes three benchmark datasets – the Massachusetts data is from north of Boston, the Texas data is from near Dallas, and the National data is a collection of very diverse communities from around the nation. The data is derived from the 2019 American Community Survey; the 24 features in the complete scheme were chosen because they capture many of the complexities from real-world data, while still being small, and simple enough to make more formal analysis feasible. The data folder includes lovely postcard documentation about the communities and a JSON data dictionary to make it easy to configure your privacy technique. The usage guidance section in the readme has helpful configuration hints (watch out for 'N').

Additionally data from 2018 has been provided as a control; this may be useful for configuring differentially private algorithms or calibrating privacy metrics. The 2018 data covers the same schema (features and geography) but does not share any individuals with the 2019 data.


Deidentified Data Archives

Download the Research Acceleration Bundle and explore! This archive of deidentified data samples and evaluation metric results provides a broad, representative sample of data deidentification as a whole.

The target data for this project are the NIST Diverse Communities Excerpts , curated data drawn from the American Community Survey. The archive is comprised of deidentified versions of the Excerpts data as generated by a wide variety of deidentification algorithms, libraries and privacy definitions. Check out our Algorithm Summary Table for a high level glimpse of the current archive contents.

The bundle also includes tools and reports designed to support meta-research which we hope will lead to a more foundational understanding of the mechanics of privacy, utility and equity on diverse populations. Scroll down this page for a tour of the exploration tools we provide along with the bundle.


Deidentification Algorithm Summary Table

This table provides a very high level summary of the deidentification algorithms in our archive. Unique Exact Match (UEM) is a simple privacy metric that counts the percentage of singleton records in the target that are also present in the deidentified data; these uniquely identifiable individuals leaked through the deidentification process. The Subsample Equivalent (SsE) utility metric uses an analogy between deidentification error and sampling error to communicate utility; a score of 5% indicates the edit distance between the target and deidentified data distributions is similar to the sampling error induced by randomly discarding 95% of the data. Edit distance is based on the k-marginal metric for sparse distributions.

Note that this isn't a leaderboard--- you can select any column in the dropdown menu and reorder the table according to that column. Algorithms with high utility (high SsE) may have a lot of privacy leakage (high UEM), algorithms with low privacy leakage (low UEM) may have poor utility (low SsE). Algorithms that have only been run on small subsets of the schema may perform differently on larger feature spaces (Avg Feat. Space Size). And, in general, SsE and UEM are very simple, reductive metrics. If you're curious about a deidentification method, we recommend checking out its full evaluation results in the metareports archive


Select Column for Sorting

DESC
Library Algorithm Team # Entries # Feat. sets Avg. Feat. Space Size ε values Utility: SSE Privacy Leak: UEM
rsynthpop ipf_NonDP Rsynthpop-categorical 1 1 3e+08 50.0 15.82
rsynthpop catall_NonDP Rsynthpop-categorical 1 1 2e+08 50.0 63.37
subsample_40pcnt subsample_40pcnt CRC 15 5 4e+25 40.67 39.93
rsynthpop cart CRC 12 4 3e+20 40.0 16.14
AI-Fairness smote UT Dallas DSPL 9 1 2e+26 40.0 17.51
sdcmicro pram CRC 12 3 1e+11 38.33 56.27
Anonos Data Embassy SDK Anonos Data Embassy SDK Anonos 3 1 2e+26 30.0 0.01
MostlyAI SD MostlyAI SD MOSTLY AI 6 1 2e+26 30.0 0.01
aindo-synth aindo-synth Aindo 3 1 2e+26 30.0 0.01
smartnoise-synth aim CRC 48 6 4e+25 1 25.0 9.12
smartnoise-synth aim CRC 48 6 4e+25 5 25.0 9.12
smartnoise-synth aim CRC 48 6 4e+25 10 25.0 9.12
rsynthpop catall Rsynthpop-categorical 6 1 2e+08 1 22.33 47.24
rsynthpop catall Rsynthpop-categorical 6 1 2e+08 10 22.33 47.24
rsynthpop catall Rsynthpop-categorical 6 1 2e+08 100 22.33 47.24
rsynthpop cart CBS-NL 3 1 2e+08 21.67 28.6
tumult DPHist CRC 5 2 6e+07 1 18.8 92.14
tumult DPHist CRC 5 2 6e+07 2 18.8 92.14
tumult DPHist CRC 5 2 6e+07 4 18.8 92.14
tumult DPHist CRC 5 2 6e+07 10 18.8 92.14
smartnoise-synth mst CRC 50 7 3e+25 1 14.7 7.06
smartnoise-synth mst CRC 50 7 3e+25 5 14.7 7.06
smartnoise-synth mst CRC 50 7 3e+25 10 14.7 7.06
ydata-sdk YData Fabric Synthesizers YData 33 4 1e+26 11.85 9.7
Genetic SD Genetic SD DataEvolution 19 2 9e+25 1 11.84 0.11
Genetic SD Genetic SD DataEvolution 19 2 9e+25 10 11.84 0.11
rsynthpop ipf CRC 10 1 2e+08 1 11.3 10.47
rsynthpop ipf CRC 10 1 2e+08 2 11.3 10.47
rsynthpop ipf CRC 10 1 2e+08 10 11.3 10.47
rsynthpop ipf CRC 10 1 2e+08 100 11.3 10.47
LostInTheNoise MWEM+PGM LostInTheNoise 1 1 5e+26 1 10.0 0.0
synthcity bayesian_network CRC 12 4 6e+25 7.17 17.86
ydata-synthetic ctgan DCAI Community 1 1 6e+14 5.0 0.33
subsample_5pcnt subsample_5pcnt CRC 4 4 1e+26 5.0 4.97
Sarus SDG Sarus SDG Sarus 1 1 2e+08 10 5.0 13.99
sdv ctgan CBS-NL 6 1 2e+26 4.33 0.0
synthcity privbayes CRC 18 3 1e+11 1 4.33 4.75
synthcity privbayes CRC 18 3 1e+11 10 4.33 4.75
smartnoise-synth mwem CRC 5 5 2e+11 10 4.2 3.52
sdv tvae CRC 13 4 6e+25 3.15 5.76
sdv fastml CRC 4 2 9e+24 3.0 1.15
synthcity tvae CRC 12 4 6e+25 3.0 4.69
rsynthpop ipf Rsynthpop-categorical 2 1 2e+08 1 3.0 7.1
rsynthpop ipf Rsynthpop-categorical 2 1 2e+08 2 3.0 7.1
sdcmicro kanonymity CRC 21 3 1e+11 2.67 23.41
smartnoise-synth pacsynth CRC 28 5 2e+11 1 2.39 5.23
smartnoise-synth pacsynth CRC 28 5 2e+11 5 2.39 5.23
smartnoise-synth pacsynth CRC 28 5 2e+11 10 2.39 5.23
smartnoise-synth patectgan CRC 41 7 3e+25 1 2.17 3.64
smartnoise-synth patectgan CRC 41 7 3e+25 5 2.17 3.64
smartnoise-synth patectgan CRC 41 7 3e+25 10 2.17 3.64
synthcity pategan CRC 24 4 6e+25 1 1.5 2.29
synthcity pategan CRC 24 4 6e+25 10 1.5 2.29
synthcity adsgan CCAIM 1 1 5e+26 1.0 0.0
sdv copula-gan CRC 1 1 3e+24 1.0 0.0
sdv ctgan CRC 6 1 5e+26 1.0 0.0
synthcity dpgan CCAIM 1 1 5e+26 1 1.0 0.0
sdv gaussian-copula Blizzard Wizard 2 1 5e+11 1.0 0.0
synthcity pategan CCAIM 1 1 5e+26 1 1.0 0.0
sdv gaussian-copula CommunityData 1 1 2e+08 1.0 0.52
subsample_1pcnt subsample_1pcnt CRC 4 4 1e+26 1.0 0.98
synthcity adsgan CRC 12 4 6e+25 1.0 1.54

Tutorial notebooks

Included in the version 1.1 of the CRC Research Acceleration Bundle, we provide a notebooks directory containing a complete set of tools to make the archive widely accessible for programmatic navigation and analysis.

Notebook Utility List (libs)

The notebooks folder includes a library of utilities to assist with navigating the deidentified data archive (Data and Metrics Bundle):

  • Index.csv file with metadata (library, algorithm type, privacy type, research paper doi, parameter settings, etc) on every technique and deidentified data sample in the archive.
  • Utility to provide easy access to the index.csv file, including help function with metadata definitions.
  • Utility to assist navigation through metrics available in report.json files. Utility to assist navigation through archive file hierarchy.
  • Utility to easily generate bar charts and scatterplots with configurable colored highlighting by deid technique metadata. Utility to make human-readable display labels for deid samples
  • Utility to configure and display collections of metric visualizations from the SDNist reports

For newcomers to python/pandas:

Welcome! We also teach everything you need to know about pandas dataframes:

  • Reading data from csv file, creating new data frames
  • Displaying excerpts of data frames
  • Adding new columns, updating and operating on columns
  • Filtering rows, iterating over rows
  • Sorting rows based on selected columns

Introduction Tutorial:

We teach all the basics for performing meta-analysis on the deidentified data archive:

  1. Setup notebook.
  2. Load deid datasets index file (index.csv).
  3. Select specific deid. datasets from the index dataframe.
  4. Working with the deidentified data csv files.
  5. Working with the target data csv files.
  6. Compare target and deid datasets.
  7. Use index.csv to highlight plots by algorithm properties.
  8. Access SDNIST evaluation reports.
  9. Show relationship between two evaluation metrics.
  10. Identify specific data samples of interest.
  11. Show images from SDNist evaluation reports.
  12. Get evaluation metrics for specific samples of interest.

Quickstart Example Jupyter Notebooks:

For folks more familiar with pandas and jupyter notebooks, we include quickstart example notebooks to demonstrate common tasks.

Example Notebook 1: Analyzing k-marginal score of deidentified datasets.

We demonstrate using a notebook to collect utility scores from the report.json files for all of the deidentified data submissions in the archive. We use the CRC plotting utility to display these scores as bar plots based on different algorithm properties.

Notebook Link


Example Notebook 2: Imposter plot (propensity scores and inconsistencies).

We show how to collect the counts of individuals in the 100% confidence propensity bin (obviously synthetic records, 'imposters') across all deidentified data submissions. We then provide a scatterplot comparing imposter count with inconsistencies count. This notebook demonstrates accessing metrics in report.json files and metric result .csv files, and using the CRC plotting utility to make highlighted scatter plots.

Notebook Link


Example Notebook 3: Race distribution metric

We show how to check deidentification impact on race distribution by directly counting individuals in the target and deidentified data csv files. We filter the index.csv metadata to print a data frame containing this score alongside relevant algorithm properties.

Notebook Link


Example Notebook 4: Privacy utility trade-off

We use the data and metrics archive to empirically explore the classic concept of the “privacy/utility trade-off curve”. This notebook collects utility and privacy metrics from the report.json files and uses the CRC plotting tool to produce scatterplots highlighted by technique properties.

Notebook Link


Pair-wise PCA Inspection Tool

Pairwise PCA is a relatively new visualization metric that was introduced by the IPUMS International team during the HLG-MOS Synthetic Data Test Drive. It lets us look at the high dimensional data distribution using a set of 2D scatterplots along principle component axes. The plots look at the deidentified data and target data from the same angle (ie, using axes from the target data), so we can directly see where their distributions differ from each other.

The pairwise PCA tool lets you interactively explore these plots using a GUI interface. You can install it by following the directions here .

We provide a brief usage guide below.


Step 1: Select your deidentified data

Select a deidentified data file and the appropriate target data file (MA, TX or National). Select a feature set. The tool will let you know whether your data contains the all the features for that feature set- it's fine if there are a few missing features, it will only use the ones that are present.



Step 2: Explore the plot

Select a deidentified data file and the appropriate target data file (MA, TX or National). Select a feature set. The tool will let you know whether your data contains the all the features for that feature set- it's fine if there are a few missing features, it will only use the ones that are present.



Step 3: Highlight the plot

In the features list you can click on a feature to highlight the plots. If you click on a specific value, it will highlight only the selected group. You can check if you'd like to highlight on all pair plots (this may run slower). You can take a screenshot using the screenshot button, and it will be stored in the PCA tool's screenshots folder for you.




Step 4: Check out the component definitions

In the component definitions tab, you'll find the definitions for all five component axes used to make the plots. You can see which features had high or low weight (influence) and whether that influence was in the positive or negative direction. Try highlighting a plot by its high weight features and see what happens! How does that compare to the low weight features?