Contents:

Benchmark Data

Deidentified Data Archive

Deidentification Algorithm Summary Table

Meta-analysis Tutorial Notebooks

Benchmark Data

The Collaborative Research Cycle uses the NIST Excerpts Benchmark Data serve as the target data for this program. All deidentification techniques in our directory have been run on this input data, and the resulting examples of deidentified data are available in the archive below.

The ACS Data Excerpts includes three benchmark datasets – the Massachusetts data is from north of Boston, the Texas data is from near Dallas, and the National data is a collection of communities from around the nation. The data is derived from the 2019 American Community Survey; the 24 features in the complete scheme were chosen because they capture many of the complexities from real-world data, while still being small, and simple enough to make more formal analysis feasible. The data folder includes lovely postcard documentation about the communities and a JSON data dictionary to make it easy to configure your privacy technique. The usage guidance section in the readme has helpful configuration hints (watch out for 'N').

Additionally data from 2018 has been provided as a control; this may be useful for configuring differentially private algorithms or calibrating privacy metrics. The 2018 data covers the same schema (features and geography) but does not share any individuals with the 2019 data.


Deidentified Data Archives

Download the Research Acceleration Bundle and explore! This archive of deidentified data samples and evaluation metric results provides a broad, representative sample of data deidentification as a whole.

The target data for this project are the ACS Data Excerpts , curated data drawn from the American Community Survey. The archive is comprised of deidentified versions of the Excerpts data as generated by a wide variety of deidentification algorithms, libraries and privacy definitions. Check out our Algorithm Summary Table for a high level glimpse of the current archive contents.


Deidentification Algorithm Summary Table

This table provides a very high level summary of the deidentification algorithms in our archive. Unique Exact Match (UEM) is a simple privacy metric that counts the percentage of singleton records in the target that are also present in the deidentified data; these uniquely identifiable individuals leaked through the deidentification process. The Subsample Equivalent (SsE) utility metric uses an analogy between deidentification error and sampling error to communicate utility; a score of 5% indicates the edit distance between the target and deidentified data distributions is similar to the sampling error induced by randomly discarding 95% of the data. Edit distance is based on the k-marginal metric for sparse distributions.

Note that this isn't a leaderboard--- you can select any column in the dropdown menu and reorder the table according to that column. Algorithms with high utility (high SsE) may have a lot of privacy leakage (high UEM), algorithms with low privacy leakage (low UEM) may have poor utility (low SsE). Algorithms that have only been run on small subsets of the schema may perform differently on larger feature spaces (Avg Feat. Space Size). And, in general, SsE and UEM are very simple, reductive metrics. If you're curious about a deidentification method, we recommend checking out its full evaluation results in the metareports archive


Select Column for Sorting

DESC
Library Algorithm Team # Entries # Feat. sets Avg. Feat. Space Size ε values Utility: SSE Privacy Leak: UEM
rsynthpop catall Rsynthpop-categorical 2 1 2e+08 100.0 65.0 81.33
rsynthpop ipf_NonDP Rsynthpop-categorical 1 1 3e+08 50.0 15.82
rsynthpop catall_NonDP Rsynthpop-categorical 1 1 2e+08 50.0 63.37
tumult DPHist CRC 2 2 6e+07 10.0 45.5 99.91
subsample subsample_40pcnt CRC 15 5 4e+25 40.67 39.93
rsynthpop cart CRC 12 4 3e+20 40.0 16.14
UTDallas-AIFairness smote UT Dallas DSPL 9 1 2e+26 40.0 17.51
sdcmicro pram CRC 12 3 1e+11 38.33 56.27
smartnoise-synth aim CRC 16 6 4e+25 10.0 34.69 10.38
Anonos Data Embassy SDK Anonos Data Embassy SDK Anonos 3 1 2e+26 30.0 0.01
MostlyAI SD MostlyAI SD MOSTLY AI 6 1 2e+26 30.0 0.01
aindo-synth aindo-synth Aindo 3 1 2e+26 30.0 0.01
smartnoise-synth aim CRC 16 6 4e+25 5.0 28.12 9.77
rsynthpop cart CBS-NL 3 1 2e+08 21.67 28.6
smartnoise-synth mst CRC 16 6 2e+19 10.0 19.69 7.89
rsynthpop ipf CRC 3 1 2e+08 100.0 18.33 16.97
Genetic SD Genetic SD DataEvolution 10 2 9e+25 10.0 17.5 0.18
rsynthpop ipf CRC 3 1 2e+08 10.0 16.67 14.29
smartnoise-synth mst CRC 18 6 4e+25 5.0 14.44 7.42
smartnoise-synth aim CRC 16 6 4e+25 1.0 12.19 7.22
ydata-sdk YData Fabric Synthesizers YData 33 4 1e+26 11.85 9.7
LostInTheNoise MWEM+PGM LostInTheNoise 1 1 5e+26 1.0 10.0 0.0
smartnoise-synth mst CRC 16 6 2e+19 1.0 10.0 5.81
synthcity bayesian_network CRC 12 4 6e+25 7.17 17.86
Genetic SD Genetic SD DataEvolution 9 2 9e+25 1.0 5.56 0.04
ydata-synthetic ctgan DCAI Community 1 1 6e+14 5.0 0.33
subsample subsample_5pcnt CRC 4 4 1e+26 5.0 4.97
rsynthpop ipf Rsynthpop-categorical 1 1 2e+08 2.0 5.0 10.68
Sarus SDG Sarus SDG Sarus 1 1 2e+08 10.0 5.0 13.99
synthcity privbayes CRC 9 3 1e+11 1.0 4.89 5.17
sdv ctgan CBS-NL 6 1 2e+26 4.33 0.0
smartnoise-synth mwem CRC 5 5 2e+11 10.0 4.2 3.52
synthcity privbayes CRC 9 3 1e+11 10.0 3.78 4.32
smartnoise-synth pacsynth CRC 9 3 1e+11 10.0 3.44 8.9
sdv tvae CRC 13 4 6e+25 3.15 5.76
sdv fastml CRC 4 2 9e+24 3.0 1.15
synthcity tvae CRC 12 4 6e+25 3.0 4.69
smartnoise-synth patectgan CRC 12 4 2e+19 10.0 3.0 5.66
sdcmicro kanonymity CRC 21 3 1e+11 2.67 23.41
rsynthpop ipf CRC 3 1 2e+08 2.0 2.33 3.05
smartnoise-synth patectgan CRC 18 6 4e+25 5.0 2.33 4.49
smartnoise-synth pacsynth CRC 15 5 2e+11 5.0 2.13 4.32
synthcity pategan CRC 12 4 6e+25 10.0 2.0 2.59
synthcity adsgan CCAIM 1 1 5e+26 1.0 0.0
sdv copula-gan CRC 1 1 3e+24 1.0 0.0
sdv ctgan CRC 6 1 5e+26 1.0 0.0
synthcity dpgan CCAIM 1 1 5e+26 1.0 1.0 0.0
sdv gaussian-copula Blizzard Wizard 2 1 5e+11 1.0 0.0
synthcity pategan CCAIM 1 1 5e+26 1.0 1.0 0.0
rsynthpop catall Rsynthpop-categorical 2 1 2e+08 1.0 1.0 0.01
smartnoise-synth patectgan CRC 11 4 2e+19 1.0 1.0 0.02
smartnoise-synth pacsynth CRC 4 2 1e+08 1.0 1.0 0.38
sdv gaussian-copula CommunityData 1 1 2e+08 1.0 0.52
subsample subsample_1pcnt CRC 4 4 1e+26 1.0 0.98
synthcity adsgan CRC 12 4 6e+25 1.0 1.54
rsynthpop ipf CRC 1 1 2e+08 1.0 1.0 1.72
synthcity pategan CRC 12 4 6e+25 1.0 1.0 1.99
rsynthpop ipf Rsynthpop-categorical 1 1 2e+08 1.0 1.0 3.53
rsynthpop catall Rsynthpop-categorical 2 1 2e+08 10.0 1.0 60.39
tumult DPHist CRC 1 1 1e+06 1.0 1.0 74.47
tumult DPHist CRC 1 1 1e+06 2.0 1.0 88.39
tumult DPHist CRC 1 1 1e+06 4.0 1.0 98.05

Tutorial notebooks

Included in the version 1.1 of the CRC Research Acceleration Bundle, we provide a notebooks directory containing a complete set of tools to make the archive widely accessible for programmatic navigation and analysis.

Notebook Utility List (libs)

The notebooks folder includes a library of utilities to assist with navigating the deidentified data archive (Data and Metrics Bundle):

  • Index.csv file with metadata (library, algorithm type, privacy type, research paper doi, parameter settings, etc) on every technique and deidentified data sample in the archive.
  • Utility to provide easy access to the index.csv file, including help function with metadata definitions.
  • Utility to assist navigation through metrics available in report.json files. Utility to assist navigation through archive file hierarchy.
  • Utility to easily generate bar charts and scatterplots with configurable colored highlighting by deid technique metadata. Utility to make human-readable display labels for deid samples
  • Utility to configure and display collections of metric visualizations from the SDNist reports

For newcomers to python/pandas:

Welcome! We also teach everything you need to know about pandas dataframes:

  • Reading data from csv file, creating new data frames
  • Displaying excerpts of data frames
  • Adding new columns, updating and operating on columns
  • Filtering rows, iterating over rows
  • Sorting rows based on selected columns

Introduction Tutorial:

We teach all the basics for performing meta-analysis on the deidentified data archive:

  1. Setup notebook.
  2. Load deid datasets index file (index.csv).
  3. Select specific deid. datasets from the index dataframe.
  4. Working with the deidentified data csv files.
  5. Working with the target data csv files.
  6. Compare target and deid datasets.
  7. Use index.csv to highlight plots by algorithm properties.
  8. Access SDNIST evaluation reports.
  9. Show relationship between two evaluation metrics.
  10. Identify specific data samples of interest.
  11. Show images from SDNist evaluation reports.
  12. Get evaluation metrics for specific samples of interest.