Contents:

Leaderboards

Research Talks and Recent Evaluation Results

Tutorials

Office Hours Discussion Summaries

Invited Subject Matter Expert Seminars

  • TBA

Leaderboards

Algorithm Summary Table:

Summary of selected deidentification algorithms. Unique Exact Match (UEM) is a simple privacy metric that counts the percentage of singleton records in the target that are also present in the deidentified data; these uniquely identifiable individuals leaked through the deidentification process. The Subsample Equivalent (SsE) utility metric uses an analogy between deidentification error and sampling error to communicate utility; a score of 5% indicates the edit distance between the target and deidentified data distributions is similar to the sampling error induced by randomly discarding 95% of the data. Edit distance is based on the k-marginal metric for sparse distributions.

Library Algorithm Team #Entries #Feature sets Avg. Feat. Space Size ε Utility: SsE Privacy Leak: UEM
rsynthpop ipf_NonDP Rsynthpop-categorical 1 1 3e+08 50.0 15.82
rsynthpop catall_NonDP Rsynthpop-categorical 1 1 2e+08 50.0 63.37
subsample_40pcnt subsample_40pcnt CRC 15 5 4e+25 40.67 39.93
rsynthpop cart CRC 12 4 3e+20 40.0 16.14
AI-Fairness smote UT Dallas DSPL 9 1 2e+26 40.0 17.51
sdcmicro pram CRC 12 3 1e+11 38.33 56.27
Anonos SDK Anonos SDK Anonos 3 1 2e+26 30.0 0.01
MostlyAI SD MostlyAI SD MOSTLY AI 6 1 2e+26 30.0 0.01
smartnoise-synth aim CRC 45 5 4e+25 1, 5, 10 23.33 7.05
rsynthpop catall Rsynthpop-categorical 6 1 2e+08 1, 10, 100 22.33 47.24
rsynthpop cart CBS-NL 3 1 2e+08 21.67 28.6
tumult DPHist CRC 5 2 6e+07 1, 2, 4, 10 18.8 92.14
smartnoise-synth mst CRC 50 7 3e+25 1, 5, 10 14.7 7.06
ydata-sdk YData Fabric YData 33 4 1e+26 11.85 9.7
Genetic SD Genetic SD DataEvolution 19 2 9e+25 1, 10 11.84 0.11
rsynthpop ipf CRC 10 1 2e+08 1, 2, 10, 100 11.3 10.47
LostInTheNoise MWEM+PGM LostInTheNoise 1 1 5e+26 1 10.0 0.0
synthcity bayesian_network CRC 12 4 6e+25 7.17 17.86
ydata-synthetic ctgan DCAI Community 1 1 6e+14 5.0 0.33
subsample_5pcnt subsample_5pcnt CRC 4 4 1e+26 5.0 4.97
Sarus SDG Sarus SDG Sarus 1 1 2e+08 10 5.0 13.99
sdv ctgan CBS-NL 6 1 2e+26 4.33 0.0
synthcity privbayes CRC 18 3 1e+11 1, 10 4.33 4.75
smartnoise-synth mwem CRC 5 5 2e+11 10 4.2 3.52
sdv tvae CRC 13 4 6e+25 3.15 5.76
sdv fastml CRC 4 2 9e+24 3.0 1.15
synthcity tvae CRC 12 4 6e+25 3.0 4.69
rsynthpop ipf Rsynthpop-categorical 2 1 2e+08 1, 2 3.0 7.1
sdcmicro kanonymity CRC 21 3 1e+11 2.67 23.41
smartnoise-synth pacsynth CRC 28 5 2e+11 1, 5, 10 2.39 5.23
smartnoise-synth patectgan CRC 41 7 3e+25 1, 5, 10 2.17 3.64
synthcity pategan CRC 24 4 6e+25 1, 10 1.5 2.29
synthcity adsgan CCAIM 1 1 5e+26 1.0 0.0
sdv copula-gan CRC 1 1 3e+24 1.0 0.0
sdv ctgan CRC 6 1 5e+26 1.0 0.0
synthcity dpgan CCAIM 1 1 5e+26 1 1.0 0.0
sdv gaussian-copula Blizzard Wizard 2 1 5e+11 1.0 0.0
synthcity pategan CCAIM 1 1 5e+26 1 1.0 0.0
sdv gaussian-copula CommunityData 1 1 2e+08 1.0 0.52
subsample_1pcnt subsample_1pcnt CRC 4 4 1e+26 1.0 0.98
synthcity adsgan CRC 12 4 6e+25 1.0 1.54


Research Talks and Recent Evaluation Results


A sampling from the Data and Metrics Archive

Below we provide a selection of deidentified data reports from version 1 of the Data and Metrics Bundle (which can be downloaded in its entirety here). The full archive contains 300 deidentified data samples from a variety of libraries, algorithms, parameter settings and feature sets.

Reports:

Posted on: June 19, 2023


CRC Kick-off talk at Privacy-Preserving Artificial Intelligence workshop (PPAI-23)

We launched the CRC program with comparative evaluation results on over a dozen techniques from libraries like OpenDP/SmartNoise, Synthetic Data Vault, R Synthpop, and Tumult Analytics. This deck includes an introduction to the “Research, Engineering, Engagement” cycle, our first round of demonstration evaluations (and early observations), as well as an illustrated walk through of our project resources.

Document Link

By: Christine Task and Karan Bhagat

Posted on: February 13, 2023


Tutorials


Website orientation and data submission tutorial

This tutorial offers a tour of the resources that are available on the project website. It then explains how you can use these resources to contribute to the first phase of the Collaborative Research Cycle, by submitting samples of deidentified data from different privacy techniques. We demonstrate how the data submission process might go for a very cool privacy algorithm, exploring the impact of "parameter [x]"

Video Link

By: Christine Task

Posted on: March 7, 2023


Project overview

This is a re-recording of the same slides that were presented for the initial project kick-off at the PPAI-23 workshop. Feel free to read through the slides to follow along at home; the presentation recording includes additional (audio) information about metric results/analysis and project motivations.

Video Link

By: Christine Task

Posted on: March 7, 2023


Office Hours Discussion Summaries


Office hour on March 13, 2023

Office hours are a time to get together, review evaluations and collaboratively think about the data/algorithms that have been submitted in the past two weeks. For our first office hours we looked at: GAN algorithms from the SynthCity open source library, MostlyAI's Synthetic Data Platform applied to all three benchmark data sets (MA, TX and National), as well as two fascinating differentially private marginal-based approaches contributed by academic researchers. Comparing SynthCity with MostlyAI shows the impact of automated neural network tuning on performance (but very diverse data still presents interesting challenges). Comparing MWEM+PGM with GeneticSD shows how pruning marginals can dramatically decrease noisiness but also potentially increase bias (impacting regression and correlations). Check out the video and the reports for more details.

Video Link

Reports:

Conducted By: Christine Task

Posted on: March 14, 2023


Office hour on March 20, 2023

For this week's office hours we started off by reviewing recent updates to our program schedule: We've added biweekly office hours and will be accepting data submissions through mid-July. Because our goal is to explore how data privacy algorithms behave on diverse data, our team leaderboard rankings are based on how much each team contributes to that exploration -- how many algorithms and privatized data sets you've submitted. In addition to contributing new research (always appreciated!), we have a lot of work left to do exploring existing mature techniques and libraries. So this week we offer a tour of the libraries already in our archive -- You can help our project (and your own ranking on the leaderboard) by exploring these libraries using different feature sets, target data sets (MA, TX and National) and different parameter configurations. Even if you're contributing your own new research too, this is a good way to get a broader picture of how data privacy operates -- you may learn something useful comparing these techniques to your own.

Video Link

Privacy Library Tour:

Conducted By: Christine Task

Posted on: March 20, 2023


Office hour on April 3, 2023

For our third office hours, we cover a few important topics. Because the SDNist evaluation library is applicable to any privacy approach, we're not limited to synthetic data privacy -- we can explore traditional approaches as well. We provide a tour of the sdcmicro Statistical Disclosure Control library, and check out two of its approaches: feature suppression based k-anonymity and Post Randomization (PRAM). Having looked more at utility in previous office hours, we take a closer look at our own privacy metrics, the Apparent Match metric and the unique exact matches metric.

And finally, we take a look at two commercial synthetic data products: A differentially private synthetic data sample from Sarus and an updated synthetic data submission from MostlyAI.

Video Link

Open-Source Privacy Library Tour:

Conducted By: Christine Task

Posted on: April 3, 2023


Invited Subject Matter Experts Seminars