Office Hours And Recent Results
In this project we’re working together to develop a more fundamental understanding of how deidentification algorithms behave on diverse human data. If you want to see how that’s going as the project progresses, here’s where to look. Below you’ll find our own research talks, presentations from invited subject matter experts, and summaries of what we’ve learned from all of our participants during office hours.
Contents:
Leaderboards
Research Talks and Recent Evaluation Results
- A sampling from the Data and Metrics Archive
- CRC Kick-off talk at Privacy-Preserving Artificial Intelligence workshop (PPAI-23)
Tutorials
Office Hours Discussion Summaries
Invited Subject Matter Expert Seminars
- TBA
Leaderboards
Algorithm Summary Table:
Summary of selected deidentification algorithms. Unique Exact Match (UEM) is a simple privacy metric that counts the percentage of singleton records in the target that are also present in the deidentified data; these uniquely identifiable individuals leaked through the deidentification process. The Subsample Equivalent (SsE) utility metric uses an analogy between deidentification error and sampling error to communicate utility; a score of 5% indicates the edit distance between the target and deidentified data distributions is similar to the sampling error induced by randomly discarding 95% of the data. Edit distance is based on the k-marginal metric for sparse distributions.
Library | Algorithm | Team | #Entries | #Feature sets | Avg. Feat. Space Size | ε | Utility: SsE | Privacy Leak: UEM |
---|---|---|---|---|---|---|---|---|
rsynthpop | ipf_NonDP | Rsynthpop-categorical | 1 | 1 | 3e+08 | 50.0 | 15.82 | |
rsynthpop | catall_NonDP | Rsynthpop-categorical | 1 | 1 | 2e+08 | 50.0 | 63.37 | |
subsample_40pcnt | subsample_40pcnt | CRC | 15 | 5 | 4e+25 | 40.67 | 39.93 | |
rsynthpop | cart | CRC | 12 | 4 | 3e+20 | 40.0 | 16.14 | |
AI-Fairness | smote | UT Dallas DSPL | 9 | 1 | 2e+26 | 40.0 | 17.51 | |
sdcmicro | pram | CRC | 12 | 3 | 1e+11 | 38.33 | 56.27 | |
Anonos SDK | Anonos SDK | Anonos | 3 | 1 | 2e+26 | 30.0 | 0.01 | |
MostlyAI SD | MostlyAI SD | MOSTLY AI | 6 | 1 | 2e+26 | 30.0 | 0.01 | |
smartnoise-synth | aim | CRC | 45 | 5 | 4e+25 | 1, 5, 10 | 23.33 | 7.05 |
rsynthpop | catall | Rsynthpop-categorical | 6 | 1 | 2e+08 | 1, 10, 100 | 22.33 | 47.24 |
rsynthpop | cart | CBS-NL | 3 | 1 | 2e+08 | 21.67 | 28.6 | |
tumult | DPHist | CRC | 5 | 2 | 6e+07 | 1, 2, 4, 10 | 18.8 | 92.14 |
smartnoise-synth | mst | CRC | 50 | 7 | 3e+25 | 1, 5, 10 | 14.7 | 7.06 |
ydata-sdk | YData Fabric | YData | 33 | 4 | 1e+26 | 11.85 | 9.7 | |
Genetic SD | Genetic SD | DataEvolution | 19 | 2 | 9e+25 | 1, 10 | 11.84 | 0.11 |
rsynthpop | ipf | CRC | 10 | 1 | 2e+08 | 1, 2, 10, 100 | 11.3 | 10.47 |
LostInTheNoise | MWEM+PGM | LostInTheNoise | 1 | 1 | 5e+26 | 1 | 10.0 | 0.0 |
synthcity | bayesian_network | CRC | 12 | 4 | 6e+25 | 7.17 | 17.86 | |
ydata-synthetic | ctgan | DCAI Community | 1 | 1 | 6e+14 | 5.0 | 0.33 | |
subsample_5pcnt | subsample_5pcnt | CRC | 4 | 4 | 1e+26 | 5.0 | 4.97 | |
Sarus SDG | Sarus SDG | Sarus | 1 | 1 | 2e+08 | 10 | 5.0 | 13.99 |
sdv | ctgan | CBS-NL | 6 | 1 | 2e+26 | 4.33 | 0.0 | |
synthcity | privbayes | CRC | 18 | 3 | 1e+11 | 1, 10 | 4.33 | 4.75 |
smartnoise-synth | mwem | CRC | 5 | 5 | 2e+11 | 10 | 4.2 | 3.52 |
sdv | tvae | CRC | 13 | 4 | 6e+25 | 3.15 | 5.76 | |
sdv | fastml | CRC | 4 | 2 | 9e+24 | 3.0 | 1.15 | |
synthcity | tvae | CRC | 12 | 4 | 6e+25 | 3.0 | 4.69 | |
rsynthpop | ipf | Rsynthpop-categorical | 2 | 1 | 2e+08 | 1, 2 | 3.0 | 7.1 |
sdcmicro | kanonymity | CRC | 21 | 3 | 1e+11 | 2.67 | 23.41 | |
smartnoise-synth | pacsynth | CRC | 28 | 5 | 2e+11 | 1, 5, 10 | 2.39 | 5.23 |
smartnoise-synth | patectgan | CRC | 41 | 7 | 3e+25 | 1, 5, 10 | 2.17 | 3.64 |
synthcity | pategan | CRC | 24 | 4 | 6e+25 | 1, 10 | 1.5 | 2.29 |
synthcity | adsgan | CCAIM | 1 | 1 | 5e+26 | 1.0 | 0.0 | |
sdv | copula-gan | CRC | 1 | 1 | 3e+24 | 1.0 | 0.0 | |
sdv | ctgan | CRC | 6 | 1 | 5e+26 | 1.0 | 0.0 | |
synthcity | dpgan | CCAIM | 1 | 1 | 5e+26 | 1 | 1.0 | 0.0 |
sdv | gaussian-copula | Blizzard Wizard | 2 | 1 | 5e+11 | 1.0 | 0.0 | |
synthcity | pategan | CCAIM | 1 | 1 | 5e+26 | 1 | 1.0 | 0.0 |
sdv | gaussian-copula | CommunityData | 1 | 1 | 2e+08 | 1.0 | 0.52 | |
subsample_1pcnt | subsample_1pcnt | CRC | 4 | 4 | 1e+26 | 1.0 | 0.98 | |
synthcity | adsgan | CRC | 12 | 4 | 6e+25 | 1.0 | 1.54 |
Research Talks and Recent Evaluation Results
A sampling from the Data and Metrics Archive
Below we provide a selection of deidentified data reports from version 1 of the Data and Metrics Bundle (which can be downloaded in its entirety here). The full archive contains 300 deidentified data samples from a variety of libraries, algorithms, parameter settings and feature sets.
Reports:
- DP Histogram (epsilon 10)
- SmartNoise PACSynth (epsilon 10)
- SmartNoise MST (epsilon 10)
- R Synthpop CART
- MostlyAI Synthetic Data Platform
- Synthetic Data Vault CTGAN
- SynthCity ADSGAN
Posted on: June 19, 2023
CRC Kick-off talk at Privacy-Preserving Artificial Intelligence workshop (PPAI-23)
We launched the CRC program with comparative evaluation results on over a dozen techniques from libraries like OpenDP/SmartNoise, Synthetic Data Vault, R Synthpop, and Tumult Analytics. This deck includes an introduction to the “Research, Engineering, Engagement” cycle, our first round of demonstration evaluations (and early observations), as well as an illustrated walk through of our project resources.
Document Link
By: Christine Task and Karan Bhagat
Posted on: February 13, 2023
Tutorials
Website orientation and data submission tutorial
This tutorial offers a tour of the resources that are available on the project website. It then explains how you can use these resources to contribute to the first phase of the Collaborative Research Cycle, by submitting samples of deidentified data from different privacy techniques. We demonstrate how the data submission process might go for a very cool privacy algorithm, exploring the impact of "parameter [x]"
Video Link
By: Christine Task
Posted on: March 7, 2023
Project overview
This is a re-recording of the same slides that were presented for the initial project kick-off at the PPAI-23 workshop. Feel free to read through the slides to follow along at home; the presentation recording includes additional (audio) information about metric results/analysis and project motivations.
Video Link
By: Christine Task
Posted on: March 7, 2023
Office Hours Discussion Summaries
Office hour on March 13, 2023
Office hours are a time to get together, review evaluations and collaboratively think about the data/algorithms that have been submitted in the past two weeks. For our first office hours we looked at: GAN algorithms from the SynthCity open source library, MostlyAI's Synthetic Data Platform applied to all three benchmark data sets (MA, TX and National), as well as two fascinating differentially private marginal-based approaches contributed by academic researchers. Comparing SynthCity with MostlyAI shows the impact of automated neural network tuning on performance (but very diverse data still presents interesting challenges). Comparing MWEM+PGM with GeneticSD shows how pruning marginals can dramatically decrease noisiness but also potentially increase bias (impacting regression and correlations). Check out the video and the reports for more details.
Video Link
Reports:
-
SynthCity: DP-GAN (epsilon 1), PATE-GAN (epsilon 1), ADSGAN (lambda 10),
-
MostlyAI: MA (least diverse), TX (moderately diverse), National (very diverse),
-
LostInTheNoise: MWEM+PGM (epsilon 1),
-
SynCity: GeneticSD (epsilon 10),
Conducted By: Christine Task
Posted on: March 14, 2023
Office hour on March 20, 2023
For this week's office hours we started off by reviewing recent updates to our program schedule: We've added biweekly office hours and will be accepting data submissions through mid-July. Because our goal is to explore how data privacy algorithms behave on diverse data, our team leaderboard rankings are based on how much each team contributes to that exploration -- how many algorithms and privatized data sets you've submitted. In addition to contributing new research (always appreciated!), we have a lot of work left to do exploring existing mature techniques and libraries. So this week we offer a tour of the libraries already in our archive -- You can help our project (and your own ranking on the leaderboard) by exploring these libraries using different feature sets, target data sets (MA, TX and National) and different parameter configurations. Even if you're contributing your own new research too, this is a good way to get a broader picture of how data privacy operates -- you may learn something useful comparing these techniques to your own.
Video Link
Privacy Library Tour:
- Synthcity: Location | Installation | Documentation
- SDV: Location | Installation | Documentation
- SmartNoise: Location | Installation | Documentation
- R Synthpop: Location | Installation | Documentation
- Tumult Analytics: Location | Installation | Documentation
Conducted By: Christine Task
Posted on: March 20, 2023
Office hour on April 3, 2023
For our third office hours, we cover a few important topics. Because the SDNist evaluation library is applicable to any privacy approach, we're not limited to synthetic data privacy -- we can explore traditional approaches as well. We provide a tour of the sdcmicro Statistical Disclosure Control library, and check out two of its approaches: feature suppression based k-anonymity and Post Randomization (PRAM). Having looked more at utility in previous office hours, we take a closer look at our own privacy metrics, the Apparent Match metric and the unique exact matches metric.
And finally, we take a look at two commercial synthetic data products: A differentially private synthetic data sample from Sarus and an updated synthetic data submission from MostlyAI.
Video Link
Open-Source Privacy Library Tour:
- sdcmicro R: Location | Installation | Documentation | GUI App