Explanatory Workshop (18 Dec 2023)
Exploratory Phase (February - May 2023)
- Join the CRC list-serv for news and updates. (submit an empty email to subscribe)
- Register your team.
- De-identify the NIST Diverse Communities Data Excerpts
- Optionally, try using the SDNist report tool to analyze your deidentified data. We’ll also create a report for you when you submit your data.
- Watch our introductory video, or learn more about the program here
- Submit your deidentified data.
- Contribute data by Friday before an office hours session, and we’ll walk through your evaluation results with you during office hours.
- Attend office hours (see timeline). We will send links out to the list-serv and registered teams before the sessions (optional).
- Make data contributions by 9 May 2023 to have your data included in the first release of the research acceleration bundle.
- Join us or watch the recording of our introduction to the Research Acceleration Bundle.(19 May 2023).
- Contribute data by 7 Jul 2023 to be included in the second release of the Research Acceleartion Bundle.
Explanatory Phase (May - November 2023)
Everyone is welcome to join the explanatory phase; participation in the preceding exploratory phase isn’t necessary.
- Download the first release of the Acceleration Bundle! and explore!
- Link to the Research Acceleration Bundle respository.
- Direct download link for all deidentified data and individual reports (537 MB).
- Direct download link for all meta-reports comparing techniques with discussion (484 MB).
- For more information on the structure of tiny papers, click here.
- Contributors may also append proofs, data, additional experiments, etc. to their tiny papers if they wish.
Data Submission Walkthrough
Help us explore the behavior of deidentification algorithms on diverse data by submitting deidentified samples of our benchmark data, using privacy techniques that are intended to prepare sensitive private data for public release. For a video tutorial on data submission, click here
We'll be accepting data submissions continuously throughout the program. Submissions received before May 9th will be included in the first release of the research acceleration bundle, and submissions received before July 9th will be included in the cummulative second release.
Do I need to be a privacy expert? : Nope! We want participants both inside and outside the privacy research community. There are a lot of easy-to-use tools out there aimed for the general public, and we’d like to know how they perform just as much as we’d like to understand recent research innovations.
What's a submission? : To make a deidentified-data submission, first pick a privacy technique and a feature subset. Then, run the privacy technique on the data with the feature subset you chose. You can include multiple files in a data submission to try out different parameter settings on your privacy technique.
Can I make more than one submission? : Yes! The more techniques and feature sets you try out, the more we’ll have, and that’s what we want. Your team will be given credit for everything you submit, and you can even attach a team logo to your submission if you like.
What data? : Here is the benchmark data we’d like you to deidentify for us. We have three benchmark datasets – the Massachusetts data is from north of Boston, the Texas data is from near Dallas, and the National data is a collection of very diverse communities from around the nation. The data is derived from the 2019 American Community Survey; the 24 features in the complete scheme were chosen because they capture many of the complexities from real-world data, while still being small, and simple enough to make more formal analysis feasible. The data folder includes lovely postcard documentation about the communities and a JSON data dictionary to make it easy to configure your privacy technique. The usage guidance section in the readme has helpful configuration hints (watch out for ‘N’).
What privacy technique? : You can pick one to try from our growing collection on the Techniques page, take a look at our list of open-source libraries for privacy newcomers, or you can contribute a new one we don’t have yet. If you submit a new technique (or a different library implementation of one we already have), then the submission form will prompt you for basic information about it: a short name, a one-sentence description, a full algorithm description (optional), and then any links or references that help document it. You’ll also have the option of adding a picture that people can use to identify the technique at a glance. In a few days, you’ll see the technique you contributed added to our website. Note – you don’t need to be the creator of a privacy technique to submit data that uses it; just be sure to properly cite the source when prompted.
What feature set? : The full schema has 24 features, but you often want to focus on just a subset of those. Some privacy algorithms are designed for smaller feature sets, and algorithm analysis may be more approachable on focused subsets. Of course, it’s also important to see how algorithms behave with larger feature sets; the best performing approach on 9 features might be the worst performing one on 21. Here are some options to try out, and you’re welcome to pick your own subset as well. A single data submission should use a single feature subset.
- All Features: Includes all 24 features
- Simpler Features: Includes 21 features, all except (INDP, WGTP, PWGTP)
- Demographic-Focused Subset:
- Industry-Focused Subset:
- Detailed Industry Subset:
- Family-Focused Subset:
- Small Categorical:
- Tiny Categorical:
How do I submit? :
- Have you registered your team? Do that first.
- Prepare your deidentified data samples, with your chosen technique and feature set. Your submission can include multiple data files (up to 10) to explore running the technique with different parameter configurations or target datasets (MA, TX, or National).
- Use the google form here to submit. It will ask you for contact info and team name, technique and feature set, and ask you some questions about what you did to prepare the submission – what privacy the technique provides and what parameter configurations you explored. You’ll also have the option to upload an image to go with your submission and provide any additional links or references you think we should have.
- If you’re using a custom technique that satisfies differential privacy, decide if you’d like to Assert Publicly Verifiable Differential Privacy. If so, you’ll need to submit some supplementary material to help others verify that your approach correctly satisfies your stated differential-privacy guarantee.
Publicly Verifiable Differential Privacy - Special Submission Process
In the NIST Synthetic Data Challenges, NIST provided a Subject-Matter-Expert differential-privacy validation process. For the Collaborative Research Cycle, we will be doing something more transparent and collaborative instead. If you would like to assert that your submission technically satisfies Publicly Verifiable Differential Privacy, then read more here.
Should I participate?
- Participation is optional: You are welcome to describe your technique as differentially private in the submission form without participating in the Publicly Verified Differential Privacy track, and that’s fine! It just won’t be publicly verified.
- In particular, if you’re using a tool from a well-known differentially private library, and you haven’t added any of your own code, then you don’t need to submit for public verification.
- However, if you’re relatively new to differential privacy, and you’re writing some of your own code, it’s a great idea to submit for verification.
- And if you’ve been in the differential-privacy research community for quite a while, then it’s an even better idea to submit for verification. This will make it easier for other DP researchers to find your work in the archive, and more eyes on source code is almost always a good idea.
What do I need to participate?
- On the submission form, you can check that you’d like to assert Publicly Verifiable Differential Privacy. Then you’ll be prompted to submit a few additional items. Note that all of these items will be shared publicly as part of the public verification process.
- Source Code: You will need to link to a public code repository where your source code can be reviewed.
- Privacy Proof and Code Guide Document: You will need to upload a written document that (1) includes a step-by-step walkthrough of your algorithm, (2) states where to find each of the steps in the source code, and (3) has a proof that this algorithm satisfies differential privacy for your chosen DP variant. Note – if you’re new to DP, or even if you’re not, it helps to be very methodical about tracking the sensitivity of each of your steps. This is a different writing style than you’d use for a publication; we’re looking for something simple and unambiguous.
- Execution Instructions: Provide instructions for someone to run your code and reproduce the results you got in the data that you’re submitting.
This is a quick multiple-choice question in the form
– when you were developing (testing and configuring)
your technique, did you use any version of the benchmark data?
To very strictly preserve privacy, it’s best not to use the
private data during development because it could leak information
that’s not protected by the privacy guarantee. Options include:
- No data was used for development (default, blind, or privacy-preserving configuration).
- TX & MA data were used during development, but the submission was run on the National data (in this case, TX & MA are considered public development data, and the National data is considered the private dataset).
- The same target data was used during development and for the submission (this is less strict about preserving privacy, but it is just fine for many practical purposes – however, because it provides a performance advantage, it’s important to note this.)
- Other (if you’ve done something else, let us know).
- Public Point of Contact: Provide the email of the person who should be contacted if anyone has questions about your approach.
What can I expect to occur?
- The Publicly Verifiable Differential Privacy track is what it sounds like: we’ll package these techniques and present them to the public for verification.
- The verification might be empirical, formal, or informal. We’ll be actively encouraging engagement from researchers whose work focuses on validating differential-privacy techniques.
- Within the constraints of available time, we’ll also provide our own informal review and feedback on any issues we uncover.
- If you receive feedback that uncovers issues with your technique, you can address them and submit an updated algorithm and sample data.
Tiny Paper Submission Walkthrough
For the Tiny Papers, we’ll be soliciting research observations/results rather than data or algorithms. The formal call will be out later this year, but you’re not required to wait to start thinking about things.
Tiny Papers?: We’re basing it off the ICLR tiny paper track – you can see some great examples there. We’d like you to describe your research concept/contribution as simply and clearly as you can, making it easy for reviewers and other readers to follow along. Incremental or initial results are just fine. We don’t expect you to have everything solved in September, but we want to know what you’ve noticed about our data/algorithms and what you’ve figured out so far. Proofs, detailed experiment results, extended related works, and technical background can go in the appendix.
Contribution Type: We invite participation from all areas of expertise. Purely theoretical and purely empirical submissions are both welcome, as are any mix of the two.
Some Open Problems: What sorts of problems might you explore in your tiny research paper? There’s a lot of possibilities, but here are some potential things to start thinking about (check out our PPAI kick-off slide deck for more context):
Through the looking glass: Why (and how, and when) do data-modeling techniques magnify some parts of the population and shrink others? What’s happening when GANs introduce artifacts or when marginal methods reduce diversity?
Equity and bias: What happens when a minority demographic has a different pattern of feature correlations than the majority (see regression metric)? Which methods are more or less expressive for retaining these differences and why?
Consistency: Not only why do some methods automatically avoid record inconsistencies and others don’t – Why do different methods do better on different feature categories? How does epsilon impact this? How can we improve results?
Granularity: What happens when a feature definition gives fine-grained information on a particular demographic group, causing only that group to be more sparsely spread out across the data space? (Consider RAC1P and AIANHN, or DVET and military veterans). Which privacy approaches handle this more or less gracefully? How can we improve?
More exciting feature sets: How do these results change when you’re running on 15 features instead of 10? Or all 24? What if you consider weights or household joins?
Verification: How can you verify that the “Publicly Verified Differential Privacy” submissions are in fact differentially private? Try out your preferred verification approach on these real solutions submitted by the community.