Contents:

Connect with Us

Submit Deidentified Data to the Archive

Submit Related Research to our Research Directory

Help Explore the Deidentified Data Archive


Connect with Us

Email, Listserve, Slack and Office Hours

Please feel free to reach out to us! We're happy to answer questions, provide tours, discuss your relevant research, etc.


Submit Deidentified Data to the Archive

Data Submission Walkthrough

Help us explore the behavior of deidentification algorithms on diverse data by submitting deidentified versions of our benchmark data, using privacy techniques that are intended to prepare sensitive private data for public release. For a video tutorial on data submission, click here . We'll be accepting data submissions continuously throughout the program, and releasing them in periodic updates to our public Archive of deidentified data. This Archive serves as the basis for meta-research into the overall problem of reliable, equitable, high quality data deidentification

When you submit data to the archive, your work will contribute to solving this problem. We will include your technique in our Techniques Directory and on our Algorithm Summary Table , we'll provide you with an SDNist report on your utility and privacy performance, and you'll be encouraged to join us for an office hour or 1-1 meeting to discuss your evaluation results.

Do I need to be a privacy expert? : Nope! We want participants both inside and outside the privacy research community. There are a lot of easy-to-use tools out there aimed for the general public, and we’d like to know how they perform just as much as we’d like to understand recent research innovations.

What's a submission? : To make a deidentified-data submission, first pick a privacy technique and a feature subset. Then, run the privacy technique on the data with the feature subset you chose. You can include multiple files in a data submission to try out different parameter settings on your privacy technique.

Can I make more than one submission? : Yes! The more techniques and feature sets you try out, the more we’ll have, and that’s what we want. Your team will be given credit for everything you submit, and you can even attach a team logo to your submission if you like.

What data? : Here is the benchmark data we’d like you to deidentify for us. We have three benchmark datasets – the Massachusetts data is from north of Boston, the Texas data is from near Dallas, and the National data is a collection of very diverse communities from around the nation. The data is derived from the 2019 American Community Survey; the 24 features in the complete scheme were chosen because they capture many of the complexities from real-world data, while still being small, and simple enough to make more formal analysis feasible. The data folder includes lovely postcard documentation about the communities and a JSON data dictionary to make it easy to configure your privacy technique. The usage guidance section in the readme has helpful configuration hints (watch out for ‘N’).

What privacy technique? : You can pick one to try from our growing collection on the Techniques page, take a look at our list of open-source libraries for privacy newcomers, or you can contribute a new one we don’t have yet. If you submit a new technique (or a different library implementation of one we already have), then the submission form will prompt you for basic information about it: a short name, a one-sentence description, a full algorithm description (optional), and then any links or references that help document it. You’ll also have the option of adding a picture that people can use to identify the technique at a glance. In a few days, you’ll see the technique you contributed added to our website. Note – you don’t need to be the creator of a privacy technique to submit data that uses it; just be sure to properly cite the source when prompted.

What feature set? : The full schema has 24 features, but you often want to focus on just a subset of those. Some privacy algorithms are designed for smaller feature sets, and algorithm analysis may be more approachable on focused subsets. Of course, it’s also important to see how algorithms behave with larger feature sets; the best performing approach on 9 features might be the worst performing one on 21. Here are some options to try out, and you’re welcome to pick your own subset as well. A single data submission should use a single feature subset.

  • All Features: Includes all 24 features
  • Simpler Features: Includes 21 features, all except (INDP, WGTP, PWGTP)
  • Demographic-Focused Subset:
  • SEX MSP RAC1P OWN_RENT PINCP_DECILE EDU AGEP HOUSING_TYPE DVET DEYE
    SEX MSP RAC1P OWN_RENT PINCP_DECILE
    EDU AGEP HOUSING_TYPE DVET DEYE
  • Industry-Focused Subset:
  • SEX MSP RAC1P OWN_RENT PINCP_DECILE EDU HISP PUMA INDP_CAT
    SEX MSP RAC1P OWN_RENT PINCP_DECILE
    EDU HISP PUMA INDP_CAT
  • Detailed Industry Subset:
  • SEX MSP RAC1P OWN_RENT PINCP_DECILE EDU HISP PUMA INDP_CAT INDP
    SEX MSP RAC1P OWN_RENT PINCP_DECILE
    EDU HISP PUMA INDP_CAT INDP
  • Family-Focused Subset:
  • SEX MSP RAC1P OWN_RENT PINCP_DECILE HISP PUMA AGEP NOC NPF POVPIP
    SEX MSP RAC1P OWN_RENT PINCP_DECILE
    HISP PUMA AGE NOC NPF POVPIP
  • Small Categorical:
  • SEX RAC1P OWN_RENT PINCP_DECILE PUMA
  • Tiny Categorical:
  • RAC1P PUMA

How do I submit? :

  1. Have you registered your team? Do that first.
  2. Prepare your deidentified data samples, with your chosen technique and feature set. Your submission can include multiple data files (up to 10) to explore running the technique with different parameter configurations or target datasets (MA, TX, or National).
  3. Use the google form here to submit. It will ask you for contact info and team name, technique and feature set, and ask you some questions about what you did to prepare the submission – what privacy the technique provides and what parameter configurations you explored. You’ll also have the option to upload an image to go with your submission and provide any additional links or references you think we should have.
  4. If you’re using a custom technique that satisfies differential privacy, decide if you’d like to Assert Publicly Verifiable Differential Privacy. If so, you’ll need to submit some supplementary material to help others verify that your approach correctly satisfies your stated differential-privacy guarantee.

Data Submission Tutorial Video

This tutorial offers a tour of the resources that are available on the project website. It then explains how you can use these resources to contribute to the first phase of the Collaborative Research Cycle, by submitting samples of deidentified data from different privacy techniques. We demonstrate how the data submission process might go for a very cool privacy algorithm, exploring the impact of "parameter [x]"

Video Link

By: Christine Task

Posted on: March 7, 2023



Publicly Verifiable Differential Privacy Submission Process

In the NIST Synthetic Data Challenges, NIST provided a Subject-Matter-Expert differential-privacy validation process. For the Collaborative Research Cycle, we will be doing something more transparent and collaborative instead. If you would like to assert that your submission technically satisfies Publicly Verifiable Differential Privacy, then read more here.

Should I participate?

  • Participation is optional: You are welcome to describe your technique as differentially private in the submission form without participating in the Publicly Verified Differential Privacy track, and that’s fine! It just won’t be publicly verified.
  • In particular, if you’re using a tool from a well-known differentially private library, and you haven’t added any of your own code, then you don’t need to submit for public verification.
  • However, if you’re relatively new to differential privacy, and you’re writing some of your own code, it’s a great idea to submit for verification.
  • And if you’ve been in the differential-privacy research community for quite a while, then it’s an even better idea to submit for verification. This will make it easier for other DP researchers to find your work in the archive, and more eyes on source code is almost always a good idea.

What do I need to participate?

  • On the submission form, you can check that you’d like to assert Publicly Verifiable Differential Privacy. Then you’ll be prompted to submit a few additional items. Note that all of these items will be shared publicly as part of the public verification process.
  • Source Code: You will need to link to a public code repository where your source code can be reviewed.
  • Privacy Proof and Code Guide Document: You will need to upload a written document that (1) includes a step-by-step walkthrough of your algorithm, (2) states where to find each of the steps in the source code, and (3) has a proof that this algorithm satisfies differential privacy for your chosen DP variant. Note – if you’re new to DP, or even if you’re not, it helps to be very methodical about tracking the sensitivity of each of your steps. This is a different writing style than you’d use for a publication; we’re looking for something simple and unambiguous.
  • Execution Instructions: Provide instructions for someone to run your code and reproduce the results you got in the data that you’re submitting.
  • Tuning Approach: This is a quick multiple-choice question in the form – when you were developing (testing and configuring) your technique, did you use any version of the benchmark data? To very strictly preserve privacy, it’s best not to use the private data during development because it could leak information that’s not protected by the privacy guarantee. Options include:
    • No data was used for development (default, blind, or privacy-preserving configuration).
    • TX & MA data were used during development, but the submission was run on the National data (in this case, TX & MA are considered public development data, and the National data is considered the private dataset).
    • The same target data was used during development and for the submission (this is less strict about preserving privacy, but it is just fine for many practical purposes – however, because it provides a performance advantage, it’s important to note this.)
    • Other (if you’ve done something else, let us know).
  • Public Point of Contact: Provide the email of the person who should be contacted if anyone has questions about your approach.

What can I expect to occur?

  • The Publicly Verifiable Differential Privacy track is what it sounds like: we’ll package these techniques and present them to the public for verification.
  • The verification might be empirical, formal, or informal. We’ll be actively encouraging engagement from researchers whose work focuses on validating differential-privacy techniques.
  • Within the constraints of available time, we’ll also provide our own informal review and feedback on any issues we uncover.
  • If you receive feedback that uncovers issues with your technique, you can address them and submit an updated algorithm and sample data.

Submit Related Research to our Research Directory

Research Directory Submission Walkthrough

In addition to a directory of deidentification techniques , we also maintain a directory of individuals and groups doing related research using CRC resources. If you've been making use of CRC resources, we'd love to include you!

The Related Research Directory exists to help researchers find each other and make use of each other's insights. Because everyone in the directory has work referencing the same set of benchmark data, its often easier to build on other's observations and make substantial progress together. To be included in the directory, we only need the following (provide by email or during an office hour) :

  • Name: (individual or group name)
  • Description: 1-2 sentences describing the individual or group, such as research focus, affiliation(s), etc
  • URL: 1-2 reference links for the group (website, arXiv paper, github, discord)
  • CRC Related Work: A list of any publicly available work that cites CRC resources. Possibilities include arXiv preprints, commercial whitepapers, government technical reports, code or data analytics notebooks--- anything publicly available that documents your observations of CRC resources which might be relevant for other researchers. Please see below for information on the definition of CRC related work and how to cite CRC resources.
  • Group Type: Academic (including students), Government, Industry, Individual/Recreational
  • Point of Contact: How can other researchers contact you? Individual or group email address, or website with a contact form.
  • [optional] List of Group Members: Anyone in your group working on CRC related research that you'd like included in the directory
  • [optional] Keywords: A list of keywords for the group's research focus, to make it easier for us to organize groups and make it easier for others to find you.
  • [optional] Image: An icon or image you'd like to use for your group. If you don't provide one, we'll use an excerpt from your CRC Related Work.


How to Cite CRC Resources

We define "CRC Related Research" to be work that uses (and cites) any of the resources available on the Data and Tools page. This includes the Diverse Community Excerpts benchmark data, the SDNist Evaluation Tool, the Deidentified Data Archive (including the Algorithm Summary Table, Data and Metrics Archive, or Meta-report Archive), the meta-analysis notebooks and the Pairwise PCA Exploration Tool.

Below we provide guidance for citing our data and code resources:

  • If you publish work that utilizes the SDNist Deidentified Data Tool, please cite the software. Citation recommendation:

    Task C., Bhagat K., and Howarth G.S. (2023), SDNist v2: Deidentified Data Report Tool, National Institute of Standards and Technology, https://doi.org/10.18434/mds2-2943.

    bibtex: @misc{task_sdnist_2023, author = {Task, Christine and Bhagat, Karan and Howarth, Gary}, doi = {10.18434/MDS2-2943}, month = mar, publisher = {National Institute of Standards and Technology}, shorttitle = {{SDNist} v2}, title = {{SDNist} v2: {Deidentified} {Data} {Report} {Tool}}, url = {https://data.nist.gov/od/id/mds2-2943}, year = {2023}}
  • If you publish work that utilizes the NIST Diverse Communities Data Excerpts, please cite the resource. Citation recommendation:

    Task C., Bhagat K., Streat D., and Howarth G.S. (2023), NIST Diverse Communities Data Excerpts, National Institute of Standards and Technology, https://doi.org/10.18434/mds2-2895.

    bibtex: @misc{task_nist_2022, author = {Task, Christine and Bhagat, Karan and Damon, Streat and Howarth, Gary}, doi = {10.18434/MDS2-2895}, month = dec, publisher = {National Institute of Standards and Technology}, title = {{NIST} {Diverse} {Community} {Excerpts} {Data}}, url = {https://data.nist.gov/od/id/mds2-2895}, year = {2022}}

Help Explore the Deidentified Data Archive

NIST Explanatory Workshop

The CRC program solicits 4-page Research Reports (Tiny Papers) exploring our resources, for potential inclusion in an annual peer-reviewed NIST workshop summarizing the year's progress on the reliable, equitable deidentification problem.

  • Check out our 2023 Call for papers here.
  • Check out the previous NIST CRC Explanatory Workshop Proceedings here .
  • And check out the guidance below for ideas about tiny research reports and the power of exploratory, explanatory research.

Tiny Paper Guidance

Tiny Papers?: We're basing it off the ICLR tiny paper track – you can see some great examples there. We'd like you to describe your research concept/contribution as simply and clearly as you can, making it easy for reviewers and other readers to follow along. Incremental or initial results are just fine. We don't expect you to have everything solved, but we want to know what you've noticed about our data/algorithms and what you've figured out so far. Proofs, detailed experiment results, extended related works, and technical background can go in the appendix.

Contribution Type: We invite participation from all areas of expertise. Purely theoretical and purely empirical submissions are both welcome, as are any mix of the two.

Some Open Problems: What sorts of problems might you explore in your tiny research paper? There’s a lot of possibilities, but here are some potential things to start thinking about (check out our PPAI kick-off slide deck for more context):

  • Through the looking glass: Why (and how, and when) do data-modeling techniques magnify some parts of the population and shrink others? What’s happening when GANs introduce artifacts or when marginal methods reduce diversity?

  • Equity and bias: What happens when a minority demographic has a different pattern of feature correlations than the majority (see regression metric)? Which methods are more or less expressive for retaining these differences and why?

  • Consistency: Not only why do some methods automatically avoid record inconsistencies and others don’t – Why do different methods do better on different feature categories? How does epsilon impact this? How can we improve results?

  • Granularity: What happens when a feature definition gives fine-grained information on a particular demographic group, causing only that group to be more sparsely spread out across the data space? (Consider RAC1P and AIANHN, or DVET and military veterans). Which privacy approaches handle this more or less gracefully? How can we improve?

  • More exciting feature sets: How do these results change when you’re running on 15 features instead of 10? Or all 24? What if you consider weights or household joins?

  • Verification: How can you verify that the “Publicly Verified Differential Privacy” submissions are in fact differentially private? Try out your preferred verification approach on these real solutions submitted by the community.