Understanding Data Deidentification

The NIST Privacy Engineering Program has launched the Collaborative Research Cycle (CRC) to spur research, innovation, and understanding of data deidentification techniques. The CRC is a revolutionary rethinking of the typical data challenge format, using metrics and benchmark data not to competitively rank approaches but to drive collaborative research towards better understanding them.

Program Purpose and Introduction

The most interesting research problems happen in cycles. First there is the idea, then engineering to implement the idea on a real-world use case, followed by engagement with real-world experts, and then that feedback is what spurs us to look more carefully at the problem and realize its true complexity. Each iteration of this collaborative research cycle leads to more valuable and foundational research.

The NIST Differential Privacy Challenge Series saw huge gains in synthetic data generation performance and the publication of a variety of open source tools. Yet, realistic evaluation and benchmarking for deidentified data (synthetic and other techniques) remains difficult. NIST seeks to assist the research community by releasing new data and evaluation tools, drawn from real-world expertise, designed to more deeply understand data deidentification techniques and their behavior on diverse, real-world data.

The Diverse Communities Data Excerpts are real-world, limited-feature data (24 columns), drawn from the American Community Survey and divided into three distinct geographic partitions. The complimentary SDNist Deidentified Data Report Generator provides a suite of both machine- and human-readable outputs with more than ten metrics, including univariate and multivariate statistics, database distance metrics, principal component analysis, propensity, basic privacy evaluation, and other information-rich tools ( see examples here).

The CRC collects, evaluates, and packages contributed deidentified Diverse Communities Data Excerpts as a research acceleration bundle. The CRC plan to update the acceleration bundle at least once during the program to allow for on-going submissions. The CRC invites researchers to use this research acceleration bundle. for investigation, comparison, and analysis and to submit their findings to a workshop.

Program Summary

The CRC invites researchers to contribute deidentified records from the NIST Diverse Communities Data Excerpts, along with a brief abstract listing their methods. The CRC has released a machine-readable research acceleration bundle of all contributions, along with detailed evaluations using the SDNist report tool . We invite researchers to use the acceleration bundle to perform analysis and submit their findings in 4-page-or-less tiny papers to a workshop to be held in December 2024. Submitted papers and NIST-contributed research will be packaged in a set of conference proceedings we expect to release in January 2025. Prizes and awards are not part of this program. For more information see the links below:

  • Participate : Learn more about how to participate.
  • Techniques : Look through our current collection of privacy approaches (if you’d like to add one we’re missing, go here!).
  • Research : Directory of research groups working with CRC resources
  • Data and Tools : Current CRC resources including benchmark data, an archive of over 450 deidentified data examples, and a selection of analysis and exploration tools.
  • Workshop : Proceedings from previous NIST CRC Explanatory Workshops.

Questions

Please contact NIST scientist Gary Howarth or crowd-source them by joining the CRC list-serv (submit an empty email to subscribe).

Download SDNist Version 2

Use our Benchmark Data and Tools in your own research or teaching