Understanding Data Deidentification
The NIST Privacy Engineering Program has launched the Collaborative Research Cycle (CRC) to spur research, innovation, and understanding of data deidentification techniques. The CRC is a revolutionary rethinking of the typical data challenge format, using metrics and benchmark data not to competitively rank approaches but to drive collaborative research towards better understanding them.
Program Purpose and Introduction
The most interesting research problems happen in cycles. First there is the idea, then engineering to implement the idea on a real-world use case, followed by engagement with real-world experts, and then that feedback is what spurs us to look more carefully at the problem and realize its true complexity. Each iteration of this collaborative research cycle leads to more valuable and foundational research.
The NIST Differential Privacy Challenge Series saw huge gains in synthetic data generation performance and the publication of a variety of open source tools. Yet, realistic evaluation and benchmarking for deidentified data (synthetic and other techniques) remains difficult. NIST seeks to assist the research community by releasing new data and evaluation tools, drawn from real-world expertise, designed to more deeply understand data deidentification techniques and their behavior on challenging, real-world data.
The NIST ACS Data Excerpts are real-world, limited-feature data (24 columns), drawn from the American Community Survey and divided into three distinct geographic partitions. The complimentary SDNist Deidentified Data Report Generator provides a suite of both machine- and human-readable outputs with more than ten metrics, including univariate and multivariate statistics, database distance metrics, principal component analysis, propensity, basic privacy evaluation, and other information-rich tools ( see examples here).
The CRC collects, evaluates, and packages contributed NIST ACS Data Excerpts as a research acceleration bundle. The CRC plan to update the acceleration bundle at least once during the program to allow for on-going submissions. The CRC invites researchers to use this research acceleration bundle. for investigation, comparison, and analysis and to submit their findings to a workshop.
The original NIST ACS Data Excerpts released in 2023 has only 24 features and 40K records, and is designed to support the fine grained analysis of the behavior of data deidentification algorithms. In two years since then we've seen significant advances in the state of synthetic data generation, including some incredibly robust high fidelity performance--on our 24 feature excerpt data. Check out our algorithms summary table and data and metrics archive to see all of the results for yourself.
In 2025 we now have two additional questions: How well do these algorithms scale to larger data? And, are they providing genuinely good protection against sophisticated reidentification attacks? We've added the NIST SBO Data Excerpts to the NIST Excerpts Benchmarks collection, to provide a platform for stress-testing successful methods on much larger schema (130 features) and data sets (161K records).
We'll be using both Excerpts (the ACS and SBO) in a Community Red Team exercise launching in March 2025 to explore the privacy protection of selected high performing algorithms in the CRC deidentified data archive. Can you hack the CRC?
Questions
Please contact NIST scientist Gary Howarth or crowd-source them by joining the CRC list-serv (submit an empty email to subscribe).