Understanding Data Deidentification

The NIST Privacy Engineering Program has launched the Collaborative Research Cycle (CRC) to spur research, innovation, and understanding of data deidentification techniques. The CRC is a revolutionary rethinking of the typical data challenge format, using metrics and benchmark data not to competitively rank approaches but to drive collaborative research towards better understanding them.

Come red-teaming with us from March - June, 2025!

Program Purpose and Introduction

The most interesting research problems happen in cycles. First there is the idea, then engineering to implement the idea on a real-world use case, followed by engagement with real-world experts, and then that feedback is what spurs us to look more carefully at the problem and realize its true complexity. Each iteration of this collaborative research cycle leads to more valuable and foundational research.

Benchmark Data

The CRC provides benchmark data for evaluating deidentification techniques, featuring two real-world, benchmark datasets. The NIST ACS Data Excerpts are real-world, limited-feature data (24 columns), drawn from the American Community Survey and divided into three distinct geographic partitions. The NIST SBO Data Excerpts provide a platform for stress-testing successful methods on much larger schema (130 features) and data sets (161K records).

Evaluation Metrics

The CRC provide tools for evaluating the privacy, fidelity and utility of deidentified data. The complimentary SDNist Deidentified Data Report Generator provides a suite of both machine- and human-readable outputs with more than ten metrics, including univariate and multivariate statistics, database distance metrics, principal component analysis, propensity, basic privacy evaluation, and other information-rich tools ( see examples here).

Technique Directory

The CRC is always accepting submissions of new deidentification techniques, which are included in our Techniques Directory, and benchmarked on the Algorithm Summary Table. To support research in deidentification, the full collection of deidentified data samples and detailed evaluation reports are available in our Research Acceleration Bundle to submit their findings to our research directory.

Get Involved

We invite the public to submit deidentification techniques to our archive and to conduct research using our software and the research acceleartion bundle.

Red Teaming

Can you re-identify the CRC synthetic data samples (i.e., reconstruct targeted individuals’ correct attribute values)? Here's your chance! We are working toward understanding deidentification techniques. How well does statistical disclosure limitation and other suppression techniques compare with differentially private solutions?

Questions

Please contact NIST scientist Gary Howarth or crowd-source them by joining the CRC list-serv (submit an empty email to subscribe).

Download SDNist Version 2

Use our Benchmark Data and Tools in your own research or teaching