Understanding Data Deidentification

The NIST Privacy Engineering Program has launched the Collaborative Research Cycle (CRC) to spur research, innovation, and understanding of data deidentification techniques. The CRC is a revolutionary rethinking of the typical data challenge format, using metrics and benchmark data not to competitively rank approaches but to drive collaborative research towards better understanding them.

Program Purpose and Introduction

The most interesting research problems happen in cycles. First there is the idea, then engineering to implement the idea on a real-world use case, followed by engagement with real-world experts, and then that feedback is what spurs us to look more carefully at the problem and realize its true complexity. Each iteration of this collaborative research cycle leads to more valuable and foundational research.

The NIST Differential Privacy Challenge Series saw huge gains in synthetic data generation performance and the publication of a variety of open source tools. Yet, realistic evaluation and benchmarking for deidentified data (synthetic and other techniques) remains difficult. NIST seeks to assist the research community by releasing new data and evaluation tools, drawn from real-world expertise, designed to more deeply understand data deidentification techniques and their behavior on diverse, real-world data.

The Diverse Communities Data Excerpts are real-world, limited-feature data (24 columns), drawn from the American Community Survey and divided into three distinct geographic partitions. The complimentary SDNist Deidentified Data Report Generator provides a suite of both machine- and human-readable outputs with more than ten metrics, including univariate and multivariate statistics, database distance metrics, principal component analysis, propensity, basic privacy evaluation, and other information-rich tools ( see examples here).

The CRC collects, evaluates, and packages contributed deidentified Diverse Communities Data Excerpts as a research acceleration bundle. The CRC plan to update the acceleration bundle at least once during the program to allow for on-going submissions. The CRC invites researchers to use this research acceleration bundle. for investigation, comparison, and analysis and to submit their findings to a workshop.

Program Summary

The CRC invites researchers to contribute deidentified records from the NIST Diverse Communities Data Excerpts, along with a brief abstract listing their methods. The CRC has released a machine-readable research acceleration bundle. of all contributions, along with detailed evaluations using the SDNist report tool. We expect to drop another release in July. We invite researchers to use the acceleration bundle to perform analysis and submit their findings in 3-page-or-less tiny papers to a workshop to be held in November 2023. Submitted papers and NIST-contributed research will be packaged in a set of conference proceedings we expect to release in January 2024. Prizes and awards are not part of this program.

  • Participate : Learn more about how to participate.
  • Results Blog : See what interesting things we’ve discovered so far; check out our results blog.
  • Techniques : Look through our current collection of privacy approaches (if you’d like to add one we’re missing, go here!).
  • How to Cite : Look here If you’d like to reference these resources in your own work.

Questions

Please contact NIST scientist Gary Howarth or crowd-source them by joining the CRC list-serv (submit an empty email to subscribe).

Office Hours

We are holding periodic office hours for drop-in conversation (see time and dates below). These informal meetings are an opportunity to discuss the problem and any challenges you may be facing. We're happy to discuss report results and to help you troubleshoot a problem. If you send us some data on the Friday before an office hour session, we are happy to take a detailed look at the results. We will record each session and provide access to recordings upon request. Join our office hours using this link.

Timeline

Exploratory Phase (February - July 2023)

Date Event
28 FEB 2023 SDNist V2 launch CRC open for submissions.
7 MAR 2023 CRC releases instructional video.
13 MAR 2023 Office hours session (11AM ET).
20 MAR 2023 Office hours session (11AM ET).
3 APR 2023 Office hours session (11AM ET).
17 APR 2023 Office hours session (11AM ET).
1 MAY 2023 Office hours session (11AM ET).
9 MAY 2023 CRC closes for first release submissions.
19 MAY 2023 research acceleration bundle.
12 JUN 2023 19 JUN 2023 Office hours session (11AM ET).
26 JUN 2023 Office hours session (11AM ET).
7 JUL 2023 CRC closes for second release submissions.
10 JUL 2023 Office hours session (11AM ET).
14 JUL 2023 Second release of the research acceleration bundle.

Explanatory Phase (May - November 2023)

Date Event
24 JUL 2023 Office hours session (11AM ET).
7 AUG 2023 Office hours session (11AM ET).
18 AUG 2023 Release of call for 4-page tiny papers.
21 AUG 2023 Office hours session (11AM ET).
11 SEP 2023 Office hours session (11AM ET).
18 SEP 2023 Office hours session (11AM ET).
12 OCT 2023 Optional: Early tiny-paper abstract submissions for feedback.
7 NOV 2023 Explanatory tiny papers due.
1 DEC 2023 Explanatory tiny-paper notifications.
18 DEC 2023 Explanatory Workshop.
5 JAN 2024 Optional: Final camera-ready tiny paper.
24 JAN 2024 CRC releases project findings and proceedings.

Download SDNist Version 2

Use our Benchmark Data and Tools in your own research or teaching