Data Evaluation Report

Report created on: March 15, 2023 21:45:38

Data Description

Deidentified (Deid.) Data:

Label Name	Label Value
Team	CCAIM
Submission Timestamp	3/9/2023 3:29:17
Algorithm Name	Synthcity-PATEGAN
Variant Label	syntheticy-pategan-default

Property	Value
Filename	pategan-ZhaozhiQian
Records	21802
Features	24

Target Data:

Property	Value
Filename	national2019
Records	27253
Features	24

Evaluated Data Features:

Feature Name	Feature Description	Feature Type	Feature Has 'N' (N/A) values?
PUMA	Public use microdata area code	object of type string	False
AGEP	Person's age	int64	False
SEX	Person's gender	int64	False
MSP	Marital Status	object of type string	True
HISP	Hispanic origin	int64	False
RAC1P	Person's Race	int64	False
NOC	Number of own children in household (unweighted)	object of type string	True
NPF	Number of persons in family (unweighted)	object of type string	True
HOUSING_TYPE	Housing unit or group quarters	int64	False
OWN_RENT	Housing unit rented or owned	int64	False
DENSITY	Population density among residents of each PUMA	float64	False
INDP	Industry codes	object of type string	True
INDP_CAT	Industry categories	object of type string	True
EDU	Educational attainment	object of type string	True
PINCP	Person's total income in dollars	object of type string	True
PINCP_DECILE	Person's total income in 10-percentile bins	object of type string	True
POVPIP	Income-to-poverty ratio (ex: 250 = 2.5 x poverty line)	object of type string	True
DVET	Veteran service connected disability rating (percentage)	object of type string	True
DREM	Cognitive difficulty	object of type string	True
DPHY	Ambulatory (walking) difficulty	object of type string	True
DEYE	Vision difficulty	int32	False
DEAR	Hearing difficulty	int32	False
WGTP	Housing unit sampling weight	int64	False
PWGTP	Person's sampling weight	int64	False

Utility Evaluation

K-Marginal Synopsys:

The k-marginal metric checks how far the shape of the deidentified data distribution has shifted away from the target data distribution. It does this using many 3-dimensional snapshots of the data, averaging the density differences across all snapshots. It was developed by Sergey Pogodin as an efficient scoring mechanism for the NIST Temporal Data Challenges, and can be applied to measure the distance between any two data distributions. A score of 0 means two distributions have zero overlap, while a score of 1000 means the two distributions match identically. More information can be found here.

K-Marginal Score: 476

Sampling Error Comparison:

Here we provide a sampling error baseline: Taking a random subsample of the data also shifts the distribution by introducing sampling error. How does the shift from deidentifying data compare to the shift that would occur from subsampling the target data?

K-Marginal score of the deidentified data closely resembles K-Marginal score of a 10% sub-sample of the target data.

Sub-Sample Size	Sub-Sample K-Marginal Score	Deidentified Data K-marginal score	Absolute Diff. From Deidentified Data K-marginal Score
10%	838	476	362
20%	881	476	405
30%	906	476	430
40%	921	476	445
50%	935	476	459
60%	944	476	468
70%	955	476	479
80%	965	476	489
90%	976	476	500
100%	1000	476	524

K-Marginal Score in Each PUMA:

Different PUMA have different subpopulations and distributions; how much has each PUMA shifted during deidentification?

PUMA	Score
29-01901: St. Louis City (North)	0
01-01301: Birmingham City (West)	23
36-04010: NYC-Brooklyn Community District 17--East Flatbush, Farragut & Rugby	70
19-01700: Des Moines City	101
51-51255: Alexandria City	110
06-08507: Santa Clara County (Southwest)--Cupertino, Saratoga Cities & Los Gatos Town	116
06-07502: San Francisco County (North & East)--North Beach & Chinatown	122
36-03710: NYC-Bronx Community District 1 & 2--Hunts Point, Longwood & Melrose	153
08-00803: Boulder County (Central)--Boulder City	168
17-03531: Chicago City (South)--Auburn Gresham, Roseland, Chatham, Avalon Park & Burnside	175
26-02702: Washtenaw County (East Central)--Ann Arbor City Area	222
51-01301: Arlington County (North)	224
17-03529: Chicago City (South)--South Shore, Hyde Park, Woodlawn, Grand Boulevard & Douglas	228
40-00200: Cherokee, Sequoyah & Adair Counties	234
28-01100: Central Region--Jackson City (East & Central)	237
30-00600: East Montana (Outside Billings City)	239
13-04600: Atlanta Regional Commission--Fulton County (Central)--Atlanta City (Central)	258
32-00405: Las Vegas City (Southeast)	260
24-01004: Montgomery County (South)--Bethesda, Potomac & North Bethesda	262
38-00100: West North Dakota--Minot City	292

Univariate Distributions:

Here we provide single feature distribution comparisons ordered to show worst performing features first (based on the L1 norm of density differences).

PWGTP: Person's sampling weight:

PINCP: Person's total income in dollars:

Feature Value: N (N/A)
Target Data Counts: 4247
Deidentified Data Counts: 0

PUMA: Public use microdata area code:

AGEP: Person's age:

WGTP: Housing unit sampling weight:

DENSITY: Population density among residents of each PUMA:

INDP_CAT: Industry categories:

Feature Value: N (N/A)
Target Data Counts: 10929
Deidentified Data Counts: 0

POVPIP: Income-to-poverty ratio (ex: 250 = 2.5 x poverty line):

Feature Value: 501 (Not in poverty: income above 5 x poverty line)
Target Data Counts: 9972
Deidentified Data Counts: 2739

PINCP_DECILE: Person's total income in 10-percentile bins:

EDU: Educational attainment:

DVET: Veteran service connected disability rating (percentage):

Feature Value: N (N/A)
Target Data Counts: 26906
Deidentified Data Counts: 0

NPF: Number of persons in family (unweighted):

MSP: Marital Status:

NOC: Number of own children in household (unweighted):

Feature Value: 0
Target Data Counts: 16052
Deidentified Data Counts: 14136

RAC1P: Person's Race:

SEX: Person's gender:

DPHY: Ambulatory (walking) difficulty:

Feature Value: 2
Target Data Counts: 23776
Deidentified Data Counts: 20294

DREM: Cognitive difficulty:

Feature Value: 2
Target Data Counts: 24379
Deidentified Data Counts: 20474

OWN_RENT: Housing unit rented or owned:

HISP: Hispanic origin:

Feature Value: 0
Target Data Counts: 24400
Deidentified Data Counts: 19206

DEAR: Hearing difficulty:

Feature Value: 2
Target Data Counts: 26282
Deidentified Data Counts: 20806

HOUSING_TYPE: Housing unit or group quarters:

Feature Value: 1
Target Data Counts: 25435
Deidentified Data Counts: 20249

DEYE: Vision difficulty:

Feature Value: 2
Target Data Counts: 26544
Deidentified Data Counts: 21044

Correlations:

A key goal of deidentified data is to preserve the feature correlations from the target data, so that analyses performed on the deidentified data provide meaningful insight about the target population. Which correlations are the deidentified data preserving, and which are being altered during deidentification?

Kendall Tau Correlation Coefficient Difference:

This chart shows pairwise correlations using a somewhat different definition of correlation. To what extent do the two different correlation metrics agree or disagree with each other about the quality of the deidentified data?

Pearson Correlation Coefficient Difference:

The Pearson Correlation difference was a popular utility metric during the HLG-MOS Synthetic Data Test Drive. Note that darker highlighting indicates pairs of features whose correlations were not well preserved by the deidentified data.

Linear Regression:

Linear regression is a fundamental data analysis technique that condenses a multi-dimensional data distribution down to a one dimensional (line) representation. It works by finding the line that sits in the 'middle' of the data, in some sense-- it minimizes the total distance between the points of the data and the line. There are more advanced forms of regression, but here we're focusing on the simplest case-- we fit a simple straight line to the data, getting the slope and y-intercept value of that line.

For this metric we're just looking at data from adults (AGEP > 15) and we're only considering the distribution of the data across two features:

EDU: The highest education level this individual has attained, ranging from 1 (elementary school) to 12 (PhD). See Appendix of this report for the full list of code values.
PINCP_DECILE: The individual's income decile relative to their PUMA. This helps us account for differences in cost of living across the country. If an individual makes a moderate income but lives in a very low income area, they may have a high value for PINCP_DECILE indicating that they have a high income for their PUMA).

Regression: 0.14 slope, 3.1 intercept

Propensity Mean Square Error:

Can a decision tree classifier tell the difference between the target data and the deidentified data? If a classifier is trained to distinguish between the two data sets and it performs poorly on the task, then the deidentified data must not be easy to distinguish from the target data. If the green line matches the blue line, then the deidentified data is high quality. Propensity based metrics have been developed by Joshua Snoke and Gillian Raab and Claire Bowen, all of whom have participated on the NIST Synthetic Data Challenges SME panels.

Score: 0.245

Propensities Distribution:

PCA:

This is another approach for visualizing where the distribution of the deidentified data has shifted away from the target data. In this approach, we begin by using Principle Component Analysis to find a way of representing the target data in a lower dimensional space (in 5 dimensions rather than the full 22 dimensions of the original feature space). Descriptions of these new five dimensions (components) are given in the components table; the components will change depending on which target data set you�re using. Five dimensions are better than 22, but we actually want to get down to two dimensions so we can plot the data on simple (x,y) axes� the plots below show the data across each possible pair combination of our five components. You can compare how the shapes change between the target data and the deidentified data, and consider what that might mean in light of the component definitions. This is a relatively new visualization metric that was introduced by the IPUMS International team during the HLG-MOS Synthetic Data Test Drive.

Contribution of Features in Each Principal Component:

Principal Component	Features Contribution: feature-name (contribution ratio)
PC-0	PINCP (0.38),NOC (0.27),NPF (0.24),WGTP (0.08),OWN_RENT (0.07)
PC-1	OWN_RENT (0.43),WGTP (0.42),PWGTP (0.36),INDP (0.24),INDP_CAT (0.23)
PC-2	PWGTP (0.37),MSP (0.34),WGTP (0.3),DENSITY (0.26),HOUSING_TYPE (0.24)
PC-3	AGEP (0.43),DVET (0.19),OWN_RENT (0.12),PUMA (0.11),PINCP_DECILE (0.1)
PC-4	INDP (0.33),INDP_CAT (0.33),PINCP_DECILE (0.23),POVPIP (0.13),DVET (0.12)

PCA Queries:

The queries below explore the PCA metric results in more detail by zooming in on a single component-pair panel and highlighting all individuals that satisfy a given constraint (such as MSP = �N�, individuals who are unmarried because they are children). If the deidentified data preserves the structure and feature correlations of the target data, the highlighted areas should have similar shape.

MSP_N: Children (AGEP < 15):

Inconsistencies:

Summary:

Inconsistency Group	Number of Records Inconsistent	Percent Records Inconsistent
Age	4464	20.5%
Work	0	0.0%
Housing	5777	26.5%

Age-Based Inconsistencies:

These inconsistencies deal with the AGE feature; records with age-based inconsistencies might have children who are married, or infants with high school diplomas

child_DVET: Children (< 15) can't be disabled military veterans:

4464 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
4	2	9	2	2	2	4	3	0	1	7615	10	4	0	5	2	26194.95917306499	5	117	38-00100	262	1	2	131

child_MSP: Children (< 15) can't be married:

4464 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
4	2	9	2	2	2	4	3	0	1	7615	10	4	0	5	2	26194.95917306499	5	117	38-00100	262	1	2	131

child_PINCP: Children (< 15) don't have personal incomes:

4464 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
4	2	9	2	2	2	4	3	0	1	7615	10	4	0	5	2	26194.95917306499	5	117	38-00100	262	1	2	131

child_PINCP_DECILE: Children (< 15) don't have personal incomes:

4464 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
4	2	9	2	2	2	4	3	0	1	7615	10	4	0	5	2	26194.95917306499	5	117	38-00100	262	1	2	131

child_INDP: Children (< 15) don't have work industries:

4464 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
4	2	9	2	2	2	4	3	0	1	7615	10	4	0	5	2	26194.95917306499	5	117	38-00100	262	1	2	131

child_INDP_CAT: Children (< 15) don't have work industries:

4464 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
4	2	9	2	2	2	4	3	0	1	7615	10	4	0	5	2	26194.95917306499	5	117	38-00100	262	1	2	131

child_phd: Children (< 15) don't have PhDs:

11 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
8	2	0	2	2	2	4	12	0	1	3942	13	1	0	3	1	12604.85179915458	6	501	28-01100	97	2	2	87

toddler_DPHY: Toddlers (< 5) naturally toddle, it's not a physical disability:

1222 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
4	2	9	2	2	2	4	3	0	1	7615	10	4	0	5	2	26194.95917306499	5	117	38-00100	262	1	2	131

toddler_DREM: Toddlers (< 5) are naturally forgetful, it's not a cognitive disability:

1222 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
4	2	9	2	2	2	4	3	0	1	7615	10	4	0	5	2	26194.95917306499	5	117	38-00100	262	1	2	131

toddler_diploma: Toddlers (< 5) don't have high school diplomas:

465 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
2	2	0	2	2	2	4	7	0	1	7571	13	4	7	3	1	21139.80314108197	1	390	29-01901	181	6	2	199

infant_EDU: Infants (< 3) aren't in school:

686 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
2	2	0	2	2	2	4	7	0	1	7571	13	4	7	3	1	21139.80314108197	1	390	29-01901	181	6	2	199

Work-Based Inconsistencies:

These inconsistencies deal with the work and finance features; records with work-based inconsistencies might have high incomes while being in poverty, or have conflicts between their industry code and industry category.

Housing-Based Inconsistencies:

These inconsistencies deal with housing and family features; records with household-based inconsistencies might have more children in the house than the total household size, or be residents of group quarters (such as prison inmates) who are listed as owning their residences.

too_many_children: Adults needed: Family size must be at least one greater than number of children:

3200 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
2	2	0	2	2	2	4	7	0	1	7571	13	4	7	3	1	21139.80314108197	1	390	29-01901	181	6	2	199

gq_own_jail: Inmates don't own jails, patients don't own hospitals: Group quarters residents aren't owners:

883 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
36	2	7	2	2	2	4	10	4	2	979	9	6	1	5	2	22798.215288480504	0	56	36-03710	270	1	1	89

gq_own_dorm: Students don't own dorms, soldiers don't own barracks: Group quarters residents aren't owners:

450 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
49	2	8	2	1	2	4	4	0	3	7586	10	4	0	3	1	29083.82594914341	5	329	06-07502	76	1	2	57

gq_h_family_NPF: Individuals who live in group quarters aren't considered family households:

1553 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
36	2	7	2	2	2	4	10	4	2	979	9	6	1	5	2	22798.215288480504	0	56	36-03710	270	1	1	89

gq_h_family_NOC: Individuals who live in group quarters aren't considered family households:

1553 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
36	2	7	2	2	2	4	10	4	2	979	9	6	1	5	2	22798.215288480504	0	56	36-03710	270	1	1	89

house_OWN_RENT: Individuals who live in houses must specify if they rent or own:

1464 violations

Example Record:

AGEP	DEAR	DENSITY	DEYE	DPHY	DREM	DVET	EDU	HISP	HOUSING_TYPE	INDP	INDP_CAT	MSP	NOC	NPF	OWN_RENT	PINCP	PINCP_DECILE	POVPIP	PUMA	PWGTP	RAC1P	SEX	WGTP
1	2	0	2	2	2	4	1	0	1	7544	9	1	5	4	0	13373.08526614587	3	199	28-01100	44	2	2	0

K-Marginal Score Breakdown:

In the metrics above we�ve considered all of the data together; however we know that algorithms may behave differently on different subgroups in the population. Below we look in more detail at deidentification performance just in the worst performing PUMA, based on k-marginal score.

5 Worst Performing PUMA:

Which are the worst performing PUMA?

PUMA	Score
29-01901: St. Louis City (North)	0
01-01301: Birmingham City (West)	23
36-04010: NYC-Brooklyn Community District 17--East Flatbush, Farragut & Rugby	70
19-01700: Des Moines City	101
51-51255: Alexandria City	110

Record Counts in 5 Worst Performing PUMA:

Did the deidentified versions of these PUMA have similar population totals to the target versions?

Dataset	Record Counts
Target	5837
Deidentified	6086