Univariate Distributions:
Here we provide single feature distribution comparisons ordered to show worst performing features first (based on the L1 norm of density differences).
PWGTP: Person's sampling weight:
PINCP: Person's total income in dollars:
Feature Value: N (N/A)
Target Data Counts: 4247
Deidentified Data Counts: 0
PUMA: Public use microdata area code:
WGTP: Housing unit sampling weight:
DENSITY: Population density among residents of each PUMA:
INDP_CAT: Industry categories:
Feature Value: N (N/A)
Target Data Counts: 10929
Deidentified Data Counts: 0
POVPIP: Income-to-poverty ratio (ex: 250 = 2.5 x poverty line):
Feature Value: 501 (Not in poverty: income above 5 x poverty line)
Target Data Counts: 9972
Deidentified Data Counts: 2739
PINCP_DECILE: Person's total income in 10-percentile bins:
EDU: Educational attainment:
DVET: Veteran service connected disability rating (percentage):
Feature Value: N (N/A)
Target Data Counts: 26906
Deidentified Data Counts: 0
NPF: Number of persons in family (unweighted):
NOC: Number of own children in household (unweighted):
Feature Value: 0
Target Data Counts: 16052
Deidentified Data Counts: 14136
DPHY: Ambulatory (walking) difficulty:
Feature Value: 2
Target Data Counts: 23776
Deidentified Data Counts: 20294
DREM: Cognitive difficulty:
Feature Value: 2
Target Data Counts: 24379
Deidentified Data Counts: 20474
OWN_RENT: Housing unit rented or owned:
Feature Value: 0
Target Data Counts: 24400
Deidentified Data Counts: 19206
DEAR: Hearing difficulty:
Feature Value: 2
Target Data Counts: 26282
Deidentified Data Counts: 20806
HOUSING_TYPE: Housing unit or group quarters:
Feature Value: 1
Target Data Counts: 25435
Deidentified Data Counts: 20249
Feature Value: 2
Target Data Counts: 26544
Deidentified Data Counts: 21044
Linear Regression:
Linear regression is a fundamental data analysis technique that condenses a multi-dimensional data distribution down to a one dimensional (line) representation. It works by finding the line that sits in the 'middle' of the data, in some sense--
it minimizes the total distance between the points of the data and the line. There are more advanced forms of regression, but here we're focusing on the simplest case-- we fit a simple straight line to the data, getting the slope and y-intercept value of that line.
For this metric we're just looking at data from adults (AGEP > 15) and we're only considering the distribution of the data across two features:
- EDU: The highest education level this individual has attained, ranging from 1 (elementary school) to 12 (PhD). See Appendix of this report for the full list of code values.
- PINCP_DECILE: The individual's income decile relative to their PUMA. This helps us account for differences in cost of living across the country. If an individual makes a moderate income but lives in a very low income area, they may have a high value for PINCP_DECILE indicating that they have a high income for their PUMA).
The basic idea is that higher values of EDU should lead to higher values of PINCP_DECILE, and this is broadly true. However, it is known that the relationship between EDU and PINCP_DECILE is different for different demographic subgroups. The heatmaps in the left column below show the density distribution of the true data for each subgroup, normalized by education category (so the density values in each column sum to 1; note that when a cell in the heatmap contains too few people (< 20 ), it is left blank; its not expected that the deidentified data will match the original distribution precisely). The regression line is drawn in red over the heatmap, so you can see the relationship between the target data distribution and its linear regression analysis. In the right column for each subgroup we show how the deidentified data's regression line compares to the target data's regression line, along with a heatmap of the density differences between the two distributions. Redder areas are where the deidentified data has created too many people, bluer areas are where it's created too few people.
We've broken this metric down into demographic subgroups so we can see not only how well the privacy techniques preserve the overall relationship between these features, but also whether they preserve how that overall relationship is built up from the different relationships that hold at each major demographic subgroup. It's important that deidentification techniques preserve these distinct subgroup patterns for analysis.
Total Population:
Target Data:
23006 records, 100.0% of adult (>15) data
Regression: 0.63 slope, -0.1 intercept
Deidentified Data:
21802 records, 100.0% of adult (>15) data
Regression: 0.32 slope, 2.38 intercept
White Men:
Target Data:
6463 records, 28.09% of adult (>15) data
Regression: 0.68 slope, 0.39 intercept
Deidentified Data:
5289 records, 24.26% of adult (>15) data
Regression: 0.36 slope, 2.41 intercept
White Women:
Target Data:
6505 records, 28.28% of adult (>15) data
Regression: 0.66 slope, -0.6 intercept
Deidentified Data:
6133 records, 28.13% of adult (>15) data
Regression: 0.35 slope, 2.6 intercept
Black Men:
Target Data:
2720 records, 11.82% of adult (>15) data
Regression: 0.52 slope, 0.45 intercept
Deidentified Data:
2646 records, 12.14% of adult (>15) data
Regression: 0.25 slope, 2.35 intercept
Black Women:
Target Data:
3366 records, 14.63% of adult (>15) data
Regression: 0.51 slope, 0.3 intercept
Deidentified Data:
3323 records, 15.24% of adult (>15) data
Regression: 0.25 slope, 2.24 intercept
Asian Men:
Target Data:
914 records, 3.97% of adult (>15) data
Regression: 0.7 slope, -0.68 intercept
Deidentified Data:
643 records, 2.95% of adult (>15) data
Regression: 0.3 slope, 2.58 intercept
Asian Women:
Target Data:
982 records, 4.27% of adult (>15) data
Regression: 0.55 slope, -0.19 intercept
Deidentified Data:
782 records, 3.59% of adult (>15) data
Regression: 0.3 slope, 2.43 intercept
American Indian, Alaskan Native and Native Hawaiians (AIANNH) Men:
Target Data:
376 records, 1.63% of adult (>15) data
Regression: 0.42 slope, 1.18 intercept
Deidentified Data:
675 records, 3.1% of adult (>15) data
Regression: 0.26 slope, 2.51 intercept
American Indian, Alaskan Native and Native Hawaiians (AIANNH) Women:
Target Data:
395 records, 1.72% of adult (>15) data
Regression: 0.54 slope, -0.19 intercept
Deidentified Data:
909 records, 4.17% of adult (>15) data
Regression: 0.14 slope, 3.1 intercept
PCA:
This is another approach for visualizing where the distribution of the deidentified data has shifted away from the target data. In this approach, we begin by using
Principle Component Analysis to find a way of representing the target data in a lower dimensional space (in 5 dimensions rather than the full 22 dimensions of the original feature space). Descriptions of these new five dimensions (components) are given in the components table; the components will change depending on which target data set you’re using. Five dimensions are better than 22, but we actually want to get down to two dimensions so we can plot the data on simple (x,y) axes– the plots below show the data across each possible pair combination of our five components. You can compare how the shapes change between the target data and the deidentified data, and consider what that might mean in light of the component definitions. This is a relatively new visualization metric that was introduced by the
IPUMS International team during the HLG-MOS Synthetic Data Test Drive.
Contribution of Features in Each Principal Component:
Principal Component
|
Features Contribution: feature-name (contribution ratio)
|
PC-0 |
PINCP (0.38),NOC (0.27),NPF (0.24),WGTP (0.08),OWN_RENT (0.07) |
PC-1 |
OWN_RENT (0.43),WGTP (0.42),PWGTP (0.36),INDP (0.24),INDP_CAT (0.23) |
PC-2 |
PWGTP (0.37),MSP (0.34),WGTP (0.3),DENSITY (0.26),HOUSING_TYPE (0.24) |
PC-3 |
AGEP (0.43),DVET (0.19),OWN_RENT (0.12),PUMA (0.11),PINCP_DECILE (0.1) |
PC-4 |
INDP (0.33),INDP_CAT (0.33),PINCP_DECILE (0.23),POVPIP (0.13),DVET (0.12) |
The queries below explore the PCA metric results in more detail by zooming in on a single component-pair panel and highlighting all individuals that satisfy a given constraint (such as MSP = “N”, individuals who are unmarried because they are children). If the deidentified data preserves the structure and feature correlations of the target data, the highlighted areas should have similar shape.
MSP_N: Children (AGEP < 15):
Inconsistencies:
Summary:
Inconsistency Group
|
Number of Records Inconsistent
|
Percent Records Inconsistent
|
Age |
4464 |
20.5% |
Work |
0 |
0.0% |
Housing |
5777 |
26.5% |
Age-Based Inconsistencies:
These inconsistencies deal with the AGE feature; records with age-based inconsistencies might have children who are married, or infants with high school diplomas
child_DVET: Children (< 15) can't be disabled military veterans:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
4 |
2 |
9 |
2 |
2 |
2 |
4 |
3 |
0 |
1 |
7615 |
10 |
4 |
0 |
5 |
2 |
26194.95917306499 |
5 |
117 |
38-00100 |
262 |
1 |
2 |
131 |
child_MSP: Children (< 15) can't be married:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
4 |
2 |
9 |
2 |
2 |
2 |
4 |
3 |
0 |
1 |
7615 |
10 |
4 |
0 |
5 |
2 |
26194.95917306499 |
5 |
117 |
38-00100 |
262 |
1 |
2 |
131 |
child_PINCP: Children (< 15) don't have personal incomes:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
4 |
2 |
9 |
2 |
2 |
2 |
4 |
3 |
0 |
1 |
7615 |
10 |
4 |
0 |
5 |
2 |
26194.95917306499 |
5 |
117 |
38-00100 |
262 |
1 |
2 |
131 |
child_PINCP_DECILE: Children (< 15) don't have personal incomes:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
4 |
2 |
9 |
2 |
2 |
2 |
4 |
3 |
0 |
1 |
7615 |
10 |
4 |
0 |
5 |
2 |
26194.95917306499 |
5 |
117 |
38-00100 |
262 |
1 |
2 |
131 |
child_INDP: Children (< 15) don't have work industries:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
4 |
2 |
9 |
2 |
2 |
2 |
4 |
3 |
0 |
1 |
7615 |
10 |
4 |
0 |
5 |
2 |
26194.95917306499 |
5 |
117 |
38-00100 |
262 |
1 |
2 |
131 |
child_INDP_CAT: Children (< 15) don't have work industries:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
4 |
2 |
9 |
2 |
2 |
2 |
4 |
3 |
0 |
1 |
7615 |
10 |
4 |
0 |
5 |
2 |
26194.95917306499 |
5 |
117 |
38-00100 |
262 |
1 |
2 |
131 |
child_phd: Children (< 15) don't have PhDs:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
8 |
2 |
0 |
2 |
2 |
2 |
4 |
12 |
0 |
1 |
3942 |
13 |
1 |
0 |
3 |
1 |
12604.85179915458 |
6 |
501 |
28-01100 |
97 |
2 |
2 |
87 |
toddler_DPHY: Toddlers (< 5) naturally toddle, it's not a physical disability:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
4 |
2 |
9 |
2 |
2 |
2 |
4 |
3 |
0 |
1 |
7615 |
10 |
4 |
0 |
5 |
2 |
26194.95917306499 |
5 |
117 |
38-00100 |
262 |
1 |
2 |
131 |
toddler_DREM: Toddlers (< 5) are naturally forgetful, it's not a cognitive disability:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
4 |
2 |
9 |
2 |
2 |
2 |
4 |
3 |
0 |
1 |
7615 |
10 |
4 |
0 |
5 |
2 |
26194.95917306499 |
5 |
117 |
38-00100 |
262 |
1 |
2 |
131 |
toddler_diploma: Toddlers (< 5) don't have high school diplomas:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
2 |
2 |
0 |
2 |
2 |
2 |
4 |
7 |
0 |
1 |
7571 |
13 |
4 |
7 |
3 |
1 |
21139.80314108197 |
1 |
390 |
29-01901 |
181 |
6 |
2 |
199 |
infant_EDU: Infants (< 3) aren't in school:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
2 |
2 |
0 |
2 |
2 |
2 |
4 |
7 |
0 |
1 |
7571 |
13 |
4 |
7 |
3 |
1 |
21139.80314108197 |
1 |
390 |
29-01901 |
181 |
6 |
2 |
199 |
Work-Based Inconsistencies:
These inconsistencies deal with the work and finance features; records with work-based inconsistencies might have high incomes while being in poverty, or have conflicts between their industry code and industry category.
Housing-Based Inconsistencies:
These inconsistencies deal with housing and family features; records with household-based inconsistencies might have more children in the house than the total household size, or be residents of group quarters (such as prison inmates) who are listed as owning their residences.
too_many_children: Adults needed: Family size must be at least one greater than number of children:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
2 |
2 |
0 |
2 |
2 |
2 |
4 |
7 |
0 |
1 |
7571 |
13 |
4 |
7 |
3 |
1 |
21139.80314108197 |
1 |
390 |
29-01901 |
181 |
6 |
2 |
199 |
gq_own_jail: Inmates don't own jails, patients don't own hospitals: Group quarters residents aren't owners:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
36 |
2 |
7 |
2 |
2 |
2 |
4 |
10 |
4 |
2 |
979 |
9 |
6 |
1 |
5 |
2 |
22798.215288480504 |
0 |
56 |
36-03710 |
270 |
1 |
1 |
89 |
gq_own_dorm: Students don't own dorms, soldiers don't own barracks: Group quarters residents aren't owners:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
49 |
2 |
8 |
2 |
1 |
2 |
4 |
4 |
0 |
3 |
7586 |
10 |
4 |
0 |
3 |
1 |
29083.82594914341 |
5 |
329 |
06-07502 |
76 |
1 |
2 |
57 |
gq_h_family_NPF: Individuals who live in group quarters aren't considered family households:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
36 |
2 |
7 |
2 |
2 |
2 |
4 |
10 |
4 |
2 |
979 |
9 |
6 |
1 |
5 |
2 |
22798.215288480504 |
0 |
56 |
36-03710 |
270 |
1 |
1 |
89 |
gq_h_family_NOC: Individuals who live in group quarters aren't considered family households:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
36 |
2 |
7 |
2 |
2 |
2 |
4 |
10 |
4 |
2 |
979 |
9 |
6 |
1 |
5 |
2 |
22798.215288480504 |
0 |
56 |
36-03710 |
270 |
1 |
1 |
89 |
house_OWN_RENT: Individuals who live in houses must specify if they rent or own:
AGEP
|
DEAR
|
DENSITY
|
DEYE
|
DPHY
|
DREM
|
DVET
|
EDU
|
HISP
|
HOUSING_TYPE
|
INDP
|
INDP_CAT
|
MSP
|
NOC
|
NPF
|
OWN_RENT
|
PINCP
|
PINCP_DECILE
|
POVPIP
|
PUMA
|
PWGTP
|
RAC1P
|
SEX
|
WGTP
|
1 |
2 |
0 |
2 |
2 |
2 |
4 |
1 |
0 |
1 |
7544 |
9 |
1 |
5 |
4 |
0 |
13373.08526614587 |
3 |
199 |
28-01100 |
44 |
2 |
2 |
0 |
K-Marginal Score Breakdown:
In the metrics above we’ve considered all of the data together; however we know that algorithms may behave differently on different subgroups in the population. Below we look in more detail at deidentification performance just in the worst performing PUMA, based on k-marginal score.
5 Worst Performing PUMA:
Which are the worst performing PUMA?
Record Counts in 5 Worst Performing PUMA:
Did the deidentified versions of these PUMA have similar population totals to the target versions?
Dataset
|
Record Counts
|
Target |
5837 |
Deidentified |
6086 |
Univariate Distribution of Worst Performing Features in 5 Worst Performing PUMA:
Which features are performing the worst in each of these PUMA?
PWGTP: Person's sampling weight:
WGTP: Housing unit sampling weight:
PINCP: Person's total income in dollars:
Feature Value: N (N/A)
Target Data Counts: 880
Deidentified Data Counts: 0
DENSITY: Population density among residents of each PUMA:
INDP_CAT: Industry categories:
Feature Value: N (N/A)
Target Data Counts: 2343
Deidentified Data Counts: 0
EDU: Educational attainment:
PINCP_DECILE: Person's total income in 10-percentile bins:
POVPIP: Income-to-poverty ratio (ex: 250 = 2.5 x poverty line):
Feature Value: 501 (Not in poverty: income above 5 x poverty line)
Target Data Counts: 1862
Deidentified Data Counts: 810
DVET: Veteran service connected disability rating (percentage):
Feature Value: N (N/A)
Target Data Counts: 5731
Deidentified Data Counts: 0
NPF: Number of persons in family (unweighted):
PUMA: Public use microdata area code:
NOC: Number of own children in household (unweighted):
Feature Value: 0
Target Data Counts: 3577
Deidentified Data Counts: 3963
OWN_RENT: Housing unit rented or owned:
DPHY: Ambulatory (walking) difficulty:
Feature Value: 2
Target Data Counts: 5015
Deidentified Data Counts: 5664
DREM: Cognitive difficulty:
Feature Value: 2
Target Data Counts: 5126
Deidentified Data Counts: 5697
Feature Value: 0
Target Data Counts: 5383
Deidentified Data Counts: 5418
DEAR: Hearing difficulty:
Feature Value: 2
Target Data Counts: 5635
Deidentified Data Counts: 5834
HOUSING_TYPE: Housing unit or group quarters:
Feature Value: 1
Target Data Counts: 5475
Deidentified Data Counts: 5698
Feature Value: 2
Target Data Counts: 5646
Deidentified Data Counts: 5897
Pearson Correlation Coefficient Difference in 5 Worst Performing PUMA:
How are feature correlations performing in each of these PUMA?