This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
As one of the several effective solutions for personal privacy protection, a global unique identifier (GUID) is linked with hash codes that are generated from combinations of personally identifiable information (PII) by a one-way hash algorithm. On the GUID server, no PII is permitted to be stored, and only GUID and hash codes are allowed. The quality of PII entry is critical to the GUID system.
The goal of our study was to explore a method of checking questionable entry of PII in this context without using or sending any portion of PII while registering a subject.
According to the principle of GUID system, all possible combination patterns of PII fields were analyzed and used to generate hash codes, which were stored on the GUID server. Based on the matching rules of the GUID system, an error-checking algorithm was developed using set theory to check PII entry errors. We selected 200,000 simulated individuals with randomly-planted errors to evaluate the proposed algorithm. These errors were placed in the required PII fields or optional PII fields. The performance of the proposed algorithm was also tested in the registering system of study subjects.
There are 127,700 error-planted subjects, of which 114,464 (89.64%) can still be identified as the previous one and remaining 13,236 (10.36%, 13,236/127,700) are discriminated as new subjects. As expected, 100% of nonidentified subjects had errors within the required PII fields. The possibility that a subject is identified is related to the count and the type of incorrect PII field. For all identified subjects, their errors can be found by the proposed algorithm. The scope of questionable PII fields is also associated with the count and the type of the incorrect PII field. The best situation is to precisely find the exact incorrect PII fields, and the worst situation is to shrink the questionable scope only to a set of 13 PII fields. In the application, the proposed algorithm can give a hint of questionable PII entry and perform as an effective tool.
The GUID system has high error tolerance and may correctly identify and associate a subject even with few PII field errors. Correct data entry, especially required PII fields, is critical to avoiding false splits. In the context of one-way hash transformation, the questionable input of PII may be identified by applying set theory operators based on the hash codes. The count and the type of incorrect PII fields play an important role in identifying a subject and locating questionable PII fields.
To accelerate biomedical discovery, it is critical for researchers to collaborate, especially to share their study data with each other. After announcing the Big Data Research and Development Initiative to explore how big data could be used to address important problems faced by the government in 2012, Obama’s administration proposed Precision Medicine Initiative [
There are various methods to protect a patient’s privacy, including data anonymization [
For the GUID system [
Before exploring the analysis of questionable data input while registering a subject in the GUID system, it is necessary to review the principle of the system.
The GUID system [
Each PII field is programmatically normalized to have only uppercase letters and numbers, no spaces, and no punctuation. For each subject, these PII fields are combined with 5 patterns (
Personally identifiable information (PII) fields used in global unique identifier (GUID) system.
Type | Name | Meaning |
Required | FN | Complete legal given (first) name at birth |
LN | Complete legal family (last) name at birth | |
MN | Complete legal additional (middle) name | |
SEX | Physical sex at birth (male or female) | |
COB | Country of government issued or national ID | |
DOB | Day of birth | |
MOB | Month of birth | |
YOB | Year of Birth | |
Optional | GIID | Government issued or national ID |
MFN | Mother’s complete legal given (first) name at her birth | |
MLN | Mother’s complete legal family (last) name at her birth | |
FFN | Father’s complete legal given (first) name at his birth | |
FLN | Father’s complete legal family (last) name at his birth | |
MDOB | Mother’s day of birth | |
MMOB | Mother’s month of birth | |
FDOB | Father’s day of birth | |
FMOB | Father’s month of birth |
Personally identifiable information (PII) combination patterns for hash cod.
Hash code | Combinations patterns |
1 | YOB + DOB + SEX + GIIDa |
2 | FN + MN + LN + COB + DOB + MOB |
3 | FN + YOB + MFNa+ MLNa+ FFNa+ FLNa |
4 | FN + LN + COB + SEX + MDOBa+ MMOBa+ FDOBa+ FMOBa |
5 | FN + MN + MOB + MFNa+ FFNa+ MLNa |
aThe field that is optional.
As part of the GUID system, each hash code consists of 64-bytes hash value, which is computed from PII combination pattern using a one-way hash algorithm, and 1 additional byte is added to hold the count of missing PII fields in the hash code (
The GUID system has 3 types of hash codes: perfect, good, and bad. For each hash code, 2 parameters are used to determine its type: a lower threshold (L) and an upper threshold (U) (
Thresholds of missing fields to determine type of hash code.
Parameters | Hash code 1 | Hash code 2 | Hash code 3 | Hash code 4 | Hash code 5 |
Lower threshold | 0 | 1 | 1 | 1 | 1 |
Upper threshold | 1 | 2 | 3 | 3 | 3 |
Once PII is inputted while registering a subject, the system will calculate the count of perfect matches or good matches. In turn, it will determine if there exists a matched subject based on matched hash codes. There are 3 parameters to determine if a subject is matched: threshold for a perfect match (P), threshold for a good match (G), and threshold for a mixed match (X). Two subjects match each other when the count of perfect matches ≥ P, or the count of good matches ≥ G, or the sum of the count of perfect matches and good matches ≥ X. In this system, the thresholds are set to
Components of hash code.
Hash codes are generated from the combinations of PII fields in GUID system, so each one can be considered as a set of transformed PII fields. In addition, there are overlapping PII fields populated within different hash codes. Therefore, set theory may be used to systematically validate questionable PII fields. As long as a hash code is matched, its corresponding PII fields may be eliminated from questionable PII fields by set operations. Because missing values of optional PII fields are permitted, first all probable combination patterns of PII fields for perfect or good hash codes need to be analyzed and then the algorithm for checking questionable PII input might be designed.
According to the principle of the GUID system, there are 3 types of hash codes and a subject is identified only with perfect or good hash codes. Missing fields may affect the match of a hash code. While registering a subject, if missing fields are considered, some improper mismatching will be avoided. For example, hash code 4 from
Each hash code is generated from different combination patterns of PII fields, which are optional or required. Based on the combination patterns, the match rule of hash code and the type of PII fields, all probable perfect or good hash codes of the GUID system can be analyzed and identified (
Probable personally identifiable information (PII) combinations for hash codes with different matching types.
Index | Hash code | Combinations of personally identifiable information fields | Missed fields | Type of hash code | |||||||
1 | 1 | GIID | SEX | DOB | YOB | Perfect | |||||
2 | a | SEX | DOB | YOB | GIID | Good | |||||
3 | 2 | FN | LN | MN | DOB | MOB | COB | Perfect | |||
4 | 3 | MFN | MLN | FFN | FLN | FN | YOB | Perfect | |||
5 | MLN | a | FFN | FLN | FN | YOB | MFN | Perfect | |||
6 | MFN | FFN | a | FLN | FN | YOB | MLN | Perfect | |||
7 | MFN | MLN | FLN | a | FN | YOB | FFN | Perfect | |||
8 | MFN | MLN | FFN | FN | a | YOB | FLN | Perfect | |||
9 | a | a | FFN | FLN | FN | YOB | MFN, MLN | Good | |||
10 | a | MLN | a | FLN | FN | YOB | MFN, FFN | Good | |||
11 | a | MLN | FFN | a | FN | YOB | MFN, FLN | Good | |||
12 | MFN | a | a | FLN | FN | YOB | MLN, FFN | Good | |||
13 | MFN | a | FFN | a | FN | YOB | MLN, FLN | Good | |||
14 | MFN | MLN | a | a | FN | YOB | FFN, FLN | Good | |||
15 | a | a | a | FLN | FN | YOB | MFN, MLN, FFN | Good | |||
16 | MFN | a | a | a | FN | YOB | MLN, FFN, FLN | Good | |||
17 | a | MLN | a | a | FN | YOB | MFN, FFN, FLN | Good | |||
18 | a | a | FFN | a | FN | YOB | MFN, MLN, FLN | Good | |||
20 | 4 | MDOB | MMOB | FDOB | FMOB | FN | LN | SEX | COB | Perfect | |
19 | a | MMOB | FDOB | FMOB | FN | LN | SEX | COB | MDOB | Perfect | |
21 | MDOB | a | FDOB | FMOB | FN | LN | SEX | COB | MMOB | Perfect | |
22 | MDOB | MMOB | a | FMOB | FN | LN | SEX | COB | FDOB | Perfect | |
23 | MDOB | MMOB | FDOB | a | FN | LN | SEX | COB |
Perfect | ||
24 | a | a | FDOB | FMOB | FN | LN | SEX | COB | MDOB, MMOB | Good | |
25 | a | MMOB | a | FMOB | FN | LN | SEX | COB | MDOB, FDOB | Good | |
26 | a | MMOB | FDOB | a | FN | LN | SEX | COB | MDOB, FMOB | Good | |
27 | MDOB | a | a | FMOB | FN | LN | SEX | COB | MMOB, FDOB | Good | |
28 | MDOB | a | FDOB | a | FN | LN | SEX | COB | MMOB, FMOB | Good | |
29 | MDOB | MMOB | a | a | FN | LN | SEX | COB | FDOB, FMOB | Good | |
30 | a | a | a | FMOB | FN | LN | SEX | COB | MDOB, MMOB, FDOB | Good | |
31 | a | a | FDOB | a | FN | LN | SEX | COB | MDOB, MMOB, FMOB | Good | |
32 | a | MMOB | a | a | FN | LN | SEX | COB | MDOB, FDOB, FMOB | Good | |
33 | MDOB | a | a | a | FN | LN | SEX | COB | MMOB, FDOB, FMOB | Good | |
34 | 5 | FN | MN | MFN | FFN | MLN | MOB | Perfect | |||
35 | FN | MN | a | FFN | MLN | MOB | MFN | Perfect | |||
36 | FN | MN | MFN | a | MLN | MOB | FFN | Perfect | |||
37 | FN | MN | MFN | FFN | a | MOB | MLN | Perfect | |||
38 | FN | MN | MFN | a | a | MOB | FFN, MLN | Good | |||
39 | FN | MN | a | a | MLN | MOB | MFN, FFN | Good | |||
40 | FN | MN | a | FFN | a | MOB | MFN, MLN | Good | |||
41 | FN | MN | a | a | a | MOB | MFN, FFN, MLN | Good |
aThe optional field that may be missed while being collected.
An example for match among hash codes.
The count of probable perfect or good hash codes.
Set theory is one of the most important theories of information processing. A set is a collection of a type of objects, and its basic operations include subtraction, union, intersection, subset, and so on. To eliminate some elements from a collection, the set operation (ie, subtraction) is a good solution. Since a hash code is transformed from a combination of PII fields, it must be related to a set of PII fields. Once it matches with one of the hash codes of an identified subject, a corresponding set of PII fields also must match with each other and those PII fields will be considered validated. So using set theory, with the match rule of hash codes and subject in the GUID system, some PII input errors are likely to be located. For example, assuming that while registering a subject, it is found that the PII fields for hash codes 3, 4, and 5 are without missing fields and those hash codes match perfectly with the corresponding hash codes of the identified subject in the server. In addition, hash codes 1 and 2 do not match with the corresponding hash codes of the identified subject. According to the matching rules of the subject, it may be deduced that the subject has been registered in the system. The PII fields related to hash codes 3, 4, and 5 can be eliminated from questionable PII fields. That is,
{GIID, FN, LN, MN, DOB, MOB, YOB, SEX, COB, MFN, MLN, FFN, FLN, MDOB, MMOB, FDOB, FMOB}
/{FN,YOB, MFN, MLN, FFN, FLN} //PII related to hash code 3
U {FN, LN, MDOB, MMOB, FDOB, FMOB, COB, SEX} //PII related to hash code 4
U {FN, MN, MOB, MFN, FFN, MLN} //PII related to hash code 5
={GIID, DOB}
Therefore, the result suggests that questionable fields may be located at PII fields DOB and GIID (
{GIID, FN, LN, MN, DOB, MOB, YOB, SEX, COB, MFN, MLN, FFN, FLN, MDOB, MMOB, FDOB, FMOB}
{ FN, YOB, MFN } U {FN, MN, MOB, MFN } //PII related to hash code 3,5
={ GIID, LN, SEX, COB, DOB, MLN, FFN, FLN, MDOB, MMOB, FDOB, FMOB}
It may be deduced that data entry error exists within PII fields GIID, LN, SEX, COB, DOB, MLN, FFN, FLN, MDOB, MMOB, FDOB, and FMOB.
Based on set theory and the principle of the GUID system, while registering subjects, the algorithm checking questionable PII fields can be described as following.
Step 1 Input PII of subject
Step 2 Generate all probable perfect or good hash codes
…
Step 3 Find matched subjects,
Step 4 If count of
else if
else
Find hash codes in
and get their set of PII fields,
Step 5 Calculate union
Step 6 Calculate subtraction between
Step 7 Return remaining PII fields
An example for locating questionable personally identifiable information (PII) fields while hash codes are perfect match.
An example for locating questionable personally identifiable information (PII) fields while hash codes are good match.
For evaluating the proposed algorithm, the mailing list information [
Then we randomly planted 200,000 errors into the simulation data, including emptying, inserting, deleting, and replacing. In any given field of the same hash code, the count of planted error is not more than one. After planting errors, out of 200,000 subjects, there are 127,700 subjects with errors and 72,300 subjects with no error. In 1 subject, the maximum for planted errors is 8. The count (N_Err) and percent of planted errors by PII fields is shown in
Distribution of planted errors by personally identifiable information (PII) fields.
PIIa fields | N_Err | Percent (%) | |
FN | 12,937 | 6.47 | |
LN | 14,166 | 7.08 | |
MN | 10,234 | 5.12 | |
COB | 12,954 | 6.48 | |
DOB | 10,440 | 5.22 | |
MOB | 12,645 | 6.32 | |
YOB | 11,578 | 5.79 | |
SEX | 11,587 | 5.79 | |
GIID | 7980 | 3.99 | |
MFN | 12,984 | 6.49 | |
MLN | 10,504 | 5.25 | |
FFN | 10,823 | 5.41 | |
FLN | 11,656 | 5.83 | |
MDOB | 13,603 | 6.80 | |
MMOB | 11,301 | 5.65 | |
FDOB | 11,188 | 5.59 | |
FMOB | 13,420 | 6.71 | |
Total | 200,000 | 100 |
aPII: personally identifiable information.
After the dataset is treated, only error-planted subjects are used for simulating input while registering from the client application. The proposed algorithm is applied to validate and locate these planted errors.
When reregistering a subject in a GUID system, the proposed methods may be used to perform the following 2 tasks:
1. Checking questionable PII fields to ensure correct input. If any of the PII fields of the subject are improperly input, the client application will prompt the user to recheck the specified PII without revealing actual input value by using the proposed method.
2. Updating hash codes. If the client ensures that input of PII fields are correct and more complete than before, the application will allow the system to update hash codes.
For the above 2 tasks, we have developed an application program and integrated it into current GUID registering operation. Registered subjects are selected to confirm its value.
Due to planted errors, the values of some PII fields have changed. As shown in
Identifying of error-planted subjects.
Matching type | Recerfa | Recnerfb | Subtotal |
Unidentified | 13,236 | 0 | 13,236 |
Identified | 65,383 | 49,081 | 114,464 |
Total | 78,619 | 49,081 | 127,700 |
aRecerf: the count of subjects with errors in required fields.bRecnerf: the count of subjects with no error in required fields.
Simulation results show that the average errors planted into the identified subjects is 1.48 and that planted into the unidentified subjects is 2.29.
Identifying of subjects with different count of planted errors.
nErr | nRec_Err | nRec_Err_Mtch | Ratio |
1 | 74,883 | 71,796 | 95.88 |
2 | 37,327 | 32,104 | 86.01 |
3 | 12,143 | 8798 | 72.45 |
4 | 2792 | 1545 | 55.34 |
5 | 476 | 199 | 41.81 |
6 | 69 | 18 | 26.09 |
7 | 8 | 4 | 50.00 |
8 | 2 | 0 | 0.00 |
Identifying of subjects with different count of error required fields.
nErr_ReqF | nRec_Err_ReqF | nRec_Err_ReqF_Mtch | Ratio |
0 | 49,081 | 49,081 | 100.00 |
1 | 62,716 | 56,750 | 90.49 |
2 | 14,026 | 8038 | 57.31 |
3 | 1740 | 569 | 32.70 |
4 | 132 | 25 | 18.94 |
5 | 5 | 1 | 20.00 |
Simulation results show that PII errors may be found and located within the limited fields. The best situation is to precisely locate an error at 1 PII field. The worst situation is to reduce the questionable scope of errors down to a set of 13 PII fields. According to the simulated results, the mean questionable scope of errors is shrunk to a set of 5.64 PII fields, 3.59 times as many as the average of errors planted into a subject. It suggests that the mean questionable scope of errors can be limited to a set of less than 4 PII fields.
For identified subjects, the count of analyzed questionable PII fields (ncqf) is related to the count of planted errors in a subject (
If only 1 error is planted into a subject, the count of analyzed questionable PII fields (ncqf_1) depends on the type of error PII field (
The count of analyzed questionable fields by count of errors.
Count of planted errors in a subject | ncqf |
||
Minimum | Maximum | Average | |
1 | 1 | 13 | 4.27 |
2 | 2 | 13 | 7.39 |
3 | 3 | 13 | 9.42 |
4 | 4 | 13 | 10.86 |
5 | 6 | 13 | 11.67 |
6 | 11 | 13 | 11.83 |
7 | 13 | 13 | 13.00 |
The count of analyzed questionable fields by personally identifiable information (PII) fields.
PIIa fields with planted errors | ncqf_PII | |||
Minimum | Maximum | Mean | ||
FN | 13 | 13 | 13 | |
LN | 6 | 13 | 7.65 | |
MN | 2 | 13 | 5.56 | |
SEX | 6 | 12 | 7.30 | |
COB | 6 | 13 | 7.67 | |
DOB | 2 | 11 | 5.69 | |
MOB | 2 | 13 | 5.53 | |
YOB | 3 | 11 | 5.28 | |
GIID | 1 | 11 | 3.74 | |
MFN | 1 | 13 | 6.48 | |
MLN | 1 | 13 | 6.51 | |
FFN | 1 | 13 | 6.59 | |
FLN | 1 | 13 | 4.84 | |
MDOB | 1 | 13 | 6.12 | |
MMOB | 1 | 13 | 6.11 | |
FDOB | 1 | 13 | 6.09 | |
FMOB | 1 | 13 | 6.06 |
aPII: personally identifiable information.
The count of analyzed questionable personally identifiable information (PII) fields from subjects with only one error.
PIIa fields with planted errors | ncqf_1 | |
FN | 13 | |
LN | 6 | |
MN | 2 | |
SEX | 6 | |
COB | 6 | |
DOB | 2 | |
MOB | 2 | |
YOB | 3 | |
GIID | 1 | |
MFN | 1/4 | |
MLN | 1/4 | |
FFN | 1/4 | |
FLN | 1 | |
MDOB | 1/4 | |
MMOB | 1/4 | |
FDOB | 1/4 | |
FMOB | 1/4 |
aPII: personally identifiable information.
The proposed hash code analysis scheme is integrated into the GUID application to enhance GUID accuracy. While registering a subject, who has been previously registered in the system, it analyzes the questionable PII fields, highlights them, and requests the client to correct them (
When the application finds the questionable PII fields, it will give a hint regarding possible PII errors. If it is confirmed that the input of all PII fields are proper, the user may select “update hash codes” function and the application will update the hash codes in the server based on user’s input.
The application of checking questionable personally identifiable information (PII) fields.
In the GUID system [
In addition, simulation results also show that the count and type of error PII fields in a subject have great effect on identifying the subject. In
Hash codes are generated from PII, but it is an irreversible process and a hash code cannot be transformed back into PII. Therefore, it is impossible to validate questionable input by reversing hash codes to PII, which is intended by design. Additionally, missing values of PII fields make it more difficult to validate questionable PII fields. Fortunately, there exists a map between combinations of PII fields and hash codes and there are overlapping PII fields among hash codes of a subject. Each hash code represents a set of PII fields and all probable perfect or good hash codes (
The simulation results also show that the count of analyzed questionable PII fields is closely related to the count of actual errors. The greater the count of actual errors, the more the questionable PII fields to be evaluated (
By using the proposed method in this study, while registering a subject, the application may give a proper hint to the user about questionable PII input. If the user assures that input of PII fields are correct, the hash codes in the system may be updated to improve from the previous entry error, thus improving the robustness of the GUID system.
In summary, a subject with PII errors may still be identified in the GUID system but it depends on the number and type of PII errors. Using set operations, questionable PII fields from the client application may be analyzed based on hash codes but it is difficult to find the exact location of an error because hash codes come from combinations of PII fields and it cannot be reversed to PII. If questionable PII fields need be precisely located, all probable perfect or good hash codes must be stored on the server or the generating mechanism of hash codes in the system must be redesigned.
Country of birth
Name of city or municipality in which subject was born
Day of birth
Father’s day of birth
Father’s complete legal given (first) name at his birth
Father’s complete legal family (last) name at his birth
Father’s month of birth
Complete legal given (first) name at birth
Government Issued or national ID
Global Unique Identifier
Complete legal family (last) name at birth
Mother’s day of birth
Mother’s complete legal given (first) name at her birth
Mother’s complete legal family (last) name at her birth
Mother’s month of birth
Complete legal additional (middle) name
Month of birth
Personally identifiable information
Physical sex at birth (male or female)
Year of Birth
This research is supported by China Scholarship Council, National Social Science Foundation of China (Grant No. 13BTQ052) and a visiting researcher appointment to the NINDS Research Participation Program which is administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the US Department of Energy and the National Institutes of Health.
None declared.