Guidance on how to prepare a Readme describing data de-identification steps performed on datasets ahead of submission
The purpose of this document is to provide guidance to data contributors on how to prepare a Readme describing data de-identification steps performed on the data sets ahead of submission in BDC. Clear documentation describing the de-identification methods is required for submission to help BDC and future data re-use to understand the data elements and de-identification applied to the data sets being shared.
Researchers are expected to describe the de-identification methods used on the data in sufficient detail for an external party to be able to understand and replicate. Documentation of the approach for all 18 De-identifying Elements is required.
Section 3 is an example checklist used by BDC to perform checking on data de-identification to demonstrate that data has been reviewed and to note if the particular PHI/PII was found in the dataset and how it was resolved.
Section 4 is an example Readme, adapted from the NHLBI Collaborating Network of Networks for Evaluating COVID-19 and Therapeutic Strategies (CONNECTS) De-Identification Guidance.
De-identification QC by BDC includes the data elements in the table below.
Names of patients, relatives, employers or household members
Data not included or data are masked using a method that renders it unreadable
All geographic subdivisions smaller than a state
Data not included or data are masked using a method that renders it unreadable
All elements of dates (except year) for dates that are directly related to an individual
Dates were addressed as follows:
Ages were reported as bins spanning 10 years: 40-49, 50-59, 60-69, 70-79, 80-89 and ≥90
Dates of events, such as treatment, visit, birth, and death, are given as year only
Telephone numbers
Data not included or data are masked using a method that renders it unreadable
Vehicle identifiers and serial numbers, including license plate numbers
Data not included or data are masked using a method that renders it unreadable
Fax numbers
Data not included or data are masked using a method that renders it unreadable
Device identifiers and serial numbers
Data not included or data are masked using a method that renders it unreadable or unlinkable to original values
Email addresses
Data not included or data are masked using a method that renders it unreadable
Web Universal Resource Locators (URLs)
Data not included or data are masked using a method that renders it unreadable
Social security numbers
Data not included or data are masked using a method that renders it unreadable
Internet Protocol (IP) addresses
Data not included or data are masked using a method that renders it unreadable
Medical record numbers
Data not included or data are masked using a method that renders it unreadable or unlinkable to original values
Biometric identifiers, including finger and voice prints
Data not included or data are masked using a method that renders it unreadable
Health plan beneficiary numbers
Data not included or data are masked using a method that renders it unreadable or unlinkable to original values
Full-face photographs and any comparable images
Data not included or data are masked using a method that renders it unreadable
Account numbers
Data not included or data are masked using a method that renders it unreadable or unlinkable to original values
Any other unique identifying number, characteristic, or code
Data not included or data are masked using a method that renders it unreadable or unlinkable to original values
Certificate/license numbers
Data not included or data are masked using a method that renders it unreadable or unlinkable to original values
As studies begin to prepare to upload data to BDC, it is required that studies describe the de-identification approach applied to the study datasets. The study team is responsible for the de-identification of all data uploaded to BDC in accordance with NIH Data Sharing Policy, as outlined in the BDC Data Protection Guidance and Data Generator Guidance.
Below is an example de-identification readme file describing the de-identification of different data elements.
The study should perform date shifting instead of using a reference date (e.g., days from randomization or consent) when de-identifying dates. Because the goal of the study is standardization across studies to the greatest extent possible, date shifting is preferred to reference dates to avoid any confusion in interpretation of Day 0 vs. Day 1 (and therefore all subsequent days) across studies. Additionally, not all studies include the same data collected in clinical trials (e.g., randomization date); use of date shifting instead of reference date will allow broad applicability and linkage to different types of studies in the future, including observational and non-randomized studies.
Dates should be shifted by a consistent length of time for each record by a random integer from 0-364 days subtracted from the true date, thus preserving the interval between dates. For example, if a subject had three sequential appointments with dates of April 2, April 15, and April 26, when the dates are shifted, each appointment will remain in order sequentially with the same interval between appointments for November 16, November 29, and December 10. For dates where only a month and year is available, the day of the month should be imputed to the 15th for date shifting purposes only. After a date using the 15th of the month is created, the same date shifting method outlined above should be employed. The dummy day of month should then be returned to missing status and only the shifted month and year should be uploaded as the actual date. If only a year is available, no date shifting should occur, and day and month should be marked as missing.
Follow the above guidance as illustrated in the examples below:
**-MON-YEAR
Impute DD as 15, date shift, return dummy DD to missing
**-***-YEAR
Do not date shift, retain YEAR as is
DD-***-YEAR
Remove DD, do not date shift, retain YEAR as is
**-MON-****
Mark entire date as missing
DD-***-****
Mark entire date as missing
DD-MON-****
Mark entire date as missing
* Indicates missing
Study personnel should review data to ensure no identifying data is included.
For individuals aged 90 or above, ages should be aggregated into a single age grouping (“90”).