Guidance on how to prepare a Readme describing data de-identification steps performed on datasets ahead of submission
The purpose of this document is to provide guidance to data contributors on how to prepare a Readme describing data de-identification steps performed on the data sets ahead of submission in BDC. Clear documentation describing the de-identification methods is required for submission to help BDC and future data re-use to understand the data elements and de-identification applied to the data sets being shared.
Researchers are expected to describe the de-identification methods used on the data in sufficient detail for an external party to be able to understand and replicate. Documentation of the approach for all 18 De-identifying Elements is required.
Section 3 is an example checklist used by BDC to perform checking on data de-identification to demonstrate that data has been reviewed and to note if the particular PHI/PII was found in the dataset and how it was resolved.
Section 4 is an example Readme, adapted from the NHLBI Collaborating Network of Networks for Evaluating COVID-19 and Therapeutic Strategies (CONNECTS) De-Identification Guidance.
De-identification QC by BDC includes the data elements in the table below.
Data Element | Process for anonymization/de-identification |
---|---|
As studies begin to prepare to upload data to BDC, it is required that studies describe the de-identification approach applied to the study datasets. The study team is responsible for the de-identification of all data uploaded to BDC in accordance with NIH Data Sharing Policy, as outlined in the BDC Data Protection Guidance and Data Generator Guidance.
Below is an example de-identification readme file describing the de-identification of different data elements.
The study should perform date shifting instead of using a reference date (e.g., days from randomization or consent) when de-identifying dates. Because the goal of the study is standardization across studies to the greatest extent possible, date shifting is preferred to reference dates to avoid any confusion in interpretation of Day 0 vs. Day 1 (and therefore all subsequent days) across studies. Additionally, not all studies include the same data collected in clinical trials (e.g., randomization date); use of date shifting instead of reference date will allow broad applicability and linkage to different types of studies in the future, including observational and non-randomized studies.
Dates should be shifted by a consistent length of time for each record by a random integer from 0-364 days subtracted from the true date, thus preserving the interval between dates. For example, if a subject had three sequential appointments with dates of April 2, April 15, and April 26, when the dates are shifted, each appointment will remain in order sequentially with the same interval between appointments for November 16, November 29, and December 10. For dates where only a month and year is available, the day of the month should be imputed to the 15th for date shifting purposes only. After a date using the 15th of the month is created, the same date shifting method outlined above should be employed. The dummy day of month should then be returned to missing status and only the shifted month and year should be uploaded as the actual date. If only a year is available, no date shifting should occur, and day and month should be marked as missing.
Follow the above guidance as illustrated in the examples below:
* Indicates missing
Study personnel should review data to ensure no identifying data is included.
For individuals aged 90 or above, ages should be aggregated into a single age grouping (“90”).
Date Elements Present (DD-MON-YEAR) | Date Shifting Approach |
---|---|
Names of patients, relatives, employers or household members
Data not included or data are masked using a method that renders it unreadable
All geographic subdivisions smaller than a state
Data not included or data are masked using a method that renders it unreadable
All elements of dates (except year) for dates that are directly related to an individual
Dates were addressed as follows:
Ages were reported as bins spanning 10 years: 40-49, 50-59, 60-69, 70-79, 80-89 and ≥90
Dates of events, such as treatment, visit, birth, and death, are given as year only
Telephone numbers
Data not included or data are masked using a method that renders it unreadable
Vehicle identifiers and serial numbers, including license plate numbers
Data not included or data are masked using a method that renders it unreadable
Fax numbers
Data not included or data are masked using a method that renders it unreadable
Device identifiers and serial numbers
Data not included or data are masked using a method that renders it unreadable or unlinkable to original values
Email addresses
Data not included or data are masked using a method that renders it unreadable
Web Universal Resource Locators (URLs)
Data not included or data are masked using a method that renders it unreadable
Social security numbers
Data not included or data are masked using a method that renders it unreadable
Internet Protocol (IP) addresses
Data not included or data are masked using a method that renders it unreadable
Medical record numbers
Data not included or data are masked using a method that renders it unreadable or unlinkable to original values
Biometric identifiers, including finger and voice prints
Data not included or data are masked using a method that renders it unreadable
Health plan beneficiary numbers
Data not included or data are masked using a method that renders it unreadable or unlinkable to original values
Full-face photographs and any comparable images
Data not included or data are masked using a method that renders it unreadable
Account numbers
Data not included or data are masked using a method that renders it unreadable or unlinkable to original values
Any other unique identifying number, characteristic, or code
Data not included or data are masked using a method that renders it unreadable or unlinkable to original values
Certificate/license numbers
Data not included or data are masked using a method that renders it unreadable or unlinkable to original values
**-MON-YEAR
Impute DD as 15, date shift, return dummy DD to missing
**-***-YEAR
Do not date shift, retain YEAR as is
DD-***-YEAR
Remove DD, do not date shift, retain YEAR as is
**-MON-****
Mark entire date as missing
DD-***-****
Mark entire date as missing
DD-MON-****
Mark entire date as missing
The five steps and their subparts outlined below provide instructions for submitting data and making it available through BDC. These instructions will be updated as new information and processes are made available.
Tip: To reduce the time to ingest and release data, you may work on more than one action at a time.
If you have already prepared your data (see Step 3), you may complete Step 1 (Intent to Submit) and then simultaneously work on dbGaP Study Registration (see Step 2) and begin the Data Submission process (see Step 4).
If you have not yet prepared your data, you may complete Step 1 (Intent to Submit) and then simultaneously work on dbGaP Study Registration (see Step 2) and Data Preparation (see Step 3).
This step has two data submitter action items, and the first is different for NHLBI intramural investigators than for extramural investigators.
Data Submitter Action Item 1: for NHLBI Intramural Investigators Email NHLBIDIRBDCSubmission@mail.nih.gov for submission information.
Data Submitter Action Item 1: for Extramural Investigators Use the following email template, complete it with information specific to your study, and send it to bdcatalystdatasharing@nih.gov.
Email template:
To: bdcatalystdatasharing@nih.gov Subject: BioData Catalyst Data Submission [Grant Number / Award Number]
After sending the email, you will receive an automated response with the following documents to use in Step 2.
Institutional Certification Form
Data Submission Information Sheet
Guidance document for registration of data in dbGaP
You will receive a response from the Genomic Program Administrator (GPA) confirming receipt of your email.
Extramural submitters will also receive a response from the BioData Catalyst Data Management Core (DMC) (nhlbi.dmc.concierge@rti.org) to provide further assistance or answer data submission-related questions.
Step 1 Related Links
All research data shared with BDC must be registered through dbGaP, though the controlled and non-controlled access processes may differ. The DMC will contact you and provide specific guidance in such cases. Study registration has two parts but only one action for data submitters.
The GPA will share the accession number and the consent group information with the DMC to create Data Submission Infrastructure for your study.
You will receive an automated email from dbGaP to complete Study Submission (see screenshots of the dbGaP email below).
Note: Gather all information ahead of the web form entry, as the current form does not have a “save” button for partial entry. Click here to download the example files for dbGaP submission.
Once you finish your study configuration, dbGaP will curate your submission and may contact you for questions. Once dbGaP completes its curation process, you will receive an email from dbGaP to approve and complete your study registration.
Note: While waiting for dbGaP curation, please proceed with data submission to BDC (steps 3 and 4 below) to reduce the time to ingest and release the data.
Data preparation can happen before, during, or after the study registration process and must be completed to submit data to BDC. This step has one action item for all data submitters and a second action item for submitters of omics and phenotypic data types. If you have already prepared your data, you may go to Step 4: Data Submission
Protocols
BDC-compliant Data Dictionaries* - reference the Data Dictionary Requirement
Survey Instruments
Data/Metadata model*
Datasets Readme* - if the datasets are organized in multiple sub-folders, need a Readme file to describe the relationship of the sub-folders, if they are independent (e.g., multiple phases or visits), main studies with ancillary studies, or overlapping (e.g., /raw data and /harmonized data, where the /harmonized data is a subset of the /raw data).
Readme file about the de-identification*
Additional Supplemental documentation to reproduce study results
* Supported documentation types for data dictionaries and models are .csv, tab-delimited, xml, json, and other machine-readable formats. PDF and SAS file formats are not machine-readable and are discouraged from submission.
Data submission has two action items for data submitters. This process can happen in parallel with Data Submitter Action Item 1 from Step 2. The data submission process begins by filling out the BDC contact form: https://biodatacatalyst.nhlbi.nih.gov/contact.
Your institutional email address used for NIH eRA Commons
Subject: Data Submission
Type: Data Submission (select in the dropdown menu)
In the body of the message, 1) include your dbGaP PHS accession number and 2) request access for read/write permission to the assigned cloud bucket
In the rare case that your institute can’t access any cloud services hosted by Google or Amazon, request assistance for direct data upload from your data package location (e.g., SFTP transfer)
Data upload may not begin until your data is prepared (see Step 3: Data Preparation), and you receive an invitation from dbGaP to complete your study submission and configuration (see the Results section in Step 2.
Follow the links and instructions in the email to activate the Amazon Web Service (AWS) S3 web interface.
If you have any questions or issues about accessing the buckets, please contact nhlbi.dmc.concierge@rti.org
Once selected the specific bucket for a consent group, use the “Upload” button to upload data files.
If you choose to use the GCP platform, see screenshot below (“upload” highlighted)
Once your data package is uploaded successfully, the data go through quality checks before ingestion and release. If issues are found, the DMC will contact you and assist in resolving the issues before ingestion and release. There are three data submitter action items associated with this step.
After data clears the data quality checks, the ingestion and release process can take as few as 4-6 weeks. After the data is released, the DMC will notify you that your study is available for use by authorized individuals in BDC (study inventory).
Contact the BioData Catalyst Data Management Core (DMC) via https://biodatacatalyst.nhlbi.nih.gov/contact and select “Data Submission” in the Type field.
A core mission of NHLBI BioData Catalyst® (BDC) is to onboard a wide variety of Heart, Lung, Blood, and Sleep (HLBS) data types into the ecosystem for immediate use by researchers to drive discovery and gain new insights. As new data types are ingested, new fields are identified for the data ingestion process. The Data Management Core (DMC) will integrate the data dictionary into its data submission requirements. The Data Release Management Working Group (DRMWG) will ensure the data submission requirements have been fulfilled before data ingestion is initialized.
Requiring a standardized data dictionary will increase the velocity of data ingestion. It can be used by curators or software to validate the data in the files and enables more automated data processing. The BDC Data Dictionary aligns with the dbGaP Data Dictionary and Format, with a few modifications. The fields DOCFILE and TYPE are required in the BDC Data Dictionary, unlike dbGaP. The data submitter will also be required to submit information about the study, such as a study abbreviation and consent(s). This information is important as more datasets are submitted directly to BDC, and this information is not assigned by dbGaP.
The data ingestion process can be expedited by requiring data submitters to adhere to a standardized data dictionary. If data submitters do not provide the data dictionary in a consistent format, such as TSV, SAS, Excel, or XML, it results in back-and-forth with submitters, a manual compilation of the data dictionary, and the development of new or tailored data ingestion pipelines. Additionally, data submitters have not consistently provided data dictionaries with usable decoding information; therefore, the data is presented to the researcher in its encoded format. For example, if 1 is the encoded value for the decoded value Male and 2 for Female, but the decoding is never defined, the researcher will not know the sex of the study participant. Each study that is not formatted requires a unique level of effort for ingestion, as described in the Unique Data Loading Use-Cases.
Supported formats include CSV, TSV, SAS, Excel, XML, or any tabular formats. PDFs are not supported.
*Indicates required
Study submission | Description |
---|---|
*Indicates required
DOCFILE and TYPE are required by BDC; however, they are not required by dbGaP.
Result
Data Submitter Action Item 2: Complete the Institutional Certification and Data Submission Information Sheet (see results from Step 1, Action 1), and email them to nhlbigeneticdata@nhlbi.nih.gov.
Results
The GPA completes the first part of dbGaP study registration and, as a result, generates your study accession number. The GPA does this by entering information from your Institutional Certification and Data Submission Information Sheet into the dbGaP Submission System. If needed, the GPA may contact you for additional information or clarification or if asking for a data sharing plan and data use agreement.
Results
Data Submitter Action Item 1: After receiving the automated email from dbGaP, complete the dbGaP submission process using guidance available in the dbGaP Study Configuration Process for Submission of Data to BDC (See a screenshot of the dbGaP Study Submission portal below). Study Config consists of a web form that collects a description of the study data, methods, and findings, inclusion/exclusion, study history, references, attributions, and terms that will be indexed to enable users to search for your study in dbGaP Advanced Search.
Results
Data Submitter Action Item 1: Prepare supplemental documentation to accompany the data submission (“data package”) according to the Instructions for Preparing Clinical Research Study Datasets for Submission to the NHLBI, including:
Data Submitter Action Item 2: Only for Omics and Phenotypic data types, prepare the data files per the dbGaP Study Submission Guidance.
Data Submitter Action Item 1: Request bucket creation by filling out the BDC contact form using the following information:
Data Submitter Action Item 2: Access the cloud bucket created for your study. You will receive a secure email from the Information Technology Applications Center (ITAC) team at NHLBI that provides the URL to activate the access with user ID and password (see screenshot below):
Data Submitter Action Item 3: Upload data sets to the cloud bucket created for your study. After access, upload datasets for each consent group to the corresponding buckets (e.g., xxxx-c1) as described in the dbGaP 2b file.
Data Submitter Action Item 1: If the DMC contacts you about QC issues with the uploaded data, respond to their inquiries to resolve the issues.
Data Submitter Action Item 2: If requested by the DMC, resubmit the data package after all issues are resolved.
Data Submitter Action Item 3: You are encouraged to log in and view your study data in BDC.
Column Headers | Description |
---|---|
Study Name
Institution Name
PI Name
Grant Number/Award Number/ZIA Number
Expected date for data upload/submission
Does this submission include genomics data?
Does this submission include biospecimen?
Does this submission include imaging data?
VARNAME*
Variable name. The VARNAME must not contain backward slashes (\). Do not use "dbGaP" in the variable name. "dbGaP" is reserved for dbGaP generated items.
VARDESC*
Variable description. The description should be understandable and enable users to replicate the variable. For example, "blood pressure" is useful, but "brachial blood pressure while sitting" provides more context. Alternatively, study documents with detail are also acceptable.
DOCFILE*
Study document name associated with the variable. To list multiple documents, add a semicolon (;) between documents. Please list only study document filenames that are submitted.
TYPE*
Data value type: integer (1,2,3,4,…), encoded value (integers or strings are coded for non-numerical meaning, ex. 1=Control; 2=Case, see VALUES), decimal (0.5,2.5,…), string (African American, Asian, Caucasian, Hispanic, Non-Hispanic). For mixed values (any combination of string, integers, decimals and/or encoded values) in a single data column, list all types present separated by a comma.
UNITS*
Units of measurement of this variable.
VALUES*
List of all unique values and/or descriptions of all encoded values. Encoded values are defined as a value and its meaning. For example, if a data file contains a variable named "EDUCATION" and its data values are "1, 2, 3, and 99," these coded values will need to be defined in the data dictionary. The format of an encoded value is VALUE=MEANING.
The VALUES field will contain the value’s meaning. There will be one field using a pipe to separate the different values.
Example: 1=Completed High School|2=Completed College|3=Completed Masters|99=NA
THE FOLLOWING FIELDS ARE NOT REQUIRED
COLLINTERVAL
Collection interval is the time frame in which the data for the variable or dataset was collected.
COMMENT1, COMMENT2
Additional information not included in the VARDESC that will further define the variable. If additional comments are needed beyond COMMENT2, insert new columns (COMMENT3, COMMENT4, etc.) before the column "ORDER."
MAX
The logical maximum value for the variable. If a separate code such as 9999 is used for a missing field, this should not be considered as the MAX value.
MIN
The logical minimum value of the variable. If a separate code such as -1 is used for a missing field, this should not be considered as the MIN value.
ORDER
The order in which VALUES appear on the variable summary report page. If VALUES of a single variable/column of data are integers or decimals, leave blank. If VALUES are encoded values, string, or mixed, define the order. VALUES can be ordered by Frequency (highest to lowest frequency of VALUES) or by List (user specifies order through placement in VALUES columns). For mixed values within a single variable/column of data, see examples: "age" and "weight" in example file 5b_SubjectPhenotypes_DD.xlsx.
RESOLUTION
Measurement resolution – the number of decimal places to which a measured value is presented in the data. For example, in 54.321 the resolution is 3.
SOURCE_VARIABLE_ID
A unique identifier from the VARIABLE_SOURCE or a unique text concept/term from various controlled vocabularies. (Must be submitted as a group with VARIABLE_SOURCE and VARIABLE_MAPPING).
UNIQUEKEY
Unique key is a combination of variables that is designed to uniquely identify a row in a longitudinal dataset or rows that have repeating SUBJECT_IDs or SAMPLE_IDs. Mark "X" for variables that constitute the unique keys, and leave other values blank. Ex. SUBJECT_ID and VISIT_NUMBER. UNIQUEKEYs can only be used in the subject phenotypes file and some cases of the sample attributes file. The SC, SSM, and pedigree files should never have UNIQUEKEYs marked, since there should be a unique identifier appearing once in each file.
VARIABLE_MAPPING
For example, a variable from the source could be Identical, Related, or Comparable. (Must be submitted as a group with VARIABLE_SOURCE and SOURCE_VARIABLE_ID).
VARIABLE_SOURCE
Source of controlled vocabularies. Ex. PhenX, MeSH, SNOMED, NCI. If there is no match, leave it blank. (Must be submitted as a group with SOURCE_VARIABLE_ID and VARIABLE_MAPPING).
Study name*
The full name of the study
e.g. NHLBI TOPMed: SubPopulations and InteRmediate Outcome Measures In COPD Study
Study abbreviation*
The study abbreviation. Spaces are not permitted. Underscores are permitted.
e.g. SPIROMICS
Study consent number/consent/abbreviation*
Consent Group: Use the dbGaP consent number for guidance.
Consent and abbreviation: Use the NIH Consent Codes: Upholding Standard Data Use Conditions for guidance.
e.g. consent number, consent, and abbreviation
consent
abbreviation
General Research Use
GRU
Non Profit Use Only
GRU-NPU
Disease-Specific (Chronic Obstructive Pulmonary Disease)
DS-COPD
Disease-Specific (Chronic Obstructive Pulmonary Disease, NPU)
DS-COPD-NPU
consent
abbreviation
General Research Use
GRU
Non Profit Use Only
GRU-NPU
Disease-Specific (Chronic Obstructive Pulmonary Disease)
DS-COPD
Disease-Specific (Chronic Obstructive Pulmonary Disease, NPU)
DS-COPD-NPU