LogoLogo
  • NHLBI BioData Catalyst® (BDC) Documentation
  • Community
    • Who We Are
    • BDC Glossary
    • Citation and Acknowledgement
    • Strategic Planning
    • Request for Comments
      • NHLBI BioData Catalyst Ecosystem Security Statement
      • NHLBI DICOM Medical Image De-Identification Baseline Protocol
    • BDC Video Content Guidance
    • Contributing User Resources to BDC
  • Written Documentation
    • Getting Started
    • Data Access
      • Data Interoperability
      • Understanding Access
      • Submitting a dbGaP Data Access Request
      • Checking Access
    • Explore Available Data
      • Dug Semantic Search
        • Search and Results
      • PIC-SURE User Guide
        • Getting Started
          • Requirements and Login
          • Available Data and Managing Data Access
            • TOPMed and TOPMed related datasets
            • BioLINCC Datasets
            • CONNECTS Dataset
        • Data Organization in PIC-SURE
        • PIC-SURE Features and General Layout
        • PIC-SURE Open Access vs. PIC-SURE Authorized Access
          • PIC-SURE Open Access
          • PIC-SURE Authorized Access
        • Data Analysis Using the PIC-SURE API
        • Additional Resources
        • PIC-SURE API Documentation
        • Appendix 1: BioData Catalyst Identifiers - dbGaP, TOPMed, and PIC-SURE
        • Appendix 2: Table of Harmonized Variables
      • Discovering Data Using Gen3
        • Dictionary
        • Exploration
        • Query
        • Workspace
        • Profile
        • PFB Files
        • Current Projects
    • Analyze Data
      • Transferring Files Between Seven Bridges and Terra
      • Seven Bridges
        • Knowledge Center
        • Getting Started Guide
        • Comprehensive Analysis Tips
        • Troubleshooting Tasks
        • GWAS with GENESIS workflows
        • Annotation Explorer
      • Terra
        • Account Setup
          • Billing
          • Managing Costs
        • Workspace Setup
          • Data Storage & Management
          • Collaboration
          • Security
        • Bring Data into a Workspace
          • Bring in Data from Gen3
          • From Terra’s Data Library
          • Use Your Own Data with Terra
        • Run Analyses
          • Batch Processing with Workflows
          • Interactive Analysis
          • Genome-Wide Association Studies
        • Troubleshooting & Support
      • Dockstore
        • Launch workflows with BioData Catalyst
        • Discover our catalog
        • Intro to Docker, WDL, CWL
        • Dockstore Forum
        • Contribute to the community
    • Community Tools & Integration
      • Bring Your Own Tool(s)
        • BYOT Glossary
        • Working with Docker
        • Creating, testing & scaling WDL workflows
        • Creating, testing & scaling CWL workflows
        • Version Control, Publishing & Validation of Workflows
        • Advanced Topics
      • Import a Dockstore App With Seven Bridges
    • Writing BDC into a Grant Proposal
    • Incurring Cloud Costs
    • Release Notes
      • 2025-04-15 BDC Release Notes
      • 2025-01-15 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2024-10-21 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2024-07-02 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2024-04-01 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2024-01-08 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2023-10-04 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2023-07-11 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2023-04-04 BioData Catalyst Ecosystem Release Notes
      • 2023-01-09 BioData Catalyst Ecosystem Release Notes
      • 2022-10-03 BioData Catalyst Ecosystem Release Notes
      • 2022-07-11 BioData Catalyst Ecosystem Release Notes
      • 2022-04-04 BioData Catalyst Ecosystem Release Notes
      • 2022-01-24 BioData Catalyst Ecosystem Release Notes
      • 2021-10-04 BioData Catalyst Ecosystem Release Notes
      • 2021-07-09 BioData Catalyst Ecosystem Release Notes
      • 2021-04-02 BioData Catalyst Ecosystem Release Notes
      • 2021-01-15 BioData Catalyst Ecosystem Release Notes
      • 2020-10-23 BioData Catalyst Ecosystem Release Notes
      • 2020-08-24 BioData Catalyst Ecosystem Release Notes
      • 2020-04-02 BioData Catalyst Ecosystem Release Notes
    • Data Versioning Release Notes
    • NIH RECOVER Release Notes
  • Tutorials: Videos & Modules
    • Seven Bridges Tutorials
      • Genetic Association Testing using GENESIS Workflows
      • Estimating and Managing Your Cloud Costs
    • Terra Tutorials
      • Getting Started with Gen3 Data on Terra Tutorial
      • Genome Wide Association Study with 1000 Genomes Data Tutorial
      • Genome Wide Association Study with TOPMed Data Tutorial
      • TOPMed Aligner, or, How to Import Data From Gen3 into Terra and Run a Workflow on It
  • Data Management
    • Data Management Strategy
    • Instructions for Data Submission to BDC
      • De-identification Readme
      • Data Dictionary Requirement
    • dbGaP Study Configuration Process for Submission of Data to BDC
Powered by GitBook
On this page
  • Purpose
  • Readme on De-identification Requirements
  • BDC De-identification QC
  • Example README file
  • Date De-identification
  • Use of Free Text Fields
  • Individuals Aged 90 or Older

Was this helpful?

Export as PDF
  1. Data Management
  2. Instructions for Data Submission to BDC

De-identification Readme

Guidance on how to prepare a Readme describing data de-identification steps performed on datasets ahead of submission

PreviousInstructions for Data Submission to BDCNextData Dictionary Requirement

Last updated 9 months ago

Was this helpful?

Purpose

The purpose of this document is to provide guidance to data contributors on how to prepare a Readme describing data de-identification steps performed on the data sets ahead of submission in . Clear documentation describing the de-identification methods is required for submission to help BDC and future data re-use to understand the data elements and de-identification applied to the data sets being shared.

Readme on De-identification Requirements

Researchers are expected to describe the de-identification methods used on the data in sufficient detail for an external party to be able to understand and replicate. Documentation of the approach for all 18 De-identifying Elements is required.

Section 3 is an example checklist used by BDC to perform checking on data de-identification to demonstrate that data has been reviewed and to note if the particular PHI/PII was found in the dataset and how it was resolved.

Section 4 is an example Readme, adapted from the NHLBI .

BDC De-identification QC

De-identification QC by BDC includes the data elements in the table below.

Data Element
Process for anonymization/de-identification

Names of patients, relatives, employers or household members

Data not included or data are masked using a method that renders it unreadable

All geographic subdivisions smaller than a state

Data not included or data are masked using a method that renders it unreadable

All elements of dates (except year) for dates that are directly related to an individual

Dates were addressed as follows:

  • Ages were reported as bins spanning 10 years: 40-49, 50-59, 60-69, 70-79, 80-89 and ≥90

  • Dates of events, such as treatment, visit, birth, and death, are given as year only

Telephone numbers

Data not included or data are masked using a method that renders it unreadable

Vehicle identifiers and serial numbers, including license plate numbers

Data not included or data are masked using a method that renders it unreadable

Fax numbers

Data not included or data are masked using a method that renders it unreadable

Device identifiers and serial numbers

Data not included or data are masked using a method that renders it unreadable or unlinkable to original values

Email addresses

Data not included or data are masked using a method that renders it unreadable

Web Universal Resource Locators (URLs)

Data not included or data are masked using a method that renders it unreadable

Social security numbers

Data not included or data are masked using a method that renders it unreadable

Internet Protocol (IP) addresses

Data not included or data are masked using a method that renders it unreadable

Medical record numbers

Data not included or data are masked using a method that renders it unreadable or unlinkable to original values

Biometric identifiers, including finger and voice prints

Data not included or data are masked using a method that renders it unreadable

Health plan beneficiary numbers

Data not included or data are masked using a method that renders it unreadable or unlinkable to original values

Full-face photographs and any comparable images

Data not included or data are masked using a method that renders it unreadable

Account numbers

Data not included or data are masked using a method that renders it unreadable or unlinkable to original values

Any other unique identifying number, characteristic, or code

Data not included or data are masked using a method that renders it unreadable or unlinkable to original values

Certificate/license numbers

Data not included or data are masked using a method that renders it unreadable or unlinkable to original values

Example README file

Below is an example de-identification readme file describing the de-identification of different data elements.

Date De-identification

The study should perform date shifting instead of using a reference date (e.g., days from randomization or consent) when de-identifying dates. Because the goal of the study is standardization across studies to the greatest extent possible, date shifting is preferred to reference dates to avoid any confusion in interpretation of Day 0 vs. Day 1 (and therefore all subsequent days) across studies. Additionally, not all studies include the same data collected in clinical trials (e.g., randomization date); use of date shifting instead of reference date will allow broad applicability and linkage to different types of studies in the future, including observational and non-randomized studies.

Dates should be shifted by a consistent length of time for each record by a random integer from 0-364 days subtracted from the true date, thus preserving the interval between dates. For example, if a subject had three sequential appointments with dates of April 2, April 15, and April 26, when the dates are shifted, each appointment will remain in order sequentially with the same interval between appointments for November 16, November 29, and December 10. For dates where only a month and year is available, the day of the month should be imputed to the 15th for date shifting purposes only. After a date using the 15th of the month is created, the same date shifting method outlined above should be employed. The dummy day of month should then be returned to missing status and only the shifted month and year should be uploaded as the actual date. If only a year is available, no date shifting should occur, and day and month should be marked as missing.

Follow the above guidance as illustrated in the examples below:

Date Elements Present (DD-MON-YEAR)
Date Shifting Approach

**-MON-YEAR

Impute DD as 15, date shift, return dummy DD to missing

**-***-YEAR

Do not date shift, retain YEAR as is

DD-***-YEAR

Remove DD, do not date shift, retain YEAR as is

**-MON-****

Mark entire date as missing

DD-***-****

Mark entire date as missing

DD-MON-****

Mark entire date as missing

* Indicates missing

Use of Free Text Fields

Study personnel should review data to ensure no identifying data is included.

Individuals Aged 90 or Older

For individuals aged 90 or above, ages should be aggregated into a single age grouping (“90”).

As studies begin to prepare to upload data to BDC, it is required that studies describe the de-identification approach applied to the study datasets. The study team is responsible for the de-identification of all data uploaded to BDC in accordance with , as outlined in the and .

BDC
Collaborating Network of Networks for Evaluating COVID-19 and Therapeutic Strategies (CONNECTS) De-Identification Guidance
NIH Data Sharing Policy
BDC Data Protection Guidance
Data Generator Guidance