LogoLogo
  • NHLBI BioData Catalyst® (BDC) Documentation
  • Community
    • Who We Are
    • BDC Glossary
    • Citation and Acknowledgement
    • Strategic Planning
    • Request for Comments
      • NHLBI BioData Catalyst Ecosystem Security Statement
      • NHLBI DICOM Medical Image De-Identification Baseline Protocol
    • BDC Video Content Guidance
    • Contributing User Resources to BDC
  • Written Documentation
    • Getting Started
    • Data Access
      • Data Interoperability
      • Understanding Access
      • Submitting a dbGaP Data Access Request
      • Checking Access
    • Explore Available Data
      • Dug Semantic Search
        • Search and Results
      • PIC-SURE User Guide
        • Getting Started
          • Requirements and Login
          • Available Data and Managing Data Access
            • TOPMed and TOPMed related datasets
            • BioLINCC Datasets
            • CONNECTS Dataset
        • Data Organization in PIC-SURE
        • PIC-SURE Features and General Layout
        • PIC-SURE Open Access vs. PIC-SURE Authorized Access
          • PIC-SURE Open Access
          • PIC-SURE Authorized Access
        • Data Analysis Using the PIC-SURE API
        • Additional Resources
        • PIC-SURE API Documentation
        • Appendix 1: BioData Catalyst Identifiers - dbGaP, TOPMed, and PIC-SURE
        • Appendix 2: Table of Harmonized Variables
      • Discovering Data Using Gen3
        • Dictionary
        • Exploration
        • Query
        • Workspace
        • Profile
        • PFB Files
        • Current Projects
    • Analyze Data
      • Transferring Files Between Seven Bridges and Terra
      • Seven Bridges
        • Knowledge Center
        • Getting Started Guide
        • Comprehensive Analysis Tips
        • Troubleshooting Tasks
        • GWAS with GENESIS workflows
        • Annotation Explorer
      • Terra
        • Account Setup
          • Billing
          • Managing Costs
        • Workspace Setup
          • Data Storage & Management
          • Collaboration
          • Security
        • Bring Data into a Workspace
          • Bring in Data from Gen3
          • From Terra’s Data Library
          • Use Your Own Data with Terra
        • Run Analyses
          • Batch Processing with Workflows
          • Interactive Analysis
          • Genome-Wide Association Studies
        • Troubleshooting & Support
      • Dockstore
        • Launch workflows with BioData Catalyst
        • Discover our catalog
        • Intro to Docker, WDL, CWL
        • Dockstore Forum
        • Contribute to the community
    • Community Tools & Integration
      • Bring Your Own Tool(s)
        • BYOT Glossary
        • Working with Docker
        • Creating, testing & scaling WDL workflows
        • Creating, testing & scaling CWL workflows
        • Version Control, Publishing & Validation of Workflows
        • Advanced Topics
      • Import a Dockstore App With Seven Bridges
    • Writing BDC into a Grant Proposal
    • Incurring Cloud Costs
    • Release Notes
      • 2025-04-15 BDC Release Notes
      • 2025-01-15 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2024-10-21 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2024-07-02 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2024-04-01 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2024-01-08 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2023-10-04 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2023-07-11 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2023-04-04 BioData Catalyst Ecosystem Release Notes
      • 2023-01-09 BioData Catalyst Ecosystem Release Notes
      • 2022-10-03 BioData Catalyst Ecosystem Release Notes
      • 2022-07-11 BioData Catalyst Ecosystem Release Notes
      • 2022-04-04 BioData Catalyst Ecosystem Release Notes
      • 2022-01-24 BioData Catalyst Ecosystem Release Notes
      • 2021-10-04 BioData Catalyst Ecosystem Release Notes
      • 2021-07-09 BioData Catalyst Ecosystem Release Notes
      • 2021-04-02 BioData Catalyst Ecosystem Release Notes
      • 2021-01-15 BioData Catalyst Ecosystem Release Notes
      • 2020-10-23 BioData Catalyst Ecosystem Release Notes
      • 2020-08-24 BioData Catalyst Ecosystem Release Notes
      • 2020-04-02 BioData Catalyst Ecosystem Release Notes
    • Data Versioning Release Notes
    • NIH RECOVER Release Notes
  • Tutorials: Videos & Modules
    • Seven Bridges Tutorials
      • Genetic Association Testing using GENESIS Workflows
      • Estimating and Managing Your Cloud Costs
    • Terra Tutorials
      • Getting Started with Gen3 Data on Terra Tutorial
      • Genome Wide Association Study with 1000 Genomes Data Tutorial
      • Genome Wide Association Study with TOPMed Data Tutorial
      • TOPMed Aligner, or, How to Import Data From Gen3 into Terra and Run a Workflow on It
  • Data Management
    • Data Management Strategy
    • Instructions for Data Submission to BDC
      • De-identification Readme
      • Data Dictionary Requirement
    • dbGaP Study Configuration Process for Submission of Data to BDC
Powered by GitBook
On this page

Was this helpful?

Export as PDF

Last updated 9 months ago

Was this helpful?

Background

A core mission of NHLBI BioData Catalyst® (BDC) is to onboard a wide variety of Heart, Lung, Blood, and Sleep (HLBS) data types into the ecosystem for immediate use by researchers to drive discovery and gain new insights. As new data types are ingested, new fields are identified for the data ingestion process. The Data Management Core (DMC) will integrate the data dictionary into its data submission requirements. The Data Release Management Working Group (DRMWG) will ensure the data submission requirements have been fulfilled before data ingestion is initialized.

Requiring a standardized data dictionary will increase the velocity of data ingestion. It can be used by curators or software to validate the data in the files and enables more automated data processing. The BDC Data Dictionary aligns with the dbGaP Data Dictionary and Format, with a few modifications. The fields DOCFILE and TYPE are required in the BDC Data Dictionary, unlike dbGaP. The data submitter will also be required to submit information about the study, such as a study abbreviation and consent(s). This information is important as more datasets are submitted directly to BDC, and this information is not assigned by dbGaP.

The data ingestion process can be expedited by requiring data submitters to adhere to a standardized data dictionary. If data submitters do not provide the data dictionary in a consistent format, such as TSV, SAS, Excel, or XML, it results in back-and-forth with submitters, a manual compilation of the data dictionary, and the development of new or tailored data ingestion pipelines. Additionally, data submitters have not consistently provided data dictionaries with usable decoding information; therefore, the data is presented to the researcher in its encoded format. For example, if 1 is the encoded value for the decoded value Male and 2 for Female, but the decoding is never defined, the researcher will not know the sex of the study participant. Each study that is not formatted requires a unique level of effort for ingestion, as described in the Unique Data Loading Use-Cases.

Structures

BioData Catalyst Data Dictionary Format

Supported formats include CSV, TSV, SAS, Excel, XML, or any tabular formats. PDFs are not supported.

BioData Catalyst Study Submission Fields

*Indicates required

Study submission
Description

BioData Catalyst Data Dictionary Fields

*Indicates required

DOCFILE and TYPE are required by BDC; however, they are not required by dbGaP.

Column Headers
Description
  1. Data Management
  2. Instructions for Data Submission to BDC

Data Dictionary Requirement

PreviousDe-identification ReadmeNextdbGaP Study Configuration Process for Submission of Data to BDC
  • Background
  • Structures
  • BioData Catalyst Data Dictionary Format
  • BioData Catalyst Study Submission Fields
  • BioData Catalyst Data Dictionary Fields

Study name*

The full name of the study

e.g. NHLBI TOPMed: SubPopulations and InteRmediate Outcome Measures In COPD Study

Study abbreviation*

The study abbreviation. Spaces are not permitted. Underscores are permitted.

e.g. SPIROMICS

Study consent number/consent/abbreviation*

e.g. consent number, consent, and abbreviation

consent

abbreviation

General Research Use

GRU

Non Profit Use Only

GRU-NPU

Disease-Specific (Chronic Obstructive Pulmonary Disease)

DS-COPD

Disease-Specific (Chronic Obstructive Pulmonary Disease, NPU)

DS-COPD-NPU

consent

abbreviation

General Research Use

GRU

Non Profit Use Only

GRU-NPU

Disease-Specific (Chronic Obstructive Pulmonary Disease)

DS-COPD

Disease-Specific (Chronic Obstructive Pulmonary Disease, NPU)

DS-COPD-NPU

VARNAME*

Variable name. The VARNAME must not contain backward slashes (\). Do not use "dbGaP" in the variable name. "dbGaP" is reserved for dbGaP generated items.

VARDESC*

Variable description. The description should be understandable and enable users to replicate the variable. For example, "blood pressure" is useful, but "brachial blood pressure while sitting" provides more context. Alternatively, study documents with detail are also acceptable.

DOCFILE*

Study document name associated with the variable. To list multiple documents, add a semicolon (;) between documents. Please list only study document filenames that are submitted.

TYPE*

Data value type: integer (1,2,3,4,…), encoded value (integers or strings are coded for non-numerical meaning, ex. 1=Control; 2=Case, see VALUES), decimal (0.5,2.5,…), string (African American, Asian, Caucasian, Hispanic, Non-Hispanic). For mixed values (any combination of string, integers, decimals and/or encoded values) in a single data column, list all types present separated by a comma.

UNITS*

Units of measurement of this variable.

VALUES*

List of all unique values and/or descriptions of all encoded values. Encoded values are defined as a value and its meaning. For example, if a data file contains a variable named "EDUCATION" and its data values are "1, 2, 3, and 99," these coded values will need to be defined in the data dictionary. The format of an encoded value is VALUE=MEANING.

The VALUES field will contain the value’s meaning. There will be one field using a pipe to separate the different values.

Example: 1=Completed High School|2=Completed College|3=Completed Masters|99=NA

THE FOLLOWING FIELDS ARE NOT REQUIRED

COLLINTERVAL

Collection interval is the time frame in which the data for the variable or dataset was collected.

COMMENT1, COMMENT2

Additional information not included in the VARDESC that will further define the variable. If additional comments are needed beyond COMMENT2, insert new columns (COMMENT3, COMMENT4, etc.) before the column "ORDER."

MAX

The logical maximum value for the variable. If a separate code such as 9999 is used for a missing field, this should not be considered as the MAX value.

MIN

The logical minimum value of the variable. If a separate code such as -1 is used for a missing field, this should not be considered as the MIN value.

ORDER

RESOLUTION

Measurement resolution – the number of decimal places to which a measured value is presented in the data. For example, in 54.321 the resolution is 3.

SOURCE_VARIABLE_ID

A unique identifier from the VARIABLE_SOURCE or a unique text concept/term from various controlled vocabularies. (Must be submitted as a group with VARIABLE_SOURCE and VARIABLE_MAPPING).

UNIQUEKEY

Unique key is a combination of variables that is designed to uniquely identify a row in a longitudinal dataset or rows that have repeating SUBJECT_IDs or SAMPLE_IDs. Mark "X" for variables that constitute the unique keys, and leave other values blank. Ex. SUBJECT_ID and VISIT_NUMBER. UNIQUEKEYs can only be used in the subject phenotypes file and some cases of the sample attributes file. The SC, SSM, and pedigree files should never have UNIQUEKEYs marked, since there should be a unique identifier appearing once in each file.

VARIABLE_MAPPING

For example, a variable from the source could be Identical, Related, or Comparable. (Must be submitted as a group with VARIABLE_SOURCE and SOURCE_VARIABLE_ID).

VARIABLE_SOURCE

Source of controlled vocabularies. Ex. PhenX, MeSH, SNOMED, NCI. If there is no match, leave it blank. (Must be submitted as a group with SOURCE_VARIABLE_ID and VARIABLE_MAPPING).

Consent Group: Use the for guidance.

Consent and abbreviation: Use the for guidance.

The order in which VALUES appear on the variable summary report page. If VALUES of a single variable/column of data are integers or decimals, leave blank. If VALUES are encoded values, string, or mixed, define the order. VALUES can be ordered by Frequency (highest to lowest frequency of VALUES) or by List (user specifies order through placement in VALUES columns). For mixed values within a single variable/column of data, see examples: "age" and "weight" in example file .

dbGaP consent number
NIH Consent Codes: Upholding Standard Data Use Conditions
5b_SubjectPhenotypes_DD.xlsx