Data Dictionary Requirement

Background

A core mission of NHLBI BioData Catalyst® (BDC) is to onboard a wide variety of Heart, Lung, Blood, and Sleep (HLBS) data types into the ecosystem for immediate use by researchers to drive discovery and gain new insights. As new data types are ingested, new fields are identified for the data ingestion process. The Data Management Core (DMC) will integrate the data dictionary into its data submission requirements. The Data Release Management Working Group (DRMWG) will ensure the data submission requirements have been fulfilled before data ingestion is initialized.

Requiring a standardized data dictionary will increase the velocity of data ingestion. It can be used by curators or software to validate the data in the files and enables more automated data processing. The BDC Data Dictionary aligns with the dbGaP Data Dictionary and Format, with a few modifications. The fields DOCFILE and TYPE are required in the BDC Data Dictionary, unlike dbGaP. The data submitter will also be required to submit information about the study, such as a study abbreviation and consent(s). This information is important as more datasets are submitted directly to BDC, and this information is not assigned by dbGaP.

The data ingestion process can be expedited by requiring data submitters to adhere to a standardized data dictionary. If data submitters do not provide the data dictionary in a consistent format, such as TSV, SAS, Excel, or XML, it results in back-and-forth with submitters, a manual compilation of the data dictionary, and the development of new or tailored data ingestion pipelines. Additionally, data submitters have not consistently provided data dictionaries with usable decoding information; therefore, the data is presented to the researcher in its encoded format. For example, if 1 is the encoded value for the decoded value Male and 2 for Female, but the decoding is never defined, the researcher will not know the sex of the study participant. Each study that is not formatted requires a unique level of effort for ingestion, as described in the Unique Data Loading Use-Cases.

Structures

BioData Catalyst Data Dictionary Format

Supported formats include CSV, TSV, SAS, Excel, XML, or any tabular formats. PDFs are not supported.

BioData Catalyst Study Submission Fields

*Indicates required

BioData Catalyst Data Dictionary Fields

*Indicates required

DOCFILE and TYPE are required by BDC; however, they are not required by dbGaP.

Last updated