Data Dictionary Requirement

Background

A core mission of NHLBI BioData Catalyst® (BDC) is to onboard a wide variety of Heart, Lung, Blood, and Sleep (HLBS) data types into the ecosystem for immediate use by researchers to drive discovery and gain new insights. As new data types are ingested, new fields are identified for the data ingestion process. The Data Management Core (DMC) will integrate the data dictionary into its data submission requirements. The Data Release Management Working Group (DRMWG) will ensure the data submission requirements have been fulfilled before data ingestion is initialized.

Requiring a standardized data dictionary will increase the velocity of data ingestion. It can be used by curators or software to validate the data in the files and enables more automated data processing. The BDC Data Dictionary aligns with the dbGaP Data Dictionary and Format, with a few modifications. The fields DOCFILE and TYPE are required in the BDC Data Dictionary, unlike dbGaP. The data submitter will also be required to submit information about the study, such as a study abbreviation and consent(s). This information is important as more datasets are submitted directly to BDC, and this information is not assigned by dbGaP.

The data ingestion process can be expedited by requiring data submitters to adhere to a standardized data dictionary. If data submitters do not provide the data dictionary in a consistent format, such as TSV, SAS, Excel, or XML, it results in back-and-forth with submitters, a manual compilation of the data dictionary, and the development of new or tailored data ingestion pipelines. Additionally, data submitters have not consistently provided data dictionaries with usable decoding information; therefore, the data is presented to the researcher in its encoded format. For example, if 1 is the encoded value for the decoded value Male and 2 for Female, but the decoding is never defined, the researcher will not know the sex of the study participant. Each study that is not formatted requires a unique level of effort for ingestion, as described in the Unique Data Loading Use-Cases.

Structures

BioData Catalyst Data Dictionary Format

Supported formats include CSV, TSV, SAS, Excel, XML, or any tabular formats. PDFs are not supported.

BioData Catalyst Study Submission Fields

*Indicates required

Study submission
Description

Study name*

The full name of the study

e.g. NHLBI TOPMed: SubPopulations and InteRmediate Outcome Measures In COPD Study

Study abbreviation*

The study abbreviation. Spaces are not permitted. Underscores are permitted.

e.g. SPIROMICS

Study consent number/consent/abbreviation*

e.g. consent number, consent, and abbreviation

consent

abbreviation

General Research Use

GRU

Non Profit Use Only

GRU-NPU

Disease-Specific (Chronic Obstructive Pulmonary Disease)

DS-COPD

Disease-Specific (Chronic Obstructive Pulmonary Disease, NPU)

DS-COPD-NPU

BioData Catalyst Data Dictionary Fields

*Indicates required

DOCFILE and TYPE are required by BDC; however, they are not required by dbGaP.

Column Headers
Description

VARNAME*

Variable name. The VARNAME must not contain backward slashes (\). Do not use "dbGaP" in the variable name. "dbGaP" is reserved for dbGaP generated items.

VARDESC*

Variable description. The description should be understandable and enable users to replicate the variable. For example, "blood pressure" is useful, but "brachial blood pressure while sitting" provides more context. Alternatively, study documents with detail are also acceptable.

DOCFILE*

Study document name associated with the variable. To list multiple documents, add a semicolon (;) between documents. Please list only study document filenames that are submitted.

TYPE*

Data value type: integer (1,2,3,4,…), encoded value (integers or strings are coded for non-numerical meaning, ex. 1=Control; 2=Case, see VALUES), decimal (0.5,2.5,…), string (African American, Asian, Caucasian, Hispanic, Non-Hispanic). For mixed values (any combination of string, integers, decimals and/or encoded values) in a single data column, list all types present separated by a comma.

UNITS*

Units of measurement of this variable.

VALUES*

List of all unique values and/or descriptions of all encoded values. Encoded values are defined as a value and its meaning. For example, if a data file contains a variable named "EDUCATION" and its data values are "1, 2, 3, and 99," these coded values will need to be defined in the data dictionary. The format of an encoded value is VALUE=MEANING.

The VALUES field will contain the value’s meaning. There will be one field using a pipe to separate the different values.

Example: 1=Completed High School|2=Completed College|3=Completed Masters|99=NA

THE FOLLOWING FIELDS ARE NOT REQUIRED

COLLINTERVAL

Collection interval is the time frame in which the data for the variable or dataset was collected.

COMMENT1, COMMENT2

Additional information not included in the VARDESC that will further define the variable. If additional comments are needed beyond COMMENT2, insert new columns (COMMENT3, COMMENT4, etc.) before the column "ORDER."

MAX

The logical maximum value for the variable. If a separate code such as 9999 is used for a missing field, this should not be considered as the MAX value.

MIN

The logical minimum value of the variable. If a separate code such as -1 is used for a missing field, this should not be considered as the MIN value.

ORDER

RESOLUTION

Measurement resolution – the number of decimal places to which a measured value is presented in the data. For example, in 54.321 the resolution is 3.

SOURCE_VARIABLE_ID

A unique identifier from the VARIABLE_SOURCE or a unique text concept/term from various controlled vocabularies. (Must be submitted as a group with VARIABLE_SOURCE and VARIABLE_MAPPING).

UNIQUEKEY

Unique key is a combination of variables that is designed to uniquely identify a row in a longitudinal dataset or rows that have repeating SUBJECT_IDs or SAMPLE_IDs. Mark "X" for variables that constitute the unique keys, and leave other values blank. Ex. SUBJECT_ID and VISIT_NUMBER. UNIQUEKEYs can only be used in the subject phenotypes file and some cases of the sample attributes file. The SC, SSM, and pedigree files should never have UNIQUEKEYs marked, since there should be a unique identifier appearing once in each file.

VARIABLE_MAPPING

For example, a variable from the source could be Identical, Related, or Comparable. (Must be submitted as a group with VARIABLE_SOURCE and SOURCE_VARIABLE_ID).

VARIABLE_SOURCE

Source of controlled vocabularies. Ex. PhenX, MeSH, SNOMED, NCI. If there is no match, leave it blank. (Must be submitted as a group with SOURCE_VARIABLE_ID and VARIABLE_MAPPING).

Last updated