Data Dictionary Requirement
Last updated
Last updated
A core mission of NHLBI BioData Catalyst® (BDC) is to onboard a wide variety of Heart, Lung, Blood, and Sleep (HLBS) data types into the ecosystem for immediate use by researchers to drive discovery and gain new insights. As new data types are ingested, new fields are identified for the data ingestion process. The Data Management Core (DMC) will integrate the data dictionary into its data submission requirements. The Data Release Management Working Group (DRMWG) will ensure the data submission requirements have been fulfilled before data ingestion is initialized.
Requiring a standardized data dictionary will increase the velocity of data ingestion. It can be used by curators or software to validate the data in the files and enables more automated data processing. The BDC Data Dictionary aligns with the dbGaP Data Dictionary and Format, with a few modifications. The fields DOCFILE and TYPE are required in the BDC Data Dictionary, unlike dbGaP. The data submitter will also be required to submit information about the study, such as a study abbreviation and consent(s). This information is important as more datasets are submitted directly to BDC, and this information is not assigned by dbGaP.
The data ingestion process can be expedited by requiring data submitters to adhere to a standardized data dictionary. If data submitters do not provide the data dictionary in a consistent format, such as TSV, SAS, Excel, or XML, it results in back-and-forth with submitters, a manual compilation of the data dictionary, and the development of new or tailored data ingestion pipelines. Additionally, data submitters have not consistently provided data dictionaries with usable decoding information; therefore, the data is presented to the researcher in its encoded format. For example, if 1 is the encoded value for the decoded value Male and 2 for Female, but the decoding is never defined, the researcher will not know the sex of the study participant. Each study that is not formatted requires a unique level of effort for ingestion, as described in the Unique Data Loading Use-Cases.
Supported formats include CSV, TSV, SAS, Excel, XML, or any tabular formats. PDFs are not supported.
*Indicates required
Study submission | Description |
---|---|
*Indicates required
DOCFILE and TYPE are required by BDC; however, they are not required by dbGaP.
Column Headers | Description |
---|---|
VARNAME*
Variable name. The VARNAME must not contain backward slashes (\). Do not use "dbGaP" in the variable name. "dbGaP" is reserved for dbGaP generated items.
VARDESC*
Variable description. The description should be understandable and enable users to replicate the variable. For example, "blood pressure" is useful, but "brachial blood pressure while sitting" provides more context. Alternatively, study documents with detail are also acceptable.
DOCFILE*
Study document name associated with the variable. To list multiple documents, add a semicolon (;) between documents. Please list only study document filenames that are submitted.
TYPE*
Data value type: integer (1,2,3,4,…), encoded value (integers or strings are coded for non-numerical meaning, ex. 1=Control; 2=Case, see VALUES), decimal (0.5,2.5,…), string (African American, Asian, Caucasian, Hispanic, Non-Hispanic). For mixed values (any combination of string, integers, decimals and/or encoded values) in a single data column, list all types present separated by a comma.
UNITS*
Units of measurement of this variable.
VALUES*
List of all unique values and/or descriptions of all encoded values. Encoded values are defined as a value and its meaning. For example, if a data file contains a variable named "EDUCATION" and its data values are "1, 2, 3, and 99," these coded values will need to be defined in the data dictionary. The format of an encoded value is VALUE=MEANING.
The VALUES field will contain the value’s meaning. There will be one field using a pipe to separate the different values.
Example: 1=Completed High School|2=Completed College|3=Completed Masters|99=NA
THE FOLLOWING FIELDS ARE NOT REQUIRED
COLLINTERVAL
Collection interval is the time frame in which the data for the variable or dataset was collected.
COMMENT1, COMMENT2
Additional information not included in the VARDESC that will further define the variable. If additional comments are needed beyond COMMENT2, insert new columns (COMMENT3, COMMENT4, etc.) before the column "ORDER."
MAX
The logical maximum value for the variable. If a separate code such as 9999 is used for a missing field, this should not be considered as the MAX value.
MIN
The logical minimum value of the variable. If a separate code such as -1 is used for a missing field, this should not be considered as the MIN value.
ORDER
The order in which VALUES appear on the variable summary report page. If VALUES of a single variable/column of data are integers or decimals, leave blank. If VALUES are encoded values, string, or mixed, define the order. VALUES can be ordered by Frequency (highest to lowest frequency of VALUES) or by List (user specifies order through placement in VALUES columns). For mixed values within a single variable/column of data, see examples: "age" and "weight" in example file 5b_SubjectPhenotypes_DD.xlsx.
RESOLUTION
Measurement resolution – the number of decimal places to which a measured value is presented in the data. For example, in 54.321 the resolution is 3.
SOURCE_VARIABLE_ID
A unique identifier from the VARIABLE_SOURCE or a unique text concept/term from various controlled vocabularies. (Must be submitted as a group with VARIABLE_SOURCE and VARIABLE_MAPPING).
UNIQUEKEY
Unique key is a combination of variables that is designed to uniquely identify a row in a longitudinal dataset or rows that have repeating SUBJECT_IDs or SAMPLE_IDs. Mark "X" for variables that constitute the unique keys, and leave other values blank. Ex. SUBJECT_ID and VISIT_NUMBER. UNIQUEKEYs can only be used in the subject phenotypes file and some cases of the sample attributes file. The SC, SSM, and pedigree files should never have UNIQUEKEYs marked, since there should be a unique identifier appearing once in each file.
VARIABLE_MAPPING
For example, a variable from the source could be Identical, Related, or Comparable. (Must be submitted as a group with VARIABLE_SOURCE and SOURCE_VARIABLE_ID).
VARIABLE_SOURCE
Source of controlled vocabularies. Ex. PhenX, MeSH, SNOMED, NCI. If there is no match, leave it blank. (Must be submitted as a group with SOURCE_VARIABLE_ID and VARIABLE_MAPPING).
Study name*
The full name of the study
e.g. NHLBI TOPMed: SubPopulations and InteRmediate Outcome Measures In COPD Study
Study abbreviation*
The study abbreviation. Spaces are not permitted. Underscores are permitted.
e.g. SPIROMICS
Study consent number/consent/abbreviation*
Consent Group: Use the dbGaP consent number for guidance.
Consent and abbreviation: Use the NIH Consent Codes: Upholding Standard Data Use Conditions for guidance.
e.g. consent number, consent, and abbreviation
consent
abbreviation
General Research Use
GRU
Non Profit Use Only
GRU-NPU
Disease-Specific (Chronic Obstructive Pulmonary Disease)
DS-COPD
Disease-Specific (Chronic Obstructive Pulmonary Disease, NPU)
DS-COPD-NPU
consent
abbreviation
General Research Use
GRU
Non Profit Use Only
GRU-NPU
Disease-Specific (Chronic Obstructive Pulmonary Disease)
DS-COPD
Disease-Specific (Chronic Obstructive Pulmonary Disease, NPU)
DS-COPD-NPU