Data Organization in PIC-SURE
PIC-SURE integrates clinical and genomic datasets across BioData Catalyst, including TOPMed and TOPMed related studies, COVID-19 studies, and BioLINCC studies. Each variable is organized as a concept path that contains information about the study, variable group, and variable. Though the specifics of the concept paths are dependent on the type of study, the overall information included is the same.
For more information about additional dbGaP, TOPMed, and PIC-SURE concept paths, refer to Appendix 1.
Table of Data Fields in PIC-SURE
General organization | Data organized using the format implemented by the database of Genotypes and Phenotypes (dbGaP). Find more information on the dbGaP data structure here.
Generally, a given study will have several tables, and those tables have several variables. | Data do not follow dbGaP format; there are no phv or pht accessions.
Data are organized in groups of like variables, when available. For example, variables like Age, Gender, and Race could be part of the Demographics variable group. |
Concept path structure | \phs\pht\phv\variable name\ | \phs\variable name |
Variable ID | phv corresponding to the variable accession number | Equivalent to variable name |
Variable name | Encoded variable name that was used by the original submitters of the data | Encoded variable name that was used by the original submitters of the data |
Variable description | Description of the variable | Description of the variable, as available |
Dataset ID | pht corresponding to the trait table accession number | Equivalent to dataset name |
Dataset name | Name of the trait table | Name of a group of like variables, as available |
Dataset description | Description of the trait table | Description of a group of like variables, as available |
Study ID | phs corresponding to the study accession number | phs corresponding to the study accession number |
Study description | Description of the study from dbGaP | Description of the study from dbGaP |
Note that there are two data types in PIC-SURE: categorical and continuous data. Categorical variables refers to any variables that have categorized values. For example, “Have you ever had asthma?” with values “Yes” and “No” is a categorical variable. Continuous variables refer to any variables that have a numeric range of values. For example, “Age” with a value range from 10 to 90 is a continuous variable. The internal PIC-SURE data load process determines the type of each variable based on the data.
Last updated