Appendix 1: BioData Catalyst Identifiers - dbGaP, TOPMed, and PIC-SURE

Table of BioData Catalyst dbGAP/TOPMed Identifiers

Patient ID

This is the HPDS Patient num. This is PIC-SURE HPDS’s internal Identifier.

Topmed / Parent Study Accession with Subject ID

  • These are the identifiers used by each in the team in the consortium to link data.

  • Values must follow this mask <STUDY_ACCESSION_NUMBER>.<VERSION>_<SUBJECT_ID> Eg: phs000007.v30_XXXXXXX

DBGAP_SUBJECT_ID

  • This is a generated id that is unique to each patient in a study.

  • Controlled by dbgap

  • It is not unique across unrelated studies. However Patients can be linked across studies. See SOURCE_SUBJECT_ID.

  • However a patient will be assigned the same across related studies. For dbGaP to assign the same dbGaP subject ID, include the two variables, SUBJECT_SOURCE and SOURCE_SUBJECT_ID.

  • This identifier is used in all the phenotypic data files and is what we sequence to a HPDS Patient Num ( Patient ID ). All sequenced identifiers are stored in a PatientMapping file and stored in s3. These mappings allow HPDS data to be correlated back to the raw data sets.

SUBJECT_ID

  • This is a generated id that is unique to each patient in a study.

  • Controlled by the submitter of a study.

  • For FHS this is replaced with shareid for phs000007. For phs000974 It uses SUBJECT_ID. The values for these two columns are the same however.

SHARE_ID

  • For FHS phs000007 this was used instead of SUBJECT_ID, but not for FHS phs000974

SOURCE_SUBJECT_ID

  • This is used internally by DBGAP in conjunction with SUBJECT_SOURCE to allow submitters to associate subjects across studies.

SAMPLE_ID

  • De-identified sample identifier.

  • These are the ids that link to the molecular data in dbgap ( vcfs, etc.).

Table of PIC-SURE Identifiers

\_Topmed Study Accession with Subject ID\

Generated identifier for TOPMed Studies. These identifiers are a concatenation using the accession name and “SUBJECT_ID” from a study’s subject multi file.

<STUDY_ACCESSION_NUMBER>.<VERSION>_<SUBJECT_ID>

Eg: phs000974.v3_XXXXXXX

\_Parent Study Accession with Subject ID\

Generated identifier for PARENT Studies. In most studies this follows the same pattern as the TOPMed Study Accession with Subject id.

However, Framingham’s parent study phs000007 does not contain SUBJECT_ID column which is replaced using the SHAREID column.

Eg: phs000007.v3_XXXXXXX

\_VCF Sample Id\

This variable is stored in the sample multi file in each dbGaP study.

This is the TOPMed DNA sample identifier. This is used to give each sample/sequence a unique identifier across TOPMed studies.

Eg: NWD123456

Patient ID (not a concept path but exists in data exports)

This is PIC-SURE’s internal Identifier. It is commonly referred to as HPDS Patient num.

This identifier is generated and assigned to subjects when they are loaded. It is not meant for data correlation between different data sources.

Last updated