Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The BioData Catalyst ecosystem hosts several datasets from the NIH NHLBI Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC). To access the BioLINCC studies, you must request access through dbGaP even if you have authorization from BioLINCC.
BioData Catalyst Powered by PIC-SURE has integrated clinical and genomic data from a variety of heart, lung, blood, and sleep related datasets. These include NHLBI Trans-Omics for Precision Medicine (TOPMed) and TOPMed related studies, BioLINCC datasets, and COVID-19 datasets.
View a summary of the data you have access to by viewing the Data Access Table.
This table displays information about the study and associated data, including the full and abbreviated name of the study, study design and focus, the number of clinical variables, participants, and samples sequenced, additional information with helpful links, consent group information, and the dbGaP accession number (or phs number). You are also able to see which studies you are authorized to access in the Access column of the table. For information from dbGaP on submitting a data access request, refer to Tips for Preparing a Successful Data Access Request documentation. Note that studies with a sickle cell disease focus contain links to the Cure SCi Metadata Catalog for additional information.
You can also check the data you have access to by going to the BioData Catalyst Data Access page on the BioData Catalyst website and clicking Check My Access.
The BioData Catalyst ecosystem hosts several datasets from the NIH NHLBI Collaborating Network of Networks for Evaluating COVID-19 and Therapeutic Strategies (CONNECTS) program. These COVID-19 related studies follow the guidelines for implementing common data elements (CDEs) and for de-identifying dates, ages, and free text fields. For more information about these efforts, you can view the CDE Manual and De-Identification Guidance documents on the CONNECTS COVID-19 Therapeutic Trial Common Data Elements webpage.
Table of COVID-19 Studies Included in the CONNECTS Program Available in PIC-SURE
A Multicenter, Adaptive, Randomized Controlled Platform Trial of the Safety and Efficacy of Antithrombotic Strategies in Hospitalized Adults with COVID-19
ACTIV4a
phs002694
COVID-19 Positive Outpatient Thrombosis Prevention in Adults Aged 40-80
ACTIV4b
phs002710
Clinical-trial of COVID-19 Convalescent Plasma in Outpatients
C3PO
phs002752
How to get started with PIC-SURE and the common endpoints you can use to query any resource registered with PIC-SURE
The PIC-SURE v2 API is a meta-API used to host any number of resources exposed through a unified set of generalized operations.
PIC-SURE Repositories:
PIC-SURE API: This is the repository for version 2+ of the PIC-SURE API.
PIC-SURE Wiki: This is the wiki page for version 2+ of the PIC-SURE API.
BioData Catalyst PIC-SURE: This is the repository for the BioData Catalyst environment of PIC-SURE.
PIC-SURE-ALL-IN-ONE: This is the repository for PIC-SURE-ALL-IN-ONE.
Additional PIC-SURE Links:
Avillachlab-Jenkins Repository: A link to the Avillach Lab Jenkins repository.
Avillachlab-Jenkins Dev Release Control: A repository for Avillach Lab Jenkins development release control.
The following are the collected client libraries for the entire PIC-SURE project.
The PIC-SURE User Interface acts as a visual aid for running normal queries of resources through PIC-SURE.
PIC-SURE User Interface Repositories:
PIC-SURE HPDS UI: The main High Performance Data Store (HPDS) UI repository.
Additional PIC-SURE User Interface Links:
PIC-SURE UI Flow: Links to a google drawing of the PIC-SURE UI flow.
The PSAMA component of the PIC-SURE ecosystem authorizes and authenticates all actions taken within PIC-SURE.
PSAMA Repos:
Additional PSAMA Links:
PSAMA Core Logic: This is where the core of the PSAMA application is stored in GitHub
HPDS is a datastore designed to work with the PIC-SURE meta-API. It grants researchers fast, dependable access to static datasets and the ability to produce statistics-ready dataframes filtered on any variable they choose at any time.
HPDS Repositories:
PIC-SURE HPDS: The main HPDS repository.
PIC-SURE HPDS Python Client: Python client library to run queries against a PIC-SURE HPDS resource.
PIC-SURE HPDS R Client: R client library to run queries against a PIC-SURE HPDS resource.
PIC-SURE HPDS UI: The main HPDS UI repository.
HPDS Annotation: This repository describes steps to prepare and annotate VCF files for loading into HPDS.
BioData Catalyst Powered by PIC-SURE YouTube channel
Introduction to BioData Catalyst Powered by PIC-SURE
Basics: Applying a Variable on a Filter
Basics: Editing a Variable Filter
PIC-SURE Open Access: Interpreting the Results
PIC-SURE Authorized Access: Applying a Genomic Filter
PIC-SURE Authorized Access: Add Variables to Export
PIC-SURE Authorized Access: Select and Package Data Tool
The BioData Catalyst ecosystem hosts several datasets from the NHLBI Trans-Omics for Precision Medicine (TOPMed) program. The PIC-SURE platform has integrated the clinical and genomic data from all studies listed in the Data Access Dashboard. Through the ingestion process, occasionally PIC-SURE will ingest phenotypic data for the TOPMed studies prior to the genomic data.
There are limited amounts of harmonized data available at this time. The TOPMed Data Coordinating Center (DCC) curation team has identified 44 variables that are shared across 17 NHLBI studies and normalized the participant values for these variables.
The 44 harmonized variables available are listed in the table in Appendix 2. For more information on this initiative, you can view the additional documentation from the TOPMed DCC GitHub repository or on the NHLBI Trans-Omics for Precision Medicine website.
If you are authorized to access any dbGaP dataset(s), the Authorized Access tab at the top will be visible. PIC-SURE Authorized Access provides access to complete, participant-level data, in addition to aggregate counts, and access to the Tool Suite.
A. Select Variables Action: Click the Select Variables icon to include variables when retrieving data. Users can select variables individually or at the dataset level.
Individually select variables: You can individually select variables from two locations:
Variable search results: From the search results you can click the data retrieval icon to include the variable in your data retrieval.
Variable modal variable data retrieval: The data retrieval icon next to the variable adds the variable to your data retrieval.
Select from a dataset or group of variables: In the variable modal the data retrieval icon next to the dataset opens a modal to allow you to select variables from the dataset table or group of variables.
B. Data Summary: In addition to the total number of participants in the filtered cohort, the number of variables the user has selected for data retrieval is also displayed.
There are four concept paths that are automatically included with any data export from PIC-SURE Authorized Access. These fields are listed and described below.
Patient ID: Internal PIC-SURE participant identifier. Please note that this field is not linking participants between studies and therefore should not be used for data correlation between different data sources or to the original data files.
Parent Study Accession with Subject ID: PIC-SURE generated identifier for parent studies. These identifiers are a combination of the study accession number and the subject identifier.
Topmed Study Accession with Subject ID: PIC-SURE generated identifier for TOPMed studies. These identifiers are a combination of the study accession number and the subject identifier.
Consents: Field used to determine which groups users are authorized to access from dbGaP. These identifiers are a combination of the study accession number and consent code.
C. Tool Suite: The Tool Suite contains tools that can be used to further explore filtered cohorts of interest. Note that at least one filter must be added to the query before using the Tool Suite.
Select and Package Data: Retrieve participant-level data corresponding to your filters and variable selections. Variables selected for data retrieval can be reviewed and modified. To learn more about the options associated with this tool, please refer to the Select and Package Data section.
Variable Distributions: View the distributions of query variables based on the filtered cohort. Note that there is a limit to the number of variable distributions that can be viewed at a given time. Additionally, genomic variables nor variables associated with any-record-of filter (e.g. entire datasets) will not be graphed.
The Select and Package Data tool is used to select and export participant-level data corresponding to your filters and variable selections. There are several options for selecting and exporting the data, which are shown using this tool.
In the top left corner of the modal, the number of participants and number of variables included in the query is shown. This is used to display the estimated number of data points in the export.
Note: Queries with more than 1,000,000 data points will not be exportable.
The table below displays a summary of the variables included in the export. Using the Selected column, variables that have been added to the export can be selected or deselected for the final dataframe.
Note: Variables with filters are automatically included in the export.
The Package Data button in the top right corner is used to prepare the data for export once the variable filters and selections have been finalized.
Once this button is clicked, there are several options to complete the export.
To export into a BioData Catalyst analysis workspace, the Export to Seven Bridges or Export to Terra buttons can be used. Once clicking either of these buttons, a new modal will be displayed with all information and instuctions needed to complete the export. This includes your personalized access token, the query ID associated with the dataframe. Additionally, there is the option to Copy Query ID without accessing Seven Bridges or Terra if you wish to use a different analysis platform.
The Export to Seven Bridges option includes a Go to Seven Bridges button, which will open a new tab to the Public PIC-SURE API Project on BioData Catalyst Powered by Seven Bridges.
The Export to Terra option includes a Go to Terra via R button and a Go to Terra via Python button, which will open the Public PIC-SURE API R Examples workspace and the Public PIC-SURE API Python Examples workspace on BioData Catalyst Powered by Terra, respectively.
In this section, the functionalities of PIC-SURE Authorized Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating some comorbidities of breast cancer in women with BRCA1 and BRCA2 gene variants, such as hypertension and COPD.
I already have been authorized to access the Women’s Health Initiative (WHI) study and am interested in a single cohort: women with breast cancer and variants of the BRCA1 and BRCA2 genes. I want to select hypertension-related variables of interest, check the distributions of some variables, and export all the data to an analysis workspace.
First, let’s apply our variable filters for the WHI study.
Search “breast cancer” in Authorized Access.
Add the WHI study tag to filter search results to only age variables found within the WHI study.
Click the “Genomic Filtering” button to begin a filter on genomic variants.
Select “BRCA1” and “BRCA2” genes of “High” and “Moderate” severity. Click “Apply genomic filter”.
Now, let’s filter to participants that have and do not have COPD. Similar to before, we will search ‘COPD’. After selecting the variable of interest, we can filter to the desired values before adding the filter to our query. Notice how the total number of participants in our cohort changes.
Search “hypertension”.
Notice how the number of variables changed in the Data Summary box.
Before we Select and Package the data for export, let’s view the distribution of our participants’ ages to see if we have a normal distribution. Open the Variable Distributions tool in the Tool Suite. Here, we can see the distributions of the two added variable filters: breast cancer (‘BREAST’) and COPD (‘F33COPD’).
Open the Select and Package Data tool in the Tool Suite. The variables shown in this table are those which will be available in your data export; you can remove variables as necessary.
Click “Package Data” when you are ready.
Once the data is packaged, you can select to either “Export to Seven Bridges” or “Export to Terra”. Copy over the personalized user token and query ID use the PIC-SURE API and export your data to an analysis workspace.
PIC-SURE provides two ways to search: PIC-SURE Open Access and PIC-SURE Authorized Access. enables the user to explore aggregate-level data without any dbGaP data authorizations. feature allows the user to explore participant-level data and requires authorization to access at least one study through an active dbGaP Data Access Request (DAR).
Table Comparison of PIC-SURE Open and Authorized Access
To obtain access to BioData Catalyst Powered by PIC-SURE, you must have an NIH eRA Commons account. For instructions and to register an account, refer to the .
After you have created an eRA Commons account, you can log in to BioData Catalyst Powered by PIC-SURE by navigating to and selecting to log in with eRA Commons. You will be directed to the NIH website to log in with your eRA Commons credentials. After signing in and accepting the terms of the agreement on the NIH RAS Information Sharing Consent page, allow the BioData Catalyst Powered by Gen3 service to manage your authorization.
Upon login, you will be directed to the Data Access Dashboard. This page provides a summary of PIC-SURE Authorized Access, PIC-SURE Open Access, and the studies you are authorized to access.
Search bar: Enter any phenotypic variable, study or table keyword into the search bar to search across studies. Users can also search specific variables by accession number, if known (phs/pht/phv).
Study Tags: Users can filter the results found through their search by limiting to studies of interest or excluding studies.
Variable Tags: Users can filter the results found through their search by limiting to keywords of interest or excluding keywords that are out of scope. For example, a user could filter to categorical variables, variables containing the term ‘blood’, and/or exclude variables containing the term ‘pressure’.
How are variable tags generated? Each variable has a set of associated tags, which are generated during the PIC-SURE data loading process. These tags are generated based on information associated with the variable, including the name of the study, study description, dataset name, PIC-SURE data type (continuous or categorical), and variable description. For a search in PIC-SURE, tags associated with a variable are displayed. Note that tags applicable to less than 5% or more than 95% of the search results are not displayed since these are not useful for filtering results.
Search Results table: View all variables associated with your search term and/or study & variable tags.
Results Panel: Panel with content boxes that describe the cohort based on the variable filters applied to the query.
Data Summary: Displays the total number of participants in the filtered cohort which meet the query criteria. When first opening the Open or Authorized Access page, the number will be the total number of participants that you can access.
Added Variable Filters summary: View all filters which have been applied to the cohort.
Filter Action: Click on the filter icon to filter cohort participants by specific variable values.
Reset button: Allows users to start a new search and query by removing all added filters and clearing all active study and variable tags.
Filter to participants with breast cancer by clicking the filter icon next to the variable of interest. Select values to filter your variable on and click “Add Filter to Query”.
Add variables to data export by clicking the select variables icon in the Actions column next to the variable of interest. The icon next to variables selected for export will change to the checkmark icon.
Atherosclerosis Risk in Communities Study
ARIC
phs000280
Cardiovascular Health Study
CHS
phs000287
Cleveland Family Study
CFS
phs000284
Coronary Artery Risk Development in Young Adults Study
CARDIA
phs000285
Epidemiology of Asthma in Costa Rica Study
CRA
phs000988
Framingham Heart Study
FHS
phs000007
Genetic Epidemiology Network of Arteriopathy
GENOA
phs001238
Genetic Epidemiology of COPD
COPDGene
phs000179
Genetics of Cardiometabolic Health in Amish
AMISH
phs000956
Genome-Wide Association Study of Venous Thrombosis Study
MAYOVTE
phs000289
Heart and Vascular Health Study
HVH
phs001013
Hispanic Community Health Study - Study of Latinos
HCHS-SOL
phs000810
Jackson Heart Study
JHS
phs000286
Multi-Ethnic Study of Atherosclerosis
MESA
phs000209
Study of Adiposity in Samoans
SAS
phs000914
Women’s Health Initiative WHI
WHI
phs000200
Removed stigmatizing variables | ✓ |
|
Data obfuscation | ✓ |
|
dbGaP approval to access required |
| ✓ |
Access to aggregate counts | ✓ | ✓ |
Access to participant-level data |
| ✓ |
Phenotypic variable search | ✓ | ✓ |
Phenotypic variable filtering | ✓ | ✓ |
Genomic variable filtering |
| ✓ |
Data retrieval |
| ✓ |
Visualizations |
| ✓ |
cac_volume_1 | Coronary artery calcium volume using CT scan(s) of coronary arteries | decimal | cubic millimeters | UMLS |
|
cac_score_1 | Coronary artery calcification (CAC) score using Agatston scoring of CT scan(s) of coronary arteries | decimal |
| UMLS |
|
cimt_1 | Common carotid intima-media thickness, calculated as the mean of two values: mean of multiple thickness estimates from the left far wall and from the right far wall. | decimal | mm | UMLS |
|
cimt_2 | Common carotid intima-media thickness, calculated as the mean of four values: maximum of multiple thickness estimates from the left far wall, left near wall, right far wall, and right near wall. | decimal | mm | UMLS |
|
carotid_stenosis_1 | Extent of narrowing of the carotid artery. | encoded |
| UMLS | 0=None||1=1%-24%||2=25%-49%||3=50%-74%||4=75%-99%||5=100% |
carotid_plaque_1 | Presence or absence of carotid plaque. | encoded |
| UMLS | 0=Plaque not present||1=Plaque present |
height_baseline_1 | Body height at baseline. | decimal | cm | UMLS |
|
current_smoker_baseline_1 | Indicates whether subject currently smokes cigarettes. | encoded |
| UMLS | 0=Does not currently smoke cigarettes||1=Currently smokes cigarettes |
weight_baseline_1 | Body weight at baseline. | decimal | kg | UMLS |
|
ever_smoker_baseline_1 | Indicates whether subject ever regularly smoked cigarettes. | encoded |
| UMLS | 0=Never a cigarette smoker||1=Current or former cigarette smoker |
bmi_baseline_1 | Body mass index calculated at baseline. | decimal | kg/m^2 | UMLS |
|
hemoglobin_mcnc_bld_1 | Measurement of mass per volume, or mass concentration (mcnc), of hemoglobin in the blood (bld). | decimal | g / dL = grams per deciliter | UMLS |
|
hematocrit_vfr_bld_1 | Measurement of hematocrit, the fraction of volume (vfr) of blood (bld) that is composed of red blood cells. | decimal | % = percentage | UMLS |
|
rbc_ncnc_bld_1 | Count by volume, or number concentration (ncnc), of red blood cells in the blood (bld). | decimal | millions / microliter | UMLS |
|
wbc_ncnc_bld_1 | Count by volume, or number concentration (ncnc), of white blood cells in the blood (bld). | decimal | thousands / microliter | UMLS |
|
basophil_ncnc_bld_1 | Count by volume, or number concentration (ncnc), of basophils in the blood (bld). | decimal | thousands / microliter | UMLS |
|
eosinophil_ncnc_bld_1 | Count by volume, or number concentration (ncnc), of eosinophils in the blood (bld). | decimal | thousands / microliter | UMLS |
|
neutrophil_ncnc_bld_1 | Count by volume, or number concentration (ncnc), of neutrophils in the blood (bld). | decimal | thousands / microliter | UMLS |
|
lymphocyte_ncnc_bld_1 | Count by volume, or number concentration (ncnc), of lymphocytes in the blood (bld). | decimal | thousands / microliter | UMLS |
|
monocyte_ncnc_bld_1 | Count by volume, or number concentration (ncnc), of monocytes in the blood (bld). | decimal | thousands / microliter | UMLS |
|
platelet_ncnc_bld_1 | Count by volume, or number concentration (ncnc), of platelets in the blood (bld). | integer | thousands / microliter | UMLS |
|
mch_entmass_rbc_1 | Measurement of the average mass (entmass) of hemoglobin per red blood cell(rbc), known as mean corpuscular hemoglobin (MCH). | decimal | pg = picogram | UMLS |
|
mchc_mcnc_rbc_1 | Measurement of the mass concentration (mcnc) of hemoglobin in a given volume of packed red blood cells (rbc), known as mean corpuscular hemoglobin concentration (MCHC). | decimal | g /dL = grams per deciliter | UMLS |
|
mcv_entvol_rbc_1 | Measurement of the average volume (entvol) of red blood cells (rbc), known as mean corpuscular volume (MCV). | decimal | fL = femtoliter | UMLS |
|
pmv_entvol_bld_1 | Measurement of the mean volume (entvol) of platelets in the blood (bld), known as mean platelet volume (MPV or PMV). | decimal | fL = femtoliter | UMLS |
|
rdw_ratio_rbc_1 | Measurement of the ratio of variation in width to the mean width of the red blood cell (rbc) volume distribution curve taken at +/- 1 CV, known as red cell distribution width (RDW). | decimal | % = percentage | UMLS |
|
bp_systolic_1 | Resting systolic blood pressure from the upper arm in a clinical setting. | decimal | mmHg | UMLS |
|
bp_diastolic_1 | Resting diastolic blood pressure from the upper arm in a clinical setting. | decimal | mmHg | UMLS |
|
antihypertensive_meds_1 | Indicator for use of antihypertensive medication at the time of blood pressure measurement. | encoded |
| UMLS | 0=Not taking antihypertensive medication||1=Taking antihypertensive medication |
race_1 | Harmonized race category of participant. | encoded |
| UMLS | AI_AN=American Indian_Alaskan Native or Native American||Asian=Asian||Black=Black or African American||HI_PI=Native Hawaiian or other Pacific Islander||Multiple=More than one race||Other=Other race||White=White or Caucasian |
ethnicity_1 | Indicator of Hispanic or Latino ethnicity. | encoded |
| UMLS | both=ethnicity component dbGaP variable values for a subject were inconsistent/contradictory (e.g. over multiple visits)||HL=Hispanic or Latino||notHL=not Hispanic or Latino |
hispanic_subgroup_1 | classification of Hispanic/Latino background for Hispanic/Latino subjects where country or region of origin information is available | encoded |
| UMLS | CentralAmerican=Central American||CostaRican=from Costa Rica||Cuban=Cuban||Dominican=Dominican||Mexican=Mexican||PuertoRican=Puerto Rican||SouthAmerican=South American |
annotated_sex_1 | Subject sex, as recorded by the study. | encoded |
| UMLS | female=Female||male=Male |
geographic_site_1 | Recruitment/field center, baseline clinic, or geographic region. | encoded |
| UMLS |
|
subcohort_1 | A distinct subgroup within a study, generally indicating subjects who share similar characteristics due to study design. Subjects may belong to only one subcohort. | encoded |
| UMLS |
|
lipid_lowering_medication_1 | Indicates whether participant was taking any lipid-lowering medication at blood draw to measure lipids phenotypes | encoded |
| UMLS | 0=Participant was not taking lipid-lowering medication||1=Participant was taking lipid-lowering medication. |
fasting_lipids_1 | Indicates whether participant fasted for at least eight hours prior to blood draw to measure lipids phenotypes. | encoded |
| UMLS | 0=Participant did not fast_or fasted for fewer than eight hours prior to measurement of lipids phenotypes.||1=Participant fasted for at least eight hours prior to measurement of lipids phenotypes. |
total_cholesterol_1 | Blood mass concentration of total cholesterol | decimal | mg/dL | UMLS |
|
triglycerides_1 | Blood mass concentration of triglycerides | decimal | mg/dL | UMLS |
|
hdl_1 | Blood mass concentration of high-density lipoprotein cholesterol | decimal | mg/dL | UMLS |
|
ldl_1 | Blood mass concentration of low-density lipoprotein cholesterol | decimal | mg/dL | UMLS |
|
vte_prior_history_1 | An indicator of whether a subject had a venous thromboembolism (VTE) event prior to the start of the medical review process (including self-reported events). | encoded |
| UMLS | 0=did not have prior VTE event||1=had prior VTE event |
vte_case_status_1 | An indicator of whether a subject experienced a venous thromboembolism event (VTE) that was verified by adjudication or by medical professionals. | encoded |
| UMLS | 0=Not known to ever have a VTE event_either self-reported or from medical records||1=Experienced a VTE event as verified by adjudication or by medical professionals |
age_at_* | For each phenotypic value for a given subject, an associated age at measurement is provided. | decimal | years |
|
unit_* | For each harmonized variable, a paired “unit_variable” is provided, whose value indicates where in the documentation to look to find the set of component variables and the algorithm used to harmonize those variables. | encoded |
|
|
Once you have refined your queries and created a cohort of interest, you can begin analyzing data using other components of the BioData Catalyst ecosystem.
Databases exposed through the PIC-SURE API encompass a wide heterogeneity of architectures and data organizations underneath. PIC-SURE hides this complexity and exposes the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible sciences. The API is available in two different programming languages, python and R, allowing investigators to query databases in the same way using either of those languages. The PIC-SURE API tutorial notebooks can be directly accessed on GitHub.
To access the PIC-SURE API, a user-specific token is needed. This is the way the API grants access to individual users to protected-access data. The user token is strictly personal; do not share it with anyone. You can copy your personalized access token by selecting the User Profile tab at the top of the screen.
Here, you can Copy your personalized access token, Reveal your token, and Refresh your token to retrieve a new token and deactivate the old token.
The PIC-SURE API can be accessed via tutorial notebooks on either BioData Catalyst Powered by Seven Bridges or Powered by Terra.
To launch one of the analysis platforms, go to the BioData Catalyst website. From the Resources menu, select Services. A list of platforms and services on the BioData Catalyst ecosystem will be displayed.
From the Analyze Data in Cloud-based Shared Workspaces section, select Launch for your preferred analysis platform.
Jupyter notebook examples in R and python can be found under the Public projects tab by selecting PIC-SURE API.
From the Data Studio tab, select an example that fits your research needs. Here, we will select PIC-SURE JupyterLab examples.
This will take you to the PIC-SURE API analysis workspace, where you can view the examples in python. Copy this workspace to your own project to edit or run the code yourself.
Note The project must have network access to run the PIC-SURE examples on Seven Bridges. To ensure this, go to the Settings tab and select “Allow network access”.
To access the Jupyter notebook examples in R and python for the PIC-SURE API, select View Workspaces from the Terra landing page.
Select the Public tab and search for “PIC-SURE”. Workspaces for both the python and R examples will be displayed. You must clone the workspaces to edit or run the code within them.
PIC-SURE: Patient Information Commons Standard Unification of Research Elements
The Patient Information Commons: Standard Unification of Research Elements (PIC-SURE) integrates clinical and genomic data to allow users to search, query, and export data at the variable and variant levels. This allows users to create analysis-ready data frames without manually mapping and merging files.
BioData Catalyst Powered by PIC-SURE functions as part of the BioData Catalyst ecosystem, allowing researchers to explore studies funded by the National Heart, Lung, and Blood Institute (NHLBI), whether they have been granted access to the participant level data or not.
PIC-SURE integrates clinical and genomic datasets across BioData Catalyst, including TOPMed and TOPMed related studies, COVID-19 studies, and BioLINCC studies. Each variable is organized as a concept path that contains information about the study, variable group, and variable. Though the specifics of the concept paths are dependent on the type of study, the overall information included is the same.
For more information about additional dbGaP, TOPMed, and PIC-SURE concept paths, refer to Appendix 1.
Table of Data Fields in PIC-SURE
Note that there are two data types in PIC-SURE: categorical and continuous data. Categorical variables refers to any variables that have categorized values. For example, “Have you ever had asthma?” with values “Yes” and “No” is a categorical variable. Continuous variables refer to any variables that have a numeric range of values. For example, “Age” with a value range from 10 to 90 is a continuous variable. The internal PIC-SURE data load process determines the type of each variable based on the data.
PIC-SURE Open Access allows you to search any clinical variable available in PIC-SURE. Your queries will return obfuscated aggregate counts per study and consent. There are some features specific to PIC-SURE Open Access, which are outlined below.
A. Stigmatizing Variables Removal: PIC-SURE Open Access data excludes clinical variables that contain potentially sensitive information. These variables are known as stigmatizing variables, which fall into the following categories:
Mental health diagnoses, history, and treatment
Illicit drug use history
Sexually transmitted disease diagnoses, history, and treatment
Sexual history
Intellectual achievement, ability, and educational attainment
Direct or surrogate identifiers of legal status
For more information about stigmatizing variables and the identification process, please refer to the documentation and code on the BioData Catalyst Powered by PIC-SURE Stigmatizing Variables GitHub repository.
B. Data Obfuscation: Because participant-level data are not available in PIC-SURE Open Access, the aggregate counts are obfuscated to further anonymize the data. This means that:
If the consent group, study, and/or total participants of the query is between one and nine, the results will be shown as < 10.\
If the consent group results are between one and nine and the study and/or total participants of the query is greater than 10, the results will be obfuscated by ± 3.
Query results that are zero participants will display 0.
C. View Filtered Results by Study: The filtered number of participants which match the query criteria is shown broken down by study and consent group. Users can see if they do or do not have access to specific studies.
In this section, the functionalities of PIC-SURE Open Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating asthma in relation to obesity in adults.
I’m interested in two cohorts: obese adults with a body mass index (BMI) greater than 30 and healthy adults with a BMI between 18.5 and 24.9. However, I have not yet submitted a Data Access Request and therefore am not authorized to access any datasets.
First, let’s explore cohort A: Healthy adults with a BMI between 18.5 and 24.9 in Framingham Heart Study (FHS).
Search for ‘age’.
Apply ‘FHS’ study tag to view only ‘age’ variables within the Framingham Heart Study (phs000007).
Select the variable of interest. You may notice many variables that appear similar. These variables may be located in different datasets, or tables, but contain similar information. Open up the variable information modal by clicking on the row containing the variable of interest to learn more.
Now, let’s filter to healthy adults with a BMI between 18.5 and 24.9. Similar to before, we will search ‘BMI’. We can narrow down the search results using the variable-level tags by including terms related to our variable of interest (such as ‘continuous’ to view only continuous variables) and excluding out-of-scope terms (such as ‘allergy’). After selecting the variable of interest, we can filter to the desired ranges before adding the filter to our query. Notice how the total number of participants in our cohort changes.
Finally, we will filter for participants who have asthma.
Note the total participant count in the Data Summary.
We can easily modify our filters to explore cohort B: Obese adults with a body mass index (BMI) greater than 30 in Framingham Heart Study.
Note the total participant count in the Data Summary.
We can easily repeat these steps for other studies, such as the Genetic Epidemiology of COPD (COPDGene) study, and create a table like the one below. By comparing these two studies, I can see that COPDGene may be more promising for my research since it contains many more participants in my cohorts of interest than FHS does.
I can then use the Request Access button to go directly to the study’s dbGaP page and begin submitting a DAR.
See for more information.
See for more information.
Filter to adults only by clicking the filter icon next to the variable. I am interested in adults, so I will set the minimum age to 18, then click “Add filter to query”.
Edit the BMI filter by clicking the edit icon in the Added Variable Filters section. Change the range to have a minimum of 30 and no maximum.
Patient ID
This is the HPDS Patient num. This is PIC-SURE HPDS’s internal Identifier.
Topmed / Parent Study Accession with Subject ID
These are the identifiers used by each in the team in the consortium to link data.
Values must follow this mask <STUDY_ACCESSION_NUMBER>.<VERSION>_<SUBJECT_ID> Eg: phs000007.v30_XXXXXXX
DBGAP_SUBJECT_ID
This is a generated id that is unique to each patient in a study.
Controlled by dbgap
It is not unique across unrelated studies. However Patients can be linked across studies. See SOURCE_SUBJECT_ID.
However a patient will be assigned the same across related studies. For dbGaP to assign the same dbGaP subject ID, include the two variables, SUBJECT_SOURCE and SOURCE_SUBJECT_ID.
This identifier is used in all the phenotypic data files and is what we sequence to a HPDS Patient Num ( Patient ID ). All sequenced identifiers are stored in a PatientMapping file and stored in s3. These mappings allow HPDS data to be correlated back to the raw data sets.
SUBJECT_ID
This is a generated id that is unique to each patient in a study.
Controlled by the submitter of a study.
For FHS this is replaced with shareid for phs000007. For phs000974 It uses SUBJECT_ID. The values for these two columns are the same however.
SHARE_ID
For FHS phs000007 this was used instead of SUBJECT_ID, but not for FHS phs000974
SOURCE_SUBJECT_ID
This is used internally by DBGAP in conjunction with SUBJECT_SOURCE to allow submitters to associate subjects across studies.
SAMPLE_ID
De-identified sample identifier.
These are the ids that link to the molecular data in dbgap ( vcfs, etc.).
\_Topmed Study Accession with Subject ID\
Generated identifier for TOPMed Studies. These identifiers are a concatenation using the accession name and “SUBJECT_ID” from a study’s subject multi file.
<STUDY_ACCESSION_NUMBER>.<VERSION>_<SUBJECT_ID>
Eg: phs000974.v3_XXXXXXX
\_Parent Study Accession with Subject ID\
Generated identifier for PARENT Studies. In most studies this follows the same pattern as the TOPMed Study Accession with Subject id.
However, Framingham’s parent study phs000007 does not contain SUBJECT_ID column which is replaced using the SHAREID column.
Eg: phs000007.v3_XXXXXXX
\_VCF Sample Id\
This variable is stored in the sample multi file in each dbGaP study.
This is the TOPMed DNA sample identifier. This is used to give each sample/sequence a unique identifier across TOPMed studies.
Eg: NWD123456
Patient ID (not a concept path but exists in data exports)
This is PIC-SURE’s internal Identifier. It is commonly referred to as HPDS Patient num.
This identifier is generated and assigned to subjects when they are loaded. It is not meant for data correlation between different data sources.
General organization
Data organized using the format implemented by the database of Genotypes and Phenotypes (dbGaP). Find more information on the dbGaP data structure here.
Generally, a given study will have several tables, and those tables have several variables.
Data do not follow dbGaP format; there are no phv or pht accessions.
Data are organized in groups of like variables, when available. For example, variables like Age, Gender, and Race could be part of the Demographics variable group.
Concept path structure
\phs\pht\phv\variable name\
\phs\variable name
Variable ID
phv corresponding to the variable accession number
Equivalent to variable name
Variable name
Encoded variable name that was used by the original submitters of the data
Encoded variable name that was used by the original submitters of the data
Variable description
Description of the variable
Description of the variable, as available
Dataset ID
pht corresponding to the trait table accession number
Equivalent to dataset name
Dataset name
Name of the trait table
Name of a group of like variables, as available
Dataset description
Description of the trait table
Description of a group of like variables, as available
Study ID
phs corresponding to the study accession number
phs corresponding to the study accession number
Study description
Description of the study from dbGaP
Description of the study from dbGaP
Framingham Heart Study (FHS)
50 +/- 3
72 +/- 3
Genetic Epidemiology of COPD (COPDGene)
488 +/- 3
868