PIC-SURE provides two ways to search: PIC-SURE Open Access and PIC-SURE Authorized Access. PIC-SURE Open Access enables the user to explore aggregate-level data without any dbGaP data authorizations. PIC-SURE Authorized Access feature allows the user to explore participant-level data and requires authorization to access at least one study through an active dbGaP Data Access Request (DAR).
Table Comparison of PIC-SURE Open and Authorized Access
Removed stigmatizing variables
✓
Data obfuscation
✓
dbGaP approval to access required
✓
Access to aggregate counts
✓
✓
Access to participant-level data
✓
Phenotypic variable search
✓
✓
Phenotypic variable filtering
✓
✓
Genomic variable filtering
✓
Data retrieval
✓
Visualizations
✓
If you are authorized to access any dbGaP dataset(s), the Authorized Access tab at the top will be visible. PIC-SURE Authorized Access provides access to complete, participant-level data, in addition to aggregate counts, and access to the Tool Suite.
A. Select Variables Action: Click the Select Variables icon to include variables when retrieving data. Users can select variables individually or at the dataset level.
Individually select variables: You can individually select variables from two locations:
Variable search results: From the search results you can click the data retrieval icon to include the variable in your data retrieval.
Variable modal variable data retrieval: The data retrieval icon next to the variable adds the variable to your data retrieval.
Select from a dataset or group of variables: In the variable modal the data retrieval icon next to the dataset opens a modal to allow you to select variables from the dataset table or group of variables.
B. Data Summary: In addition to the total number of participants in the filtered cohort, the number of variables the user has selected for data retrieval is also displayed.
There are four concept paths that are automatically included with any data export from PIC-SURE Authorized Access. These fields are listed and described below.
Patient ID: Internal PIC-SURE participant identifier. Please note that this field is not linking participants between studies and therefore should not be used for data correlation between different data sources or to the original data files.
Parent Study Accession with Subject ID: PIC-SURE generated identifier for parent studies. These identifiers are a combination of the study accession number and the subject identifier.
Topmed Study Accession with Subject ID: PIC-SURE generated identifier for TOPMed studies. These identifiers are a combination of the study accession number and the subject identifier.
Consents: Field used to determine which groups users are authorized to access from dbGaP. These identifiers are a combination of the study accession number and consent code.
C. Tool Suite: The Tool Suite contains tools that can be used to further explore filtered cohorts of interest. Note that at least one filter must be added to the query before using the Tool Suite.
Select and Package Data: Retrieve participant-level data corresponding to your filters and variable selections. Variables selected for data retrieval can be reviewed and modified. To learn more about the options associated with this tool, please refer to the Select and Package Data section.
Variable Distributions: View the distributions of query variables based on the filtered cohort. Note that there is a limit to the number of variable distributions that can be viewed at a given time. Additionally, genomic variables nor variables associated with any-record-of filter (e.g. entire datasets) will not be graphed.
The Select and Package Data tool is used to select and export participant-level data corresponding to your filters and variable selections. There are several options for selecting and exporting the data, which are shown using this tool.
In the top left corner of the modal, the number of participants and number of variables included in the query is shown. This is used to display the estimated number of data points in the export.
Note: Queries with more than 1,000,000 data points will not be exportable.
The table below displays a summary of the variables included in the export. Using the Selected column, variables that have been added to the export can be selected or deselected for the final dataframe.
Note: Variables with filters are automatically included in the export.
The Package Data button in the top right corner is used to prepare the data for export once the variable filters and selections have been finalized.
Once this button is clicked, there are several options to complete the export.
To export into a BioData Catalyst analysis workspace, the Export to Seven Bridges or Export to Terra buttons can be used. Once clicking either of these buttons, a new modal will be displayed with all information and instuctions needed to complete the export. This includes your personalized access token, the query ID associated with the dataframe. Additionally, there is the option to Copy Query ID without accessing Seven Bridges or Terra if you wish to use a different analysis platform.
The Export to Seven Bridges option includes a Go to Seven Bridges button, which will open a new tab to the Public PIC-SURE API Project on BioData Catalyst Powered by Seven Bridges.
The Export to Terra option includes a Go to Terra via R button and a Go to Terra via Python button, which will open the Public PIC-SURE API R Examples workspace and the Public PIC-SURE API Python Examples workspace on BioData Catalyst Powered by Terra, respectively.
In this section, the functionalities of PIC-SURE Authorized Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating some comorbidities of breast cancer in women with BRCA1 and BRCA2 gene variants, such as hypertension and COPD.
I already have been authorized to access the Women’s Health Initiative (WHI) study and am interested in a single cohort: women with breast cancer and variants of the BRCA1 and BRCA2 genes. I want to select hypertension-related variables of interest, check the distributions of some variables, and export all the data to an analysis workspace.
First, let’s apply our variable filters for the WHI study.
Search “breast cancer” in Authorized Access.
Add the WHI study tag to filter search results to only age variables found within the WHI study.
Click the “Genomic Filtering” button to begin a filter on genomic variants.
Select “BRCA1” and “BRCA2” genes of “High” and “Moderate” severity. Click “Apply genomic filter”.
Now, let’s filter to participants that have and do not have COPD. Similar to before, we will search ‘COPD’. After selecting the variable of interest, we can filter to the desired values before adding the filter to our query. Notice how the total number of participants in our cohort changes.
Search “hypertension”.
Notice how the number of variables changed in the Data Summary box.
Before we Select and Package the data for export, let’s view the distribution of our participants’ ages to see if we have a normal distribution. Open the Variable Distributions tool in the Tool Suite. Here, we can see the distributions of the two added variable filters: breast cancer (‘BREAST’) and COPD (‘F33COPD’).
Open the Select and Package Data tool in the Tool Suite. The variables shown in this table are those which will be available in your data export; you can remove variables as necessary.
Click “Package Data” when you are ready.
Once the data is packaged, you can select to either “Export to Seven Bridges” or “Export to Terra”. Copy over the personalized user token and query ID use the PIC-SURE API and export your data to an analysis workspace.
Filter to participants with breast cancer by clicking the filter icon next to the variable of interest. Select values to filter your variable on and click “Add Filter to Query”.
Add variables to data export by clicking the select variables icon in the Actions column next to the variable of interest. The icon next to variables selected for export will change to the checkmark icon.
PIC-SURE Open Access allows you to search any clinical variable available in PIC-SURE. Your queries will return obfuscated aggregate counts per study and consent. There are some features specific to PIC-SURE Open Access, which are outlined below.
A. Stigmatizing Variables Removal: PIC-SURE Open Access data excludes clinical variables that contain potentially sensitive information. These variables are known as stigmatizing variables, which fall into the following categories:
Mental health diagnoses, history, and treatment
Illicit drug use history
Sexually transmitted disease diagnoses, history, and treatment
Sexual history
Intellectual achievement, ability, and educational attainment
Direct or surrogate identifiers of legal status
For more information about stigmatizing variables and the identification process, please refer to the documentation and code on the BioData Catalyst Powered by PIC-SURE Stigmatizing Variables GitHub repository.
B. Data Obfuscation: Because participant-level data are not available in PIC-SURE Open Access, the aggregate counts are obfuscated to further anonymize the data. This means that:
If the consent group, study, and/or total participants of the query is between one and nine, the results will be shown as < 10.\
If the consent group results are between one and nine and the study and/or total participants of the query is greater than 10, the results will be obfuscated by ± 3.
Query results that are zero participants will display 0.
C. View Filtered Results by Study: The filtered number of participants which match the query criteria is shown broken down by study and consent group. Users can see if they do or do not have access to specific studies.
In this section, the functionalities of PIC-SURE Open Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating asthma in relation to obesity in adults.
I’m interested in two cohorts: obese adults with a body mass index (BMI) greater than 30 and healthy adults with a BMI between 18.5 and 24.9. However, I have not yet submitted a Data Access Request and therefore am not authorized to access any datasets.
First, let’s explore cohort A: Healthy adults with a BMI between 18.5 and 24.9 in Framingham Heart Study (FHS).
Search for ‘age’.
Apply ‘FHS’ study tag to view only ‘age’ variables within the Framingham Heart Study (phs000007).
Select the variable of interest. You may notice many variables that appear similar. These variables may be located in different datasets, or tables, but contain similar information. Open up the variable information modal by clicking on the row containing the variable of interest to learn more.
Filter to adults only by clicking the filter icon next to the variable. I am interested in adults, so I will set the minimum age to 18, then click “Add filter to query”.
Now, let’s filter to healthy adults with a BMI between 18.5 and 24.9. Similar to before, we will search ‘BMI’. We can narrow down the search results using the variable-level tags by including terms related to our variable of interest (such as ‘continuous’ to view only continuous variables) and excluding out-of-scope terms (such as ‘allergy’). After selecting the variable of interest, we can filter to the desired ranges before adding the filter to our query. Notice how the total number of participants in our cohort changes.
Finally, we will filter for participants who have asthma.
Note the total participant count in the Data Summary.
We can easily modify our filters to explore cohort B: Obese adults with a body mass index (BMI) greater than 30 in Framingham Heart Study.
Edit the BMI filter by clicking the edit icon in the Added Variable Filters section. Change the range to have a minimum of 30 and no maximum.
Note the total participant count in the Data Summary.
We can easily repeat these steps for other studies, such as the Genetic Epidemiology of COPD (COPDGene) study, and create a table like the one below. By comparing these two studies, I can see that COPDGene may be more promising for my research since it contains many more participants in my cohorts of interest than FHS does.
I can then use the Request Access button to go directly to the study’s dbGaP page and begin submitting a DAR.
Framingham Heart Study (FHS)
50 +/- 3
72 +/- 3
Genetic Epidemiology of COPD (COPDGene)
488 +/- 3
868