arrow-left

All pages
gitbookPowered by GitBook
1 of 28

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Workspace

Overview of Workspaces on BDC-Gen3

When navigating to a Workspace, users are presented with multiple workspace options.

BDC-Gen3 Workspace Page

The Gen3 platform offers two workspace environments: Jupyter Notebooks and R Studio.

There are six workspaces:

Virtual machines (VM):

  • Small Jupyter Notebook VM

  • Large Jupyter Notebook Power VM

  • R Studio VM

Pre-made workflow workspaces:

  • Autoencoder Demo

  • CIP Demo

  • Tensorflow-Pytorch.

To start a workspace, select Launch. You will see the following launch loading screen.

circle-info

Launching a VM can take up to five minutes depending on the size and complexity of the workspace.

Once the VM is ready, the initial screen for the workspace will appear. For scripts and output that need to be saved when the workspace is terminated, store those files in the pd/ directory.

This workspace will persist once the user has logged out of the BDC-Gen3 system. If the workspace is no longer being used, terminate the workspace by selecting Terminate Workspace at the bottom of the window. You will be returned to the Workspace page with all of the workspace options.

For more information about the Gen3 Workspace, refer to .

Data Analysis in a Gen3 Data Commonsarrow-up-right
Launch loading screen
The initial workspace for Jupyter Notebooks
The initial workspace for R Studio

PFB Files

Overview of the Portable Format for Bioinformatics (PFB) file type

hashtag
What is a Portable Format for Bioinformatics?

A Portable Format for Bioinformatics (PFB) allows users to transfer both the metadata from the the Data Dictionary as well as the Data Dictionary itself. As a result, data can be transferred while keeping the structure from the original source. Specifically, a PFB consists of three parts:

  • A schema

  • Metadata

  • Data

For more information and an in-depth review that includes Python tools for PFB creation and exploration, refer to the and install the newest version.

circle-info

Note The following PFB example is a direct PFB export from the tutorial-synthetic_data_set_1 found on . Due to the large amount of data stored within PFB files, only small sections are shown with breaks (displayed as ... ) occurring in the output.

hashtag
Schema

A schema is a JSON formatted Data Dictionary containing information about the properties, such as value types, descriptions, and so on.

To view the PFB schema, use the following command:

Example Output

NOTE: To make the outputs more human-readable, the above information was then piped through the program . Example: pfb show -i PFB_file.avro schema | jq

hashtag
Metadata

The metadata in a PFB contains all of the information explaining the linkage between nodes and external references for each of the properties.

To view the PFB metadata, use the following command:

Example Output

hashtag
Data

The data in the PFB are the values for the properties in the format of the Data Dictionary.

To view the data within the PFB, use the following command:

To view at a certain number of entries in the PFB file, use the flag -n to designate a number. For example, to view the first 10 data entries within the PFB, use the following command:

Example Output

Getting Started

PyPFB github pagearrow-up-right
BioData Catalyst Powered by Gen3arrow-up-right
jqarrow-up-right
pfb show -i PFB_file.avro schema
...
  {
    "type": "record",
    "name": "gene_expression",
    "fields": [
      {
        "default": null,
        "name": "data_category",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_data_category",
            "symbols": [
              "Transcriptome Profiling"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "data_type",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_data_type",
            "symbols": [
              "Gene Expression Quantification"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "data_format",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_data_format",
            "symbols": [
              "TXT",
              "TSV",
              "CSV",
              "GCT"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "experimental_strategy",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_experimental_strategy",
            "symbols": [
              "RNA-Seq",
              "Total RNA-Seq"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "file_name",
        "type": [
          "null",
          "string"
        ]
      },
      {
        "default": null,
        "name": "file_size",
        "type": [
          "null",
          "long"
        ]
      },
      {
        "default": null,
        "name": "md5sum",
        "type": [
          "null",
          "string"
        ]
      },
      {
        "default": null,
        "doc": "The GUID of the object in the index service.",
        "name": "object_id",
        "type": [
          "null",
          "string"
        ]
      }
...
pfb show -i PFB_file.avro metadata
...
    {
      "name": "exposure",
      "ontology_reference": "",
      "values": {},
      "links": [
        {
          "multiplicity": "MANY_TO_ONE",
          "dst": "subject",
          "name": "subjects"
        }
      ],
      "properties": [
        {
          "name": "years_smoked",
          "ontology_reference": "Person Smoking Duration Year Count",
          "values": {
            "source": "caDSR",
            "cde_id": "3137957",
            "cde_version": "1.0",
            "term_url": "https://cdebrowser.nci.nih.gov/CDEBrowser/search?elementDetails=9&FirstTimer=0&PageId=ElementDetailsGroup&publicId=3137957&version=1.0"
          }
        },
        {
          "name": "years_smoked_gt89",
          "ontology_reference": "Person Smoking Duration Year Count",
          "values": {
            "source": "caDSR",
            "cde_id": "3137957",
            "cde_version": "1.0",
            "term_url": "https://cdebrowser.nci.nih.gov/CDEBrowser/search?elementDetails=9&FirstTimer=0&PageId=ElementDetailsGroup&publicId=3137957&version=1.0"
          }
        },
        {
          "name": "alcohol_history",
          "ontology_reference": "Alcohol Lifetime History Indicator",
          "values": {
            "source": "caDSR",
            "cde_id": "2201918",
            "cde_version": "1.0",
            "term_url": "https://cdebrowser.nci.nih.gov/CDEBrowser/search?elementDetails=9&FirstTimer=0&PageId=ElementDetailsGroup&publicId=2201918&version=1.0"
          }
        },
        {
          "name": "alcohol_intensity",
          "ontology_reference": "Person Self-Report Alcoholic Beverage Exposure Category",
          "values": {
            "source": "caDSR",
            "cde_id": "3457767",
            "cde_version": "1.0",
            "term_url": "https://cdebrowser.nci.nih.gov/CDEBrowser/search?elementDetails=9&FirstTimer=0&PageId=ElementDetailsGroup&publicId=3457767&version=1.0"
          }
        },
...
pfb show -i PFB_file.avro
pfb show -i PFB_file.avro -n 10
...
{
  "id": "6c5e21d5-da76-49a5-9f82-7e3a726d44c6",
  "name": "lab_result",
  "object": {
    "cer451q1": null,
    "oxldl1": null,
    "f81c": null,
    "renins1c": null,
    "cystatc1": null,
    "triglycerides": -0.40415245294570923,
    "glucos1c": 6.5463337898254395,
    "glucos1u": null,
    "ldl": 2.0789523124694824,
    "hdl": 2.7123606204986572,
    "creatin1": null,
    "total_cholesterol": 3.039848566055298,
    "chlcat1c": null,
    
...

    "uabcat1c": null,
    "inslnr1t": 1.8090298175811768,
    "vldlp31c": null,
    
...

    "unit_hematocrit_vfr_bld": null,
    "age_at_total_cholesterol": 80,
    "unit_total_cholesterol": null,
    "age_at_triglycerides": 80,
    "unit_triglycerides": null,
    "age_at_hdl": 80,
    "unit_hdl": null,
    "age_at_ldl": 80,
    "unit_ldl": null,
    
...

    "unit_mcv_entvol_rbc": null,
    "submitter_id": "HG00325_lab_res",
    "state": "validated",
    "project_id": "tutorial-synthetic_data_set_1",
    "created_datetime": "2020-01-27T13:54:06.745386+00:00",
    "updated_datetime": "2020-01-27T13:54:06.745386+00:00"
  },
  "relations": [
    {
      "dst_id": "f4fdda57-80f4-4995-bea2-161c3242c525",
      "dst_name": "subject"
    }
  ]
}

Search and Results

  1. Navigate to https://biodatacatalyst.nhlbi.nih.gov/use-bdc/explore-data/dug/arrow-up-right to access Dug Semantic Search.

  2. Semantic search is a concept-based search engine designed for users to search biomedical concepts, such as “asthma,” “lung,” or “fever,” and the variables related to and/or used to measure them. For example, a search for “chronic pain acceptance” will return a list of related biomedical concepts, such as chronic pain, headaches, neuralgia, or fibromyalgia, each of which can be expanded to display related variables and CDEs. Semantic search can also find variable names and descriptions directly, using synonyms from its knowledge graphs to find search-related variables.

  3. Enter a search term and press “Enter,” or click on the Search button. This will take you to the Semantic Search interface.

Requirements and Login

hashtag
Requirements

To obtain access to BDC-PIC-SURE, you must have an NIH eRA Commons account. For instructions and to register an account, refer to the eRA websitearrow-up-right.

hashtag
Login

After you have created an eRA Commons account, you can log in to BDC-PIC-SURE by navigating to and selecting to log in with eRA Commons. You will be directed to the NIH website to log in with your eRA Commons credentials. After signing in and accepting the terms of the agreement on the NIH RAS Information Sharing Consent page, allow the BDC-Gen3 service to manage your authorization.

Upon login, you will be directed to the Data Access Dashboard. This page provides a summary of PIC-SURE Authorized Access, PIC-SURE Open Access, and the studies you are authorized to access.

hashtag

https://picsure.biodatacatalyst.nhlbi.nih.govarrow-up-right
PIC-SURE Data Access Dashboard

Dug Semantic Search

Step-by-step guidance on using Dug Semantic Search: efficiently and effectively perform and interpret a search using Dug.

hashtag
Overview

Dug Semantic Search is a tool that allows users to deep dive into BDC studies and biomedical topics, research, and publications to identify related studies, datasets, and variables. If you are interested in how Dug connects study variables to biomedical concepts, read the Dug paperarrow-up-right or visit the Help Portalarrow-up-right.

This tool applies semantic web and knowledge graph techniques to improve BDC research data Findability, Access, Interoperability, and Reusability (FAIR). Through this process, semantic search helps users identify novel relations, build unique research questions, and identify potential collaborations.

PIC-SURE User Guide

PIC-SURE: Patient Information Commons Standard Unification of Research Elements

The Patient Information Commons: Standard Unification of Research Elements (PIC-SURE) integrates clinical and genomic data to allow users to search, query, and export data at the variable and variant levels. This allows users to create analysis-ready data frames without manually mapping and merging files.

BDC Powered by PIC-SURE (BDC-PIC-SURE) functions as part of the BDC ecosystem, allowing researchers to explore studies funded by the National Heart, Lung, and Blood Institute (NHLBI), whether they have been granted access to the participant level data or not.

Overview of PIC-SURE search interface

Explore Available Data

BioLINCC Datasets

The BDC ecosystem hosts several datasets from the NIH NHLBI Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC)arrow-up-right. To access the BioLINCC studies, you must request access through dbGaP even if you have authorization from BioLINCC.

PIC-SURE Features and General Layout

General layout of PIC-SURE search
  1. Search bar: Enter any phenotypic variable, study or table keyword into the search bar to search across studies. Users can also search specific variables by accession number, if known (phs/pht/phv).

  2. Study Tags: Users can filter the results found through their search by limiting to studies of interest or excluding studies.

  3. Variable Tags: Users can filter the results found through their search by limiting to keywords of interest or excluding keywords that are out of scope. For example, a user could filter to categorical variables, variables containing the term ‘blood’, and/or exclude variables containing the term ‘pressure’.

    How are variable tags generated? Each variable has a set of associated tags, which are generated during the PIC-SURE data loading process. These tags are generated based on information associated with the variable, including the name of the study, study description, dataset name, PIC-SURE data type (continuous or categorical), and variable description. For a search in PIC-SURE, tags associated with a variable are displayed. Note that tags applicable to less than 5% or more than 95% of the search results are not displayed since these are not useful for filtering results.

  4. Search Results table: View all variables associated with your search term and/or study & variable tags.

  5. Results Panel: Panel with content boxes that describe the cohort based on the variable filters applied to the query.

  6. Data Summary: Displays the total number of participants in the filtered cohort which meet the query criteria. When first opening the Open or Authorized Access page, the number will be the total number of participants that you can access.

  7. Added Variable Filters summary: View all filters which have been applied to the cohort.

  8. Filter Action: Click on the filter icon to filter cohort participants by specific variable values.

  9. Reset button: Allows users to start a new search and query by removing all added filters and clearing all active study and variable tags.

PIC-SURE Open Access

PIC-SURE Open Access allows you to search any clinical variable available in PIC-SURE. Your queries will return obfuscated aggregate counts per study and consent. There are some features specific to PIC-SURE Open Access, which are outlined below.

PIC-SURE Open Access specific features and layout.

A. Stigmatizing Variables Removal: PIC-SURE Open Access data excludes clinical variables that contain potentially sensitive information. These variables are known as stigmatizing variables, which fall into the following categories:

  • Mental health diagnoses, history, and treatment

  • Illicit drug use history

  • Sexually transmitted disease diagnoses, history, and treatment

  • Sexual history

  • Intellectual achievement, ability, and educational attainment

  • Direct or surrogate identifiers of legal status

For more information about stigmatizing variables and the identification process, please refer to the documentation and code on the .

B. Data Obfuscation: Because participant-level data are not available in PIC-SURE Open Access, the aggregate counts are obfuscated to further anonymize the data. This means that:

  • If the consent group, study, and/or total participants of the query is between one and nine, the results will be shown as < 10.\

  • If the consent group results are between one and nine and the study and/or total participants of the query is greater than 10, the results will be obfuscated by ± 3.

  • Query results that are zero participants will display 0.

C. View Filtered Results by Study: The filtered number of participants which match the query criteria is shown broken down by study and consent group. Users can see if they do or do not have access to specific studies.

hashtag
Use Case: Using PIC-SURE Open Access to Investigate Asthma in Healthy and Obese Adult Populations

In this section, the functionalities of PIC-SURE Open Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating asthma in relation to obesity in adults.

I’m interested in two cohorts: obese adults with a body mass index (BMI) greater than 30 and healthy adults with a BMI between 18.5 and 24.9. However, I have not yet submitted a Data Access Request and therefore am not authorized to access any datasets.

First, let’s explore cohort A: Healthy adults with a BMI between 18.5 and 24.9 in Framingham Heart Study (FHS).

  1. Search for ‘age’.

  2. Apply ‘FHS’ study tag to view only ‘age’ variables within the Framingham Heart Study (phs000007).

  3. Select the variable of interest. You may notice many variables that appear similar. These variables may be located in different datasets, or tables, but contain similar information. Open up the variable information modal by clicking on the row containing the variable of interest to learn more.

We can easily modify our filters to explore cohort B: Obese adults with a body mass index (BMI) greater than 30 in Framingham Heart Study.

  1. Edit the BMI filter by clicking the edit icon in the Added Variable Filters section. Change the range to have a minimum of 30 and no maximum.

  2. Note the total participant count in the Data Summary.

We can easily repeat these steps for other studies, such as the Genetic Epidemiology of COPD (COPDGene) study, and create a table like the one below. By comparing these two studies, I can see that COPDGene may be more promising for my research since it contains many more participants in my cohorts of interest than FHS does.

I can then use the Request Access button to go directly to the study’s dbGaP page and begin submitting a DAR.

PIC-SURE Open Access vs. PIC-SURE Authorized Access

PIC-SURE provides two ways to search: PIC-SURE Open Access and PIC-SURE Authorized Access. PIC-SURE Open Accessarrow-up-right enables the user to explore aggregate-level data without any dbGaP data authorizations. PIC-SURE Authorized Accessarrow-up-right feature allows the user to explore participant-level data and requires authorization to access at least one study through an active dbGaP Data Access Request (DAR).

Table Comparison of PIC-SURE Open and Authorized Access

Removed stigmatizing variables

✓

Data obfuscation

✓

dbGaP approval to access required

✓

Access to aggregate counts

✓

✓

Access to participant-level data

✓

Phenotypic variable search

✓

✓

Phenotypic variable filtering

✓

✓

Genomic variable filtering

✓

Data retrieval

✓

Visualizations

✓

  • Filter to adults only by clicking the filter icon next to the variable. I am interested in adults, so I will set the minimum age to 18, then click “Add filter to query”.

  • Now, let’s filter to healthy adults with a BMI between 18.5 and 24.9. Similar to before, we will search ‘BMI’. We can narrow down the search results using the variable-level tags by including terms related to our variable of interest (such as ‘continuous’ to view only continuous variables) and excluding out-of-scope terms (such as ‘allergy’). After selecting the variable of interest, we can filter to the desired ranges before adding the filter to our query. Notice how the total number of participants in our cohort changes.

  • Finally, we will filter for participants who have asthma.

  • Note the total participant count in the Data Summary.

  • Framingham Heart Study (FHS)

    50 +/- 3

    72 +/- 3

    Genetic Epidemiology of COPD (COPDGene)

    488 +/- 3

    868

    BioData Catalyst Powered by PIC-SUREarrow-up-right
    Stigmatizing Variables GitHub repositoryarrow-up-right
    Variable Information modal for ‘age1’ variable from Framingham Heart Study.
    Adding a filter to the ‘age1’ variable from Framingham Heart Study.
    Adding a filter to the ‘B128’ variable from Framingham Heart Study.

    Data Analysis Using the PIC-SURE API

    Once you have refined your queries and created a cohort of interest, you can begin analyzing data using other components of the BDC ecosystem.

    hashtag
    What is the PIC-SURE API?

    Databases exposed through the PIC-SURE API encompass a wide heterogeneity of architectures and data organizations underneath. PIC-SURE hides this complexity and exposes the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible sciences. The API is available in two different programming languages, python and R, allowing investigators to query databases in the same way using either of those languages. The PIC-SURE API tutorial notebooks can be directly accessed on GitHubarrow-up-right.

    hashtag
    PIC-SURE Access Token

    To access the PIC-SURE API, a user-specific token is needed. This is the way the API grants access to individual users to protected-access data. The user token is strictly personal; do not share it with anyone. You can copy your personalized access token by selecting the User Profile tab at the top of the screen.

    Here, you can Copy your personalized access token, Reveal your token, and Refresh your token to retrieve a new token and deactivate the old token.

    hashtag
    Analysis in the BDC Ecosystem

    The PIC-SURE API can be accessed via tutorial notebooks on either BDC- or BDC-.

    To launch one of the analysis platforms, go to the . From the Resources menu, select Services. A list of platforms and services on the BDC ecosystem will be displayed.

    From the Analyze Data in Cloud-based Shared Workspaces section, select Launch for your preferred analysis platform.

    hashtag
    BDC-Seven Bridges

    Jupyter notebook examples in R and python can be found under the Public projects tab by selecting PIC-SURE API.

    From the Data Studio tab, select an example that fits your research needs. Here, we will select PIC-SURE JupyterLab examples.

    This will take you to the PIC-SURE API analysis workspace, where you can view the examples in python. Copy this workspace to your own project to edit or run the code yourself.

    circle-info

    Note The project must have network access to run the PIC-SURE examples on Seven Bridges. To ensure this, go to the Settings tab and select “Allow network access”.

    hashtag
    BDC-Terra

    To access the Jupyter notebook examples in R and python for the PIC-SURE API, select View Workspaces from the .

    Select the Public tab and search for “PIC-SURE”. Workspaces for both the python and R examples will be displayed. You must clone the workspaces to edit or run the code within them.

    Data Organization in PIC-SURE

    PIC-SURE integrates clinical and genomic datasets across BDC, including TOPMed and TOPMed related studies, COVID-19 studies, and BioLINCC studies. Each variable is organized as a concept path that contains information about the study, variable group, and variable. Though the specifics of the concept paths are dependent on the type of study, the overall information included is the same.

    For more information about additional dbGaP, TOPMed, and PIC-SURE concept paths, refer to Appendix 1.

    Table of Data Fields in PIC-SURE

    General organization

    Data organized using the format implemented by the . Find more information on the dbGaP data structure .

    Generally, a given study will have several tables, and those tables have several variables.

    Data do not follow dbGaP format; there are no phv or pht accessions.

    Data are organized in groups of like variables, when available. For example, variables like Age, Gender, and Race could be part of the Demographics variable group.

    Note that there are two data types in PIC-SURE: categorical and continuous data. Categorical variables refers to any variables that have categorized values. For example, “Have you ever had asthma?” with values “Yes” and “No” is a categorical variable. Continuous variables refer to any variables that have a numeric range of values. For example, “Age” with a value range from 10 to 90 is a continuous variable. The internal PIC-SURE data load process determines the type of each variable based on the data.

    Concept path structure

    \phs\pht\phv\variable name\

    \phs\variable name

    Variable ID

    phv corresponding to the variable accession number

    Equivalent to variable name

    Variable name

    Encoded variable name that was used by the original submitters of the data

    Encoded variable name that was used by the original submitters of the data

    Variable description

    Description of the variable

    Description of the variable, as available

    Dataset ID

    pht corresponding to the trait table accession number

    Equivalent to dataset name

    Dataset name

    Name of the trait table

    Name of a group of like variables, as available

    Dataset description

    Description of the trait table

    Description of a group of like variables, as available

    Study ID

    phs corresponding to the study accession number

    phs corresponding to the study accession number

    Study description

    Description of the study from dbGaP

    Description of the study from dbGaP

    database of Genotypes and Phenotypes (dbGaP)arrow-up-right
    herearrow-up-right
    Seven Bridgesarrow-up-right
    Terraarrow-up-right
    BioData Catalyst websitearrow-up-right
    Terra landing pagearrow-up-right
    User Profile modal displaying personalized access token.
    List of analysis platforms in the Analyze Data in Cloud-based Shared Workspaces section on the BioData Catalyst website.
    Navigating to the PIC-SURE API in Seven Bridges Public Projects.
    Dashboard of the PIC-SURE API on Seven Bridges
    Copying the PIC-SURE API Public Project to a workspace from the Data Studio page.
    BioData Catalyst Powered by Terra landing page
    Searching for the PIC-SURE API examples in Terra workspaces

    Appendix 2: Table of Harmonized Variables

    cac_volume_1

    Coronary artery calcium volume using CT scan(s) of coronary arteries

    decimal

    cubic millimeters

    UMLS

    cac_score_1

    Coronary artery calcification (CAC) score using Agatston scoring of CT scan(s) of coronary arteries

    decimal

    UMLS

    Additional Resources

    hashtag
    Video Walkthroughs

    hashtag
    Playlist

    BioData Catalyst Powered by PIC-SURE arrow-up-right

    hashtag
    Videos

    CONNECTS Dataset

    The BDC ecosystem hosts several datasets from the NIH NHLBI Collaborating Network of Networks for Evaluating COVID-19 and Therapeutic Strategies (CONNECTS) programarrow-up-right. These COVID-19 related studies follow the guidelines for implementing common data elements (CDEs) and for de-identifying dates, ages, and free text fields. For more information about these efforts, you can view the CDE Manual and De-Identification Guidance documents on the CONNECTS COVID-19 Therapeutic Trial Common Data Elements webpagearrow-up-right.

    Table of COVID-19 Studies Included in the CONNECTS Program Available in PIC-SURE

    A Multicenter, Adaptive, Randomized Controlled Platform Trial of the Safety and Efficacy of Antithrombotic Strategies in Hospitalized Adults with COVID-19

    ACTIV4a

    phs002694

    COVID-19 Positive Outpatient Thrombosis Prevention in Adults Aged 40-80

    ACTIV4b

    phs002710

    Available Data and Managing Data Access

    BDC-PIC-SURE has integrated clinical and genomic data from a variety of heart, lung, blood, and sleep related datasets. These include NHLBI Trans-Omics for Precision Medicine (TOPMed) and TOPMed related studies, BioLINCC datasets, and COVID-19 datasets.

    View a summary of the data you have access to by viewing the Data Access Table.

    This table displays information about the study and associated data, including the full and abbreviated name of the study, study design and focus, the number of clinical variables, participants, and samples sequenced, additional information with helpful links, consent group information, and the dbGaP accession number (or phs number). You are also able to see which studies you are authorized to access in the Access column of the table. For information from dbGaP on submitting a data access request, refer to Tips for Preparing a Successful Data Access Request documentationarrow-up-right. Note that studies with a sickle cell disease focus contain links to the Cure SCi Metadata Catalogarrow-up-right for additional information.

    Sample summary table of studies available and user-based authorization via the Data Table.

    You can also check the data you have access to by going to the page on the BDC website and clicking Check My Access.

    PIC-SURE Authorized Access

    If you are authorized to access any dbGaP dataset(s), the Authorized Access tab at the top will be visible. PIC-SURE Authorized Access provides access to complete, participant-level data, in addition to aggregate counts, and access to the Tool Suite.

    PIC-SURE Authorized Access specific features and layout.

    A. Select Variables Action: Click the Select Variables icon to include variables when retrieving data. Users can select variables individually or at the dataset level.

    • Individually select variables: You can individually select variables from two locations:

      • Variable search results: From the search results you can click the data retrieval icon to include the variable in your data retrieval.

      • Variable modal variable data retrieval: The data retrieval icon next to the variable adds the variable to your data retrieval.

    • Select from a dataset or group of variables: In the variable modal the data retrieval icon next to the dataset opens a modal to allow you to select variables from the dataset table or group of variables.

    B. Data Summary: In addition to the total number of participants in the filtered cohort, the number of variables the user has selected for data retrieval is also displayed.

    There are four concept paths that are automatically included with any data export from PIC-SURE Authorized Access. These fields are listed and described below.

    • Patient ID: Internal PIC-SURE participant identifier. Please note that this field is not linking participants between studies and therefore should not be used for data correlation between different data sources or to the original data files.

    • Parent Study Accession with Subject ID: PIC-SURE generated identifier for parent studies. These identifiers are a combination of the study accession number and the subject identifier.

    C. Tool Suite: The Tool Suite contains tools that can be used to further explore filtered cohorts of interest. Note that at least one filter must be added to the query before using the Tool Suite.

    • Select and Package Data: Retrieve participant-level data corresponding to your filters and variable selections. Variables selected for data retrieval can be reviewed and modified. To learn more about the options associated with this tool, please refer to the Select and Package Data section.

    • Variable Distributions: View the distributions of query variables based on the filtered cohort. Note that there is a limit to the number of variable distributions that can be viewed at a given time. Additionally, genomic variables nor variables associated with any-record-of filter (e.g. entire datasets) will not be graphed.

    hashtag
    Select and Package Data

    The Select and Package Data tool is used to select and export participant-level data corresponding to your filters and variable selections. There are several options for selecting and exporting the data, which are shown using this tool.

    In the top left corner of the modal, the number of participants and number of variables included in the query is shown. This is used to display the estimated number of data points in the export.

    Note: Queries with more than 1,000,000 data points will not be exportable.

    The table below displays a summary of the variables included in the export. Using the Selected column, variables that have been added to the export can be selected or deselected for the final dataframe.

    Note: Variables with filters are automatically included in the export.

    The Package Data button in the top right corner is used to prepare the data for export once the variable filters and selections have been finalized.

    Once this button is clicked, there are several options to complete the export.

    To export into a BDC analysis workspace, the Export to Seven Bridges or Export to Terra buttons can be used. Once clicking either of these buttons, a new modal will be displayed with all information and instuctions needed to complete the export. This includes your personalized access token, the query ID associated with the dataframe. Additionally, there is the option to Copy Query ID without accessing Seven Bridges or Terra if you wish to use a different analysis platform.

    The Export to Seven Bridges option includes a Go to Seven Bridges button, which will open a new tab to the Public PIC-SURE API Project on BDC-Seven Bridges.

    The Export to Terra option includes a Go to Terra via R button and a Go to Terra via Python button, which will open the Public PIC-SURE API R Examples workspace and the Public PIC-SURE API Python Examples workspace on BDC-Terra, respectively.

    hashtag
    Use Case: Investigating Comorbidities of Breast Cancer in Authorized Access

    In this section, the functionalities of PIC-SURE Authorized Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating some comorbidities of breast cancer in women with BRCA1 and BRCA2 gene variants, such as hypertension and COPD.

    I already have been authorized to access the Women’s Health Initiative (WHI) study and am interested in a single cohort: women with breast cancer and variants of the BRCA1 and BRCA2 genes. I want to select hypertension-related variables of interest, check the distributions of some variables, and export all the data to an analysis workspace.

    First, let’s apply our variable filters for the WHI study.

    1. Search “breast cancer” in Authorized Access.

    2. Add the WHI study tag to filter search results to only age variables found within the WHI study.

    3. Filter to participants with breast cancer by clicking the filter icon next to the variable of interest. Select values to filter your variable on and click “Add Filter to Query”.

    TOPMed and TOPMed related datasets

    The BDC ecosystem hosts several datasets from the NHLBI Trans-Omics for Precision Medicine (TOPMed) programarrow-up-right. The PIC-SURE platform has integrated the clinical and genomic data from all studies listed in the Data Access Dashboard. Through the ingestion process, occasionally PIC-SURE will ingest phenotypic data for the TOPMed studies prior to the genomic data.

    hashtag
    Harmonized Data (TOPMed Harmonized Clinical Variables)

    There are limited amounts of harmonized data available at this time. The TOPMed Data Coordinating Center (DCC) curation team has identified 44 variables that are shared across 17 NHLBI studies and normalized the participant values for these variables.

    The 44 harmonized variables available are listed in the table in Appendix 2. For more information on this initiative, you can view the or on the .

    hashtag
    Table of Studies Included in the TOPMed Harmonized Dataset Available in PIC-SURE

    Discovering Data Using Gen3

    How to login to the BDC Powered by Gen3 (BDC-Gen3) platform and view available genomic and phenotypic data.

    hashtag
    Login to the BDC-Gen3 Platform

    In order to navigate and access data available on the Gen3 platform, start by visiting the login pagearrow-up-right. You will need an eRA Commons account as well as access permissions through the Database of Genotypes and Phenotypes (dbGaP)arrow-up-right. If you are a researcher, login by selecting NIH Login and using your eRA Commons accountarrow-up-right. BDC consortia developers can login using their Google accounts. Make sure to use the correct login method that contains access to your available projects.

    Once logged in, your username will appear in the upper right-hand corner of the page. You will also see a display with aggregate statistics for the total number of subjects, studies, aliquots and files available within the BDC platform.

    NOTE: These numbers may differ from those displayed in the dbGaP records as they include TOPMed studies as well as the associated parent studies.

    hashtag
    Types of Hosted Data

    hashtag
    Phenotypic

    hashtag
    DCC Harmonized clinical data:

    A number of clinical variables have been harmonized by the in order to facilitate cross-study analysis. Faceted search over the DCC Harmonized Variables is available via the page, under the "Data" tab.

    hashtag
    Unharmonized clinical data:

    Unharmonized clinical files are also available on the Gen3 platform and contain all of the raw phenotypic information for the hosted studies. Unlike the DCC Harmonized Variables, these files are located and searchable under the "" tab in the page.

    hashtag
    Genomic

    The Gen3 platform hosts genomic data provided by the (TOPMed) program and the plus synthetic tutorial data from Terra. At present, these projects include CRAM and VCF files together with their respective index files. Specifically for TOPMed projects, each project will contain at least one multi-sample VCF that comprises all subjects within the consent group. CRAM and VCF are based on an individual level, whereas multi-sample VCFs are based on the study consent level.

    All files are available under the "Files" tab in the page. More detailed information on currently hosted data on the Gen3 platform can be found .

    hashtag
    Gen3 Pages

    The BDC-Gen3 platform contains five pages described below:

    • : An interactive data dictionary display that details the contents and relationships between clinical and biospecimen data

    • : The facet filter custom cohort creation tool

    • : The GraphQL query tool to retrieve specific data within the graph model

    Appendix 1: BDC Identifiers - dbGaP, TOPMed, and PIC-SURE

    hashtag
    Table of BDC dbGAP/TOPMed Identifiers

    Current Projects

    Overview of current projects hosted on BDC-Gen3, including their dependencies, characteristics, and relationships.

    hashtag
    Current Project IDs

    A list of current project IDs can be found in the Data tab, under Filters>Project>Project Id. The current project IDs are:

    cimt_1

    Common carotid intima-media thickness, calculated as the mean of two values: mean of multiple thickness estimates from the left far wall and from the right far wall.

    decimal

    mm

    UMLS

    cimt_2

    Common carotid intima-media thickness, calculated as the mean of four values: maximum of multiple thickness estimates from the left far wall, left near wall, right far wall, and right near wall.

    decimal

    mm

    UMLS

    carotid_stenosis_1

    Extent of narrowing of the carotid artery.

    encoded

    UMLS

    0=None||1=1%-24%||2=25%-49%||3=50%-74%||4=75%-99%||5=100%

    carotid_plaque_1

    Presence or absence of carotid plaque.

    encoded

    UMLS

    0=Plaque not present||1=Plaque present

    height_baseline_1

    Body height at baseline.

    decimal

    cm

    UMLS

    current_smoker_baseline_1

    Indicates whether subject currently smokes cigarettes.

    encoded

    UMLS

    0=Does not currently smoke cigarettes||1=Currently smokes cigarettes

    weight_baseline_1

    Body weight at baseline.

    decimal

    kg

    UMLS

    ever_smoker_baseline_1

    Indicates whether subject ever regularly smoked cigarettes.

    encoded

    UMLS

    0=Never a cigarette smoker||1=Current or former cigarette smoker

    bmi_baseline_1

    Body mass index calculated at baseline.

    decimal

    kg/m^2

    UMLS

    hemoglobin_mcnc_bld_1

    Measurement of mass per volume, or mass concentration (mcnc), of hemoglobin in the blood (bld).

    decimal

    g / dL = grams per deciliter

    UMLS

    hematocrit_vfr_bld_1

    Measurement of hematocrit, the fraction of volume (vfr) of blood (bld) that is composed of red blood cells.

    decimal

    % = percentage

    UMLS

    rbc_ncnc_bld_1

    Count by volume, or number concentration (ncnc), of red blood cells in the blood (bld).

    decimal

    millions / microliter

    UMLS

    wbc_ncnc_bld_1

    Count by volume, or number concentration (ncnc), of white blood cells in the blood (bld).

    decimal

    thousands / microliter

    UMLS

    basophil_ncnc_bld_1

    Count by volume, or number concentration (ncnc), of basophils in the blood (bld).

    decimal

    thousands / microliter

    UMLS

    eosinophil_ncnc_bld_1

    Count by volume, or number concentration (ncnc), of eosinophils in the blood (bld).

    decimal

    thousands / microliter

    UMLS

    neutrophil_ncnc_bld_1

    Count by volume, or number concentration (ncnc), of neutrophils in the blood (bld).

    decimal

    thousands / microliter

    UMLS

    lymphocyte_ncnc_bld_1

    Count by volume, or number concentration (ncnc), of lymphocytes in the blood (bld).

    decimal

    thousands / microliter

    UMLS

    monocyte_ncnc_bld_1

    Count by volume, or number concentration (ncnc), of monocytes in the blood (bld).

    decimal

    thousands / microliter

    UMLS

    platelet_ncnc_bld_1

    Count by volume, or number concentration (ncnc), of platelets in the blood (bld).

    integer

    thousands / microliter

    UMLS

    mch_entmass_rbc_1

    Measurement of the average mass (entmass) of hemoglobin per red blood cell(rbc), known as mean corpuscular hemoglobin (MCH).

    decimal

    pg = picogram

    UMLS

    mchc_mcnc_rbc_1

    Measurement of the mass concentration (mcnc) of hemoglobin in a given volume of packed red blood cells (rbc), known as mean corpuscular hemoglobin concentration (MCHC).

    decimal

    g /dL = grams per deciliter

    UMLS

    mcv_entvol_rbc_1

    Measurement of the average volume (entvol) of red blood cells (rbc), known as mean corpuscular volume (MCV).

    decimal

    fL = femtoliter

    UMLS

    pmv_entvol_bld_1

    Measurement of the mean volume (entvol) of platelets in the blood (bld), known as mean platelet volume (MPV or PMV).

    decimal

    fL = femtoliter

    UMLS

    rdw_ratio_rbc_1

    Measurement of the ratio of variation in width to the mean width of the red blood cell (rbc) volume distribution curve taken at +/- 1 CV, known as red cell distribution width (RDW).

    decimal

    % = percentage

    UMLS

    bp_systolic_1

    Resting systolic blood pressure from the upper arm in a clinical setting.

    decimal

    mmHg

    UMLS

    bp_diastolic_1

    Resting diastolic blood pressure from the upper arm in a clinical setting.

    decimal

    mmHg

    UMLS

    antihypertensive_meds_1

    Indicator for use of antihypertensive medication at the time of blood pressure measurement.

    encoded

    UMLS

    0=Not taking antihypertensive medication||1=Taking antihypertensive medication

    race_1

    Harmonized race category of participant.

    encoded

    UMLS

    AI_AN=American Indian_Alaskan Native or Native American||Asian=Asian||Black=Black or African American||HI_PI=Native Hawaiian or other Pacific Islander||Multiple=More than one race||Other=Other race||White=White or Caucasian

    ethnicity_1

    Indicator of Hispanic or Latino ethnicity.

    encoded

    UMLS

    both=ethnicity component dbGaP variable values for a subject were inconsistent/contradictory (e.g. over multiple visits)||HL=Hispanic or Latino||notHL=not Hispanic or Latino

    hispanic_subgroup_1

    classification of Hispanic/Latino background for Hispanic/Latino subjects where country or region of origin information is available

    encoded

    UMLS

    CentralAmerican=Central American||CostaRican=from Costa Rica||Cuban=Cuban||Dominican=Dominican||Mexican=Mexican||PuertoRican=Puerto Rican||SouthAmerican=South American

    annotated_sex_1

    Subject sex, as recorded by the study.

    encoded

    UMLS

    female=Female||male=Male

    geographic_site_1

    Recruitment/field center, baseline clinic, or geographic region.

    encoded

    UMLS

    subcohort_1

    A distinct subgroup within a study, generally indicating subjects who share similar characteristics due to study design. Subjects may belong to only one subcohort.

    encoded

    UMLS

    lipid_lowering_medication_1

    Indicates whether participant was taking any lipid-lowering medication at blood draw to measure lipids phenotypes

    encoded

    UMLS

    0=Participant was not taking lipid-lowering medication||1=Participant was taking lipid-lowering medication.

    fasting_lipids_1

    Indicates whether participant fasted for at least eight hours prior to blood draw to measure lipids phenotypes.

    encoded

    UMLS

    0=Participant did not fast_or fasted for fewer than eight hours prior to measurement of lipids phenotypes.||1=Participant fasted for at least eight hours prior to measurement of lipids phenotypes.

    total_cholesterol_1

    Blood mass concentration of total cholesterol

    decimal

    mg/dL

    UMLS

    triglycerides_1

    Blood mass concentration of triglycerides

    decimal

    mg/dL

    UMLS

    hdl_1

    Blood mass concentration of high-density lipoprotein cholesterol

    decimal

    mg/dL

    UMLS

    ldl_1

    Blood mass concentration of low-density lipoprotein cholesterol

    decimal

    mg/dL

    UMLS

    vte_prior_history_1

    An indicator of whether a subject had a venous thromboembolism (VTE) event prior to the start of the medical review process (including self-reported events).

    encoded

    UMLS

    0=did not have prior VTE event||1=had prior VTE event

    vte_case_status_1

    An indicator of whether a subject experienced a venous thromboembolism event (VTE) that was verified by adjudication or by medical professionals.

    encoded

    UMLS

    0=Not known to ever have a VTE event_either self-reported or from medical records||1=Experienced a VTE event as verified by adjudication or by medical professionals

    age_at_*

    For each phenotypic value for a given subject, an associated age at measurement is provided.

    decimal

    years

    See TOPMed Harmonization Strategies arrow-up-rightfor more information.

    unit_*

    For each harmonized variable, a paired “unit_variable” is provided, whose value indicates where in the documentation to look to find the set of component variables and the algorithm used to harmonize those variables.

    encoded

    See TOPMed Harmonization Strategies arrow-up-rightfor more information.

    Clinical-trial of COVID-19 Convalescent Plasma in Outpatients

    C3PO

    phs002752

    YouTube channelarrow-up-right
    Introduction to arrow-up-right
    BioData Catalyst Powered by PIC-SUREarrow-up-right
    Basics: Finding Variablesarrow-up-right
    Basics: Applying a Variable on a Filterarrow-up-right
    Basics: Editing a Variable Filterarrow-up-right
    PIC-SURE Open Access: Interpreting the Resultsarrow-up-right
    PIC-SURE Authorized Access: Applying a Genomic Filterarrow-up-right
    PIC-SURE Authorized Access: Add Variables to Exportarrow-up-right
    PIC-SURE Authorized Access: Select and Package Data Toolarrow-up-right
    PIC-SURE Authorized Access: Variable Distributions Toolarrow-up-right
    PIC-SURE Open Application Programming Interface (API)arrow-up-right
    BioData Catalyst Data Accessarrow-up-right
    “Check my access” on the BDC Access page.

    phs000988

    Framingham Heart Study

    FHS

    phs000007

    Genetic Epidemiology Network of Arteriopathy

    GENOA

    phs001238

    Genetic Epidemiology of COPD

    COPDGene

    phs000179

    Genetics of Cardiometabolic Health in Amish

    AMISH

    phs000956

    Genome-Wide Association Study of Venous Thrombosis Study

    MAYOVTE

    phs000289

    Heart and Vascular Health Study

    HVH

    phs001013

    Hispanic Community Health Study - Study of Latinos

    HCHS-SOL

    phs000810

    Jackson Heart Study

    JHS

    phs000286

    Multi-Ethnic Study of Atherosclerosis

    MESA

    phs000209

    Study of Adiposity in Samoans

    SAS

    phs000914

    Women’s Health Initiative WHI

    WHI

    phs000200

    Atherosclerosis Risk in Communities Study

    ARIC

    phs000280

    Cardiovascular Health Study

    CHS

    phs000287

    Cleveland Family Study

    CFS

    phs000284

    Coronary Artery Risk Development in Young Adults Study

    CARDIA

    phs000285

    Epidemiology of Asthma in Costa Rica Study

    additional documentation from the TOPMed DCC GitHub repositoryarrow-up-right
    NHLBI Trans-Omics for Precision Medicine websitearrow-up-right

    CRA

    Topmed Study Accession with Subject ID: PIC-SURE generated identifier for TOPMed studies. These identifiers are a combination of the study accession number and the subject identifier.
  • Consents: Field used to determine which groups users are authorized to access from dbGaP. These identifiers are a combination of the study accession number and consent code.

  • Click the “Genomic Filtering” button to begin a filter on genomic variants.

  • Select “BRCA1” and “BRCA2” genes of “High” and “Moderate” severity. Click “Apply genomic filter”.

    Filter to participants with breast cancer by clicking the filter icon next to the variable of interest. Select values to filter your variable on and click “Add Filter to Query”.
  • Now, let’s filter to participants that have and do not have COPD. Similar to before, we will search ‘COPD’. After selecting the variable of interest, we can filter to the desired values before adding the filter to our query. Notice how the total number of participants in our cohort changes.

  • Search “hypertension”.

  • Add variables to data export by clicking the select variables icon in the Actions column next to the variable of interest. The icon next to variables selected for export will change to the checkmark icon.

    Adding a ‘hypertension’ variables (‘HTNTRT’, ‘HYPT’, ‘HYPTPILL’, and ‘HYPTPILN’) for export from Women’s Health Initiative Study.
  • Notice how the number of variables changed in the Data Summary box.

  • Before we Select and Package the data for export, let’s view the distribution of our participants’ ages to see if we have a normal distribution. Open the Variable Distributions tool in the Tool Suite. Here, we can see the distributions of the two added variable filters: breast cancer (‘BREAST’) and COPD (‘F33COPD’).

    Variable Distributions modal for the Authorized Access example cohort.
  • Open the Select and Package Data tool in the Tool Suite. The variables shown in this table are those which will be available in your data export; you can remove variables as necessary.

    Select and Package Data modal.
  • Click “Package Data” when you are ready.

  • Once the data is packaged, you can select to either “Export to Seven Bridges” or “Export to Terra”. Copy over the personalized user token and query ID use the PIC-SURE API and export your data to an analysis workspace.

  • Select and Package Data tool modal with example filters and variables.
    Select and Package Data tool modal with example filters and variables after clicking “Package Data”.
    Export to Seven Bridges modal.
    Export to Terra modal.
    Adding a filter to the ‘BREAST’ variable from Women’s Health Initiative Study.

    Workspace: The launch page for Gen3 workspaces that includes Jupyter Notebooks and RStudio

  • Profile: The information page for each user, displaying access and the location for credential file downloads

  • Data Coordinating Center (DCC)arrow-up-right
    Exploration
    Files
    Exploration
    Trans-Omics for Precision Medicinearrow-up-right
    1000 Genomes Projectarrow-up-right
    Exploration
    here
    Dictionary
    Exploration
    Query
    Post-login view of the BDC-Gen3 front page.
    The BDC-Gen3 Pages.

    SUBJECT_ID

    • This is a generated id that is unique to each patient in a study.

    • Controlled by the submitter of a study.

    • For FHS this is replaced with shareid for phs000007. For phs000974 It uses SUBJECT_ID. The values for these two columns are the same however.

    SHARE_ID

    • For FHS phs000007 this was used instead of SUBJECT_ID, but not for FHS phs000974

    SOURCE_SUBJECT_ID

    • This is used internally by DBGAP in conjunction with SUBJECT_SOURCE to allow submitters to associate subjects across studies.

    SAMPLE_ID

    • De-identified sample identifier.

    • These are the ids that link to the molecular data in dbgap ( vcfs, etc.).

    hashtag
    Table of PIC-SURE Identifiers

    \_Topmed Study Accession with Subject ID\

    Generated identifier for TOPMed Studies. These identifiers are a concatenation using the accession name and “SUBJECT_ID” from a study’s subject multi file.

    <STUDY_ACCESSION_NUMBER>.<VERSION>_<SUBJECT_ID>

    Eg: phs000974.v3_XXXXXXX

    \_Parent Study Accession with Subject ID\

    Generated identifier for PARENT Studies. In most studies this follows the same pattern as the TOPMed Study Accession with Subject id.

    However, Framingham’s parent study phs000007 does not contain SUBJECT_ID column which is replaced using the SHAREID column.

    Eg: phs000007.v3_XXXXXXX

    \_VCF Sample Id\

    This variable is stored in the sample multi file in each dbGaP study.

    This is the TOPMed DNA sample identifier. This is used to give each sample/sequence a unique identifier across TOPMed studies.

    Eg: NWD123456

    Patient ID (not a concept path but exists in data exports)

    This is PIC-SURE’s internal Identifier. It is commonly referred to as HPDS Patient num.

    This identifier is generated and assigned to subjects when they are loaded. It is not meant for data correlation between different data sources.

    Patient ID

    This is the HPDS Patient num. This is PIC-SURE HPDS’s internal Identifier.

    Topmed / Parent Study Accession with Subject ID

    • These are the identifiers used by each in the team in the consortium to link data.

    • Values must follow this mask <STUDY_ACCESSION_NUMBER>.<VERSION>_<SUBJECT_ID> Eg: phs000007.v30_XXXXXXX

    DBGAP_SUBJECT_ID

    • This is a generated id that is unique to each patient in a study.

    • Controlled by dbgap

    • It is not unique across unrelated studies. However Patients can be linked across studies. See SOURCE_SUBJECT_ID.

    Parent
  • TOPMed

  • Open_Access

  • Tutorial

  • The list of current project IDs can be found under Project Id.

    hashtag
    Parent and TOPMed Studies

    hashtag
    Distinguishing Between Parent and TOPMed Studies

    The Parent and TOPMed study types have been categorized on Gen3 by their Program designation. An example of this designation by Program is presented below.

    A list of Parent (underlined in blue) and TOPMed studies (underlined in red).

    The Program types can be further identified by whether there is an underscore (_) at the end of the study:

    • Parent studies will include an underscore at the end of the study name.

      • Example: parent-WHI_HMB-IRB_

    • TOPMed studies will not include an underscore at the end of the study name.

      • Example: topmed-BioMe_HMB-NPU

    hashtag
    Relationship Between Parent and TOPMed Studies

    There are three distinct relationships possible between Parent and TOPMed studies. The first two relationships are streamlined:

    • Parent only: The Parent study does not have a TOPMed counterpart study. This usually means that there are no genomic data, such as WXS (whole exome sequencing) or WGS (whole genome sequencing), located within the study; only phenotypic data.

    • TOPMed only: This TOPMed study does not have a Parent counterpart study. These studies will contain both genomic data, WXS or WGS, and phenotypic data.

    • Parent study with a counterpart TOPMed study: The Parent study will contain the phenotypic data, while the TOPMEd study will contain the genomic data. Under dbGaP, these studies would be kept separate from one another and the user would need to create the linkages. In the Gen3 platform, these studies have been linked together under the Parent study, based on the participant IDs found in dbGaP. This allows our system to produce valuable information and cohort creation as it combines both phenotypic and genomic data.

    hashtag
    Parent and TOPMed Study Contents

    The most notable difference between the Program categories is the type of hosted data.

    hashtag
    Parent

    • Genomic data: None

    • Phenotypic data: Like with TOPMed studies, any phenotypic data found within the Graph Model, will only be DCC harmonized variables. For the raw phenotypic data from dbGaP, again, it can be found in the reference_file node.

    hashtag
    TOPMed

    • Genomic data: Available data can include CRAM, VCFs and Cohort-level VCF files

    • Phenotypic data: TOPMed studies without an associated Parent study will include phenotypic data in the data graph by way of DCC harmonized variables. Additionally, raw phenotypic data from dbGaP can be found in the reference_file as tar files that share this common naming scheme: RootStudyConsentSet_phs######.<study_shorthand>.v#.p#.c#.<consent_codes>.tar.gz

    hashtag
    Open_Access - 1000 Genomes project

    The 1000 Genomes Project is an international research effort (2008-2015) to establish the most detailed catalogue of human variation and genotype data. On the Gen3 platform, the Program open_access contains:

    • Genotypic data: Available data can include CRAM and VCF files.

    • Phenotypic data: The data graph will contain phenotypic data by way of DCC harmonized variables. Additionally, raw phenotypic data can be found in the reference_file as VCF and TXT files.

    hashtag
    Tutorial

    This program contains genomic data from 1000 Genomes and synthetic clinical data generated by Terra. Purpose of this dataset is to use it as a genome-wide association study (GWAS) tutorial. GWAS is an approach used in genetics research to associate specific genetic variations with particular diseases. For more information, see Terra Tutorialsarrow-up-right.

    On the Gen3 platform, the Program tutorial contains:

    • Genotypic data: Available data can include CRAM and VCF files.

    • Phenotypic data: The data graph will contain phenotypic data by way of DCC harmonized variables. Additionally, raw phenotypic data can be found in the reference_file as VCF and GDS files.

    Query

    Overview of the Query page on BDC-Gen3

    hashtag
    Overview

    The Query page can search and return metadata from either the Flat Model or the Graph Model of a commons. Using GraphQL, these searches can be tailored to filter and return fields of interest for the data sets being queried. These queries can be made immediately after data submission as this queries the model directly.

    BDC Query Page

    For more information about how to use the Query page, refer to the Gen3 documentationarrow-up-right.

    PIC-SURE API Documentation

    How to get started with PIC-SURE and the common endpoints you can use to query any resource registered with PIC-SURE

    The PIC-SURE v2 API is a meta-API used to host any number of resources exposed through a unified set of generalized operations.

    PIC-SURE Repositories:

    • PIC-SURE APIarrow-up-right: This is the repository for version 2+ of the PIC-SURE API.

    • : This is the wiki page for version 2+ of the PIC-SURE API.

    • : This is the repository for the BDC environment of PIC-SURE.

    • : This is the repository for PIC-SURE-ALL-IN-ONE.

    Additional PIC-SURE Links:

    • : A link to the Avillach Lab Jenkins repository.

    • : A repository for Avillach Lab Jenkins development release control.

    hashtag
    Client Libraries

    The following are the collected client libraries for the entire PIC-SURE project.

    hashtag
    PIC-SURE User Interface

    The PIC-SURE User Interface acts as a visual aid for running normal queries of resources through PIC-SURE.

    PIC-SURE User Interface Repositories:

    • : The main High Performance Data Store (HPDS) UI repository.

    Additional PIC-SURE User Interface Links:

    • : Links to a google drawing of the PIC-SURE UI flow.

    hashtag
    PIC-SURE Auth Micro-App (PSAMA)

    The PSAMA component of the PIC-SURE ecosystem authorizes and authenticates all actions taken within PIC-SURE.

    PSAMA Repos:

    Additional PSAMA Links:

    • : This is where the core of the PSAMA application is stored in GitHub

    hashtag
    High Performance Data Store (HPDS)

    HPDS is a datastore designed to work with the PIC-SURE meta-API. It grants researchers fast, dependable access to static datasets and the ability to produce statistics-ready dataframes filtered on any variable they choose at any time.

    HPDS Repositories:

    • : The main HPDS repository.

    • : Python client library to run queries against a PIC-SURE HPDS resource.

    • : R client library to run queries against a PIC-SURE HPDS resource.

    Dictionary

    Interactive Data Dictionary on BDC-Gen3

    hashtag
    Overview

    The Dictionary page contains an interactive visual representation of the Gen3 data model. The default graph model view, as pictured below, displays all of the nodes and relationships between nodes in a hierarchical structure. The model further specifies the node types and links between nodes, as highlighted in the legend located at the top right side of the page.

    Default view of the interactive Gen3 Data Dictionary

    hashtag
    Graph View

    Users can click on any of the graph nodes in order to learn more about their respective properties. By clicking on a node, the graph will highlight that specific node and all associated links that connect it to the Program node. A "Data Model Structure" list will also appear on the left side toolbar. This will display the node path required to reach the selected node from the Program node.

    When a second node in the path is selected, it will then gray out the other possible paths and only highlight the selected path. It will also change the "Data Model Structure" list on the left side toolbar.

    The left side toolbar has two options available:

    • Open properties: Will download the submission files for all the nodes in the "Data Model Structure" list. This option can also be found on the node that was first selected.

    • Download templates: Will open the node properties in a new pop-up window; an example is displayed in the following screenshot.

    This property view will display all properties in the node and information about each property:

    • Property: Name of the property.

    • Type: The type of input for the node. Examples of this are string, integer, Boolean and enumerated values (enum), which are displayed as preset strings.

    hashtag
    Table View

    The Table view is similar to the Properties view, and nodes are displayed as a list of entries grouped by their node category.

    Clicking on one of the nodes will open the Properties view of the node.

    hashtag
    Dictionary Search

    The Dictionary contains a text-based search function that will search through the names of the properties and the descriptions. While typing, a list of suggestions appears below the search bar. Click on a suggestion to search for it.

    When the search function is used, it will default to the graph model and highlight nodes that contain the search term. Frames around the node boxes indicate whether the searched word was identified in the name of the node (full line) or in the node's description and properties' names/descriptions (dashed line).

    Clicking on one of these nodes, it will only display the properties that have this keyword present in either the property name or the description.

    Click Clear Search Result to clear the free text search if needed.

    The search history is saved below the search bar in the "Last Search" list. Click on an item here to display the results again.

    Profile

    Overview of the Profile page on the BDC-Gen3

    hashtag
    Profile Page

    The Profile page contains two sections: API keys and Project access.

    Profile page with an active key and access to projects

    hashtag
    API key(s)

    To download large amounts of data, an API key will be required as a part of the . To create a key on your local machine, click Create API key, which will activate the following pop-up window:

    Click Download json to save the credential file to your local machine. After completion, a new entry will appear in the API key(s) section of the Profile page. It will display the API key key_id and the expiration date (one month after the key creation). The user should delete the key after it has expired. If for any reason a user feels that their API key has been compromised, the key should be deleted before subsequently creating a new one.

    hashtag
    Project Access

    This section of the Profile page lists the projects and the methods of access for the data within in the BDC-Gen3 system. If you do not see access to a specific study, check that you have been granted access within . If access has been granted for over a week, contact the BDC Help Desk: bdcat-support@datacommons.io

    Exploration

    An explanation for the Exploration page on BDC-Gen3

    hashtag
    Using Exploration

    The Exploration page located in the upper right-hand section of the toolbar allows users to search through data and create cohorts. The Exploration portal contains a dynamic summary statistics display, as well as search facets leveraging the DCC Harmonized Variables.

    hashtag
    Data Accessibility

    Users can navigate through data on the Exploration page by selecting any of the three Data Access categories.

    • Data with Access: A user can view all of the summary data and associated study information for studies the user has access to, including but not limited to Project ID, file types, and clinical variables.

    • Data without Access:

      • Locks next to the project ID signify to users that they do not have subject-level access but they can still search through the available studies but only view summary statistics. Users can request access to data by visiting the

    • All Data: Users can view all of the data available in the BDC-Gen3 platform, including studies with and without access. As a result, studies not available to a user will be locked as demonstrated below.

    By default, all users visiting the Exploration page will be assigned to Data with Access.

    hashtag
    The Data Tab

    Under the "Data" tab, users can leverage the to create custom cohorts. When facets are selected and/or updated to cover a desired range of values, the display will reflect the information relevant to the new applied filter. If no facets have been selected, all of the data accessible to the user will be displayed. At this time, a user can filter based on three categories of clinical information:

    • Project: Any specifically defined piece of work that is undertaken or attempted to meet a single investigative question or requirement.

    • Subject: The collection of all data related to a specific subject in the context of a specific experiment.

    • Harmonized Variables: A selection of different clinical properties from multiple nodes, defined by the Consortium.

    NOTE: The facet filters are based on the DCC Harmonized Variables, which are a selected subset of clinical data that have been transformed for compatibility across the dbGaP studies. TOPMed studies that do not contain harmonized clinical data at this time will be filtered out when a facet is chosen, unless the no data option is also selected for certain facets.

    hashtag
    Exporting Data from the Data Tab

    After a cohort has been selected, the user has four different options for exporting the data.

    hashtag
    Export

    The options for export are as follows:

    • Export All to Terra : Initiate a export of all clinical data and file GUIDs for the selected cohort to . At this time the max number of subjects that can be exported to Terra is 120,000.

    • Export All to Seven Bridges: Initiate a export of all clinical data and file GUIDs for the selected cohort to

    NOTE: PFB export times can take up to 60 minutes, but often will complete in less than 10 minutes.

    hashtag
    The Files Tab

    The Files tab displays study files from the facets chosen on the left-side panel (Project ID, Data Type, Data Format, Callset, and Bucket Path). Each time a facet selection is made, the data summary and displays will update to reflect the applied filters.

    hashtag
    Locating Unharmonized Clinical Data

    The Files tab also contains files that are either case-independent or project-level. This is important for files that are part of the Unharmonized Clinical Data category under the Data Type field. Unharmonized clinical files are made available in two distinct data formats:

    • TAR : Contain a complete directory of phenotypic datasets as XML and TXT files that are direct downloads of unharmonized clinical data from dbGaP on a study consent level project.

    • AVRO: These files are the same as the unharmonized clinical data from dbGaP as the TAR files, but in form of a file.

    NOTE: The unharmonized clinical data sets contains all data from the dbGaP study, but it is not cross-compatible across all studies within BDC.

    hashtag
    Exporting/Downloading Data from the Files Tab

    Once the user has selected a cohort, there are five options for accessing the files:

    • Download Manifest: Download the file manifest and use this manifest to download the enlisted data files using the .

    • Export to Workspace: The files can be exported to a Gen3 workspace.

    • Export All PFB

    hashtag
    File Information Page

    A user can visit the File Information Page after clicking on any of the available GUID link(s) in the Files tab page. The page will display details such as data format, size, object_id, the last time it was updated and the md5sum. The page also contains a button to download the file via the browser (see below). For files that are 5GB or more, we suggest using the .

    hashtag
    Free text search for Submitter IDs and File Names

    Both the Data and File tabs contain a text-based search function that will initiate a list of suggestions below the search bar while typing.

    In the Data tab, Submitter IDs can be searched under the Subject tab.

    In the File tab, File Names can be searched under the File tab.

    Click either on a single or on multiple suggestions in the list appearing underneath the search bar to create a cohort and export/download the data. The selections can be again clicked to be removed from the created cohort.

    However a patient will be assigned the same across related studies. For dbGaP to assign the same dbGaP subject ID, include the two variables, SUBJECT_SOURCE and SOURCE_SUBJECT_ID.

  • This identifier is used in all the phenotypic data files and is what we sequence to a HPDS Patient Num ( Patient ID ). All sequenced identifiers are stored in a PatientMapping file and stored in s3. These mappings allow HPDS data to be correlated back to the raw data sets.

  • : The main HPDS UI repository.

  • : This repository describes steps to prepare and annotate VCF files for loading into HPDS.

  • PIC-SURE Wikiarrow-up-right
    BioData Catalyst PIC-SUREarrow-up-right
    PIC-SURE-ALL-IN-ONEarrow-up-right
    DCPPC Presentation on PIC-SURE as a meta-APIarrow-up-right
    Avillachlab-Jenkins Repositoryarrow-up-right
    Avillachlab-Jenkins Dev Release Controlarrow-up-right
    R Client Libraryarrow-up-right
    Python Client Libraryarrow-up-right
    PIC-SURE HPDS UIarrow-up-right
    PIC-SURE UI Flowarrow-up-right
    PIC-SURE Auth MicroApp Repositoryarrow-up-right
    PSAMA Core Logicarrow-up-right
    PIC-SURE HPDSarrow-up-right
    PIC-SURE HPDS Python Clientarrow-up-right
    PIC-SURE HPDS R Clientarrow-up-right

    Required: This field will display whether the property is required for the submission of the node into the data model.

  • Description: This field will display further information about the property.

  • Term: This field can be populated with external resources that have further information about the property.

  • An example of a node being selected in the interactive graph view.
    An example of a second node being selected in the path of the first selected node.
    A node's property window.
    Table View of the Gen3 Data Dictionary.
    Opening the Properties in the Table View format.
    An example search for the term "study". results appear under the search bar as you type.
    Search results for "study" in the Graph View.
    Search results are highlighted in orange color.
    Clear the search results
    .
  • Projects will also be hidden if the select cohort contains fewer than 50 subjects (50 ↓, "You may only view summary information for this project", example below); in this case grayed out boxes and locks both appear. An additional lock means users have no access.

  • Export to PFB : Initiate a export of all clinical data and file GUIDs for the selected cohort to your local storage.
  • Export to Workspaces : Export a manifest to the user's workspace and make the case-associated data files available in the workspace under the /pd/data directory.

  • XML: These files contain either dictionary or variable reports of the phenotypic datasets that are in the TXT files. These supporting files do contain information on a study-level and not on a subject-level.

  • TXT: These files contain subject-level phenotypic datasets.

  • : Initiate a
    export of the selected files.
  • Export All to Terra: Initiate a export of the selected files to .

  • Export All to Seven Bridges: Initiate a export of the selected files to

  • GUID Download File Page: Aside from the 5 button options, users can download files by first clicking on the link(s) under the GUIDs column, followed by the Download button in the file information pages (see next section below).

  • DCC harmonized variablesarrow-up-right
    Portable Format for Bioinformatics (PFB)
    BioData Catalyst powered by Terraarrow-up-right
    Portable Format for Bioinformatics (PFB)
    BioData Catalyst powered by Seven Bridges. arrow-up-right
    PFB
    gen3-clientarrow-up-right
    gen3-clientarrow-up-right
    Data Access panel on the Exploration page.
    The view on the list of Projects when "Data without Access" is selected.
    Example: The variable of Ethnicity is hidden once the number of subjects falls below 50.
    Lock, grayed out box and "50" signify the number of subjects falls <50 and users have no access..
    Exploration page with Data Access displaying the Data with Access.
    Four options offered for data export.
    The Files Tab page.
    Five button options offered for file download or export.
    Download files by clicking on the link located under the GUID column.
    An example file information page with the Download button.
    Free text search of Submitter IDs in Subject on the Data Tab.
    Free text search of File Names on the File Tab.
    Select multiple suggestions to create an exportable cohort.
    gen3-clientarrow-up-right
    dbGaParrow-up-right
    API key creation pop-up window
    PIC-SURE HPDS UIarrow-up-right
    HPDS Annotationarrow-up-right
    dbGaP homepagearrow-up-right
    PFB
    PFB
    PFB
    BioData Catalyst powered by Terraarrow-up-right
    PFB
    BioData Catalyst powered by Seven Bridges. arrow-up-right