Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Overview of the Profile page on the BioData Catalyst Powered by Gen3
The Profile page contains two sections: API keys and Project access.
To download large amounts of data, an API key will be required as a part of the gen3-client. To create a key on your local machine, click Create API key, which will activate the following pop-up window:
Click Download json to save the credential file to your local machine. After completion, a new entry will appear in the API key(s) section of the Profile page. It will display the API key key_id
and the expiration date (one month after the key creation). The user should delete the key after it has expired. If for any reason a user feels that their API key has been compromised, the key should be deleted before subsequently creating a new one.
This section of the Profile page lists the projects and the methods of access for the data within in the Gen3 BioData Catalyst system. If you do not see access to a specific study, check that you have been granted access within dbGaP. If access has been granted for over a week, contact the BioData Catalyst Help Desk: bdcat-support@datacommons.io
Overview of the Query page on BioData Catalyst Powered by Gen3
The Query page can search and return metadata from either the Flat Model or the Graph Model of a commons. Using GraphQL, these searches can be tailored to filter and return fields of interest for the data sets being queried. These queries can be made immediately after data submission as this queries the model directly.
For more information about how to use the Query page, refer to the Gen3 documentation.
Overview of current projects hosted on BioData Catalyst Powered by Gen3, including their dependencies, characteristics, and relationships.
A list of current project IDs can be found in the Data tab, under Filters>Project>Project Id. The current project IDs are:
Parent
TOPMed
Open_Access
Tutorial
The Parent and TOPMed study types have been categorized on Gen3 by their Program designation. An example of this designation by Program is presented below.
The Program types can be further identified by whether there is an underscore (_
) at the end of the study:
Parent studies will include an underscore at the end of the study name.
Example: parent-WHI_HMB-IRB_
TOPMed studies will not include an underscore at the end of the study name.
Example: topmed-BioMe_HMB-NPU
There are three distinct relationships possible between Parent and TOPMed studies. The first two relationships are streamlined:
Parent only: The Parent study does not have a TOPMed counterpart study. This usually means that there are no genomic data, such as WXS (whole exome sequencing) or WGS (whole genome sequencing), located within the study; only phenotypic data.
TOPMed only: This TOPMed study does not have a Parent counterpart study. These studies will contain both genomic data, WXS or WGS, and phenotypic data.
Parent study with a counterpart TOPMed study: The Parent study will contain the phenotypic data, while the TOPMEd study will contain the genomic data. Under dbGaP, these studies would be kept separate from one another and the user would need to create the linkages. In the Gen3 platform, these studies have been linked together under the Parent study, based on the participant IDs found in dbGaP. This allows our system to produce valuable information and cohort creation as it combines both phenotypic and genomic data.
The most notable difference between the Program categories is the type of hosted data.
Genomic data: None
Phenotypic data: Like with TOPMed studies, any phenotypic data found within the Graph Model, will only be DCC harmonized variables. For the raw phenotypic data from dbGaP, again, it can be found in the reference_file
node.
Genomic data: Available data can include CRAM, VCFs and Cohort-level VCF files
Phenotypic data: TOPMed studies without an associated Parent study will include phenotypic data in the data graph by way of DCC harmonized variables. Additionally, raw phenotypic data from dbGaP can be found in the reference_file
as tar files that share this common naming scheme: RootStudyConsentSet_phs######.<study_shorthand>.v#.p#.c#.<consent_codes>.tar.gz
The 1000 Genomes Project is an international research effort (2008-2015) to establish the most detailed catalogue of human variation and genotype data. On the Gen3 platform, the Program open_access contains:
Genotypic data: Available data can include CRAM and VCF files.
Phenotypic data: The data graph will contain phenotypic data by way of DCC harmonized variables. Additionally, raw phenotypic data can be found in the reference_file
as VCF and TXT files.
This program contains genomic data from 1000 Genomes and synthetic clinical data generated by Terra. Purpose of this dataset is to use it as a genome-wide association study (GWAS) tutorial. GWAS is an approach used in genetics research to associate specific genetic variations with particular diseases. For more information, see Terra Tutorials.
On the Gen3 platform, the Program tutorial contains:
Genotypic data: Available data can include CRAM and VCF files.
Phenotypic data: The data graph will contain phenotypic data by way of DCC harmonized variables. Additionally, raw phenotypic data can be found in the reference_file
as VCF and GDS files.
Overview of Workspaces on BioData Catalyst Powered by Gen3
When navigating to a Workspace, users are presented with multiple workspace options.
The Gen3 platform offers two workspace environments: Jupyter Notebooks and R Studio.
There are six workspaces:
Virtual machines (VM):
Small Jupyter Notebook VM
Large Jupyter Notebook Power VM
R Studio VM
Pre-made workflow workspaces:
Autoencoder Demo
CIP Demo
Tensorflow-Pytorch.
To start a workspace, select Launch. You will see the following launch loading screen.
Launching a VM can take up to five minutes depending on the size and complexity of the workspace.
Once the VM is ready, the initial screen for the workspace will appear. For scripts and output that need to be saved when the workspace is terminated, store those files in the pd/
directory.
This workspace will persist once the user has logged out of the Gen3 BioData Catalyst system. If the workspace is no longer being used, terminate the workspace by selecting Terminate Workspace at the bottom of the window. You will be returned to the Workspace page with all of the workspace options.
For more information about the Gen3 Workspace, refer to Data Analysis in a Gen3 Data Commons.
An explanation for the Exploration page on BioData Catalyst Powered by Gen3
The Exploration page located in the upper right-hand section of the toolbar allows users to search through data and create cohorts. The Exploration portal contains a dynamic summary statistics display, as well as search facets leveraging the DCC Harmonized Variables.
Users can navigate through data on the Exploration page by selecting any of the three Data Access categories.
Data with Access: A user can view all of the summary data and associated study information for studies the user has access to, including but not limited to Project ID, file types, and clinical variables.
Data without Access:
Locks next to the project ID signify to users that they do not have subject-level access but they can still search through the available studies but only view summary statistics. Users can request access to data by visiting the dbGaP homepage.
Projects will also be hidden if the select cohort contains fewer than 50 subjects (50
↓, "You may only view summary information for this project", example below); in this case grayed out boxes and locks both appear. An additional lock means users have no access.
All Data: Users can view all of the data available in the BioData Catalyst Gen3 platform, including studies with and without access. As a result, studies not available to a user will be locked as demonstrated below.
By default, all users visiting the Exploration page will be assigned to Data with Access
.
Under the "Data" tab, users can leverage the DCC harmonized variables to create custom cohorts. When facets are selected and/or updated to cover a desired range of values, the display will reflect the information relevant to the new applied filter. If no facets have been selected, all of the data accessible to the user will be displayed. At this time, a user can filter based on three categories of clinical information:
Project: Any specifically defined piece of work that is undertaken or attempted to meet a single investigative question or requirement.
Subject: The collection of all data related to a specific subject in the context of a specific experiment.
Harmonized Variables: A selection of different clinical properties from multiple nodes, defined by the Consortium.
NOTE: The facet filters are based on the DCC Harmonized Variables, which are a selected subset of clinical data that have been transformed for compatibility across the dbGaP studies. TOPMed studies that do not contain harmonized clinical data at this time will be filtered out when a facet is chosen, unless the
no data
option is also selected for certain facets.
After a cohort has been selected, the user has four different options for exporting the data.
The options for export are as follows:
Export All to Terra
: Initiate a Portable Format for Bioinformatics (PFB) export of all clinical data and file GUIDs for the selected cohort to BioData Catalyst powered by Terra. At this time the max number of subjects that can be exported to Terra is 120,000.
Export All to Seven Bridges
: Initiate a Portable Format for Bioinformatics (PFB) export of all clinical data and file GUIDs for the selected cohort to BioData Catalyst powered by Seven Bridges.
Export to PFB
: Initiate a PFB export of all clinical data and file GUIDs for the selected cohort to your local storage.
Export to Workspaces
: Export a manifest to the user's workspace and make the case-associated data files available in the workspace under the /pd/data
directory.
NOTE: PFB export times can take up to 60 minutes, but often will complete in less than 10 minutes.
The Files tab displays study files from the facets chosen on the left-side panel (Project ID, Data Type, Data Format, Callset, and Bucket Path). Each time a facet selection is made, the data summary and displays will update to reflect the applied filters.
The Files tab also contains files that are either case-independent or project-level. This is important for files that are part of the Unharmonized Clinical Data
category under the Data Type field. Unharmonized clinical files are made available in two distinct data formats:
TAR
: Contain a complete directory of phenotypic datasets as XML
and TXT
files that are direct downloads of unharmonized clinical data from dbGaP on a study consent level project.
AVRO
: These files are the same as the unharmonized clinical data from dbGaP as the TAR
files, but in form of a PFB file.
XML
: These files contain either dictionary or variable reports of the phenotypic datasets that are in the TXT files. These supporting files do contain information on a study-level and not on a subject-level.
TXT
: These files contain subject-level phenotypic datasets.
NOTE: The unharmonized clinical data sets contains all data from the dbGaP study, but it is not cross-compatible across all studies within BioData Catalyst.
Once the user has selected a cohort, there are five options for accessing the files:
Download Manifest
: Download the file manifest and use this manifest to download the enlisted data files using the gen3-client.
Export to Workspace
: The files can be exported to a Gen3 workspace.
Export All PFB
: Initiate a PFB export of the selected files.
Export All to Terra
: Initiate a PFB export of the selected files to BioData Catalyst powered by Terra.
Export All to Seven Bridges
: Initiate a PFB export of the selected files to BioData Catalyst powered by Seven Bridges.
GUID Download File Page
: Aside from the 5 button options, users can download files by first clicking on the link(s) under the GUIDs column, followed by the Download button in the file information pages (see next section below).
A user can visit the File Information Page after clicking on any of the available GUID link(s) in the Files tab page. The page will display details such as data format, size, object_id, the last time it was updated and the md5sum. The page also contains a button to download the file via the browser (see below). For files that are 5GB or more, we suggest using the gen3-client.
Both the Data and File tabs contain a text-based search function that will initiate a list of suggestions below the search bar while typing.
In the Data tab, Submitter IDs can be searched under the Subject
tab.
In the File tab, File Names can be searched under the File
tab.
Click either on a single or on multiple suggestions in the list appearing underneath the search bar to create a cohort and export/download the data. The selections can be again clicked to be removed from the created cohort.
Overview of the Portable Format for Bioinformatics (PFB) file type
A Portable Format for Bioinformatics (PFB) allows users to transfer both the metadata from the the Data Dictionary as well as the Data Dictionary itself. As a result, data can be transferred while keeping the structure from the original source. Specifically, a PFB consists of three parts:
A schema
Metadata
Data
For more information and an in-depth review that includes Python tools for PFB creation and exploration, refer to the and install the newest version.
Note
The following PFB example is a direct PFB export from the tutorial-synthetic_data_set_1
found on . Due to the large amount of data stored within PFB files, only small sections are shown with breaks (displayed as ...
) occurring in the output.
A schema is a JSON formatted Data Dictionary containing information about the properties, such as value types, descriptions, and so on.
To view the PFB schema, use the following command:
Example Output
The metadata in a PFB contains all of the information explaining the linkage between nodes and external references for each of the properties.
To view the PFB metadata, use the following command:
Example Output
The data in the PFB are the values for the properties in the format of the Data Dictionary.
To view the data within the PFB, use the following command:
To view at a certain number of entries in the PFB file, use the flag -n
to designate a number. For example, to view the first 10 data entries within the PFB, use the following command:
Example Output
Interactive Data Dictionary on BioData Catalyst Powered by Gen3
The Dictionary page contains an interactive visual representation of the Gen3 data model. The default graph model view, as pictured below, displays all of the nodes and relationships between nodes in a hierarchical structure. The model further specifies the node types and links between nodes, as highlighted in the legend located at the top right side of the page.
Users can click on any of the graph nodes in order to learn more about their respective properties. By clicking on a node, the graph will highlight that specific node and all associated links that connect it to the Program node. A "Data Model Structure" list will also appear on the left side toolbar. This will display the node path required to reach the selected node from the Program node.
When a second node in the path is selected, it will then gray out the other possible paths and only highlight the selected path. It will also change the "Data Model Structure" list on the left side toolbar.
The left side toolbar has two options available:
Open properties: Will download the submission files for all the nodes in the "Data Model Structure" list. This option can also be found on the node that was first selected.
Download templates: Will open the node properties in a new pop-up window; an example is displayed in the following screenshot.
This property view will display all properties in the node and information about each property:
Property: Name of the property.
Type: The type of input for the node. Examples of this are string
, integer
, Boolean
and enumerated values (enum
), which are displayed as preset strings.
Required: This field will display whether the property is required for the submission of the node into the data model.
Description: This field will display further information about the property.
Term: This field can be populated with external resources that have further information about the property.
The Table view is similar to the Properties view, and nodes are displayed as a list of entries grouped by their node category.
Clicking on one of the nodes will open the Properties view of the node.
The Dictionary contains a text-based search function that will search through the names of the properties and the descriptions. While typing, a list of suggestions appears below the search bar. Click on a suggestion to search for it.
When the search function is used, it will default to the graph model and highlight nodes that contain the search term. Frames around the node boxes indicate whether the searched word was identified in the name of the node (full line) or in the node's description and properties' names/descriptions (dashed line).
Clicking on one of these nodes, it will only display the properties that have this keyword present in either the property name or the description.
Click Clear Search Result to clear the free text search if needed.
The search history is saved below the search bar in the "Last Search" list. Click on an item here to display the results again.
NOTE: To make the outputs more human-readable, the above information was then piped through the program . Example: pfb show -i PFB_file.avro schema | jq
How to login to the NHLBI BioData Catalyst Gen3 platform and view available genomic and phenotypic data.
In order to navigate and access data available on the Gen3 platform, start by visiting the login page. You will need an eRA Commons account as well as access permissions through the Database of Genotypes and Phenotypes (dbGaP). If you are a researcher, login by selecting NIH Login and using your eRA Commons account. BioData Catalyst consortia developers can login using their Google accounts. Make sure to use the correct login method that contains access to your available projects.
Once logged in, your username will appear in the upper right-hand corner of the page. You will also see a display with aggregate statistics for the total number of subjects, studies, aliquots and files available within the BioData Catalyst platform.
NOTE: These numbers may differ from those displayed in the dbGaP records as they include TOPMed studies as well as the associated parent studies.
A number of clinical variables have been harmonized by the Data Coordinating Center (DCC) in order to facilitate cross-study analysis. Faceted search over the DCC Harmonized Variables is available via the Exploration page, under the "Data" tab.
Unharmonized clinical files are also available on the Gen3 platform and contain all of the raw phenotypic information for the hosted studies. Unlike the DCC Harmonized Variables, these files are located and searchable under the "Files" tab in the Exploration page.
The Gen3 platform hosts genomic data provided by the Trans-Omics for Precision Medicine (TOPMed) program and the 1000 Genomes Project plus synthetic tutorial data from Terra. At present, these projects include CRAM and VCF files together with their respective index files. Specifically for TOPMed projects, each project will contain at least one multi-sample VCF that comprises all subjects within the consent group. CRAM and VCF are based on an individual level, whereas multi-sample VCFs are based on the study consent level.
All files are available under the "Files" tab in the Exploration page. More detailed information on currently hosted data on the Gen3 platform can be found here.
The BioData Catalyst Gen3 platform contains five pages described below:
Dictionary: An interactive data dictionary display that details the contents and relationships between clinical and biospecimen data
Exploration: The facet filter custom cohort creation tool
Query: The GraphQL query tool to retrieve specific data within the graph model
Workspace: The launch page for Gen3 workspaces that includes Jupyter Notebooks and RStudio
Profile: The information page for each user, displaying access and the location for credential file downloads