Welcome to the NHLBI BioData Catalyst ecosystem and thank you for joining our community of practice. The ecosystem offers secure workspaces to support your data analysis in addition to a number of bioinformatics tools for analysis. The ecosystem currently hosts datasets from the Transomics for Precision Medicine (TOPMed) program. There is a lot of information to understand and many resources (documentation, learning guides, videos, etc.) available, so we developed this overview to help you get started. If you have additional questions, please use the links at the very end of this document, under the "Questions" section, to contact us.
What is NHLBI BioData Catalyst?
The NHLBI BioData Catalyst ecosystem is a cloud-based platform providing tools, applications, and workflows in secure workspaces. It is designed to be nimble and responsive to the ever-changing conditions of the biomedical and data science communities. Though the primary goal of the BioData Catalyst project is to build a data science ecosystem, at its core, this is a people-centric endeavor. BioData Catalyst is also building a community of practice working collaboratively to solve technical and scientific challenges.
What are we doing and why does it matter?
By increasing access to the NHLBI’s datasets and innovative data analysis capabilities, the NHLBI BioData Catalyst ecosystem accelerates efficient biomedical research that drives discovery and scientific advancement, leading to novel diagnostic tools, therapeutics, and prevention strategies for heart, lung, blood, and sleep disorders.
Who is developing NHLBI BioData Catalyst?
The ecosystem is funded by the National Heart, Lung, and Blood Institute (NHLBI). Researchers and other professionals receive funding from the NHLBI to work on the development of the ecosystem, together often referred to as “The NHLBI BioData Catalyst Consortium” or “The Consortium” for short. You can find a list of partners and platforms powering the ecosystem on the about page of the project’s website and a list of the principal investigators is available in our documentation.
Learn about our culture.
The NHLBI BioData Catalyst community follows a statement of conduct that reflects the Consortium’s dedication to providing a harassment-free experience for everyone.
Find out the meanings of our terms and acronyms.
Like many professional communities, NHLBI BioData Catalyst has adopted terms to help us communicate quickly and more efficiently, but that can be a challenge for newcomers. To help, we created the NHLBI BioData Catalyst glossary of terms and acronyms. If ever there is a time when an ecosystem term or acronym is unfamiliar and isn’t in the glossary, please contact us so we can give you the information and add it to the glossary for future newcomers.
Learn about the platforms and services available in the ecosystem.
The NHLBI BioData Catalyst ecosystem features the following platforms and services.
Explore Available Data
BioData Catalyst Powered by Gen3 - Hosts genomic and phenotypic data and enables faceted search for authorized users to create and export cohorts to workspaces in a scalable, reproducible, and secure manner.
BioData Catalyst Powered by PIC-SURE - Enables access to all clinical data, feasibility queries to be conducted, and allows cohorts to be built in real-time and results to be exported via the API for analysis.
Analyze Data in Cloud-based Shared Workspaces
BioData Catalyst Powered by Seven Bridges - Collaborative workspaces where researchers can find and analyze hosted datasets (e.g. TOPMed) as well as their own data by using hundreds of optimized analysis tools and workflows in CWL, as well as JupyterLab and RStudio for interactive analysis.
BioData Catalyst Powered by Terra - Secure collaborative place to organize data, run and monitor workflow (e.g. WDL) analysis pipelines, and perform interactive analysis using applications such as Jupyter Notebooks and the Hail GWAS tool.
Use Community Tools on Controlled-access Datasets
Dockstore - Catalog of Docker-based workflows (from individuals, labs, organizations) that export to Terra or Seven Bridges.
The NHLBI BioData Catalyst website provides further details about the platforms and services available in the ecosystem. We encourage you to create accounts on all the platforms as you get to know BioData Catalyst.
How does data access work?
The BioData Catalyst ecosystem manages access to the hosted controlled data using data access approvals from the NIH Database of Genotypes and Phenotypes (dbGaP). Therefore, users who want to access a hosted controlled study on the ecosystem must be approved for access to that study in dbGaP.
How do I login?
Users log into BioData Catalyst platforms with their eRA Commons credentials (see Understanding eRA Commons Accounts) and authentication is performed by iTrust. Every time a user logs in, the ecosystem checks his/her user credentials to ensure s/he can only access the data for which s/he has dbGaP approval.
While all of the platforms within BioData Catalyst use eRA Commons credentials and iTrust performs authorization and authentication, respectively, there are some slight differences between the platforms when getting set up:
BioData Catalyst Powered by Gen3 - Users do not set up usernames on Gen3. Upon the first time logging in, select “Login from NIH”, then enter eRA commons credentials at the prompt. This ‘User Identity’ is used to track the user on the system.
BioData Catalyst Powered by PIC-SURE - Similar to Gen3, user identities are used - researchers log into the system by selecting “Log in with eRA Commons.”
BioData Catalyst Powered by Seven Bridges - Users set up platform accounts. The first time on the system, users select to “Create an account” and then proceed with entering their eRA Commons credentials. The user is then prompted to fill out a registration form with their name, email, and preferred username. Users are also asked to acknowledge that they have read the Privacy Act notice and then they can proceed to the platform.
BioData Catalyst Powered by Terra - Users initially log in using Google credentials and are asked to agree to the Terms of Service and Privacy Act notice. User activity is tracked via the Google credentials, but users can link their eRA Commons credentials to the account to get access to hosted datasets.
Details about how data access works on the NHLBI BioData Catalyst ecosystem are on the website.
How do I check which data I can access?
We recommend users first check their access to data before logging in. Do this by going to the Accessing BioData Catalyst page and clicking on the “Check My Access” button. Once you confirm your data access, go to the Platforms and Services page from which you click on the “Launch” hyperlink for the platform or service you wish to use. Platforms and services have login/sign in links on their pages that bring you to the pages on which you enter your eRA Commons credentials. Documentation on checking your access to data is also available.
What data are available in the ecosystem?
The NHLBI BioData Catalyst currently hosts a subset of datasets from TOPMed including phs numbers with genomic data and related phs numbers with phenotype data. You can find information about which TOPMed studies are currently hosted on the Data page of the website as well as in the Release Notes.
Harmonized data available.
There are limited amounts of harmonized data available to users with appropriate access at this time. The TOPMed Data Coordinating Center curation team has produced forty-four (44) harmonized phenotype variables from seventeen (17) NHLBI studies. Information about the 17 studies and the 44 variables can be found in the BioData Catalyst Powered by PIC-SURE User Guide.
Bring your own data and workflows into the system.
We allow researchers to bring their own data and workflows into the ecosystem to support their analysis needs. Researchers can bring their own datasets into BioData Catalyst Powered by Seven Bridges and BioData Catalyst Powered by Terra. Users can also bring their own workflows to the system. Users can either add workflows to Dockstore in CWL or WDL, or they can create CWL tools directly on BioData Catalyst Powered by Seven Bridges and develop custom workflows for use on BioData Catalyst Powered by Terra.
Learn about Genome-wide association study and genetic association testing on BioData Catalyst.
Walk through our self-paced genome-wide association study and genetic association testing tutorials.
Share your workflows.
We encourage users to publish their workflows so they can be used by other researchers working in the NHLBI BioData Catalyst ecosystem. Share your workflows via Dockstore.
Costs and cloud credits.
BioData Catalyst hosts a number of datasets available for analysis to users with appropriate data access approvals. Users are not charged for the storage of these hosted datasets; however, if hosted data is used in analyses users incur costs for computation and storage of derived results. Cloud credits are available on the system, and you can learn more here.
Let us know about your publications and see how you can cite us.
If you are writing a manuscript about research you conducted using NHLBI BioData Catalyst, please use the citation available here.
Immediately after learning your manuscript has been accepted, please email [email protected] to let us know. Please include in your email the manuscript title, the name of the publication that accepted your manuscript, and information about pre-publication posting (if it will take place), along with your name and contact information.
Learn more, ask questions, or request help.
Answers to frequently asked questions are available on the website, as are many resources that can be found under Learn & Support. You can also use a form to Contact Us, and if you aren’t sure which selections to make on the form, please see our help desk directory.