1 of 5

Data Access

Data Interoperability

How to access additional data stacks

GTEx Data

The Genotype-Tissue Expression (GTEx) Program is a widely used data resource and tissue bank to study the relationship between genetic variants (inherited changes in DNA sequence) and gene expression (how genes are turned on and off) in multiple human tissues and across individuals. For information on access to GTEx data, refer to GTEx v8 - Free Egress Instructions as part of the AnVIL documentation.

NCPI Data Portal

The NIH Cloud Platform Interoperability Effort (NCPI) is currently working to establish and implement guidelines and technical standards to empower end-user analyses across participating cloud platforms and facilitate the realization of a trans-NIH, federated data ecosystem. Participating institutions include BioData Catalyst, AnVIL, Cancer Research Data Commons, and Kids First Data Resource Center. Learn what data is currently hosted by these platforms by using the NCPI Data Portal.

Understanding Access

This checklist is intended to help new users understand their obligations regarding access and permissions, which individual users are responsible for obtaining and maintaining.

eRA Commons Account

Users log into BioData Catalyst platforms with their eRA Commons credentials. For more information, see Ecosystem Access, Hosted Data, and System Services.

Users create an eRA Commons Account through their institution's Office of Sponsored Research or equivalent. For more information, refer to Understanding eRA Commons Accounts.

dbGaP

Users who want to access a hosted controlled study on the BioData Catalyst ecosystem must be approved for access to that study in the NIH Database of Genotypes and Phenotypes (dbGaP). For more information, see Ecosystem Access, Hosted Data, and System Services and BioData Catalyst FAQs. Note that obtaining these approvals can be a time-intensive process; failure to obtain them in a timely manner may delay data access.

Users have two options for obtaining dbGaP approval depending on whether they already are affiliated with a PI who has dbGaP access to the relevant data:

The BioData Catalyst user has no affiliation with an existing dbGaP approved project. In this case the user needs to create their own dbGaP project and then submit a data access request (DAR) for approval by the NHLBI Data Access Committee (DAC). This process often takes 2-6 months depending on whether local IRB approval is required by the dataset the user is requesting, the amount of time it takes for local review of the dbGaP application by the user’s home institution and processing by the DAR committee. See the dbGap Authorized Access Portal or dbGaP Overview: Requesting Controlled-Access Data. Once a DAR is approved, it can take a week or longer for the approval to be reflected on BioData Catalyst.
The BioData Catalyst user is affiliated with an existing principal investigator, who already has an approved dbGaP application with existing DAR access (for example, the BioData Catalyst user is a post-doctoral fellow in a PI’s lab). A principal investigator with dbGaP DAR access assigns the User as a “Downloader” in dbGaP. See Assign Downloaders for dbGaP Data. It can take about 24 hours for “Downloader” approval to be reflected on BioData Catalyst.

Notes

DARs must be renewed annually to maintain your data access permissions. If your permissions expire, you may lose access to hosted data in BDCatalyst during your renewal process.

A Cloud Use Statement may be required as part of the DAR.

TOPMed

BioData Catalyst hosts data from the NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium. BioData Catalyst users are not automatically onboarded as TOPMed investigators. BioData Catalyst users who are not members of the TOPMed Consortium may apply for released data through the regular dbGaP Data Access Request process.

When conducting TOPMed-related research on BioData Catalyst, members of the TOPMed consortium must follow the TOPMed Publications Policy and associated processes; for example, operating within Working Groups.

For more information, refer to the following resources:

Information on joining TOPMed
TOPMed website
TOPMed FAQs (login required)
BioData Catalyst FAQs

IRB

Users must ensure that IRB data use agreements (DUAs) are approved and maintained as they are enforced by the BioData Catalyst ecosystem.

BioData Catalyst

Refer to the BioData Catalyst Data Protection page to learn more about topics such as data privacy, access controls, and restrictions.

Use your eRA Commons account to review the data indexed by BioData Catalyst to which you have access on the Explore BioData Catalyst Data page. For more information, see Checking Access.

If your data is not indexed, inform BioData Catalyst team members during your onboarding meetings or by submitting a Help Desk ticket.

Submitting a dbGaP Data Access Request

Requirements

An NIH eRA Commons ID (or appropriate NIH Login) is required for submitting a Data Access Request (DAR). If you do not have an eRA Commons account, you must request one through your institution’s Office of Sponsored Research or equivalent. For more information, refer to .
To submit a DAR, users must have PI status through their institution. Non-PI users must have a PI they work with that can submit a DAR and add them as a downloader.

Data Access Request Process

Step 1: Go to to log in to dbGaP.

Step 2: Navigate to My Projects.

Step 3: Select Datasets.

You can search by Primary disease type, or if you know the dataset you are interested in, you can use Study lookup.

We want to request HCT for SCD, so we will use the accession number phs002385. As you type the accession number, the numbers will start to auto-populate.

Select the study to add it to the Data Access Request. You can request up to 200 studies that you are interested in accessing.

The user can add additional datasets as necessary needed to answer the research question.

Sample Research Use Statement

Title

Long-term survival and late death after hematopoietic cell transplant for sickle cell disease

Research Use Statement

Our project is limited to requested dataset. We have no plans to combine with other datasets.

In 2018, the National Heart Lung and Blood Institute (NHLBI) began work on BioData Catalyst, a shared virtual space where scientists can access NHLBI data and work with the digital objects needed for biomedical research (www.nhlbi.nih.gov/science/biodata-catalyst). This is a cloud-based platform that allows for tools, applications and workflows. It provides secure workspaces to share, store, cross-link and analyze large sets of data generated from biomedical research. Biodata Catalyst addresses the NHLBI Strategic Vision objective of leveraging emerging opportunities in data science to facilitate research in heart, lung, blood and sleep disorders. It offers specialized search functions, controlled access to data and analytic tools via programming interfaces and its interoperability will allow exchange of information with other components of the Data Commons. BioData Catalyst may be accessed by the biomedical researchers and the public at large. The first available datasets in BioData Catalyst include data from NHLBI’s Trans-Omics for Precision Medicine (TOPMed) Program and the Cure Sickle Cell Initiative. Rigor in designing and performing scientific research and the ability to reproduce biomedical research are two of the cornerstones of science advancement. In order to test reproducibility of biomedical data available in BioData Catalyst we accessed NHLBI data from the Cure Sickle Cell Initiative to test and validate the findings of a publication that utilized those data. That report, focused on the effect of donor type and transplant conditioning regime intensity on hematopoietic cell transplant outcomes for sickle cell disease. Hematopoietic cell transplant is potentially curative, yet this treatment is associated with risks for mortality from the treatment procedure. Published reports suggest the life expectancy of adults with sickle cell disease in the United States is shortened by at least two decades compared to the general population. Thus, a fundamental question that is often asked is whether hematopoietic cell transplant over time would offer a survival advantage compared to treatment with disease-modifying agents. In the report1 that examined for factors associated with survival after transplantation, young patients (aged =12 years) and patients who received their graft from an HLA-matched sibling had the highest survival. For those without an HLA-matched sibling the data did not favor one alternative donor type over another.1 The purpose of the current analyses is two-fold: 1) test and validate a publication that utilized data in the public domain and 2) the utility of these data to conduct an independent study. The aim of the later study was to estimate the conditional survival rates after hematopoietic cell transplantation stratified by time survived since transplantation and to compare all-cause mortality risks to those of an age, sex, and race-matched general population in the United States.

Non-technical summary

Investigators in the Cure Sickle Cell Initiative Data Consortium request access to data to examine rigor and reproducibility of data submitted - using a previous publication as reference. Additionally, we will calculate survival after hematopoietic cell transplant by time survived since transplant. For example, we will calculate the 5- and 10-year likelihood of being alive 2 and 5 years after transplantation.

Cloud-Use Statement

The NHLBI-supported BioData Catalyst (www.nhlbiBioDataCatalyst.org) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces. The BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by FireCloud, hosted and operated by the Broad Institute, Fair4Cures, hosted and operated by Seven Bridges Genomics and PIC-SURE operated by Harvard Medical School. For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Additionally, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) for Harvard Medical School.

Cloud Provider Information

Cloud Provider:

NHLBI BioData Catalyst, Private, The NHLBI-supported BioData Catalyst (https://biodatacatalyst.nhlbi.nih.gov/) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces.

The NHLBI BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. The NHLBI BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by FireCloud, hosted and operated by the Broad Institute, Fair4Cures, hosted and operated by Seven Bridges Genomics and PIC-SURE operated by Harvard Medical School.

For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Additionally, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) for Harvard Medical School.

Google Cloud Platform, Commercial

Google Cloud Platform is a public cloud platform that provides solutions and services such as virtual machines, database instances, storage, and more. We will use the Google Compute, a service that provides resizable compute capacity, to allocate Machine Types in which we will develop the methods and infrastructure necessary to build the NHLBI BioData Catalyst. Google Cloud offers several storage options that work in conjunction with Compute Engine: Google Cloud Storage, and Google Compute Engine Persistent Disks. We expect to use each of these as the provide different capabilities including persistent storage and direct and networked storage for attaching to running machine instances. We will use networking technologies based on Google’s Andromeda architecture, which can create networking elements at any level with software. This software-defined networking allows Cloud Platform's services to implement networking features that fit their exact needs, such as secure firewalls for virtual machines in Google Compute Engine. We will use Google Cloud Identity & Access Management to control user access to these compute resources.

Checking Access

You can check your access to data on BioData Catalyst using the public website or on your specific platform.

Public website

Go to Accessing BioData Catalyst Data and click Check My Access.

BioData Catalyst powered by Gen3 platform

Go to BioData Catalyst Powered by Gen3, select NIH Login, then log in using your NIH credentials. Once logged in, select the Exploration tab. From the Data Access panel on the left, make sure Data with Access is selected. Note whether you have access to all the datasets you expect.

Data Access

Parameter

Description

Data with Access (default)

Displays projects you have access to.

Data without Access

Displays data you do not have subject-level access to, but for which summary statistics can be accessed.

All Data

Displays all projects, but also projects you have no access to. A lock will appear for data you cannot access.

Request Access

You can request access to data by visiting the dbGaP homepage. For more information on Data Access, see the Data Accessibility on the Exploration page.

BioData Catalyst powered by Seven Bridges platform

Go to BioData Catalyst powered by Seven Bridges and login. To check your data access:

Click your username in the upper right and select Account Settings.
Select the tab for Dataset Access.
Browse the datasets and note whether you have access to all the datasets you expect.
- Datasets you have access to will have green check marks.
- Datasets you do not have access to will have red check marks.

BioData Catalyst powered by Terra platform

You do not need to check your data access on BioData Catalyst powered by Terra. But before submitting a help desk ticket, ensure that you’ve done the following steps:

Establish a link in BioData Catalyst powered by Terra to your eRA Commons/NIH Account and the University of Chicago DCP Framework. To link eRA Commons, NIH, and DCP Framework Services, go to your Profile page in BioData Catalyst powered by Terra and log in with your NIH credentials.

If your access still has issues using particular files or datasets in analyses on BioData Catalyst powered by Terra, submit a request to our help desk.

BioData Catalyst powered by PIC-SURE platform

You do not need to check your data access on BioData Catalyst powered by PIC-SURE. Instead, refer to the Accessing BioData Catalyst Data page, then click Check My Access.

Understanding Access

This checklist is intended to help new users understand their obligations regarding access and permissions, which individual users are responsible for obtaining and maintaining.

eRA Commons Account

Users log into BioData Catalyst platforms with their eRA Commons credentials. For more information, see Ecosystem Access, Hosted Data, and System Services.

Users create an eRA Commons Account through their institution's Office of Sponsored Research or equivalent. For more information, refer to Understanding eRA Commons Accounts.

dbGaP

Users have two options for obtaining dbGaP approval depending on whether they already are affiliated with a PI who has dbGaP access to the relevant data:

The BioData Catalyst user has no affiliation with an existing dbGaP approved project. In this case the user needs to create their own dbGaP project and then submit a data access request (DAR) for approval by the NHLBI Data Access Committee (DAC). This process often takes 2-6 months depending on whether local IRB approval is required by the dataset the user is requesting, the amount of time it takes for local review of the dbGaP application by the user’s home institution and processing by the DAR committee. See the dbGap Authorized Access Portal or dbGaP Overview: Requesting Controlled-Access Data. Once a DAR is approved, it can take a week or longer for the approval to be reflected on BioData Catalyst.
The BioData Catalyst user is affiliated with an existing principal investigator, who already has an approved dbGaP application with existing DAR access (for example, the BioData Catalyst user is a post-doctoral fellow in a PI’s lab). A principal investigator with dbGaP DAR access assigns the User as a “Downloader” in dbGaP. See Assign Downloaders for dbGaP Data. It can take about 24 hours for “Downloader” approval to be reflected on BioData Catalyst.

Notes

DARs must be renewed annually to maintain your data access permissions. If your permissions expire, you may lose access to hosted data in BDCatalyst during your renewal process.

A Cloud Use Statement may be required as part of the DAR.

TOPMed

For more information, refer to the following resources:

Information on joining TOPMed
TOPMed website
TOPMed FAQs (login required)
BioData Catalyst FAQs

IRB

Users must ensure that IRB data use agreements (DUAs) are approved and maintained as they are enforced by the BioData Catalyst ecosystem.

BioData Catalyst

Refer to the BioData Catalyst Data Protection page to learn more about topics such as data privacy, access controls, and restrictions.

Use your eRA Commons account to review the data indexed by BioData Catalyst to which you have access on the Explore BioData Catalyst Data page. For more information, see Checking Access.

If your data is not indexed, inform BioData Catalyst team members during your onboarding meetings or by submitting a Help Desk ticket.

Submitting a dbGaP Data Access Request

Requirements

An NIH eRA Commons ID (or appropriate NIH Login) is required for submitting a Data Access Request (DAR). If you do not have an eRA Commons account, you must request one through your institution’s Office of Sponsored Research or equivalent. For more information, refer to .
To submit a DAR, users must have PI status through their institution. Non-PI users must have a PI they work with that can submit a DAR and add them as a downloader.

Data Access Request Process

Step 1: Go to to log in to dbGaP.

Step 2: Navigate to My Projects.

Step 3: Select Datasets.

You can search by Primary disease type, or if you know the dataset you are interested in, you can use Study lookup.

We want to request HCT for SCD, so we will use the accession number phs002385. As you type the accession number, the numbers will start to auto-populate.

Select the study to add it to the Data Access Request. You can request up to 200 studies that you are interested in accessing.

The user can add additional datasets as necessary needed to answer the research question.

Sample Research Use Statement

Title

Long-term survival and late death after hematopoietic cell transplant for sickle cell disease

Research Use Statement

Our project is limited to requested dataset. We have no plans to combine with other datasets.

Non-technical summary

Cloud-Use Statement

Cloud Provider Information

Cloud Provider:

Amazon Web Services (AWS), Commercial Amazon Web services (AWS) is a public cloud platform that provides solutions and services such as virtual machines, database instances, storage, and more. We will use the Amazon Elastic Compute Cloud (Amazon EC2), a web service that provides resizable compute capacity, to allocate Amazon Machine Instances (AMIs) in which we will develop the methods and infrastructure necessary to build the NHLBI BioData Catalyst. AWS offers several storage options that work in conjunction with EC2: Amazon Simple Storage Service (Amazon S3), Amazon Elastic Block Store (EBS) and Amazon Elastic File System (Amazon EFS). We expect to use each of these as the provide different capabilities including persistent storage and direct and networked storage for attaching to running AMI(s). We will use the Amazon Virtual Private Cloud (VPC) to provide security and robust networking functionality to these compute resources and Amazon Identity and Access Management (IAM) to control user access to these compute resources. AWS offers extensive security and has written a white paper with guidelines for working with controlled access data sets in AWS which we will follow. (see ).