Users who want to access a hosted controlled study on the ecosystem must be approved for access to that study in dbGaP
An NIH eRA Commons ID (or appropriate NIH Login) is required for submitting a Data Access Request (DAR). If you do not have an eRA Commons account, you must request one through your institution’s Office of Sponsored Research or equivalent. For more information, refer to Understanding eRA Commons Accounts on the eRA website.
To submit a DAR, users must have PI status through their institution. Non-PI users must have a PI they work with that can submit a DAR and add them as a downloader.
Step 1: Go to to log in to dbGaP.
Step 2: Navigate to My Projects.
Step 3: Select Datasets.
You can search by Primary disease type, or if you know the dataset you are interested in, you can use Study lookup.
We want to request HCT for SCD, so we will use the accession number phs002385. As you type the accession number, the numbers will start to auto-populate.
Select the study to add it to the Data Access Request. You can request up to 200 studies that you are interested in accessing.
The user can add additional datasets as necessary needed to answer the research question.
Long-term survival and late death after hematopoietic cell transplant for sickle cell disease
Our project is limited to requested dataset. We have no plans to combine with other datasets.
In 2018, the National Heart Lung and Blood Institute (NHLBI) began work on BioData Catalyst, a shared virtual space where scientists can access NHLBI data and work with the digital objects needed for biomedical research (www.nhlbi.nih.gov/science/biodata-catalyst). This is a cloud-based platform that allows for tools, applications and workflows. It provides secure workspaces to share, store, cross-link and analyze large sets of data generated from biomedical research. Biodata Catalyst addresses the NHLBI Strategic Vision objective of leveraging emerging opportunities in data science to facilitate research in heart, lung, blood and sleep disorders. It offers specialized search functions, controlled access to data and analytic tools via programming interfaces and its interoperability will allow exchange of information with other components of the Data Commons. BioData Catalyst may be accessed by the biomedical researchers and the public at large. The first available datasets in BioData Catalyst include data from NHLBI’s Trans-Omics for Precision Medicine (TOPMed) Program and the Cure Sickle Cell Initiative. Rigor in designing and performing scientific research and the ability to reproduce biomedical research are two of the cornerstones of science advancement. In order to test reproducibility of biomedical data available in BioData Catalyst we accessed NHLBI data from the Cure Sickle Cell Initiative to test and validate the findings of a publication that utilized those data. That report, focused on the effect of donor type and transplant conditioning regime intensity on hematopoietic cell transplant outcomes for sickle cell disease. Hematopoietic cell transplant is potentially curative, yet this treatment is associated with risks for mortality from the treatment procedure. Published reports suggest the life expectancy of adults with sickle cell disease in the United States is shortened by at least two decades compared to the general population. Thus, a fundamental question that is often asked is whether hematopoietic cell transplant over time would offer a survival advantage compared to treatment with disease-modifying agents. In the report1 that examined for factors associated with survival after transplantation, young patients (aged =12 years) and patients who received their graft from an HLA-matched sibling had the highest survival. For those without an HLA-matched sibling the data did not favor one alternative donor type over another.1 The purpose of the current analyses is two-fold: 1) test and validate a publication that utilized data in the public domain and 2) the utility of these data to conduct an independent study. The aim of the later study was to estimate the conditional survival rates after hematopoietic cell transplantation stratified by time survived since transplantation and to compare all-cause mortality risks to those of an age, sex, and race-matched general population in the United States.
Investigators in the Cure Sickle Cell Initiative Data Consortium request access to data to examine rigor and reproducibility of data submitted - using a previous publication as reference. Additionally, we will calculate survival after hematopoietic cell transplant by time survived since transplant. For example, we will calculate the 5- and 10-year likelihood of being alive 2 and 5 years after transplantation.
The NHLBI-supported BioData Catalyst (www.nhlbiBioDataCatalyst.org) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces. The BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by FireCloud, hosted and operated by the Broad Institute, Fair4Cures, hosted and operated by Seven Bridges Genomics and PIC-SURE operated by Harvard Medical School. For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Additionally, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) for Harvard Medical School.
NHLBI BioData Catalyst, Private, The NHLBI-supported BioData Catalyst (https://biodatacatalyst.nhlbi.nih.gov/) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces.
The NHLBI BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. The NHLBI BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by FireCloud, hosted and operated by the Broad Institute, Fair4Cures, hosted and operated by Seven Bridges Genomics and PIC-SURE operated by Harvard Medical School.
For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Additionally, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) for Harvard Medical School.
Amazon Web Services (AWS), Commercial Amazon Web services (AWS) is a public cloud platform that provides solutions and services such as virtual machines, database instances, storage, and more. We will use the Amazon Elastic Compute Cloud (Amazon EC2), a web service that provides resizable compute capacity, to allocate Amazon Machine Instances (AMIs) in which we will develop the methods and infrastructure necessary to build the NHLBI BioData Catalyst. AWS offers several storage options that work in conjunction with EC2: Amazon Simple Storage Service (Amazon S3), Amazon Elastic Block Store (EBS) and Amazon Elastic File System (Amazon EFS). We expect to use each of these as the provide different capabilities including persistent storage and direct and networked storage for attaching to running AMI(s). We will use the Amazon Virtual Private Cloud (VPC) to provide security and robust networking functionality to these compute resources and Amazon Identity and Access Management (IAM) to control user access to these compute resources. AWS offers extensive security and has written a white paper with guidelines for working with controlled access data sets in AWS which we will follow. (see ).
Google Cloud Platform is a public cloud platform that provides solutions and services such as virtual machines, database instances, storage, and more. We will use the Google Compute, a service that provides resizable compute capacity, to allocate Machine Types in which we will develop the methods and infrastructure necessary to build the NHLBI BioData Catalyst. Google Cloud offers several storage options that work in conjunction with Compute Engine: Google Cloud Storage, and Google Compute Engine Persistent Disks. We expect to use each of these as the provide different capabilities including persistent storage and direct and networked storage for attaching to running machine instances. We will use networking technologies based on Google’s Andromeda architecture, which can create networking elements at any level with software. This software-defined networking allows Cloud Platform's services to implement networking features that fit their exact needs, such as secure firewalls for virtual machines in Google Compute Engine. We will use Google Cloud Identity & Access Management to control user access to these compute resources.
How to access additional data stacks
The Genotype-Tissue Expression (GTEx) Program is a widely used data resource and tissue bank to study the relationship between genetic variants (inherited changes in DNA sequence) and gene expression (how genes are turned on and off) in multiple human tissues and across individuals. For information on access to GTEx data, refer to GTEx v8 - Free Egress Instructions as part of the AnVIL documentation.
The is currently working to establish and implement guidelines and technical standards to empower end-user analyses across participating cloud platforms and facilitate the realization of a trans-NIH, federated data ecosystem. Participating institutions include BDC, AnVIL, Cancer Research Data Commons, and Kids First Data Resource Center. Learn what data is currently hosted by these platforms by using the .
You can check your access to data on BDC using the website or on a specific platform
Go to the Explore BDC Data page of the BDC website. Under the section, "Requirements for Accessing BDC Hosted Data," click Check My Access.
Go to , select NIH Login, then log in using your NIH credentials. Once logged in, select the Exploration tab. From the Data Access panel on the left, make sure Data with Access is selected. Note whether you have access to all the datasets you expect.
You do not need to check your data access on BDC-PIC-SURE. Instead, refer to the page, then click Check My Access.
Log into .
Click your username in the upper right, then select Account settings.
From the upper left, select the tab for Dataset Access.
Browse the datasets and note whether you have access to all the datasets you expect.
You do not need to check your data access on BDC-Terra. But before submitting a help desk ticket, ensure that you’ve done the following steps:
Establish a link in to your eRA Commons/NIH Account and the University of Chicago DCP Framework. To link eRA Commons, NIH, and DCP Framework Services, go to your Profile page in BDC-Terra and log in with your NIH credentials.
If your access still has issues using particular files or datasets in analyses on BDC-Terra, submit a request to our .
Datasets you have access to will have green check marks.
Datasets you do not have access to will have red check marks.
Data with Access (default)
A user can view all of the summary data and associated study information for studies the user has access to, including but not limited to Project ID, file types, and clinical variables.
Data without Access
Displays data you do not have subject-level access to, but for which summary statistics can be accessed. Locks next to the project ID signify to users that they do not have subject-level access but they can still search through the available studies but only view summary statistics.
Projects will also be hidden if the select cohort contains fewer than 50 subjects (50 ↓, "You may only view summary information for this project", example below); in this case grayed out boxes and locks both appear. An additional lock means users have no access.
All Data
Displays all projects, but also projects you have no access to. A lock will appear for data you cannot access. Users can view all of the data available in the BDC-Gen3 platform, including studies with and without access. As a result, studies not available to a user will be locked as demonstrated below.
This checklist is intended to help new users understand their obligations regarding access and permissions, which individual users are responsible for obtaining and maintaining.
Users log into BDC platforms with their eRA Commons credentials. For more information, see Ecosystem Access, Hosted Data, and System Services.
Users create an eRA Commons Account through their institution's Office of Sponsored Research or equivalent. For more information, refer to Understanding eRA Commons Accounts.
Users who want to access a hosted controlled study on the BDC ecosystem must be approved for access to that study in the NIH Database of Genotypes and Phenotypes (). Therefore, users who want to access a hosted controlled study on the ecosystem must be approved for access to that study in dbGaP. For more information, see the . Note that obtaining these approvals can be a time-intensive process; failure to obtain them in a timely manner may delay data access.
Users have two options for obtaining dbGaP approval depending on whether they already are affiliated with a PI who has dbGaP access to the relevant data:
The BDC user has no affiliation with an existing dbGaP approved project. In this case the user needs to create their own dbGaP project and then submit a data access request (DAR) for approval by the NHLBI Data Access Committee (DAC). This process often takes 2-6 months depending on whether local IRB approval is required by the dataset the user is requesting, the amount of time it takes for local review of the dbGaP application by the user’s home institution and processing by the DAR committee. See thel or. Once a DAR is approved, it can take a week or longer for the approval to be reflected on BDC.
The BDC user is affiliated with an existing principal investigator, who already has an approved dbGaP application with existing DAR access (for example, the BDC user is a post-doctoral fellow in a PI’s lab). A principal investigator with dbGaP DAR access assigns the User as a “Downloader” in dbGaP. See. It can take about 24 hours for “Downloader” approval to be reflected on BDC.
DARs must be to maintain your data access permissions. If your permissions expire, you may lose access to hosted data in BDCatalyst during your renewal process.
A may be required as part of the DAR.
BDC hosts data from the NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium. BDC users are not automatically onboarded as TOPMed investigators. BDC users who are not members of the TOPMed Consortium may apply for released data through the regular dbGaP Data Access Request process.
When conducting TOPMed-related research on BDC, members of the TOPMed consortium must follow the and associated processes; for example, operating within Working Groups.
For more information, refer to the following resources:
Information on TOPMed
(login required)
Users must ensure that IRB data use agreements (DUAs) are approved and maintained as they are enforced by the BDC ecosystem.
Refer to the BDC page to learn more about topics such as data privacy, access controls, and restrictions.
Use your eRA Commons account to review the data indexed by BDC to which you have access on the page. For more information, see .
If your data is not indexed, inform BDC team members during your onboarding meetings or by submitting a .