1 of 100

BDC Documentation

NHLBI BioData Catalyst® (BDC) Documentation

This is a repository for documentation related to the platforms and services that are part of the BDC ecosystem.

Click here to access the NHLBI BioData Catalyst^® (BDC) website.

Welcome to NHLBI BioData Catalyst^® (BDC)

Welcome to the BDC ecosystem and thank you for joining our community of practice. The ecosystem offers secure workspaces to support your data analysis in addition to a number of bioinformatics tools for analysis. The ecosystem currently hosts datasets from the Transomics for Precision Medicine (TOPMed) program. There is a lot of information to understand and many resources (documentation, learning guides, videos, etc.) available, so we developed this overview to help you get started. If you have additional questions, please use the links at the very end of this document, under the "Questions" section, to contact us.

About BDC and Our Community

What is BDC?

NHLBI BioData Catalyst^® (BDC) is a cloud-based ecosystem that offers researchers data, analytical tools, applications, and workflows in secure workspaces. BDC is a community where researchers can find, access, share, store, and analyze heart, lung, blood, and sleep data. BDC is an NHLBI data repository where researchers share scientific data from NHLBI-funded research, so they and others can reproduce findings and reuse data to advance science.

By increasing access to NHLBI data and innovative analytic capabilities, BDC accelerates reproducible biomedical research to drive scientific advances that can help prevent, diagnose, and treat heart, lung, blood, and sleep disorders.

What are we doing and why does it matter?

By increasing access to the NHLBI’s datasets and innovative data analysis capabilities, the BDC ecosystem accelerates efficient biomedical research that drives discovery and scientific advancement, leading to novel diagnostic tools, therapeutics, and prevention strategies for heart, lung, blood, and sleep disorders.

Who is developing BDC?

The ecosystem is funded by the National Heart, Lung, and Blood Institute (NHLBI). Researchers and other professionals receive funding from the NHLBI to work on the development of the ecosystem, together often referred to as “The BDC Consortium” or “The Consortium” for short. You can refer to a list of partners and platforms powering the ecosystem on the Overview page of the BDC website and a list of the principal investigators is available in our documentation.

Find out the meanings of our terms and acronyms.

Like many professional communities, BDC has adopted terms to help us communicate quickly and more efficiently, but that can be a challenge for newcomers. To help, we created a BDC glossary of terms and acronyms. If ever there is a time when an ecosystem term or acronym is unfamiliar and isn’t in the glossary, contact us so we can give you the information and add it to the glossary.

The BDC Ecosystem and Services

Learn about the platforms and services available in the ecosystem.

The BDC ecosystem features the following platforms and services.

Explore Available Data

BioData Catalyst Powered by Gen3 - Hosts genomic and phenotypic data and enables faceted search for authorized users to create and export cohorts to workspaces in a scalable, reproducible, and secure manner.
BioData Catalyst Powered by PIC-SURE - Enables access to all clinical data, feasibility queries to be conducted, and allows cohorts to be built in real-time and results to be exported via the API for analysis.

Analyze Data in Cloud-based Shared Workspaces

BioData Catalyst Powered by Seven Bridges - Collaborative workspaces where researchers can find and analyze hosted datasets (e.g. TOPMed) as well as their own data by using hundreds of optimized analysis tools and workflows in CWL, as well as JupyterLab and RStudio for interactive analysis.
BioData Catalyst Powered by Terra - Secure collaborative place to organize data, run and monitor workflow (e.g. WDL) analysis pipelines, and perform interactive analysis using applications such as Jupyter Notebooks and the Hail GWAS tool.

Use Community Tools on Controlled-access Datasets

Dockstore - Catalog of Docker-based workflows (from individuals, labs, organizations) that export to Terra or Seven Bridges.

The NHLBI BioData Catalyst website provides further details about the platforms and services available in the ecosystem. We encourage you to create accounts on all the platforms as you get to know BioData Catalyst.

Ecosystem Access, Hosted Data, and System Services

How does data access work?

The BioData Catalyst ecosystem manages access to the hosted controlled data using data access approvals from the NIH Database of Genotypes and Phenotypes (dbGaP). Therefore, users who want to access a hosted controlled study on the ecosystem must be approved for access to that study in dbGaP.

How do I login?

Users log into BioData Catalyst platforms with their eRA Commons credentials (see Understanding eRA Commons Accounts) and authentication is performed by iTrust. Every time a user logs in, the ecosystem checks his/her user credentials to ensure s/he can only access the data for which s/he has dbGaP approval.

While all of the platforms within BioData Catalyst use eRA Commons credentials and iTrust performs authorization and authentication, respectively, there are some slight differences between the platforms when getting set up:

BioData Catalyst Powered by Gen3 - Users do not set up usernames on Gen3. Upon the first time logging in, select “Login from NIH”, then enter eRA commons credentials at the prompt. This ‘User Identity’ is used to track the user on the system.
BioData Catalyst Powered by PIC-SURE - Similar to Gen3, user identities are used - researchers log into the system by selecting “Log in with eRA Commons.”
BioData Catalyst Powered by Seven Bridges - Users set up platform accounts. The first time on the system, users select to “Create an account” and then proceed with entering their eRA Commons credentials. The user is then prompted to fill out a registration form with their name, email, and preferred username. Users are also asked to acknowledge that they have read the Privacy Act notice and then they can proceed to the platform.
BioData Catalyst Powered by Terra - Users initially log in using Google credentials and are asked to agree to the Terms of Service and Privacy Act notice. User activity is tracked via the Google credentials, but users can link their eRA Commons credentials to the account to get access to hosted datasets.

Details about how data access works on the NHLBI BioData Catalyst ecosystem are on the website.

How do I check which data I can access?

We recommend users first check their access to data before logging in. Do this by going to the Accessing BioData Catalyst page and clicking on the “Check My Access” button. Once you confirm your data access, go to the Platforms and Services page from which you click on the “Launch” hyperlink for the platform or service you wish to use. Platforms and services have login/sign in links on their pages that bring you to the pages on which you enter your eRA Commons credentials. Documentation on checking your access to data is also available.

What data are available in the ecosystem?

The NHLBI BioData Catalyst currently hosts a subset of datasets from TOPMed including phs numbers with genomic data and related phs numbers with phenotype data. You can find information about which TOPMed studies are currently hosted on the Data page of the website as well as in the Release Notes.

Harmonized data available.

There are limited amounts of harmonized data available to users with appropriate access at this time. The TOPMed Data Coordinating Center curation team has produced forty-four (44) harmonized phenotype variables from seventeen (17) NHLBI studies. Information about the 17 studies and the 44 variables can be found in the BioData Catalyst Powered by PIC-SURE User Guide.

Bring your own data and workflows into the system.

We allow researchers to bring their own data and workflows into the ecosystem to support their analysis needs. Researchers can bring their own datasets into BioData Catalyst Powered by Seven Bridges and BioData Catalyst Powered by Terra. Users can also bring their own workflows to the system. Users can either add workflows to Dockstore in CWL or WDL, or they can create CWL tools directly on BioData Catalyst Powered by Seven Bridges and develop custom workflows for use on BioData Catalyst Powered by Terra.

Learn about Genome-wide association study and genetic association testing on BioData Catalyst.

Walk through our self-paced genome-wide association study and genetic association testing tutorials.

Share your workflows.

We encourage users to publish their workflows so they can be used by other researchers working in the NHLBI BioData Catalyst ecosystem. Share your workflows via Dockstore.

Costs and cloud credits.

BioData Catalyst hosts a number of datasets available for analysis to users with appropriate data access approvals. Users are not charged for the storage of these hosted datasets; however, if hosted data is used in analyses users incur costs for computation and storage of derived results. Cloud credits are available on the system, and you can learn more here.

BioData Catalyst Publications

Let us know about your publications and see how you can cite us.

If you are writing a manuscript about research you conducted using NHLBI BioData Catalyst, please use the citation available here.

Immediately after learning your manuscript has been accepted, please email BDCatalystOutreach@nih.gov to let us know. Please include in your email the manuscript title, the name of the publication that accepted your manuscript, and information about pre-publication posting (if it will take place), along with your name and contact information.

Questions?

Learn more, ask questions, or request help.

Answers to frequently asked questions are available on the website, as are many resources that can be found under Learn & Support. You can also use a form to Contact Us, and if you aren’t sure which selections to make on the form, please see our help desk directory.

Community

Who We Are

Our Culture: Though the primary goal of the BDC project is to build a data science ecosystem, at its core, this is a people-centric endeavor. BDC is also building a community of practice working collaboratively to solve technical and scientific challenges in biomedical science.

Principal Investigators (PIs):

Stan Ahalt, PI RENCI (Coordination Center)
Rebecca Boyles, Co-PI RTI (Coordination Center)
Paul Avillach, PI HMS (Team Carbon)
Kira Bradford, Co-PI RENCI (Team Helium)
Steve Cox, Co-PI RENCI (Team Helium)
Brandi Davis-Dusenbery, PI Seven Bridges (Team Xenon)
Robert Grossman, PI UChicago (Team Calcium)
Ashok Krishnamurthy, PI RENCI (Team Helium )
Benedict Paten, PI UCSC (Team Calcium)
Anthony Philippakis, PI Broad Institute (Team Calcium)

Note: BDC collaboration is organized around teams based on elements in the periodic table. There are additional modes of collaboration in BDC including Tiger Teams, Working Groups, Steering Committee, and Publications.

More about who we are and the partners empowering our ecosystem can be found at the .

BDC Glossary

Glossary of terms used in the context of the BDC Consortium and platform.

Agile Development
Agile software development is an approach to software development under which requirements and solutions evolve through the collaborative effort of self-organizing and cross-functional teams and their customer(s)/end user(s).
Alpha Users
A small group of users who are more willing to tolerate working in a system that isn’t as fully developed, providing detailed feedback & maybe some back & forth discussions
[Amazon] EFS
[Amazon] Elastic File System a simple, scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources.
Ambassadors
A small group of experts that represent the personas featured within the priority User Narratives. For their time and help, Ambassadors will receive early access to the BDC platform, free compute time, monetary fee for time, and relevant travel expenses will be covered.
App
1. In Seven Bridges, an app is a general term to refer to both tools and workflows.
2. App may also refer to persistent software that is integrated into a platform.
API
Application Programmer Interfaces. API technologies serve as software-based intermediaries to exchange data.
AWS
Amazon Web Services. A provider of cloud services available on-demand.
BagIt
BagIt is a hierarchical file packaging format for storage and transfer of arbitrary digital content.
BDC3
BDC Coordinating Center
Beta Users
A slightly larger group than the alpha users who are not as tolerant to a difficult/clunky environment but understand that the version they are using is not polished and they need to give feedback.
Beta-User Training
Once the platform is available to a broader audience, we will support freely-accessible online training for beta-users at any time.
Carpentries Instructor Training Program
Ambassadors attend this training program to become BDC trainers.
CCM
Change Control Management; the systematic approach to managing all changes made to a document or process. Ensures no unnecessary changes are made, all changes are documented, and a process exists for implementing approved change.
CIO
Chief Information Officer
Cloud Computing
Internet-based computing, wherein computing power, networking, storage or applications running on computers outside an organization are presented to that organization in a secure, services-oriented way.
Components
Software units that implement a specific function or functions and which can be reused.
ConOps
Concept of Operations
Consortium
A collection of teams and stakeholders working to deliver on the common goals of integrated and advanced cyberinfrastructure, leading-edge data management and analysis tools, FAIR data, and HLBS researcher engagement.
Containers
A standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another (for example, Docker).
Command
In workflow writing, the command specifies the literal command line run (akin to what you would run in the terminal).
COPDGene
Chronic Obstructive Pulmonary Disease (COPD) Gene
Cost Monitoring (level)
At the Epic Level The Coordinating Center will facilitate this process by developing reporting templates (see example in PM Plan, Financial Management) for distribution to the teams. The BDC teams will complete these templates and send them directly to NHLBI. Each team is responsible for tracking their finances based upon the award conditions and for providing status updates as requested to NHLBI.
CRAM File Compressed columnar file format for storing biological sequences aligned to a reference sequence. Designed to be an efficient reference-based alternative to the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence, reducing storage costs. Additionally each column in the SAM format is separated into its own blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller than BAM, depending on the data held within them (from Wikipedia).
CSOC Alpha
Common Services Operations Center (CSOC): operates cloud, commons, compliance and security services that enable the operation of data commons; has ATO and hosts production system.
CSOC Beta
Development/testing; Real data in pilot (not production) that can be accessed by users
Common Workflow Language (CWL)
Simple scripting language for describing computational workflows for performing sequential operations on data. CWL is a way to describe command line tools and connect them together to create workflows. CWL is well suited for describing large-scale workflows in cluster, cloud and high performance computing environments where tasks are scheduled in parallel across many nodes.
DAC
Data Access Committee: reviews all requests for access to human studies datasets
DAR Data Access Request
Data Access
A process that involves authorization to access different data repositories; part of a User Narrative for the December 2020 release goal A Work Stream PM Plan constraint: NHLBI, as the project sponsor, will identify a process to enable data access by the BDC team members and for research users
Data Commons
Provides tools, applications, and workflows to enable computing large scale data sets in secure workspaces.
Data Repository Service (DRS) Generic interface (API) to data repositories so data consumers, including workflow systems, can access data in a single, standardized way regardless of where it’s stored or how it’s managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID (from GA4GH).
Data Steward
Members of the TOPMed and COPDGene communities who are working with BDC teams.
dbGaP
Database of Genotypes and Phenotypes
DCPPC
Data Commons Pilot Phase Consortium. The Other Transaction Awardees, Data Stewards, and the NIH.
Decision Tree
A decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility
Deep Learning
A machine learning method based on neural networks to learn from data through training to recognize patterns in the data.
Deliverables
Demonstrations and products.
Demos
Activities and documentation resulting from the DCPPC to build, test and demonstrate completion of goals of the Data Commons Pilot Phase.
DEV Environment
Set of processes and programming tools used to create the program or software product
DMI
Data Management Incident
Docker
Software for running containers, packaged, portable units of code and dependencies that can be run in the same way across many computers. See also Containers.
Dockerfile
A text document that contains all the commands a user could call on the command line to assemble an image.
Dockstore
An open platform developed by the Cancer Genome Collaboratory and used by the GA4GH for sharing Docker-based tools described with the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL)
DOI
Digital Object Identifier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself.
DUO
Data Use Ontology - a GA4GH standard for automating access (API) to human genomics data (https://github.com/EBISPOT/DUO)
DUOS
Data Use Oversight System, https://duos.broadinstitute.org/
Ecosystem
A software ecosystem is a collection of processes that execute on a shared platform or across shared protocols to provide flexible services. Example: The "BDC Ecosystem" - inclusive of all platforms and tools
EEP
External Expert Panel. A group of experts who provide guidance and direction to NIH about the program.
Epic
A very large user story which can be broken down into executable stories
*NHLBI’s cost-monitoring level
eRA Commons
Designated ID provider for whitelist
External Expert Panel
An independent body of experts that inform and advise the work of the BDC Consortium.
FAIR
Findable Accessible Interoperable Reusable.
Feature
A functionality at the system level that fulfills a meaningful stakeholder need
*Level at which the CC coordinates
FireCloud
Broad Institute secure cloud environment for analytical processing, https://software.broadinstitute.org/firecloud/
FISMA moderate environment
Federal Information Security Modernization Act of 2014, amends the Federal Information Security Management Act of 2002 (FISMA), see https://www.dhs.gov/fisma
FS
Full Stack
GA4GH
Global Alliance for Genomics and Health
GA4GH APIs
The Genomic Data Working Group is a coalition assembled to create interoperability standards for storing and sharing genomic data. The GA4GH Genomics API offers Interoperability for exchanging genomic data between various platforms and organizations by sending simple HTTP requests through a JSON equipped RESTful API.
GCP
Google Cloud Platform
GCR
Governance, Compliance, and Risk
Gen3
Gen3 is open source and licensed under the Apache license, which you can use for setting up, developing and operating data commons
GitHub
An online hub for storing and sharing computer programs and other plain text files. We use it for storage, hosting websites, communication and project management.
Gold Master
A gold master, or GM, is the final version of software or data ready for release to production; a master version from which copies can be made.
GWAS
Genome-wide Association Study
HLBS
Heart, Lung, Blood, Sleep
Identity Providers
A system entity that creates, maintains, and manages identity information for principals while providing authentication services to relying applications within a federation or distributed network; identity providers offer user authentication as a service
Interoperability
The ability of data or tools from multiple resources to effectively integrate data, or operate processes, across all systems with a moderate degree of effort.
Instance
In cloud computing, refers to a virtual server instance from a public or private cloud network.
Image
In the context of containers and Docker, this refers to the resting state of the software.
IP
BDC Implementation Plan; outlines how the various elements from the planning phase of the BDC project will come together to form concrete, operationalized BDC platform.
IRB
Institutional Review Board; the entity within a research organization that reviews and approves research protocols and clinical research protocols to protect human and animal subjects.
IRC
Informatics Research Core
ISA
Interoperability Service Agreement
ITAC
Information Technology Applications Center
Jupyter Notebooks
A web-based interactive environment for organizing data, performing computation, and visualizing output.
Linux
An open source computer operating system
Metadata
Data about other data
Milestone
Marks specific progress points on the development timeline, and they can be invaluable in measuring and monitoring the evolution and risk of a program. © Scaled Agile, Inc.
MSD
Minimum set of documents
MVP
Minimum viable product
NHLBI
National Heart, Lung, and Blood Institute
NIH
National Institutes of Health
NIST Moderate controls
NIST 800-53 - A collection of security controls and assessment procedures that both U.S. Federal and non-governmental organizations can apply to their information systems, policies, and procedures.
OTA
Other Transaction Authority - the mechanism of award that NHLBI chose because it provides a degree of flexibility in the scope of the work that is needed to advance this type of high risk/high reward project
PI
Principal Investigator
Platform
A piece of the BDC ecosystem. Examples: Terra, Gen3, Seven Bridges, etc.
PM
Project Manager
PMP
BDC Project Management Plan; breaks down the implementation of BDC from the perspective of the project managers involved in the project including details on roles, specific milestones, and the project schedule.
PO
Program Officer
Portable Format for Biomedical Data (PFB) Avro-based serialization format with specific schema to import, export and evolve biomedical data. Specifies metadata and data in one file. Metadata includes data dictionary, ontology references and relations between nodes. Supports versioning, back- and forward compatibility. A binary format.
Portfolio for Jira
Software-as-a-Service project management tool, used to track, roadmap, and visualize various project metrics.
Python
Open source programming language, used extensively in research for data manipulation, analysis, and modeling
Quality Assurance
The planned and systematic activities implemented in quality management so that quality requirements for a product or service satisfy stated goals and expectations.
Quality Control
The operational techniques and activities aimed at monitoring and measuring work processes and eliminating the causes of unsatisfactory outputs.
RACI
Responsible, Accountable, Consulted and Informed; tool that can be used for identifying roles and responsibilities during an organizational change process; BDC RACI
Researcher Auth Service (RAS) Will be a service provided by NIH's Center for Information Technology to facilitate access to NIH’s open and controlled data assets and repositories in a consistent and user-friendly manner. The RAS initiative is advancing data infrastructure and ecosystem goals defined in the NIH Strategic Plan for Data Science.
RFC
Request for Comment: A process that documents and enables effective interactions between stakeholders to support shared decision making.
Risk Register
A tool used to continuously identify risk, risk response planning and status updates throughout the project lifecycle. This project risk register is the primary risk reporting tool, and is located in the Project Management Plan.
SC
Steering Committee
Scientific use case
Defined in this project as an analysis of data from the designated sources which has relevance and value in the domain of health sciences, probably implementation and software agnostic.
SF or SFP
BDC Strategic Framework [Plan]; defines what the BDC teams have accomplished up to this point, what we plan to accomplish in a timeline fashion, and milestones to track and measure implementation.
SFTP
Secure File Transfer Protocol
Software Developers Kit
A set of software development tools that allows the creation of applications for a certain software package, software framework, hardware platform, computer system, or similar development platform
Sprints
Term of art used in software generation, referring to short, iterative cycles of development, with continuous review of code through daily builds and end-of-sprint demos
Stack
Term of art referring to a suite of services that run in the cloud and enable ubiquitous, convenient, on-demand access to a shared pool of configurable computing resources.
Steering Committee
Responsible for decision-making and communication in BDC.
STRIDES
Science & Technology Research Infrastructure for Discovery, Experimentation, and Sustainability
Task
In workflow writing, the term task encompasses all of the information necessary to execute a command, such as specifying input/output files and parameters.
Team
Groups of people led by a Principal Investigator (PI), or PIs, who will complete milestones and produce deliverables. Each group has been assigned a name, represented by the elements on the periodic chart.
Tiger Teams
A diversified group of experts brought together to investigate, solve, build, or recommend possible solutions to unique situations or problems. Populated with mature experts who know what's at stake, what needs to be done, and how to work well with others; their strengths are diversity of knowledge, a single focus or purpose, cross-functional communications, decision-making sovereignty, and organizational agility.
Tool
In Common Workflow Language, the term tool specifies a single command. This definition is not as discrete in other workflow languages such as WDL.
Tool Registry Service (TRS)
The GA4GH Cloud Work Stream has released a standard API for exchanging tools and workflows to analyze, read, and manipulate genomic data. The Tool Registry Service (TRS) API is one of a series of technical standards from the Cloud Work Stream that together allow genomics researchers to bring algorithms to datasets in disparate cloud environments, rather than moving data around.
TOPMed
Trans-Omics for Precision Medicine. One of the primary data sets of the DCPPC.
TOPMed DCC
TOPMed Data Coordinating Center
Trans-cloud
A provider-agnostic multi-cloud deployment architecture.
User Narrative
Descriptions of a user interaction experience within the system from the perspective of a particular persona. User Narratives are further broken down into Features, Epics, and User Stories. Currently formulated into rough 6-month timelines to benchmark progress.
User story
A description of a software feature from a technical/process-oriented perspective; a backlog item that describes a requirement or functionality for a user
*Finest level of PM Monitoring
Variant Call Format (VCF)
File format for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome. There is also a Genomic VCF (gVCF) extended format, which includes additional information about "blocks" that match the reference and their qualities (from Wikipedia). See http://www.internationalgenome.org/wiki/Analysis/vcf4.0/.
VDS
A composite of complete server hardware, along with the operating system (OS), which is powered by a remote access layer that allows end users to globally access their server via the Internet
VPC
Virtual Private Cloud
Whitelist
A security measure to permit only an approved list of entities. We recommend instead using the term "allow list".
Workflow
A sequence of processes, usually computational in this context, through which a user may analyze data.
Workflow Description Language (WDL) Way to specify data processing workflows with a human-readable and writeable syntax. Define complex analysis tasks, chain them together in workflows, and parallelize their execution.
Workspace
Areas to work on/with data within a platform. Examples: projects within Seven Bridges
Workstream
A collection of related features; orthogonal to a User Narrative
Wrapping
The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.
Virtual Machine (VM)
An isolated computing environment with its own operating system.

Citation and Acknowledgement

How to cite and acknowledge NHLBI BioData Catalyst® (BDC)

For citation of BDC:

National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services (2020). The NHLBI BioData Catalyst. Zenodo. https://doi.org/10.5281/zenodo.3822858

To acknowledge BDC, use:

The authors wish to acknowledge the contributions of the consortium working on the development of the NHLBI BioData Catalyst® (BDC) ecosystem.

Strategic Planning

In the context of agile development and a Consortium with a diverse set of members, the application of various agile-development terms may mean different things to different individuals.

The table below defines the BDC Core Terminology:

Term

Definition/Description

Example

User Narrative

Descriptions of a user interaction experience within the system from the perspective of a particular persona. User Narratives are further broken down into Features, Epics, and User Stories. Currently formulated into rough 6-month timelines to benchmark progress.

An experience bioinformatician wants to search TOPMed studies for a qualitative trait to be used in a GWAS study

Feature

A functionality at the system level that fulfills a meaningful stakeholder need

*Level at which the BDC3 coordinates

Search TOPMed datasets using PIC-SURE platform

Epic

A very large user story which can be broken down into executable stories

*NHLBI’s cost-monitoring level

PIC-SURE is accessible on BDC

User Stories

A backlog item that describes a requirement or functionality for a user

*Finest level of PM Monitoring

A user can access PIC-SURE through an icon on BDC to initiate search

Workstream

A collection of related features; orthogonal to a User Narrative

Workstreams impacted by the User Narrative above include:

production system
data analysis
data access
data management

Strategic Planning Documents Reviewed & Approved by NHLBI Leadership

471KB

BioData-Catalyst-Strategic-Framework-Plan-V1-v2.0 (1).pdf

pdf

685KB

BioData-Catalyst-Implementation-Plan-V1-v2.0.pdf

pdf

491KB

BioData Catalyst Data Management Strategy - V1.0(3).pdf

pdf

622KB

BioData Catalyst Project Management Plan V2.0 (1).pdf

pdf

Request for Comments

NHLBI BioData Catalyst Ecosystem Security Statement

BDCatalyst-RFC-#: 11 BDCatalyst-RFC-Title: NHLBI BioData Catalyst Ecosystem Security Statement BDCatalyst-RFC-Type: Consensus Building Name of the person who is to be Point of Contact: Sarah Davis Email of the person who is to be Point of Contact: sdavis@renci.org Submitting Team: BDC3/NHLBI Requested BDCatalyst-RFC posting start date: 6/14/2021 Date Emailed for consideration: 6/14/2021 BDCatalyst-RFC-Status: Comment only URL Link to this document: https://bdcatalyst.gitbook.io/biodata-catalyst-documentation/community/request-for-comments/nhlbi-biodata-catalyst-ecosystem-security-statement URL Link to the website: https://www.biodatacatalyst.org/collaboration/rfcs/bdcatalyst-rfc-11/ License: This work is licensed under a CC-BY-4.0 license.

Overview

The purpose of this RFC is to provide the NHLBI BioData Catalyst Consortium and users of the NHLBI BioData Catalyst ecosystem with a clear statement on security mechanisms of the ecosystem that protect the confidentiality, integrity, provenance, and availability of the hosted data as well as any data that may be uploaded using the ecosystem’s “Bring Your Own Data” (BYOD) functionality.

Figure 1. The NHLBI BioData Catalyst ecosystem leverages separately developed and managed platforms to maximize flexibility for users based on their research needs, expertise, and backgrounds. Utilizing multiple Authorizations to Operate (ATO), these platforms combine to provide secure, cloud-based workspaces, user authentication and authorization, search, tools and workflows, applications, and new innovative features to address community needs.

NHLBI BioData Catalyst Ecosystem Security Statement

The NHLBI and BioData Catalyst Consortium recognizes the importance of protecting both the privacy and security of the data and respecting the consent of the study participants whose data is stored within the BioData Catalyst ecosystem. Tackling these issues produces some challenges beyond those faced by most Federal Information Systems. The BioData Catalyst Consortium has implemented many innovative approaches to enable compliance and ensure that users understand their responsibility to protect data as articulated in specific Data Use Agreements (DUA). These approaches and controls work to protect the confidentiality, integrity and availability of the data; the privacy of the study participants who have contributed data; and data that may be uploaded to BioData Catalyst using the ecosystem’s “Bring Your Own Data” (BYOD) functionality. While the same general security controls are applied to both system and BYOD data, BYOD data is further protected as the ecosystem provides access only to the data’s uploaders and their designated collaborators.

From a Federal Information Security Modernization Act (FISMA) perspective, the BioData Catalyst ecosystem is a set of software systems with distinct security boundaries. Each system owner holds an Authority to Operate (ATO) issued by the NIH. The ATO is the result of a rigorous Security Assessment and Authorization (SA&A) process and third party assessment consistent with guidance from the National Institute of Standards and Technology (NIST). The ecosystem operates via a set of Interconnection Security Agreements (ISA) (Reindl 1979) and utilizes several existing components of security infrastructure (Bridges 2017, Gutiérrez-Sacristán et al. 2018) developed for other NIH platforms. Where the documentation provided as part of the SA&A process describes how security controls are implemented based on the NIST Special Publication 800-53r4 (see Endnote), the ISAs describe the permitted exchange of data and establish ecosystem-wide incident response, logging and auditing expectations that enable the consortium to respond in a unified manner to any suspected cybersecurity incident. The SA&A documentation provides for regular evaluation of the security of the component systems including regular scanning for vulnerabilities and the conduct of an annual penetration test. This level of security represents a baseline, and the BioData Catalyst ecosystem will extend protections over time.

Where the processes, policies, and technical controls protect confidentiality, integrity, and availability of data in accordance with Federal statute and regulation, there are additional ways to ensure that data is used in a manner consistent with study participants’ wishes, as represented by the consent form participants sign when enrolling in a specific study. Respect for these consents is critical to maintaining the public’s trust and requires additional policy, process, and technical controls. The respect for consent in NHLBI BioData Catalyst is enforced using normative NIH policies and processes for data sharing and using the existing infrastructure provided by the National Center for Biomedical Information’s (NCBI) Database of Genotypes and Phenotypes (dbGaP). All NHLBI-provided data within the NHLBI BioData Catalyst ecosystem are registered in dbGaP; in this process, data are assigned “consent groups” that describe in a machine-readable format the parameters of the consent for the data. These range from the most expansive “General Research Use” to more restrictive, such as only allowing for research outcomes related to “Health Medical Biomedical topics” or even to specific diseases, such as Chronic Obstructive Pulmonary Disease. Further, while secondary analysis of data is not considered human subjects research as described in the Common Rule (45 CRF Part 46), some datasets require the review of a research proposal by an Independent Review Board (IRB) or a Letter of Collaboration (LOC) with the originating study Principal Investigator as determined by the informed consents or special considerations that the submitting institution has determined are needed. These measures provide additional protection for datasets with particular sensitivity or special criteria for use.

For instance, IRB review is required when 1) informed consents that were signed by the study participants state that IRB oversight for secondary use of the data is required, and/or 2) the study IRB of record determines that the data may contain sensitive information that requires IRB oversight for secondary research. For collaboration letters, the informed consents indicate that the secondary use of the data by researchers outside the study will work with the study and therefore formal collaborations need to be set in place. While these are rarely used, they provide additional protection under special circumstances, such as where an indigenous population or sovereign nation requires direct control of how their data is used. Because consent is expressed at the individual level there may be a variety of consents for a study, either because the study offered choices to their participants or because the study consent evolved over an extended longitudinal study, such as the Framingham Heart Study. These variations in consent are reflected as multiple “consent groups” within a study and may mean that an investigator may only receive permission for subsets of study participants.

BioData Catalyst uses telemetry provided by dbGaP to enforce compliance with consents. Accordingly, users of BioData Catalyst can see only the data for which they have completed the process of a dbGaP Data Access Request (DAR) and after they receive approval from an NIH Data Access Committee (DAC). DAC approval results in a Data Use Agreement (DUA) describing any Data Use Limitations asserted by the originating study Principal Investigator, including ensuring that any requirements for an IRB review or Letter of Collaboration are met. On the BioData Catalyst ecosystem, the Gen3 “Fence” service, developed and operated by the University of Chicago and utilized by other NIH platforms and many other non-federal data commons, ensures enforcement of data access requirements. In order to ensure that the NHLBI maintains control over the use of the data and has the ability to audit this use, the BioData Catalyst policy does not allow download of source data outside the cloud ecosystem. Instead, BioData Catalyst provides access to data in approved cloud environments where computation occurs, which is documented in an ISA via standard Application Programming Interfaces (APIs) that use the authentication and authorization provided by the NIH Researcher Auth Service (RAS) (REF), together with Fence, to protect access and to allow for monitoring and auditing for appropriate data use (e.g. within scope of the approved project). These APIs include implementations of protected GA4GH Data Repository Service (DRS) for access to data objects such as genomic data and protected PIC-SURE interfaces for access to phenotypic, genotypic variant, and electronic health record data. The use of these APIs will, once fully implemented, enable other trusted cloud-based systems that meet equivalent security requirements to access data stored within BioData Catalyst for analysis using that other systems’ tools without the data being downloaded outside the security perimeters of the systems. This commitment to the use of APIs together with the requirement that data stay within the designated security boundaries is a critical component of making NHLBI data FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016, Corpas et al. 2018), while also ensuring confidentiality of data and respect for consent, regardless of the platform where the data is analyzed. BioData Catalyst has extended this model through the use of the NIH Researcher Auth Service (RAS) to provide authentication and authorization controls which together with the use of secure APIs is enabling secure interoperability with other trusted NIH-funded platforms such as the NHGRI’s AnVIL and the Gabriella Miller Kids First Data Resource.

Endnote

While NIST-800-53r4 is a thorough standard for many kinds of systems, it has some gaps for the most-modern systems encountered. There are additional standards to apply, either from the High column of NIST-800-53r4 or addendums as “best practice”.

In particular, the Standard does not give guidance on modern applications (“appsec”) -- especially when the Infrastructure is serverless and the entire security surface is the application itself. For instance, there are requirements for regular scanning but scanning tools do not scan APIs or modern Single Page Apps (SPA) sufficiently. A “web app scan” with tools like IBM’s Appscan would return no vulnerabilities on an API without actually testing it, yet it would satisfy RA-5 of NIST-800-53 to run such a scan. Nor does the standard require that the Infrastructure as a Service layer (AWS, GCP, Azure) abide by a continual scanning posture for misconfiguration -- only Networks and VMs are specified. That is one example of where NIST-800-53r4 doesn’t work for modern applications.

There’s also no guidance from NIST at large about running an API where external parties are building “clients'' to that API. Does allowing clients extend the security boundary, and thus all 3rd party applications are to be evaluated as such? Is there a different consideration for these 3rd party applications? The standard is silent there. Companies like Apple and Google through their AppStores require all 3rd party apps to undergo evaluation to their own security standards and such an idea might be applicable here. NHLBI might consider adding some extra controls, what the AllOfUs Program calls FISMA+, for enhanced security.

NHLBI DICOM Medical Image De-Identification Baseline Protocol

BDC-RFC-#: 28 Title: DICOM Medical Image De-Identification Baseline Protocol Type: Process Contact Name and Email: Keyvan Farahani, Submitting Teams: NHLBI, DMC Date Sent to Consortium: Oct. 11, 2023 Status: Closed for comment URL Link to this Google Document: License: This work is licensed under a CC-BY-4.0 license.

Medical Image De-Identification: BDC Baseline Protocol

Contributors:

Zixin Nie (BDC Data Management Core)
Keyvan Farahani (NHLBI)
David Clunie (PixelMed Publishing)

Why image de-identification?

De-identification of protected health information (PHI) is often a necessary procedure to undertake in order to share potentially sensitive information, such as health data. Many data repositories that allow human data to be deposited and shared require the data to be de-identified. Medical images and their associated metadata (i.e., DICOM headers) often contain PHI, such as patient names, dates of birth, or medical record numbers. The de-identification of these images is essential to minimize privacy risk and comply with regulations and standards that require the protection of PHI. The overarching goal in medical image de-identification is to reduce the risk of identification as much as possible.

De-identification facilitates the sharing of medical imaging data, enabling greater access by researchers and the public and allowing for secondary research to be conducted. Several standards exist for de-identification of medical images, including the confidentiality profile detailed in the DICOM Part 15 standard, HIPAA Safe Harbor and Expert Determination. The BioData Catalyst Data Management Core (BDC DMC) performed an evaluation of these standards and used them to create the protocol detailed in this document. This document describes the de-identification processes and technical considerations for de-identifying medical images as they are being added to BDC and made available to researchers using the BDC platform. The protocol, referred to as the “BDC Baseline Protocol for Image De-identification,” takes into account the data use cases for researchers accessing the BDC platform by defining a de-identification profile that strikes a balance between privacy protection and preserving utility.

The Baseline protocol only applies to the metadata in radiologic (DICOM) images (see table below). It does not apply to image pixel information, other imaging formats, or other types of data that may be imported into BDC, such as clinical and omics data. It reflects the understanding of the de-identification needs of BDC as of October 2023. Future RFCs are planned that will address masking of unique identifiers, the details of how imaging pixel data will be de-identified, the de-identification process workflow, and quality management.

Major medical imaging modalities

The focus of this RFC is on de-identification of DICOM images.

The Baseline De-Identification Protocol

The de-identification protocol described in this section is intended to be a baseline for de-identification within BDC. The protocol is compliant with regulations such as the HIPAA Privacy Rule and the Common Rule, while retaining the maximal amount of research utility possible. It is designed based on the experiences from the HeartShare imaging pilot project. The protocol will evolve over time, with future iterations to address new issues as they arise, and customizations to address specific research use cases. These may involve Expert Determinations, which can both increase privacy protections and improve research utility. This protocol is to be used for all medical imaging data to be submitted to the BDC. The protocol may be implemented in an image de-identification tool at the submitter’s site, or in a central BDC-related data curation service. Any deviation from this protocol must be discussed with and approved by the BDC/DMC. The baseline de-identification protocol can be found at this link: .

Introduction to HIPAA Safe Harbor and DICOM Part 15

De-identification of DICOM data can be performed according to different standards. Two commonly accepted standards are HIPAA Safe Harbor and (referred to in the rest of this document as the DICOM Part 15 Standard).

HIPAA Safe Harbor de-identification calls for the removal of 18 types of identifiers (detailed here: ). The standard legally applies to PHI handled by HIPAA Covered Entities, however as it has been in use for over 20 years it is generally accepted as a standard for de-identification for other types of data as well.

The DICOM Part 15 Standard was developed through a careful review of all DICOM attributes, identifying any that had the possibility of containing identifying information and creating a mitigation strategy. It is more extensive than HIPAA Safe Harbor, covering attributes that are not part of the 18 prescribed types of identifiers such as ethnicity and biological sex. Various mitigation strategies are presented to treat the attributes detailed as part of the standard, with the Basic DICOM Part 15 Confidentiality Profile being the most conservative, calling for suppression of most of the attributes.

De-Identification of DICOM Header Data

In order to have de-identified data that still possesses analytic utility for BDC researchers, while also being a standardized implementation of de-identification that can be applied across most data to be ingested by BDC, an evaluation was performed to produce a set of de-identification rules that can be applied to DICOM header attributes. The evaluation leveraged the de-identification profiles detailed in the DICOM Part 15 standard by evaluating its contents and aligning with the minimum requirements to comply with HIPAA Safe Harbor. The resulting de-identification strategy should be sufficient to construct a de-identification profile that can be applied across all DICOM headers.

The steps for performing this evaluation were as follows:

Attributes from each profile were classified into the following categories: Direct Identifier (DI), Quasi-Identifier (QI), and Non-Identifier (NI), according to the classification framework detailed in the following diagram:

After classification, DIs and QIs were then aligned with the 18 types of identifiers specified for removal within the HIPAA Safe Harbor provision.
Each of the attributes that aligns with one of the HIPAA Safe Harbor identifiers was then assigned a mitigation technique to remove the identifying information that could appear in the field.

Of the attributes within the DICOM Part 15 standard that must be removed for compliance with HIPAA Safe Harbor, there are:

4 name attributes
4 patient address attributes
122 date attributes
5 telephone number attributes
91 other unique ID attributes

Names, addresses, and telephone numbers should be suppressed from the data. Dates can be kept accurate to the year (a future BDC medical image de-identification RFC will address improving this approach for longitudinally acquired imaging studies). The other unique IDs can either be suppressed or they can be masked in a way so that their original values cannot be re-obtained. The specifics of how the other unique IDs will be masked will come in a separate RFC that describes the masking procedures. Additionally, there are 26 attributes that contain various forms of free text, such as comments, notes, labels, and text strings. Identifying information may be written in these attributes. As such, they should be suppressed to prevent the leakage of identifying information.

The other attributes detailed in the DICOM Part 15 standard do not necessarily require mitigation for compliance with HIPAA Safe Harbor. However, if they do not have analytic usage, it is recommended to mitigate them according to the specifications detailed in the DICOM Part 15 standard in order to decrease the risk of re-identification represented by indirectly identifying fields not mentioned in HIPAA Safe Harbor.

De-Identification of Image Pixel Data

Image pixel data, often encountered in ultrasound (echo) imaging, can contain PHI, such as patient names, dates of birth, and the hospital or imaging center names. This information can be shown either in labels on images, which usually have pre-specified areas, or in the form of burned-in text, which can appear anywhere on the image. Any identifying information contained within pixel data should be removed before it is made available to researchers.

Methods for removal of image pixel data include the following:

Masking through opaque boxes over parts of the image
AI assisted removal of identifying information, deploying optical character recognition (OCR)
Deletion of images from the dataset that contain identifying information

Image pixel de-identification will be performed as a service by tools provided by existing third party tool provided by DMC contractors. After de-identification, images will still require review to ensure that the process was able to capture and remove all identifying information on the images. This is a necessary quality control to ensure that there is no leakage of identifying information.

De-Identification of Filenames and File Paths

Metadata associated with images, such as filenames and file paths, can often include unique IDs and dates of medical events. This information is important to associate imaging data correctly with other types of data for linkage, processing, and analysis, however it can also present a risk of leakage of identifying information on de-identified data files. To prevent that from happening, the following rules should be followed:

Folder names should only include the study name and associated visit number, and no further information
1. e.g., for the first visit of the MESA study, the folder name should be called MESA_V1
Image filenames are to be set to the following format: STUDYNAME_TYPE_VISITNN_ YYYYMMDD_SEQ
1. VISITNN: ”VISIT”+VisitNumber (specifically include the label “VISIT” to inform investigator what the number is referring
2. YYYYMMDD: AcquisitionDate set to set to 01-01-YYYY, where YYYY is the year of acquisition
3. SEQ: sequence number to ensure filename is unique
4. e.g., MESA_ECG_VISIT05_20220101_999.xml

Risk Mitigation

The risks presented by using the de-identification methods detailed in this RFC are as follows:

HIPAA Safe Harbor, while being an accepted standard for de-identification, does not cover all potential identifiers (leaving out potentially attributes such as race, employment, diagnoses, procedures, and treatments). Data de-identified under HIPAA Safe Harbor holds a residual risk of re-identification.
Automated imaging de-identification solutions are not 100% accurate, leaving the potential for small amounts of identifying information to be retained.

Data made available through BDC is provided for research purposes to investigators who should not have ulterior motives to perform re-identification. HIPAA Safe Harbor represents a standard that has been in use for over 20 years, so the risks presented from using that standard are well understood and acceptable by BDC. The risk presented by leakage of identifying information from imaging data can be mitigated through human review of de-identified images to ensure that all identifying information has been removed.

In the event that PHI is discovered in de-identified imaging data in BDC, such data shall be pulled off-line, checked for removal of offending PHI, before being posted again on BDC. In such cases, the data submitter shall be informed of the incident.

Local vs. Cloud-based Image De-Identification

Depending on the capabilities of the de-identification tool and the legal and logistic requirements for access to original identifiable images, de-identification may be done locally on the data-generating site or through a central cloud-based service. Although the latter is often more efficient (semi-automated and scalable), the transfer of identifiable (PHI-containing) images to a central cloud may require agreements between the data provider (submitter) and the de-identification service provider, stipulated through execution of Data Transfer Agreement (DTA). Details as to the image de-identification process that will be used will be provided in a future RFC.

BDC Video Content Guidance

Overview

BDC recognizes the importance of multimedia resources for ecosystem users, particularly audio/visual recordings. This document provides guidelines on the program's video content approach. Using these guidelines will ensure users get optimized video experiences, from consistent branding that offers insights into the sources of the videos to best practices in video creation that support learning.

Overview of BDC Videos

To share video content - from the consortium, platforms, and users, as described in the following sections - BDC created a YouTube channel: https://www.youtube.com/channel/UCGkmY5oNK8uFZzT8vV_9KgQ

The BioData Catalyst Coordinating Center (BDC3) , has authority (with direction from the NHLBI) to post (or not post), remove, edit, and otherwise change video content on this channel with or without permission from or notice to video creators, owners, or sharers. Feedback about videos on the BDC YouTube channel should be sent to BDCatalystOutreach@nih.gov.

Categories and Organization of Videos

The BDC YouTube Channel hosts three categories of videos based on their sources and/or approval statuses:

Consortium-produced / Consortium-approved
Platform-generated
User-generated

Learn more about each video category below. Note that each category has its own set of standards that must be adhered to when creating and publishing video content, whether the final outlet is the BDC YouTube channel or another channel.

BDC3 is responsible for organizing videos on the BDC YouTube channel, grouping them into playlists it believes will be most beneficial to ecosystem community members. Playlists may include videos from any or all categories of videos. Viewers can determine the category of a video based on the branding (or non-branding) that appears. The additional information about each video category includes video standards that direct video creators on branding for each category of videos.

Consortium-produced / Consortium-approved Videos

Videos in this category are produced by BDC3, or are produced by Platforms or Users that receive approval from the BDC Consortium (select organizations developing and maintaining the ecosystem). These videos contain pre-approved opening and closing BDC animations and sound.

Consortium-produced / Consortium-approved Video Standards

Videos produced by the Consortium, or by Platforms or Users that submit for approval for recognition as a Consortium-approved video, must adhere to the following standards:

Comply with all requirements and, when possible, follow all best practices outlined in Addendum A: Consortium-produced / Consortium-approved Videos Best Practices. Platforms and users generating videos who wish to submit them for recognition as Consortium-approved must complete the BDC Consortium Video Submission Pre-Approval Application. Submit the form BEFORE producing the video to improve the likelihood that the video receives Consortium approval.

Platform-generated Videos

Videos in this category are produced by one of the BDC platforms to support users' understanding of their platform. These videos are not vetted by BDC3, BDC3 Consortium members, or representatives of other BDC platforms. These videos must open with the creator's platform "Powered by" logo (downloadable from the BDC3 internal consortium website).

Platform-generated Video Standards

Unless a Platform plans to seek Consortium-approval status for a video, platforms should use the following standards in the production and posting of their platform-generated videos:

Producers of Platform-generated videos, like all BDC ecosystem users, are always obligated to protect participant privacy and must follow NIH policies for data protection. Platforms are accountable and may be subject to sanctions if policies are violated. Only produce videos that provide information specific to the Platform's BDC instance. Use the Platform's Powered by logo (and only the Powered by logo) for the YouTube thumbnail image. Videos should open with the following information: “In this video we will [discuss/cover/explore] BioData Catalyst Powered by [platform name] and [task/example]” YouTube description language should include: The following language: This is a BioData Catalyst platform-generated video to support ecosystem users' understanding of the BioData Catalyst Powered by [platform name]. The link to the NHLBI BioData Catalyst homepage: https://biodatacatalyst.nhlbi.nih.gov/ Videos should be uploaded using YouTube's auto-generated captions to support 508 compliance. Once the video is uploaded, email the link to: BDCatalystOutreach@nih.gov so BDC3 can make it visible on the BioData Catalyst YouTube channel.

Important Notes

Only videos offering information specific to the use of ecosystem Platform instances will be shared on the BDC YouTube channel. Videos that support the use of Platforms but are not specific to BDC instances may be linked from the ecosystem documentation but will not appear on the BioData Catalyst YouTube channel.
Platform-generated videos that do not follow the above standards will not be made visible on the BDC YouTube channel.

User-generated Videos

These videos are neither approved nor vetted by BDC, the BDC Consortium, BDC Platforms, or the organizations they represent. The opinions and other content in these videos are those of the video creators and sharers alone. These videos may NOT open or close with BDC branding and may only display BDC branding when capturing images of properties where it already appears (i.e., a screencap of an ecosystem platform instance).

User-generated Video Standards

BDC offers user-generated video tutorials and guides. Unless a user plans to seek Consortium-approval status for a video, BDC requires the following for user-generated videos, their creators, and their sharers:

Producers of user-generated videos, like all BDC ecosystem users, are always obligated to protect participant privacy and must follow NIH policies for data protection. User institutions are accountable and may be subject to sanctions if policies are violated. By submitting a video for inclusion, users are attesting that the content of the video follows NIH policies for data protection, agreeing to follow this guidance, and committing to the inclusion of the following statement in video descriptions: This is a user-generated video and is neither approved nor vetted by NHLBI BioData Catalyst (BDC), the members of the BDC Consortium, or the organizations they represent. For more information about BDC, go to https://biodatacatalyst.nhlbi.nih.gov/. For more BDC videos, go to https://www.youtube.com/channel/UCGkmY5oNK8uFZzT8vV_9KgQ. #BioDataCatalyst To share a video, please contact: BDCatalystOutreach@nih.gov

Important Notes

User-generated videos that do not follow the above standards will not be made visible on the BioData Catalyst YouTube channel.
User-generated videos are just one type of user-contributed content BDC seeks to share. To learn about other kinds of user-generated content BDC seeks, read Contributing User Resources to BDC.

Addendum A: Consortium-produced / Consortium-approved Videos Best Practices

Consortium-produced/Consortium-approved videos must adhere to this addendum. While not required of BDC Platforms and users, BDC encourages them to consider these best practices for the videos they produce.

Gaining Approval: Submitting Your Idea

Phase/Task

Required or Best Practice

Context

Consider if the video is fulfilling a need/gap

Required

Ensure video isn't replicating information already available to users

Complete & submit for pre-approval

Required

Pre-approval is required to ensure relevance & consistency

Planning the video: Considerations before recording

Phase/Task

Required or Best Practice

Outline the video

Best practice

Consider how info can be presented in a concise & useful manner

Avoid having too much text on slides

Best practice

Slides should be concise; keep text & bullets at a minimum; use images when possible as viewers respond to images more positively than text

Shooting the video: Best practices

Phase/Task

Required or Best Practice

Context

Use clear language & explain jargon

Best Practice

Simple communications are preferred; many viewers may not speak English as a first language

Policy compliance: Federal regulations & BDC3 best practices

Phase/Task

Required or Best Practice

Context

Ensure Section 508 compliance

Required

Subtitles & transcripts are required to ensure equity in access for people with disabilities

Ensure privacy policy compliance

Required

Protect study participants' privacy by not including personally identifiable, confidential, sensitive, or personal health information (e.g., blur screenshots with data)

Required

For people with disabilities, readability can be essential to a successful user experience

Use appropriate branding according to the BDC Style Guide

Required

Required to create a unified look across the BioData Catalyst ecosystem. Work with your BDC3 contact to get a copy of the style guide

Technical aspects: Steps after shooting

Phase/Task

Required or Best Practice

Context

Best practice

Search for meaningful keywords for titles, descriptions & tags

Create a meaningful title

Required

The title should be under 66 characters to make it easier for Google to display; make the title engaging & descriptive

Required

Think about the action the user is trying to take & the keywords they might use to find your video

Required

Transcription is free but likely needs editing; you can make changes to the text & timestamps of your captions

for interaction

Best practice

Cards are clickable calls to action that take viewers to another video, channel, or site

for marketing

Best practice

End screens can be added to a video's last 5 - 20 seconds to promote other videos, encourage viewers to subscribe, etc.

& create Table of Contents

Best practice

Break up videos into sections (each with an individual preview) to provide more info & context; eases re-playing certain sections

Required

A clear & colorful video thumbnail will catch viewers' attention & let them see a quick snapshot of your video as they're browsing

, including the required #BioDataCatalyst tag

Required

Tags are descriptive keywords you can add to your video to help viewers find your content; include at least 10 tags

Add links to BDC

Best practice

Where possible, provide links to relevant parts of the BDC ecosystem

Phase/Task

Require or Best Practice

Context

Share completed videos with BDC3

Required

Email with info on accessing the video, a thumbnail image, descriptive tags to include, and the video description

BDC3 sets appropriate privacy settings according to policy with input from the video creator

If Approved

Videos can be Public, Unlisted (link needed), or Private (invite needed; most secure)

BDC3 uploads to YouTube channel & adds to relevant playlists

If Approved

Videos can be in multiple playlists but don't need to be in any playlists

Teams and BDC3 develop plans to promote the video, if appropriate.

Best Practice

Potential options include Facebook, Instagram, LinkedIn, Snapchat, Twitter, Vimeo, WeChat, Pinterest, Flipgrid, etc.

Library maintenance: Keeping an up-to-date catalog

Phase/Task

Required or Best Practice

Context

BDC3 will prompt teams annually to check videos to ensure continued relevance.

Required

Outdated videos could cause viewers to lose confidence in the accuracy of info available on the channel

Contributing User Resources to BDC

The BDC user community is essential to advancing science with new and exciting discoveries and informing the development of the ecosystem and its infrastructure. Members of the BDC user community learn how to explore the hosted data, use the services, and employ its tools in exciting and valuable ways that even developers may not know. Therefore, we actively invite user resource contributions to be shared with the community.

Types of Resources

Consider supporting fellow ecosystem users in one of the following ways:

Written Documentation: Develop step-by-step guides, FAQs, checklists, and so on. Include screenshots to support user understanding.
Videos: Record a shortcut, tip, or process you think would be helpful to other users. Keep videos short by dividing larger processes into smaller segments and recording separate videos for each.
Respond to inquiries: Answer questions posed in the BDC Forums. Forum content with significant engagement may get incorporated into written documentation or made into videos.

Note

All materials must ensure privacy policy compliance. Make certain to block any patient information on all content and protect study participants' privacy by not including personally identifiable, confidential, sensitive, or personal health information (for example, blur screenshots with data).

Experienced users who want to share their tips and tricks should consider the following questions.

Did someone already share my tip? Look through the resources already available to users before investing your time and energy into creating a new one.
- Check the Frequently Asked Questions page on the BDC website
- View the “Learn” and “Documentation” links available on the BioData Catalyst Services webpage.
- View the BDC Documentation hosted on GitBook.
- Explore the links to platform-specific documentation, videos, FAQs, community forums, blogs, tutorials, and upcoming events on the BioData Catalyst Learn webpage
- Check out the videos on the BDC YouTube channel.
Which format best suits your resource? Ask yourself, "Would I prefer to watch this on video or have a step-by-step guide to help me?" Then ask yourself which you think other users would prefer. Figuring out which you'd prefer is a great place to start because you are the one who identified the tip. But remember that you are creating something to help other people whose preferences will determine whether a resource gets used.
Is my tip complex, or does it require several steps? If so, a written how-to guide will probably be easier to follow than a video because someone watching a video may need to stop and restart it often. Still, visual aids will be helpful, so consider using screenshots in your how-to guide.
Is the guidance I want to share relatively straightforward, but it requires clicking through several pages/places? If so, a short video could be the best way to share your tip. Finding buttons or links can be much easier if shown rather than described.
If I create a video and make sure to go slowly enough that someone can follow along, will it be longer than 15 minutes? If so, creating a video may not be the right format, or breaking down the content into shorter (more digestible) videos may be preferable.
Am I comfortable following the requirements outlined for the user-generated video tutorials? If not, please create written documentation (e.g., a how-to guide).
Do I want to provide help in almost-real-time without needing to formally draft a document or record a video? Visit the Community Forum often to provide answers to questions posed by other users or even just post your tip.

Once you decide upon the best way to share what you learned, you'll need to create your contribution and then share it.

For a quick tip that you want to distribute swiftly, draft something short that you can easily post to the Community Forum. The following is an example of a quick tip for using PIC-SURE’s Data Access Table:
- In PIC-SURE, did you know you can use the search bar in the Data Access Table to find studies? Instead of scrolling through the table and looking at the list of available studies manually, you can search for studies. An example could be “MESA” for a specific study name, or a phenotype like “Sickle Cell” to find all sickle cell related studies. It seems obvious, but I’m not sure how many other users are aware of this, and I found it really helpful!
For Written Documentation, draft your suggestions and include screenshots to help lead users through the process you describe. Once complete, submit the file to BDCatalystOutreach@nih.gov for review and posting to the BDC Gitbook. Note that we will accept Google Doc (with at least suggesting edits status preferred) and Microsoft Word formats; PDFs are not accepted.
For videos, review the User-Generated Videos portion of the BDC Video Content Guidance page. By submitting a video, you agree to those conditions. Once your video is uploaded to your YouTube channel, email the link to BDCatalystOutreach@nih.gov for consideration to be linked to the BDC YouTube channel also.

Finding User-Generated User Resources

Forum messages will post directly in the community forums.
Written documentation will live in the BDC Documentation, hosted in Gitbook.
User-generated videos will be linked in the BDC YouTube Channel.

Written Documentation

Getting Started

Documentation for getting started on the NHLBI BioData CatalystⓇ (BDC) ecosystem.

Data Access

Data Interoperability

How to access additional data stacks

GTEx Data

The Genotype-Tissue Expression (GTEx) Program is a widely used data resource and tissue bank to study the relationship between genetic variants (inherited changes in DNA sequence) and gene expression (how genes are turned on and off) in multiple human tissues and across individuals. For information on access to GTEx data, refer to GTEx v8 - Free Egress Instructions as part of the AnVIL documentation.

NCPI Data Portal

The NIH Cloud Platform Interoperability Effort (NCPI) is currently working to establish and implement guidelines and technical standards to empower end-user analyses across participating cloud platforms and facilitate the realization of a trans-NIH, federated data ecosystem. Participating institutions include BDC, AnVIL, Cancer Research Data Commons, and Kids First Data Resource Center. Learn what data is currently hosted by these platforms by using the NCPI Data Portal.

Understanding Access

This checklist is intended to help new users understand their obligations regarding access and permissions, which individual users are responsible for obtaining and maintaining.

eRA Commons Account

Users log into BDC platforms with their eRA Commons credentials. For more information, see Ecosystem Access, Hosted Data, and System Services.

Users create an eRA Commons Account through their institution's Office of Sponsored Research or equivalent. For more information, refer to Understanding eRA Commons Accounts.

dbGaP

Users who want to access a hosted controlled study on the BDC ecosystem must be approved for access to that study in the NIH Database of Genotypes and Phenotypes (dbGaP). For more information, see Ecosystem Access, Hosted Data, and System Services and BioData Catalyst FAQs. Note that obtaining these approvals can be a time-intensive process; failure to obtain them in a timely manner may delay data access.

Users have two options for obtaining dbGaP approval depending on whether they already are affiliated with a PI who has dbGaP access to the relevant data:

The BDC user has no affiliation with an existing dbGaP approved project. In this case the user needs to create their own dbGaP project and then submit a data access request (DAR) for approval by the NHLBI Data Access Committee (DAC). This process often takes 2-6 months depending on whether local IRB approval is required by the dataset the user is requesting, the amount of time it takes for local review of the dbGaP application by the user’s home institution and processing by the DAR committee. See the dbGap Authorized Access Portal or dbGaP Overview: Requesting Controlled-Access Data. Once a DAR is approved, it can take a week or longer for the approval to be reflected on BDC.
The BDC user is affiliated with an existing principal investigator, who already has an approved dbGaP application with existing DAR access (for example, the BDC user is a post-doctoral fellow in a PI’s lab). A principal investigator with dbGaP DAR access assigns the User as a “Downloader” in dbGaP. See Assign Downloaders for dbGaP Data. It can take about 24 hours for “Downloader” approval to be reflected on BDC.

Notes

DARs must be renewed annually to maintain your data access permissions. If your permissions expire, you may lose access to hosted data in BDCatalyst during your renewal process.

A Cloud Use Statement may be required as part of the DAR.

TOPMed

BDC hosts data from the NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium. BDC users are not automatically onboarded as TOPMed investigators. BDC users who are not members of the TOPMed Consortium may apply for released data through the regular dbGaP Data Access Request process.

When conducting TOPMed-related research on BDC, members of the TOPMed consortium must follow the TOPMed Publications Policy and associated processes; for example, operating within Working Groups.

For more information, refer to the following resources:

Information on joining TOPMed
TOPMed website
TOPMed FAQs (login required)
BioData Catalyst FAQs

IRB

Users must ensure that IRB data use agreements (DUAs) are approved and maintained as they are enforced by the BDC ecosystem.

BDC

Refer to the BDC Data Protection page to learn more about topics such as data privacy, access controls, and restrictions.

Use your eRA Commons account to review the data indexed by BDC to which you have access on the Explore BioData Catalyst Data page. For more information, see Checking Access.

If your data is not indexed, inform BDC team members during your onboarding meetings or by submitting a Help Desk ticket.

Submitting a dbGaP Data Access Request

Requirements

An NIH eRA Commons ID (or appropriate NIH Login) is required for submitting a Data Access Request (DAR). If you do not have an eRA Commons account, you must request one through your institution’s Office of Sponsored Research or equivalent. For more information, refer to .
To submit a DAR, users must have PI status through their institution. Non-PI users must have a PI they work with that can submit a DAR and add them as a downloader.

Data Access Request Process

Step 1: Go to to log in to dbGaP.

Step 2: Navigate to My Projects.

Step 3: Select Datasets.

You can search by Primary disease type, or if you know the dataset you are interested in, you can use Study lookup.

We want to request HCT for SCD, so we will use the accession number phs002385. As you type the accession number, the numbers will start to auto-populate.

Select the study to add it to the Data Access Request. You can request up to 200 studies that you are interested in accessing.

The user can add additional datasets as necessary needed to answer the research question.

Sample Research Use Statement

Title

Long-term survival and late death after hematopoietic cell transplant for sickle cell disease

Research Use Statement

Our project is limited to requested dataset. We have no plans to combine with other datasets.

In 2018, the National Heart Lung and Blood Institute (NHLBI) began work on BioData Catalyst, a shared virtual space where scientists can access NHLBI data and work with the digital objects needed for biomedical research (www.nhlbi.nih.gov/science/biodata-catalyst). This is a cloud-based platform that allows for tools, applications and workflows. It provides secure workspaces to share, store, cross-link and analyze large sets of data generated from biomedical research. Biodata Catalyst addresses the NHLBI Strategic Vision objective of leveraging emerging opportunities in data science to facilitate research in heart, lung, blood and sleep disorders. It offers specialized search functions, controlled access to data and analytic tools via programming interfaces and its interoperability will allow exchange of information with other components of the Data Commons. BioData Catalyst may be accessed by the biomedical researchers and the public at large. The first available datasets in BioData Catalyst include data from NHLBI’s Trans-Omics for Precision Medicine (TOPMed) Program and the Cure Sickle Cell Initiative. Rigor in designing and performing scientific research and the ability to reproduce biomedical research are two of the cornerstones of science advancement. In order to test reproducibility of biomedical data available in BioData Catalyst we accessed NHLBI data from the Cure Sickle Cell Initiative to test and validate the findings of a publication that utilized those data. That report, focused on the effect of donor type and transplant conditioning regime intensity on hematopoietic cell transplant outcomes for sickle cell disease. Hematopoietic cell transplant is potentially curative, yet this treatment is associated with risks for mortality from the treatment procedure. Published reports suggest the life expectancy of adults with sickle cell disease in the United States is shortened by at least two decades compared to the general population. Thus, a fundamental question that is often asked is whether hematopoietic cell transplant over time would offer a survival advantage compared to treatment with disease-modifying agents. In the report1 that examined for factors associated with survival after transplantation, young patients (aged =12 years) and patients who received their graft from an HLA-matched sibling had the highest survival. For those without an HLA-matched sibling the data did not favor one alternative donor type over another.1 The purpose of the current analyses is two-fold: 1) test and validate a publication that utilized data in the public domain and 2) the utility of these data to conduct an independent study. The aim of the later study was to estimate the conditional survival rates after hematopoietic cell transplantation stratified by time survived since transplantation and to compare all-cause mortality risks to those of an age, sex, and race-matched general population in the United States.

Non-technical summary

Investigators in the Cure Sickle Cell Initiative Data Consortium request access to data to examine rigor and reproducibility of data submitted - using a previous publication as reference. Additionally, we will calculate survival after hematopoietic cell transplant by time survived since transplant. For example, we will calculate the 5- and 10-year likelihood of being alive 2 and 5 years after transplantation.

Cloud-Use Statement

The NHLBI-supported BioData Catalyst (www.nhlbiBioDataCatalyst.org) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces. The BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by FireCloud, hosted and operated by the Broad Institute, Fair4Cures, hosted and operated by Seven Bridges Genomics and PIC-SURE operated by Harvard Medical School. For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Additionally, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) for Harvard Medical School.

Cloud Provider Information

Cloud Provider:

NHLBI BioData Catalyst, Private, The NHLBI-supported BioData Catalyst (https://biodatacatalyst.nhlbi.nih.gov/) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces.

The NHLBI BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. The NHLBI BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by FireCloud, hosted and operated by the Broad Institute, Fair4Cures, hosted and operated by Seven Bridges Genomics and PIC-SURE operated by Harvard Medical School.

For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Additionally, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) for Harvard Medical School.

Amazon Web Services (AWS), Commercial Amazon Web services (AWS) is a public cloud platform that provides solutions and services such as virtual machines, database instances, storage, and more. We will use the Amazon Elastic Compute Cloud (Amazon EC2), a web service that provides resizable compute capacity, to allocate Amazon Machine Instances (AMIs) in which we will develop the methods and infrastructure necessary to build the NHLBI BioData Catalyst. AWS offers several storage options that work in conjunction with EC2: Amazon Simple Storage Service (Amazon S3), Amazon Elastic Block Store (EBS) and Amazon Elastic File System (Amazon EFS). We expect to use each of these as the provide different capabilities including persistent storage and direct and networked storage for attaching to running AMI(s). We will use the Amazon Virtual Private Cloud (VPC) to provide security and robust networking functionality to these compute resources and Amazon Identity and Access Management (IAM) to control user access to these compute resources. AWS offers extensive security and has written a white paper with guidelines for working with controlled access data sets in AWS which we will follow. (see ).

Google Cloud Platform, Commercial

Google Cloud Platform is a public cloud platform that provides solutions and services such as virtual machines, database instances, storage, and more. We will use the Google Compute, a service that provides resizable compute capacity, to allocate Machine Types in which we will develop the methods and infrastructure necessary to build the NHLBI BioData Catalyst. Google Cloud offers several storage options that work in conjunction with Compute Engine: Google Cloud Storage, and Google Compute Engine Persistent Disks. We expect to use each of these as the provide different capabilities including persistent storage and direct and networked storage for attaching to running machine instances. We will use networking technologies based on Google’s Andromeda architecture, which can create networking elements at any level with software. This software-defined networking allows Cloud Platform's services to implement networking features that fit their exact needs, such as secure firewalls for virtual machines in Google Compute Engine. We will use Google Cloud Identity & Access Management to control user access to these compute resources.

Checking Access

You can check your access to data on BDC using the public website or on your specific platform.

Public website

Go to Accessing BioData Catalyst Data and click Check My Access.

BDC powered by Gen3 (BDC-Gen3) platform

Go to BioData Catalyst Powered by Gen3, select NIH Login, then log in using your NIH credentials. Once logged in, select the Exploration tab. From the Data Access panel on the left, make sure Data with Access is selected. Note whether you have access to all the datasets you expect.

Data Access

Parameter

Description

Data with Access (default)

Displays projects you have access to.

Data without Access

Displays data you do not have subject-level access to, but for which summary statistics can be accessed.

All Data

Displays all projects, but also projects you have no access to. A lock will appear for data you cannot access.

Request Access

You can request access to data by visiting the dbGaP homepage. For more information on Data Access, see the Data Accessibility on the Exploration page.

BDC powered by Seven Bridges (BDC-Seven Bridges) platform

Go to BioData Catalyst powered by Seven Bridges and login. To check your data access:

Click your username in the upper right and select Account Settings.
Select the tab for Dataset Access.
Browse the datasets and note whether you have access to all the datasets you expect.
- Datasets you have access to will have green check marks.
- Datasets you do not have access to will have red check marks.

BDC powered by Terra (BDC-Terra) platform

You do not need to check your data access on BDC-Terra. But before submitting a help desk ticket, ensure that you’ve done the following steps:

Establish a link in BioData Catalyst powered by Terra to your eRA Commons/NIH Account and the University of Chicago DCP Framework. To link eRA Commons, NIH, and DCP Framework Services, go to your Profile page in BDC-Terra and log in with your NIH credentials.

If your access still has issues using particular files or datasets in analyses on BDC-Terra, submit a request to our help desk.

BDC powered by PIC-SURE (BDC-PIC-SURE) platform

You do not need to check your data access on BDC-PIC-SURE. Instead, refer to the Accessing BioData Catalyst Data page, then click Check My Access.

Explore Available Data

Dug Semantic Search

Step-by-step guidance on using Dug Semantic Search: efficiently and effectively perform and interpret a search using Dug.

Overview

Dug Semantic Search is a tool that allows users to deep dive into BDC studies and biomedical topics, research, and publications to identify related studies, datasets, and variables. If you are interested in how Dug connects study variables to biomedical concepts, or visit the.

This tool applies semantic web and knowledge graph techniques to improve BDC research data Findability, Access, Interoperability, and Reusability (FAIR). Through this process, semantic search helps users identify novel relations, build unique research questions, and identify potential collaborations.

Search and Results

Navigate to https://biodatacatalyst.nhlbi.nih.gov/use-bdc/explore-data/dug/ to access Dug Semantic Search.
Semantic search is a concept-based search engine designed for users to search biomedical concepts, such as “asthma,” “lung,” or “fever,” and the variables related to and/or used to measure them. For example, a search for “chronic pain acceptance” will return a list of related biomedical concepts, such as chronic pain, headaches, neuralgia, or fibromyalgia, each of which can be expanded to display related variables and CDEs. Semantic search can also find variable names and descriptions directly, using synonyms from its knowledge graphs to find search-related variables.
Enter a search term and press “Enter,” or click on the Search button. This will take you to the Semantic Search interface.

PIC-SURE User Guide

PIC-SURE: Patient Information Commons Standard Unification of Research Elements

The Patient Information Commons: Standard Unification of Research Elements (PIC-SURE) integrates clinical and genomic data to allow users to search, query, and export data at the variable and variant levels. This allows users to create analysis-ready data frames without manually mapping and merging files.

BDC Powered by PIC-SURE (BDC-PIC-SURE) functions as part of the BDC ecosystem, allowing researchers to explore studies funded by the National Heart, Lung, and Blood Institute (NHLBI), whether they have been granted access to the participant level data or not.

Getting Started

Requirements and Login

Requirements

To obtain access to BDC-PIC-SURE, you must have an NIH eRA Commons account. For instructions and to register an account, refer to the eRA website.

After you have created an eRA Commons account, you can log in to BDC-PIC-SURE by navigating to https://picsure.biodatacatalyst.nhlbi.nih.gov and selecting to log in with eRA Commons. You will be directed to the NIH website to log in with your eRA Commons credentials. After signing in and accepting the terms of the agreement on the NIH RAS Information Sharing Consent page, allow the BDC-Gen3 service to manage your authorization.

Upon login, you will be directed to the Data Access Dashboard. This page provides a summary of PIC-SURE Authorized Access, PIC-SURE Open Access, and the studies you are authorized to access.

Available Data and Managing Data Access

BDC-PIC-SURE has integrated clinical and genomic data from a variety of heart, lung, blood, and sleep related datasets. These include NHLBI Trans-Omics for Precision Medicine (TOPMed) and TOPMed related studies, BioLINCC datasets, and COVID-19 datasets.

View a summary of the data you have access to by viewing the Data Access Table.

This table displays information about the study and associated data, including the full and abbreviated name of the study, study design and focus, the number of clinical variables, participants, and samples sequenced, additional information with helpful links, consent group information, and the dbGaP accession number (or phs number). You are also able to see which studies you are authorized to access in the Access column of the table. For information from dbGaP on submitting a data access request, refer to . Note that studies with a sickle cell disease focus contain links to the for additional information.

You can also check the data you have access to by going to the page on the BDC website and clicking Check My Access.

TOPMed and TOPMed related datasets

The BDC ecosystem hosts several datasets from the NHLBI Trans-Omics for Precision Medicine (TOPMed) program. The PIC-SURE platform has integrated the clinical and genomic data from all studies listed in the Data Access Dashboard. Through the ingestion process, occasionally PIC-SURE will ingest phenotypic data for the TOPMed studies prior to the genomic data.

Harmonized Data (TOPMed Harmonized Clinical Variables)

There are limited amounts of harmonized data available at this time. The TOPMed Data Coordinating Center (DCC) curation team has identified 44 variables that are shared across 17 NHLBI studies and normalized the participant values for these variables.

The 44 harmonized variables available are listed in the table in Appendix 2. For more information on this initiative, you can view the additional documentation from the TOPMed DCC GitHub repository or on the NHLBI Trans-Omics for Precision Medicine website.

Table of Studies Included in the TOPMed Harmonized Dataset Available in PIC-SURE

Atherosclerosis Risk in Communities Study

ARIC

phs000280

Cardiovascular Health Study

CHS

phs000287

Cleveland Family Study

CFS

phs000284

Coronary Artery Risk Development in Young Adults Study

CARDIA

phs000285

Epidemiology of Asthma in Costa Rica Study

CRA

phs000988

Framingham Heart Study

FHS

phs000007

Genetic Epidemiology Network of Arteriopathy

GENOA

phs001238

Genetic Epidemiology of COPD

COPDGene

phs000179

Genetics of Cardiometabolic Health in Amish

AMISH

phs000956

Genome-Wide Association Study of Venous Thrombosis Study

MAYOVTE

phs000289

Heart and Vascular Health Study

HVH

phs001013

Hispanic Community Health Study - Study of Latinos

HCHS-SOL

phs000810

Jackson Heart Study

JHS

phs000286

Multi-Ethnic Study of Atherosclerosis

MESA

phs000209

Study of Adiposity in Samoans

SAS

phs000914

Women’s Health Initiative WHI

WHI

phs000200

BioLINCC Datasets

The BDC ecosystem hosts several datasets from the . To access the BioLINCC studies, you must request access through dbGaP even if you have authorization from BioLINCC.

CONNECTS Dataset

The BDC ecosystem hosts several datasets from the NIH NHLBI Collaborating Network of Networks for Evaluating COVID-19 and Therapeutic Strategies (CONNECTS) program. These COVID-19 related studies follow the guidelines for implementing common data elements (CDEs) and for de-identifying dates, ages, and free text fields. For more information about these efforts, you can view the CDE Manual and De-Identification Guidance documents on the CONNECTS COVID-19 Therapeutic Trial Common Data Elements webpage.

Table of COVID-19 Studies Included in the CONNECTS Program Available in PIC-SURE

A Multicenter, Adaptive, Randomized Controlled Platform Trial of the Safety and Efficacy of Antithrombotic Strategies in Hospitalized Adults with COVID-19

ACTIV4a

phs002694

COVID-19 Positive Outpatient Thrombosis Prevention in Adults Aged 40-80

ACTIV4b

phs002710

Clinical-trial of COVID-19 Convalescent Plasma in Outpatients

C3PO

phs002752

Data Organization in PIC-SURE

PIC-SURE integrates clinical and genomic datasets across BDC, including TOPMed and TOPMed related studies, COVID-19 studies, and BioLINCC studies. Each variable is organized as a concept path that contains information about the study, variable group, and variable. Though the specifics of the concept paths are dependent on the type of study, the overall information included is the same.

For more information about additional dbGaP, TOPMed, and PIC-SURE concept paths, refer to Appendix 1.

Table of Data Fields in PIC-SURE

Note that there are two data types in PIC-SURE: categorical and continuous data. Categorical variables refers to any variables that have categorized values. For example, “Have you ever had asthma?” with values “Yes” and “No” is a categorical variable. Continuous variables refer to any variables that have a numeric range of values. For example, “Age” with a value range from 10 to 90 is a continuous variable. The internal PIC-SURE data load process determines the type of each variable based on the data.

PIC-SURE Features and General Layout

Search bar: Enter any phenotypic variable, study or table keyword into the search bar to search across studies. Users can also search specific variables by accession number, if known (phs/pht/phv).
Study Tags: Users can filter the results found through their search by limiting to studies of interest or excluding studies.
Variable Tags: Users can filter the results found through their search by limiting to keywords of interest or excluding keywords that are out of scope. For example, a user could filter to categorical variables, variables containing the term ‘blood’, and/or exclude variables containing the term ‘pressure’.
How are variable tags generated? Each variable has a set of associated tags, which are generated during the PIC-SURE data loading process. These tags are generated based on information associated with the variable, including the name of the study, study description, dataset name, PIC-SURE data type (continuous or categorical), and variable description. For a search in PIC-SURE, tags associated with a variable are displayed. Note that tags applicable to less than 5% or more than 95% of the search results are not displayed since these are not useful for filtering results.
Search Results table: View all variables associated with your search term and/or study & variable tags.
Results Panel: Panel with content boxes that describe the cohort based on the variable filters applied to the query.
Data Summary: Displays the total number of participants in the filtered cohort which meet the query criteria. When first opening the Open or Authorized Access page, the number will be the total number of participants that you can access.
Added Variable Filters summary: View all filters which have been applied to the cohort.
Filter Action: Click on the filter icon to filter cohort participants by specific variable values.
Reset button: Allows users to start a new search and query by removing all added filters and clearing all active study and variable tags.

PIC-SURE Open Access vs. PIC-SURE Authorized Access

PIC-SURE provides two ways to search: PIC-SURE Open Access and PIC-SURE Authorized Access. enables the user to explore aggregate-level data without any dbGaP data authorizations. feature allows the user to explore participant-level data and requires authorization to access at least one study through an active dbGaP Data Access Request (DAR).

Table Comparison of PIC-SURE Open and Authorized Access

PIC-SURE Open Access

PIC-SURE Open Access allows you to search any clinical variable available in PIC-SURE. Your queries will return obfuscated aggregate counts per study and consent. There are some features specific to PIC-SURE Open Access, which are outlined below.

A. Stigmatizing Variables Removal: PIC-SURE Open Access data excludes clinical variables that contain potentially sensitive information. These variables are known as stigmatizing variables, which fall into the following categories:

Mental health diagnoses, history, and treatment
Illicit drug use history
Sexually transmitted disease diagnoses, history, and treatment
Sexual history
Intellectual achievement, ability, and educational attainment
Direct or surrogate identifiers of legal status

For more information about stigmatizing variables and the identification process, please refer to the documentation and code on the BioData Catalyst Powered by PIC-SURE Stigmatizing Variables GitHub repository.

B. Data Obfuscation: Because participant-level data are not available in PIC-SURE Open Access, the aggregate counts are obfuscated to further anonymize the data. This means that:

If the consent group, study, and/or total participants of the query is between one and nine, the results will be shown as < 10.\
If the consent group results are between one and nine and the study and/or total participants of the query is greater than 10, the results will be obfuscated by ± 3.
Query results that are zero participants will display 0.

C. View Filtered Results by Study: The filtered number of participants which match the query criteria is shown broken down by study and consent group. Users can see if they do or do not have access to specific studies.

Use Case: Using PIC-SURE Open Access to Investigate Asthma in Healthy and Obese Adult Populations

In this section, the functionalities of PIC-SURE Open Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating asthma in relation to obesity in adults.

I’m interested in two cohorts: obese adults with a body mass index (BMI) greater than 30 and healthy adults with a BMI between 18.5 and 24.9. However, I have not yet submitted a Data Access Request and therefore am not authorized to access any datasets.

First, let’s explore cohort A: Healthy adults with a BMI between 18.5 and 24.9 in Framingham Heart Study (FHS).

Search for ‘age’.
Apply ‘FHS’ study tag to view only ‘age’ variables within the Framingham Heart Study (phs000007).
Select the variable of interest. You may notice many variables that appear similar. These variables may be located in different datasets, or tables, but contain similar information. Open up the variable information modal by clicking on the row containing the variable of interest to learn more.
Variable Information modal for ‘age1’ variable from Framingham Heart Study.
Filter to adults only by clicking the filter icon next to the variable. I am interested in adults, so I will set the minimum age to 18, then click “Add filter to query”.
Adding a filter to the ‘age1’ variable from Framingham Heart Study.
Now, let’s filter to healthy adults with a BMI between 18.5 and 24.9. Similar to before, we will search ‘BMI’. We can narrow down the search results using the variable-level tags by including terms related to our variable of interest (such as ‘continuous’ to view only continuous variables) and excluding out-of-scope terms (such as ‘allergy’). After selecting the variable of interest, we can filter to the desired ranges before adding the filter to our query. Notice how the total number of participants in our cohort changes.
Finally, we will filter for participants who have asthma.
Adding a filter to the ‘B128’ variable from Framingham Heart Study.
Note the total participant count in the Data Summary.

We can easily modify our filters to explore cohort B: Obese adults with a body mass index (BMI) greater than 30 in Framingham Heart Study.

Edit the BMI filter by clicking the edit icon in the Added Variable Filters section. Change the range to have a minimum of 30 and no maximum.
Note the total participant count in the Data Summary.

We can easily repeat these steps for other studies, such as the Genetic Epidemiology of COPD (COPDGene) study, and create a table like the one below. By comparing these two studies, I can see that COPDGene may be more promising for my research since it contains many more participants in my cohorts of interest than FHS does.

Framingham Heart Study (FHS)

50 +/- 3

72 +/- 3

Genetic Epidemiology of COPD (COPDGene)

488 +/- 3

868

I can then use the Request Access button to go directly to the study’s dbGaP page and begin submitting a DAR.

PIC-SURE Authorized Access

If you are authorized to access any dbGaP dataset(s), the Authorized Access tab at the top will be visible. PIC-SURE Authorized Access provides access to complete, participant-level data, in addition to aggregate counts, and access to the Tool Suite.

A. Select Variables Action: Click the Select Variables icon to include variables when retrieving data. Users can select variables individually or at the dataset level.

Individually select variables: You can individually select variables from two locations:
- Variable search results: From the search results you can click the data retrieval icon to include the variable in your data retrieval.
- Variable modal variable data retrieval: The data retrieval icon next to the variable adds the variable to your data retrieval.
Select from a dataset or group of variables: In the variable modal the data retrieval icon next to the dataset opens a modal to allow you to select variables from the dataset table or group of variables.

B. Data Summary: In addition to the total number of participants in the filtered cohort, the number of variables the user has selected for data retrieval is also displayed.

There are four concept paths that are automatically included with any data export from PIC-SURE Authorized Access. These fields are listed and described below.

Patient ID: Internal PIC-SURE participant identifier. Please note that this field is not linking participants between studies and therefore should not be used for data correlation between different data sources or to the original data files.
Parent Study Accession with Subject ID: PIC-SURE generated identifier for parent studies. These identifiers are a combination of the study accession number and the subject identifier.
Topmed Study Accession with Subject ID: PIC-SURE generated identifier for TOPMed studies. These identifiers are a combination of the study accession number and the subject identifier.
Consents: Field used to determine which groups users are authorized to access from dbGaP. These identifiers are a combination of the study accession number and consent code.

C. Tool Suite: The Tool Suite contains tools that can be used to further explore filtered cohorts of interest. Note that at least one filter must be added to the query before using the Tool Suite.

Select and Package Data: Retrieve participant-level data corresponding to your filters and variable selections. Variables selected for data retrieval can be reviewed and modified. To learn more about the options associated with this tool, please refer to the Select and Package Data section.
Variable Distributions: View the distributions of query variables based on the filtered cohort. Note that there is a limit to the number of variable distributions that can be viewed at a given time. Additionally, genomic variables nor variables associated with any-record-of filter (e.g. entire datasets) will not be graphed.

Select and Package Data

The Select and Package Data tool is used to select and export participant-level data corresponding to your filters and variable selections. There are several options for selecting and exporting the data, which are shown using this tool.

In the top left corner of the modal, the number of participants and number of variables included in the query is shown. This is used to display the estimated number of data points in the export.

Note: Queries with more than 1,000,000 data points will not be exportable.

The table below displays a summary of the variables included in the export. Using the Selected column, variables that have been added to the export can be selected or deselected for the final dataframe.

Note: Variables with filters are automatically included in the export.

The Package Data button in the top right corner is used to prepare the data for export once the variable filters and selections have been finalized.

Once this button is clicked, there are several options to complete the export.

To export into a BDC analysis workspace, the Export to Seven Bridges or Export to Terra buttons can be used. Once clicking either of these buttons, a new modal will be displayed with all information and instuctions needed to complete the export. This includes your personalized access token, the query ID associated with the dataframe. Additionally, there is the option to Copy Query ID without accessing Seven Bridges or Terra if you wish to use a different analysis platform.

The Export to Seven Bridges option includes a Go to Seven Bridges button, which will open a new tab to the Public PIC-SURE API Project on BDC-Seven Bridges.

The Export to Terra option includes a Go to Terra via R button and a Go to Terra via Python button, which will open the Public PIC-SURE API R Examples workspace and the Public PIC-SURE API Python Examples workspace on BDC-Terra, respectively.

Use Case: Investigating Comorbidities of Breast Cancer in Authorized Access

In this section, the functionalities of PIC-SURE Authorized Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating some comorbidities of breast cancer in women with BRCA1 and BRCA2 gene variants, such as hypertension and COPD.

I already have been authorized to access the Women’s Health Initiative (WHI) study and am interested in a single cohort: women with breast cancer and variants of the BRCA1 and BRCA2 genes. I want to select hypertension-related variables of interest, check the distributions of some variables, and export all the data to an analysis workspace.

First, let’s apply our variable filters for the WHI study.

Search “breast cancer” in Authorized Access.
Add the WHI study tag to filter search results to only age variables found within the WHI study.
Filter to participants with breast cancer by clicking the filter icon next to the variable of interest. Select values to filter your variable on and click “Add Filter to Query”.
Adding a filter to the ‘BREAST’ variable from Women’s Health Initiative Study.
Click the “Genomic Filtering” button to begin a filter on genomic variants.
Select “BRCA1” and “BRCA2” genes of “High” and “Moderate” severity. Click “Apply genomic filter”.
Filter to participants with breast cancer by clicking the filter icon next to the variable of interest. Select values to filter your variable on and click “Add Filter to Query”.
Now, let’s filter to participants that have and do not have COPD. Similar to before, we will search ‘COPD’. After selecting the variable of interest, we can filter to the desired values before adding the filter to our query. Notice how the total number of participants in our cohort changes.
Search “hypertension”.
Add variables to data export by clicking the select variables icon in the Actions column next to the variable of interest. The icon next to variables selected for export will change to the checkmark icon.
Adding a ‘hypertension’ variables (‘HTNTRT’, ‘HYPT’, ‘HYPTPILL’, and ‘HYPTPILN’) for export from Women’s Health Initiative Study.
Notice how the number of variables changed in the Data Summary box.
Before we Select and Package the data for export, let’s view the distribution of our participants’ ages to see if we have a normal distribution. Open the Variable Distributions tool in the Tool Suite. Here, we can see the distributions of the two added variable filters: breast cancer (‘BREAST’) and COPD (‘F33COPD’).
Variable Distributions modal for the Authorized Access example cohort.
Open the Select and Package Data tool in the Tool Suite. The variables shown in this table are those which will be available in your data export; you can remove variables as necessary.
Select and Package Data modal.
Click “Package Data” when you are ready.
Once the data is packaged, you can select to either “Export to Seven Bridges” or “Export to Terra”. Copy over the personalized user token and query ID use the PIC-SURE API and export your data to an analysis workspace.

Data Analysis Using the PIC-SURE API

Once you have refined your queries and created a cohort of interest, you can begin analyzing data using other components of the BDC ecosystem.

What is the PIC-SURE API?

Databases exposed through the PIC-SURE API encompass a wide heterogeneity of architectures and data organizations underneath. PIC-SURE hides this complexity and exposes the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible sciences. The API is available in two different programming languages, python and R, allowing investigators to query databases in the same way using either of those languages. The PIC-SURE API tutorial notebooks can be directly accessed on .

PIC-SURE Access Token

To access the PIC-SURE API, a user-specific token is needed. This is the way the API grants access to individual users to protected-access data. The user token is strictly personal; do not share it with anyone. You can copy your personalized access token by selecting the User Profile tab at the top of the screen.

Here, you can Copy your personalized access token, Reveal your token, and Refresh your token to retrieve a new token and deactivate the old token.

Analysis in the BDC Ecosystem

The PIC-SURE API can be accessed via tutorial notebooks on either BDC- or BDC-.

To launch one of the analysis platforms, go to the . From the Resources menu, select Services. A list of platforms and services on the BDC ecosystem will be displayed.

From the Analyze Data in Cloud-based Shared Workspaces section, select Launch for your preferred analysis platform.

BDC-Seven Bridges

Jupyter notebook examples in R and python can be found under the Public projects tab by selecting PIC-SURE API.

From the Data Studio tab, select an example that fits your research needs. Here, we will select PIC-SURE JupyterLab examples.

This will take you to the PIC-SURE API analysis workspace, where you can view the examples in python. Copy this workspace to your own project to edit or run the code yourself.

Note The project must have network access to run the PIC-SURE examples on Seven Bridges. To ensure this, go to the Settings tab and select “Allow network access”.

BDC-Terra

To access the Jupyter notebook examples in R and python for the PIC-SURE API, select View Workspaces from the .

Select the Public tab and search for “PIC-SURE”. Workspaces for both the python and R examples will be displayed. You must clone the workspaces to edit or run the code within them.

Appendix 1: BDC Identifiers - dbGaP, TOPMed, and PIC-SURE

Table of BDC dbGAP/TOPMed Identifiers

Table of PIC-SURE Identifiers

Appendix 2: Table of Harmonized Variables

2025-01-15 NHLBI BioData Catalyst Ecosystem Release Notes

Introduction

The 2025-01-15 release marks the 20th release for the NHLBI BioData Catalyst® (BDC) ecosystem. This release includes several new features (e.g., storage cost savings and PFB handoff of cohort data) along with documentation and tutorials (e.g., video guides to BDC Data Studio environments) to help new users get started on the system. This release also includes enhanced support for working with cohort data. Please find more detail on the new features and user support materials in the sections below.

The 2025-01-15 data releases include the addition of studies on pulmonary fibrosis, COPD, asthma, and congenital heart defects, along with new imaging from atherosclerosis and echocardiogram studies. Updates also include research on cardiovascular health, genetic epidemiology, COVID-19, blood pressure, veteran health, and lifestyle interventions. Please refer to the Data Releases section below for more information as well as the Data page on the BDC website.

Significant new features

Save on storage costs in Terra with bucket lifecycle rules: This feature on BDC Powered by Terra (BDC-Terra) gives users better controls to delete unnecessary workspace bucket files and manage cloud storage costs. Learn more about bucket lifecycle rules here.

PFB Handoff of Cohort Data from PIC-SURE to Terra: After exploring data and adding filters to build a cohort of interest in BDC Powered by PIC-SURE (BDC-PIC-SURE), investigators can now seamlessly move the participant-level data to BDC-Terra for analysis. This feature allows investigators to bring the data into a new or previously existing BDC-Terra workspace using the Portable Format for Bioinformatics, or PFB, format. This format includes two tables: the participant-level data and the associated data dictionary. Learn more about handing off participant data from BDC-PIC-SURE to BDC-Terra here and at BDC-Terra Support.

Links to Original Files from Selected Cohort Data: The selected participant-level data from BDC-PIC-SURE is now connected back to the original data file. The data is connected using DRS URIs, a GA4GH standard used to allow access to data in a single, standard way. This allows investigators to refer back to the original source of the BDC-PIC-SURE data. This feature is currently available in the data dictionary table with the PFB formatted BDC-PIC-SURE data. Note: This is currently available for some studies, but the DRS URIs of other studies are being added regularly.

Connect Cohort Data to Genomic Information via Sample Identifiers: Investigators can automatically include sample identifiers when preparing selected cohort data for analysis in BDC-PIC-SURE. The sample identifiers allow researchers to connect the phenotypic information to the associated genomic data or other sample types. Learn more about including sample identifiers here.

Explore Data with Social Determinants of Health (SDOH) Gravity Domains: Several variables from BDC data have been mapped to SDOH domains from the Gravity Project, a collaborative public-private initiative with the goal of developing consensus-driven data standards to support the collection, use, and exchange of data to address SDOH. These mappings can be used to explore the data in BDC-PIC-SURE.

New user support materials and documentation

Video guides to BDC Data Studio Environments: Three new onboarding videos were created to introduce and orient users to the three kinds of Data Studio environments available on BDC: JupyterLab, RStudio, and SAS Studio. These videos are available on the Velsera YouTube channel as platform-generated videos.

Data Releases

The table below highlights which studies were included in the 2025-01-15 data release.

The latest release features NHLBI TOPMed projects, including the San Antonio Family Heart Study (SAFHS), Women's Health Initiative (WHI), and the Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project, with updates from the Cardiovascular Health Study (CHS). New additions include the study of African Americans, Asthma, Genes, and Environment (SAGE), the Pulmonary Fibrosis Whole Genome Sequencing project, and the Genetic Epidemiology of COPD (COPDGene). Furthermore, the release highlights studies on the Molecular Genetics of Heterotaxy and Related Congenital Heart Defects, and the Collaborative Cohort of Cohorts for COVID-19 Research (C4R) with data from SPIROMICS and Jackson Heart Study (JHS). Featured are several BioLINCC studies, such as the Systolic Blood Pressure Intervention Trial (SPRINT), Heart Failure Network studies, and the Resuscitation Outcomes Consortium (ROC). This release introduces the Multi-Ethnic Study of Atherosclerosis (MESA) Echocardiogram Image Repository and includes data from the Veterans Administration (VA) Million Veteran Program (MVP) as well as the Healthy Lifestyle Program (HeLP).

The data is now available for access across the entire ecosystem.

Study Name

phs I.D. #

Acronym

New to BioData Catalyst

New study version

NHLBI TOPMed: San Antonio Family Heart Study (SAFHS)

phs001215.v4.p2.c1

topmed-SAFHS_DS-DHD-IRB-PUB-MDS-RD

Yes

NHLBI TOPMed: Women's Health Initiative (WHI)

phs001237.v3.p1.c1

topmed-WHI_HMB-IRB

Yes

NHLBI TOPMed: Women's Health Initiative (WHI)

phs001237.v3.p1.c2

topmed-WHI_HMB-IRB-NPU

Yes

NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study

phs001368.v4.p2.c1

topmed-CHS_HMB-MDS

Yes

NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study

phs001368.v4.p2.c2

topmed-CHS_HMB-NPU-MDS

Yes

NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study

phs001368.v4.p2.c3

topmed-CHS_DS-CVD-MDS

Yes

NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study

phs001368.v4.p2.c4

topmed-CHS_DS-CVD-NPU-MDS

Yes

NHLBI TOPMed: Whole Genome Sequencing of Venous Thromboembolism (WGS of VTE)

phs001402.v3.p1.c1

topmed-Mayo_VTE_GRU

Yes

NHLBI TOPMed: My Life Our Future (MLOF) Research Repository of Patients with Hemophilia A (Factor VIII Deficiency) or Hemophilia B (Factor IX Deficiency)

phs001515.v2.p2.c1

topmed-MLOF_HMB-PUB

Yes

NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing

phs001607.v3.p2.c1

topmed-IPF_DS-ILD-IRB-NPU

Yes

NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing

phs001607.v3.p2.c2

topmed-IPF_DS-LD-IRB-NPU

Yes

NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing

phs001607.v3.p2.c3

topmed-IPF_DS-PFIB-IRB-NPU

Yes

NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing

phs001607.v3.p2.c4

topmed-IPF_DS-PUL-ILD-IRB-NPU

Yes

NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing

phs001607.v3.p2.c5

topmed-IPF_HMB-IRB-NPU

Yes

TRanscriptomic ANalySis of left ventriCulaR gene Expression (TRANSCRibE)

phs001679.v1.p1.c1

heartfailure-TRANSCRibE_GRU

Yes

TRanscriptomic ANalySis of left ventriCulaR gene Expression (TRANSCRibE)

phs001679.v1.p1.c2

heartfailure-TRANSCRibE_DS-CI

Yes

NHLBI TOPMed: Pediatric Cardiac Genomics Consortium (PCGC)'s Congenital Heart Disease Biobank

phs001735.v2.p1.c1

topmed-PCGC_CHD_HMB

NHLBI TOPMed: Pediatric Cardiac Genomics Consortium (PCGC)'s Congenital Heart Disease Biobank

phs001735.v2.p1.c2

topmed-PCGC_CHD_DS-CHD

Molecular Genetics of Heterotaxy and Related Congenital Heart Defects

phs001814.v1.p1.c1

heartfailure-MolGen_CHD_GRU

Yes

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)

phs002909.v1.p1.c1

COVID19-C4R_SPIROMICS_GRU

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)

phs002909.v1.p1.c2

COVID19-C4R_SPIROMICS_GRU-NPU

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)

phs002909.v1.p1.c3

COVID19-C4R_SPIROMICS_DS-COPD

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)

phs002909.v1.p1.c4

COVID19-C4R_SPIROMICS_DS-COPD-NPU

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)

phs002909.v1.p1.c5

COVID19-C4R_SPIROMICS_GRU-COL

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)

phs002909.v1.p1.c6

COVID19-C4R_SPIROMICS_GRU-COL-NPU

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)

phs002909.v1.p1.c7

COVID19-C4R_SPIROMICS_DS-COPD-COL

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)

phs002909.v1.p1.c8

COVID19-C4R_SPIROMICS_DS-COPD-COL-NPU

Adult Observational Cohort Study (RC_Adult)

phs003463.v3.p2.c1

RECOVER-RC_Adult_GRU

Yes

Systolic Blood Pressure Intervention Trial (SPRINT-BioLINCC)

phs003483.v1.p1.c1

BioLINCC-BL_SPRINT_GRU

Yes

Surgical Treatment for Ischemic Heart Failure (STICH-BioLINCC)

phs003493.v1.p1.c1

BioLINCC-BL_STICH_GRU

Yes

Heart Failure Network Aldosterone Targeted Neurohormonal Combined with Natriuresis Therapy - (HFN ATHENA-BioLINCC)

phs003506.v1.p1.c1

BioLINCC-BL_HFN_ATHENA_GRU

Yes

Heart Failure Network - Effectiveness of Ultrafiltration in Treating People with Acute Decompensated Heart Failure and Cardiorenal Syndrome (HFN CARRESS - BioLINCC)

phs003510.v1.p1.c1

BioLINCC-BL_HFN_CARRESS_GRU

Yes

Sickle Cell Disease Natural History Data Resource (SCD NHDR)

phs003529.v1.p1.c1

CureSC-SCD_NHDR_GRU-IRB

Heart Failure Network - Nitrate's Effect on Activity Tolerance in Heart Failure with Preserved Ejection Fraction (HFN NEAT-BioLINCC)

phs003548.v1.p1.c1

BioLINCC-BL_HFN-NEAT_GRU

Yes

Heart Failure Network - Phosphodiesterase-5 Inhibition to Improve Clinical Status and Exercise Capacity in Diastolic Heart Failure (HFN RELAX-BioLINCC)

phs003565.v1.p1.c1

BioLINCC-BL_HFN-RELAX_GRU

Yes

Heart Failure Network - Renal Optimization Strategies Evaluation in Acute Heart Failure and Reliable Evaluation of Dyspnea (HFN ROSE-BioLINCC)

phs003589.v1.p1.c1

BioLINCC-BL_HFN-ROSE_GRU

Yes

Heart Failure: A Controlled Trial Investigating Outcomes of Exercise Training (HF-ACTION-BioLINCC)

phs003599.v1.p1.c1

BioLINCC-BL_HF-ACTION_HMB

Yes

Heart Failure: A Controlled Trial Investigating Outcomes of Exercise Training (HF-ACTION-BioLINCC)

phs003599.v1.p1.c2

BioLINCC-BL_HF-ACTION_HMB-NPU

Yes

Heart Failure Network: Inorganic Nitrite Delivery to Improve Exercise Capacity in HFpEF (HFN INDIE-BioLINCC)

phs003667.v1.p1.c1

BioLINCC-BL_HFN-INDIE_GRU

Yes

CONNECTS Master Protocol for Clinical Trials targeting Macro- and Micro-Immuno-Thrombosis, Vascular Hyperinflammation, and Hypercoagulability and Renin-Angiotensin-Aldosterone System (RAAS) in Hospitalized Patients with COVID-19 (ACTIV-4 Host Tissue)

phs003708.v1.p1.c1

COVID19-ACTIV4_HostTissue_GRU

Yes

Acute Respiratory Distress Network (ARDSNet) Study 04 Assessment of Low Tidal Volume and Elevated End-Expiratory Volume to Obviate Lung Injury (ALVEOLI-BioLINCC)

phs003714.v1.p1.c1

BioLINCC-BL_ARDSNet_ALVEOLI_GRU

Yes

Resuscitation Outcomes Consortium (ROC) Cardiac Epidemiologic Registry (Cardiac Epistry) Version 3 (ROC-Cardiac Epistry 3-BioLINCC)

phs003726.v1.p1.c1

BioLINCC-BL_ROC_Cardiac_Epistry_3_GRU

Yes

Beta-Blocker Evaluation in Survival Trial (BEST-BioLINCC)

phs003730.v1.p1.c1

BioLINCC-BL_BEST_GRU

Yes

Acute Respiratory Distress Network (ARDSNet) Studies 06 and 08 Prospective, Randomized, Multicenter Trial of Aerosolized Albuterol Versus Placebo for the Treatment of Acute Lung Injury (ALTA) (ARDSNet-ALTA-BioLINCC)

phs003743.v1.p1.c1

BioLINCC-BL_ARDSNet_ALTA_HMB-MDS

Yes

NHLBI TOPMed: Coronary Artery Risk Development in Young Adults (CARDIA)

phs001612.v3.p3.c1

topmed-CARDIA_HMB-IRB

Yes

NHLBI TOPMed: Coronary Artery Risk Development in Young Adults (CARDIA)

phs001612.v3.p3.c2

topmed-CARDIA_HMB-IRB-NPU

Yes

NHLBI TOPMed: Genetic Epidemiology of COPD (COPDGene)

phs000951.v6.p5.c2

topmed-COPDGene_DS-CS-RD

Yes

NHLBI TOPMed: Genetic Epidemiology of COPD (COPDGene)

phs000951.v6.p5.c1

topmed-COPDGene_HMB

Yes

NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica

phs000988.v6.p1.c1

topmed-CRA_DS-ASTHMA-IRB-MDS-RD

Yes

NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing

phs001607.v4.p3.c1

topmed-IPF_DS-ILD-IRB-NPU

Yes

NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing

phs001607.v4.p3.c2

topmed-IPF_DS-LD-IRB-NPU

Yes

NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing

phs001607.v4.p3.c3

topmed-IPF_DS-PFIB-IRB-NPU

Yes

NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing

phs001607.v4.p3.c4

topmed-IPF_DS-PUL-ILD-IRB-NPU

Yes

NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing

phs001607.v4.p3.c5

topmed-IPF_HMB-IRB-NPU

Yes

NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE)

phs000921.v5.p2.c2

topmed-SAGE_DS-LD-IRB-COL

Yes

NHLBI TOPMed: Women's Health Initiative (WHI)

phs001237.v4.p2.c1

topmed-WHI_HMB-IRB

Yes

NHLBI TOPMed: Women's Health Initiative (WHI)

phs001237.v4.p2.c2

topmed-WHI_HMB-IRB-NPU

Yes

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Multi-Ethnic Study of Atherosclerosis (MESA)

phs003017.v1.p1.c1

COVID19-C4R_MESA_HMB

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Multi-Ethnic Study of Atherosclerosis (MESA)

phs003017.v1.p1.c2

COVID19-C4R_MESA_HMB-NPU

Acute Respiratory Distress Network (ARDSNet) Studies 01 and 03 Lower Versus Higher Tidal Volume, Ketoconazole Treatment and Lisofylline Treatment (ARMA/KARMA/LARMA) (ARDSNet-ARMA/KARMA/LARMA-BioLINCC)

phs003734.v1.p1.c1

BioLINCC-BL_ARDSNet_ARMA_KARMA_LARMA_GRU

Yes

ARDSNet 07-08: Randomized, Blinded, Placebo-Controlled, Multi-Center Trial of Omega-3 Fatty Acid, Gamma-Linolenic Acid, and Antioxidants in Acute Lung Injury or ARDS (OMEGA) (ARDSNet-Omega-BioLINCC)

phs003744.v1.p1.c1

BioLINCC-BL_ARDSNet_Omega_HMB-MDS

Yes

Acute Respiratory Distress Network (ARDSNet) Studies 10 and 12 Statins for Acutely Injured Lungs from Sepsis (SAILS) (ARDSNet-SAILS-BioLINCC)

phs003736.v1.p1.c1

BioLINCC-BL_ARDSNet_SAILS_HMB-MDS

Yes

Prevention and Early Treatment of Acute Lung Injury (PETAL) - Low Tidal Volume Universal Support Feasibility of Recruitment for Interventional Trial (LOTUS FRUIT) (PETAL-LOTUS FRUIT-BioLINCC)

phs003791.v1.p1.c1

BioLINCC-BL_PETAL_LOTUS_FRUIT_GRU

Yes

Resuscitation Outcomes Consortium (ROC) Amiodarone, Lidocaine or Neither for Out-Of-Hospital Cardiac Arrest Due to Ventricular Fibrillation or Ventricular Tachycardia (ALPS)

phs003784.v1.p1.c1

BioLINCC-BL_ROC_ALPS_GRU

Yes

Resuscitation Outcomes Consortium (ROC) Cardiac Epidemiologic Registry (Cardiac Epistry) Versions 1 and 2 (ROC-Cardiac Epistry 1 and 2-BioLINCC)

phs003803.v1.p1.c1

BioLINCC-BL_ROC_Cardiac_Epistry_1_2_GRU

Yes

Treatment of Preserved Cardiac Function Heart Failure with an Aldosterone Antagonist (TOPCAT-BioLINCC)

phs003665.v1.p1.c1

BioLINCC-BL_TOPCAT_HMB-MDS

Yes

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Jackson Heart Study (JHS)

phs002907.v1.p1.c4

COVID19-C4R_JHS_DS-FDO-IRB

Yes

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Jackson Heart Study (JHS)

phs002907.v1.p1.c2

COVID19-C4R_JHS_DS-FDO-NPU-IRB

Yes

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Jackson Heart Study (JHS)

phs002907.v1.p1.c3

COVID19-C4R_JHS_HMB-IRB

Yes

Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Jackson Heart Study (JHS)

phs002907.v1.p1.c1

COVID19-C4R_JHS_HMB-NPU-IRB

Yes

Veterans Administration (VA) Million Veteran Program (MVP) Summary Results from Omics Studies

phs001672.v11.p1.c1

dbGaP-MVP_HMB-MDS

Yes

Multi-Ethnic Study of Atherosclerosis (Echocardiogram Image Repository)

phs003702.v1.p1.c1

imaging-img_MESA_ECHO_HMB

Yes

Multi-Ethnic Study of Atherosclerosis (Echocardiogram Image Repository)

phs003702.v1.p1.c2

imaging-img_MESA_ECHO_HMB-NPU

Yes

Incentives and Case Management to Improve Cardiac Care: Healthy Lifestyle Program (HeLP)

phs003737.v1.p1.c1

Individual_Study-UTMB_HeLP_GRU

Yes

Planned upcoming Data Releases

Study Name

phs I.D. #

Acronym

New to BioData Catalyst

New study version

Resuscitation Outcomes Consortium (ROC) Trauma Epidemiologic Registry (Trauma Epistry) (ROC-Trauma Epistry-BioLINCC)

phs003809.v1.p1.c1

BioLINCC-BL_ROC-Trauma_Epistry_GRU

Yes

BioLINCC The Women's Health Initiative (WHI)

phs003824.v1.c1

imaging-img_WHI_HMB

Yes

BioLINCC The Women's Health Initiative (WHI)

phs003824.v1.c2

imaging-img_WHI_HMB-NPU

Yes

The Jackson Heart Study (JHS)

phs003747.v1.p1.c1

imaging-img_JHS_HMB-IRB-NPU

Yes

The Jackson Heart Study (JHS)

phs003747.v1.p1.c2

imaging-img_JHS_DS-FDO-IRB-NPU

Yes

The Jackson Heart Study (JHS)

phs003747.v1.p1.c3

imaging-img_JHS_HMB-IRB

Yes

The Jackson Heart Study (JHS)

phs003747.v1.p1.c4

imaging-img_JHS_DS-FDO-IRB

Yes

Resuscitation Outcomes Consortium (ROC) Hypertonic Saline (HS) Trial Shock Study and Traumatic Brain Injury Study (TBI) (ROC-HS/TBI-BioLINCC)

phs003777.v1.p1.c1

BioLINCC-BL_ROC_HS_TBI-GRU

Yes

NHLBI TOPMed: Genomic Activities such as Whole Genome Sequencing and Related Phenotypes in the Framingham Heart Study

phs000974.v6.p5.c1

topmed-FHS_HMB-IRB-MDS

Yes

NHLBI TOPMed: Genomic Activities such as Whole Genome Sequencing and Related Phenotypes in the Framingham Heart Study

phs000974.v6.p5.c2

topmed-FHS_HMB-IRB-NPU-MDS

Yes

NHLBI TOPMed: MESA and MESA Family AA-CAC

phs001416.v4.p1.c1

topmed-MESA_HMB

Yes

NHLBI TOPMed: MESA and MESA Family AA-CAC

phs001416.v4.p1.c2

topmed-MESA_HMB-NPU

Yes

NHLBI TOPMed - NHGRI CCDG: Genes-Environments and Admixture in Latino Asthmatics (GALA II)

phs000920.v6.p4.c2

topmed-GALAII_DS-LD-IRB-COL

Yes

NHLBI TOPMed - NHGRI CCDG: Atherosclerosis Risk in Communities (ARIC)

phs001211.v5.p4.c1

topmed-ARIC_HMB-IRB-NPU-MDS

Yes

NHLBI TOPMed - NHGRI CCDG: Atherosclerosis Risk in Communities (ARIC)

phs001211.v5.p4.c2

topmed-ARIC_DS-CVD-IRB-NPU-MDS

Yes

For detailed platform release notes please consult the following resources:

BDC Powered by Gen3 release notes BDC Powered by Terra release notes BDC Powered by Seven Bridges release notes BDC Powered by PIC-SURE release notes

BDC Documentation

NHLBI BioData Catalyst® (BDC) Documentation

Welcome to NHLBI BioData Catalyst® (BDC)

About BDC and Our Community

The BDC Ecosystem and Services

Ecosystem Access, Hosted Data, and System Services

BioData Catalyst Publications

Questions?

Community

Who We Are

BDC Glossary

Citation and Acknowledgement

Strategic Planning

Strategic Planning Documents Reviewed & Approved by NHLBI Leadership

Request for Comments

NHLBI BioData Catalyst Ecosystem Security Statement

Overview

NHLBI BioData Catalyst Ecosystem Security Statement

Endnote

NHLBI DICOM Medical Image De-Identification Baseline Protocol

Medical Image De-Identification: BDC Baseline Protocol

Why image de-identification?

Major medical imaging modalities

The Baseline De-Identification Protocol

De-Identification of DICOM Header Data

De-Identification of Image Pixel Data

De-Identification of Filenames and File Paths

Risk Mitigation

BDC Video Content Guidance

Overview

Overview of BDC Videos

Categories and Organization of Videos

Consortium-produced / Consortium-approved Videos

Consortium-produced / Consortium-approved Video Standards

Platform-generated Videos

Platform-generated Video Standards

User-generated Videos

User-generated Video Standards

Addendum A: Consortium-produced / Consortium-approved Videos Best Practices

Gaining Approval: Submitting Your Idea

Planning the video: Considerations before recording

Shooting the video: Best practices

Policy compliance: Federal regulations & BDC3 best practices

Technical aspects: Steps after shooting

Publishing & promoting: Publicizing & sharing video

Library maintenance: Keeping an up-to-date catalog

Contributing User Resources to BDC

Types of Resources

Decide How to Share What You Know

Creating and Sharing Your Contribution

Finding User-Generated User Resources

Written Documentation

Getting Started

Data Access

Data Interoperability

GTEx Data

NCPI Data Portal

Understanding Access

eRA Commons Account

dbGaP

Notes

TOPMed

IRB

BDC

Submitting a dbGaP Data Access Request

Requirements

Data Access Request Process

Sample Research Use Statement

Title

Research Use Statement

Non-technical summary

Cloud-Use Statement

Cloud Provider Information

Cloud Provider:

Google Cloud Platform, Commercial

Checking Access

Public website

BDC powered by Gen3 (BDC-Gen3) platform

Data Access

Request Access

Welcome to NHLBI BioData Catalyst^® (BDC)

Welcome to NHLBI BioData Catalyst^® (BDC)