Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The BioData Catalyst Consortium is dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, or religion (or lack thereof). We do not tolerate harassment of community members in any form. Sexual language and imagery is generally not appropriate for any venue, including meetings, presentations, or discussions.
Glossary of terms used in the context of the BioData Catalyst Consortium and platform.
Agile Development
Agile software development is an approach to software development under which requirements and solutions evolve through the collaborative effort of self-organizing and cross-functional teams and their customer(s)/end user(s).
Alpha Users
A small group of users who are more willing to tolerate working in a system that isn’t as fully developed, providing detailed feedback & maybe some back & forth discussions
[Amazon] EFS
[Amazon] Elastic File System a simple, scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources.
Ambassadors
A small group of experts that represent the personas featured within the priority User Narratives. For their time and help, Ambassadors will receive early access to the BDC platform, free compute time, monetary fee for time, and relevant travel expenses will be covered.
App
In Seven Bridges, an app is a general term to refer to both tools and workflows.
App may also refer to persistent software that is integrated into a platform.
API
Application Programmer Interfaces. API technologies serve as software-based intermediaries to exchange data.
AWS
Amazon Web Services. A provider of cloud services available on-demand.
BagIt
BagIt is a hierarchical file packaging format for storage and transfer of arbitrary digital content.
BDC3
BioData Catalyst Coordinating Center
Beta Users
A slightly larger group than the alpha users who are not as tolerant to a difficult/clunky environment but understand that the version they are using is not polished and they need to give feedback.
Beta-User Training
Once the platform is available to a broader audience, we will support freely-accessible online training for beta-users at any time.
Carpentries Instructor Training Program
Ambassadors attend this training program to become BDC trainers.
CCM
Change Control Management; the systematic approach to managing all changes made to a document or process. Ensures no unnecessary changes are made, all changes are documented, and a process exists for implementing approved change.
CIO
Chief Information Officer
Cloud Computing
Internet-based computing, wherein computing power, networking, storage or applications running on computers outside an organization are presented to that organization in a secure, services-oriented way.
Components
Software units that implement a specific function or functions and which can be reused.
ConOps
Concept of Operations
Consortium
A collection of teams and stakeholders working to deliver on the common goals of integrated and advanced cyberinfrastructure, leading-edge data management and analysis tools, FAIR data, and HLBS researcher engagement.
Containers
A standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another (for example, Docker).
Command
In workflow writing, the command specifies the literal command line run (akin to what you would run in the terminal).
COPDGene
Chronic Obstructive Pulmonary Disease (COPD) Gene
Cost Monitoring (level)
At the Epic Level The Coordinating Center will facilitate this process by developing reporting templates (see example in PM Plan, Financial Management) for distribution to the teams. The BDC teams will complete these templates and send them directly to NHLBI. Each team is responsible for tracking their finances based upon the award conditions and for providing status updates as requested to NHLBI.
CRAM File Compressed columnar file format for storing biological sequences aligned to a reference sequence. Designed to be an efficient reference-based alternative to the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence, reducing storage costs. Additionally each column in the SAM format is separated into its own blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller than BAM, depending on the data held within them (from Wikipedia).
CSOC Alpha
Common Services Operations Center (CSOC): operates cloud, commons, compliance and security services that enable the operation of data commons; has ATO and hosts production system.
CSOC Beta
Development/testing; Real data in pilot (not production) that can be accessed by users
Common Workflow Language (CWL)
Simple scripting language for describing computational workflows for performing sequential operations on data. CWL is a way to describe command line tools and connect them together to create workflows. CWL is well suited for describing large-scale workflows in cluster, cloud and high performance computing environments where tasks are scheduled in parallel across many nodes.
DAC
Data Access Committee: reviews all requests for access to human studies datasets
DAR Data Access Request
Data Access
A process that involves authorization to access different data repositories; part of a User Narrative for the December 2020 release goal A Work Stream PM Plan constraint: NHLBI, as the project sponsor, will identify a process to enable data access by the BDC team members and for research users
Data Commons
Provides tools, applications, and workflows to enable computing large scale data sets in secure workspaces.
Data Repository Service (DRS) Generic interface (API) to data repositories so data consumers, including workflow systems, can access data in a single, standardized way regardless of where it’s stored or how it’s managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID (from GA4GH).
Data Steward
Members of the TOPMed and COPDGene communities who are working with BDC teams.
dbGaP
Database of Genotypes and Phenotypes
DCPPC
Data Commons Pilot Phase Consortium. The Other Transaction Awardees, Data Stewards, and the NIH.
Decision Tree
A decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility
Deep Learning
A machine learning method based on neural networks to learn from data through training to recognize patterns in the data.
Deliverables
Demonstrations and products.
Demos
Activities and documentation resulting from the DCPPC to build, test and demonstrate completion of goals of the Data Commons Pilot Phase.
DEV Environment
Set of processes and programming tools used to create the program or software product
DMI
Data Management Incident
Docker
Software for running containers, packaged, portable units of code and dependencies that can be run in the same way across many computers. See also Containers.
Dockerfile
A text document that contains all the commands a user could call on the command line to assemble an image.
Dockstore
An open platform developed by the Cancer Genome Collaboratory and used by the GA4GH for sharing Docker-based tools described with the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL)
DOI
Digital Object Identifier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself.
DUO
Data Use Ontology - a GA4GH standard for automating access (API) to human genomics data (https://github.com/EBISPOT/DUO)
DUOS
Data Use Oversight System, https://duos.broadinstitute.org/
Ecosystem
A software ecosystem is a collection of processes that execute on a shared platform or across shared protocols to provide flexible services. Example: The "BDC Ecosystem" - inclusive of all platforms and tools
EEP
External Expert Panel. A group of experts who provide guidance and direction to NIH about the program.
Epic
A very large user story which can be broken down into executable stories
*NHLBI’s cost-monitoring level
eRA Commons
Designated ID provider for whitelist
External Expert Panel
An independent body of experts that inform and advise the work of the BDC Consortium.
FAIR
Findable Accessible Interoperable Reusable.
Feature
A functionality at the system level that fulfills a meaningful stakeholder need
*Level at which the CC coordinates
FireCloud
Broad Institute secure cloud environment for analytical processing, https://software.broadinstitute.org/firecloud/
FISMA moderate environment
Federal Information Security Modernization Act of 2014, amends the Federal Information Security Management Act of 2002 (FISMA), see https://www.dhs.gov/fisma
FS
Full Stack
GA4GH
Global Alliance for Genomics and Health
GA4GH APIs
The Genomic Data Working Group is a coalition assembled to create interoperability standards for storing and sharing genomic data. The GA4GH Genomics API offers Interoperability for exchanging genomic data between various platforms and organizations by sending simple HTTP requests through a JSON equipped RESTful API.
GCP
Google Cloud Platform
GCR
Governance, Compliance, and Risk
Gen3
Gen3 is open source and licensed under the Apache license, which you can use for setting up, developing and operating data commons
GitHub
An online hub for storing and sharing computer programs and other plain text files. We use it for storage, hosting websites, communication and project management.
Gold Master
A gold master, or GM, is the final version of software or data ready for release to production; a master version from which copies can be made.
GWAS
Genome-wide Association Study
HLBS
Heart, Lung, Blood, Sleep
Identity Providers
A system entity that creates, maintains, and manages identity information for principals while providing authentication services to relying applications within a federation or distributed network; identity providers offer user authentication as a service
Interoperability
The ability of data or tools from multiple resources to effectively integrate data, or operate processes, across all systems with a moderate degree of effort.
Instance
In cloud computing, refers to a virtual server instance from a public or private cloud network.
Image
In the context of containers and Docker, this refers to the resting state of the software.
IP
BDC Implementation Plan; outlines how the various elements from the planning phase of the BDC project will come together to form concrete, operationalized BDC platform.
IRB
Institutional Review Board; the entity within a research organization that reviews and approves research protocols and clinical research protocols to protect human and animal subjects.
IRC
Informatics Research Core
ISA
Interoperability Service Agreement
ITAC
Information Technology Applications Center
Jupyter Notebooks
A web-based interactive environment for organizing data, performing computation, and visualizing output.
Linux
An open source computer operating system
Metadata
Data about other data
Milestone
Marks specific progress points on the development timeline, and they can be invaluable in measuring and monitoring the evolution and risk of a program. © Scaled Agile, Inc.
MSD
Minimum set of documents
MVP
Minimum viable product
NHLBI
National Heart, Lung, and Blood Institute
NIH
National Institutes of Health
NIST Moderate controls
NIST 800-53 - A collection of security controls and assessment procedures that both U.S. Federal and non-governmental organizations can apply to their information systems, policies, and procedures.
OTA
Other Transaction Authority - the mechanism of award that NHLBI chose because it provides a degree of flexibility in the scope of the work that is needed to advance this type of high risk/high reward project
PI
Principal Investigator
Platform
A piece of the BDC ecosystem. Examples: Terra, Gen3, Seven Bridges, etc.
PM
Project Manager
PMP
BDC Project Management Plan; breaks down the implementation of BDC from the perspective of the project managers involved in the project including details on roles, specific milestones, and the project schedule.
PO
Program Officer
Portable Format for Biomedical Data (PFB) Avro-based serialization format with specific schema to import, export and evolve biomedical data. Specifies metadata and data in one file. Metadata includes data dictionary, ontology references and relations between nodes. Supports versioning, back- and forward compatibility. A binary format.
Portfolio for Jira
Software-as-a-Service project management tool, used to track, roadmap, and visualize various project metrics.
Python
Open source programming language, used extensively in research for data manipulation, analysis, and modeling
Quality Assurance
The planned and systematic activities implemented in quality management so that quality requirements for a product or service satisfy stated goals and expectations.
Quality Control
The operational techniques and activities aimed at monitoring and measuring work processes and eliminating the causes of unsatisfactory outputs.
RACI
Responsible, Accountable, Consulted and Informed; tool that can be used for identifying roles and responsibilities during an organizational change process; BDC RACI
Researcher Auth Service (RAS) Will be a service provided by NIH's Center for Information Technology to facilitate access to NIH’s open and controlled data assets and repositories in a consistent and user-friendly manner. The RAS initiative is advancing data infrastructure and ecosystem goals defined in the NIH Strategic Plan for Data Science.
RFC
Request for Comment: A process that documents and enables effective interactions between stakeholders to support shared decision making.
Risk Register
A tool used to continuously identify risk, risk response planning and status updates throughout the project lifecycle. This project risk register is the primary risk reporting tool, and is located in the Project Management Plan.
SC
Steering Committee
Scientific use case
Defined in this project as an analysis of data from the designated sources which has relevance and value in the domain of health sciences, probably implementation and software agnostic.
SF or SFP
BDC Strategic Framework [Plan]; defines what the BDC teams have accomplished up to this point, what we plan to accomplish in a timeline fashion, and milestones to track and measure implementation.
SFTP
Secure File Transfer Protocol
Software Developers Kit
A set of software development tools that allows the creation of applications for a certain software package, software framework, hardware platform, computer system, or similar development platform
Sprints
Term of art used in software generation, referring to short, iterative cycles of development, with continuous review of code through daily builds and end-of-sprint demos
Stack
Term of art referring to a suite of services that run in the cloud and enable ubiquitous, convenient, on-demand access to a shared pool of configurable computing resources.
Steering Committee
Responsible for decision-making and communication in BDC.
STRIDES
Science & Technology Research Infrastructure for Discovery, Experimentation, and Sustainability
Task
In workflow writing, the term task encompasses all of the information necessary to execute a command, such as specifying input/output files and parameters.
Team
Groups of people led by a Principal Investigator (PI), or PIs, who will complete milestones and produce deliverables. Each group has been assigned a name, represented by the elements on the periodic chart.
Tiger Teams
A diversified group of experts brought together to investigate, solve, build, or recommend possible solutions to unique situations or problems. Populated with mature experts who know what's at stake, what needs to be done, and how to work well with others; their strengths are diversity of knowledge, a single focus or purpose, cross-functional communications, decision-making sovereignty, and organizational agility.
Tool
In Common Workflow Language, the term tool specifies a single command. This definition is not as discrete in other workflow languages such as WDL.
Tool Registry Service (TRS)
The GA4GH Cloud Work Stream has released a standard API for exchanging tools and workflows to analyze, read, and manipulate genomic data. The Tool Registry Service (TRS) API is one of a series of technical standards from the Cloud Work Stream that together allow genomics researchers to bring algorithms to datasets in disparate cloud environments, rather than moving data around.
TOPMed
Trans-Omics for Precision Medicine. One of the primary data sets of the DCPPC.
TOPMed DCC
TOPMed Data Coordinating Center
Trans-cloud
A provider-agnostic multi-cloud deployment architecture.
User Narrative
Descriptions of a user interaction experience within the system from the perspective of a particular persona. User Narratives are further broken down into Features, Epics, and User Stories. Currently formulated into rough 6-month timelines to benchmark progress.
User story
A description of a software feature from a technical/process-oriented perspective; a backlog item that describes a requirement or functionality for a user
*Finest level of PM Monitoring
Variant Call Format (VCF)
File format for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome. There is also a Genomic VCF (gVCF) extended format, which includes additional information about "blocks" that match the reference and their qualities (from Wikipedia). See http://www.internationalgenome.org/wiki/Analysis/vcf4.0/.
VDS
A composite of complete server hardware, along with the operating system (OS), which is powered by a remote access layer that allows end users to globally access their server via the Internet
VPC
Virtual Private Cloud
Whitelist
A security measure to permit only an approved list of entities. We recommend instead using the term "allow list".
Workflow
A sequence of processes, usually computational in this context, through which a user may analyze data.
Workflow Description Language (WDL) Way to specify data processing workflows with a human-readable and writeable syntax. Define complex analysis tasks, chain them together in workflows, and parallelize their execution.
Workspace
Areas to work on/with data within a platform. Examples: projects within Seven Bridges
Workstream
A collection of related features; orthogonal to a User Narrative
Wrapping
The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.
Virtual Machine (VM)
An isolated computing environment with its own operating system.
Our Culture: Though the primary goal of the BDC project is to build a data science ecosystem, at its core, this is a people-centric endeavor. BDC is also building a community of practice working collaboratively to solve technical and scientific challenges in biomedical science.
Principal Investigators (PIs):
Stan Ahalt, PI RENCI (Coordination Center)
Rebecca Boyles, Co-PI RTI (Coordination Center)
Paul Avillach, PI HMS (Team Carbon)
Kira Bradford, Co-PI RENCI (Team Helium)
Steve Cox, Co-PI RENCI (Team Helium)
Brandi Davis-Dusenbery, PI Seven Bridges (Team Xenon)
Robert Grossman, PI UChicago (Team Calcium)
Ashok Krishnamurthy, PI RENCI (Team Helium )
Benedict Paten, PI UCSC (Team Calcium)
Anthony Philippakis, PI Broad Institute (Team Calcium)
Note: BDC collaboration is organized around teams based on elements in the periodic table. There are additional modes of collaboration in BDC including Tiger Teams, Working Groups, Steering Committee, and Publications.
More about who we are and the partners empowering our ecosystem can be found at the BioData Catalyst About page.
Documentation for getting started on the NHLBI BioData Catalyst ecosystem.
BDCatalyst-RFC-#: 6 BDCatalyst-RFC-Title: Data Access Working Group Data Upload and Download Policy and Recommendations For Users BDCatalyst-RFC-Type: Process Name of the person who is to be Point of Contact: Kira Bradford, Jessica Lyons Email of the person who is to be Point of Contact: kcbradford@renci.org, Jessica_Lyons@hms.harvard.edu Submitting Team: Data Access Working Group Requested BDCatalyst-RFC posting start date: 2020-03-31 BDCatalyst-RFC-Status: Comment only URL Link to this document: https://bdcatalyst.gitbook.io/biodata-catalyst-documentation/community/request-for-comments/data-upload-and-download-policy-and-recommendations-for-users URL Link to the BDCatalyst-RFC: https://www.nhlbidatastage.org/collaboration/rfcs/bdcatalyst-rfc-6 License: This work is licensed under a CC-BY-4.0 license.
This document describes and defines data movement (data egress and ingress), explaining the types of data movement currently allowable on each platform, and what kinds of data should be downloaded or uploaded. This document is meant to inform the BioData Catalyst Go Live users of the Data Access Working Group’s (DAWG) data upload and download recommendations and policies. These policies and recommendations herein are specific to BioData Catalyst users. The following terminology is in use throughout this document.
Ecosystem: BioData Catalyst
Platform: Piece of the BioData Catalyst ecosystem.
Examples: Terra, Gen3, Seven Bridges, PIC-SURE
Workspace: Areas to work on or with data within a platform.
Examples: Projects/workspaces within Seven Bridges or Terra
Individual-level Data: Data at the level of the individual.
Controlled-access data: Data that is not publicly accessible and requires specific credentials or approvals for access and use, primarily due to study participant consents.
External data/user uploaded data: Other data sources not hosted on BioData Catalyst, such as a users own created dataset, or data from another source.
FISMA moderate: The Federal Information Security Management Act (FISMA) creates standards to ensure that all government partners handle confidential and sensitive data appropriately. FISMA moderate is the designated level that means if data is compromised there could be a serious adverse impact, such as a loss of confidentiality, integrity, or availability of data.
Security boundary: Refers to the technical infrastructure boundaries of a BioData Catalyst platform and Ecosystem.
We define data movement as the transfer of data, including controlled-access data, in and out of the FISMA moderate security boundaries of the BioData Catalyst ecosystem. Types of data movement are listed below along with available functionality by Go Live:
Type 1: Uploading and downloading data to a workspace within a platform
Uploading: Moving data that exists outside the BioData Catalyst security boundary inside the security boundary (i.e. uploading data) - available by Go Live (See more information below on Permissible Data Upload)
Downloading: Moving data from within the BioData Catalyst security boundary outside of the security boundary (i.e. downloading data) - available by Go Live (See Permissible Data Download for data download limitations)
Type 2: Moving data from one workspace to another workspace within the same platform - available by Go Live
Example: move data from one workspace to another workspace within Terra. For example, a user copies a workspace on Terra and creates a new workspace to run a similar analysis in a different Terra workspace.
Type 3: Moving data from one platform workspace to another platform workspace - not available by Go Live from all platforms (see Table 1 below). Moving data using core services will be available, such as accessing data available from Gen3 and/or curated data from PIC-SURE and moving it to another platform for analysis.
Example: If the user generates result files from running an analysis in one workspace (e.g. Seven Bridges workspace), those work files are not currently able to be accessed by a different platform workspace (e.g. Terra workspace).
Type 4: Sharing controlled-access data brought into or datasets with controlled-access data produced by users in the BioData Catalyst ecosystem with unauthorized users - prohibited for Go Live. This pertains to controlled access data brought into the ecosystem by the user or data available through BioData Catalyst, including data that has been created or altered for analyses. For user-generated results files, the BioData Catalyst consortium does not currently have policies in place or technical implementations to track this. We do not have the technical implementation to track a user bringing their own data on to the BioData Catalyst ecosystem, and sharing it with others. We also do not have any data provenance technical implementation for tracking how a dataset is transformed. Therefore, our policy is that the user is responsible for any external uploaded or transformed data. Users must adhere to all regulations and data use agreements and are solely responsible for the use of any data uploaded and transformed within the ecosystem.
The following table describes the current types of data movement allowed for BioData Catalyst users based on which platform they are using.
Platform
Type 1
Type 2
Type 3
Type 4
Gen3
✔
✔
✔
(as part of core services)
prohibited
Terra
✔
✔
Not available by Go Live
prohibited
Seven Bridges
✔
✔
Not available by Go Live
prohibited
PIC-SURE
✔
✔
✔
(as part of core services)
prohibited
Table 1: Data Movement allowed in BioData Catalyst
Data Movement from core services to platforms is permissible. These include accessing data available from Gen3 and/or curated data from PIC-SURE and moving it to another platform for analysis.
Users are permitted to upload data not available to the BioData Catalyst ecosystem (i.e. external data) to their own workspace. Users may upload data into BioData Catalyst if they have the required approvals for such use. These approvals include: Approval of a Data Access Request for controlled access data that includes a BioData Catalyst cloud use statement and the user's institutional review board policies and guidelines. At all times, it is the user's responsibility to ensure they use the data they upload consistent with applicable Data Use Agreements, Data Use Limitations, IRB and any other restrictions on use.
Due to the sensitive nature of data available through the BioData Catalyst ecosystem, users are only allowed to download certain pieces of data/results as outlined in Table 2. It is acknowledged that the technical infrastructure allows for data download on Biodata Catalyst platforms so the responsibility for compliance with data download requirements lies with the user of BioData Catalyst. Results and data that the user would broadly share, such as in an academic publication, may be downloaded through shared workspaces on the BioData Catalyst ecosystem; however, users are strongly encouraged to keep results and data within the BioData Catalyst ecosystem. Users are prohibited from downloading any controlled access, individual-level data (see Table 2). Users should be aware that if they choose to download permitted data and results, that the act of transferring that data through the BioData Catalyst security boundary may or may not be supported by your Data Use Agreement(s), Limitation(s), or Institutional Review Board policies and guidelines. BioData Catalyst users are solely responsible for adhering to the terms of these policies. Users should be aware that all data downloads are logged and regularly reviewed for compliance. See below examples of data permissible and prohibited for download.
Table 2: BioData Catalyst Permissible and Prohibited Download
Permissible to Download
Prohibited to Download
Aggregate results/tables that would be publishable in academic publication
Summary data that does not include individual-level and/or controlled-access data
Your own data that you brought to the platform for analyses following your DUAs and/or IRB protocols
Users are prohibited to download any controlled-access, individual level data such as:
Hosted TOPMed CRAM files with individual-level data
Hosted TOPMed VCF files with individual-level data
Hosted TOPMed or TOPMed-related phenotypic study data files with individual-level data
This list is not exhaustive of all possible scenarios and is subject to change. If you have questions about permissible data download please contact https://biodatacatalyst.nhlbi.nih.gov/contact.
Allowing users to share data with other collaborators is permissible per BioData Catalyst policy, but platforms are not required to have this sharing capability available on BioData Catalyst for Go Live. The BioData Catalyst user is ultimately responsible for maintaining the confidentiality, integrity, and the availability of any data uploaded or downloaded from the BioData Catalyst ecosystem. It is therefore essential that all users of the BioData Catalyst ecosystem accessing controlled access data understand their responsibilities for ensuring appropriate information security controls and that they work with their institutions to effectively implement those responsibilities. Users can upload their data to a workspace; however, it is the responsibility of the uploader to ensure that data policies and permissions are in place to permit data transfer to any users of the shared workspace. Additionally, the uploader should make all collaborators aware of any Data Use Agreements or Limitations of any newly uploaded data.
Users will be made aware and reminded of the data upload and download policy recommendations when working in the BioData Catalyst Ecosystem. The user will see this or similar messaging when working on BioData Catalyst platforms.
“You are transferring data through the BioData Catalyst security boundary. Downloading controlled-access, individual-level data through BioData Catalyst is prohibited and downloading other types of data is strongly discouraged, due to the sensitive nature of the data hosted on the platform. Please see the Permissible and Prohibited Data Download section of the Data Upload and Download Policy for more information. Additionally, transferring data may or may not be supported by your Data Use Agreement(s), Limitation(s), or your Institutional Review Board policies and guidelines. As a BioData Catalyst user, you are solely responsible for adhering to the terms of these policies.”
BDC recognizes the importance of multimedia resources for ecosystem users, particularly audio/visual recordings. This document provides guidelines on the program's video content approach. Using these guidelines will ensure users get optimized video experiences, from consistent branding that offers insights into the sources of the videos to best practices in video creation that support learning.
To share video content - from the consortium, platforms, and users, as described in the following sections - BDC created a YouTube channel: https://www.youtube.com/channel/UCGkmY5oNK8uFZzT8vV_9KgQ
The BioData Catalyst Coordinating Center (BDC3) , has authority (with direction from the NHLBI) to post (or not post), remove, edit, and otherwise change video content on this channel with or without permission from or notice to video creators, owners, or sharers. Feedback about videos on the BDC YouTube channel should be sent to BDCatalystOutreach@nih.gov.
The BDC YouTube Channel hosts three categories of videos based on their sources and/or approval statuses:
Consortium-produced / Consortium-approved
Platform-generated
User-generated
Learn more about each video category below. Note that each category has its own set of standards that must be adhered to when creating and publishing video content, whether the final outlet is the BDC YouTube channel or another channel.
BDC3 is responsible for organizing videos on the BDC YouTube channel, grouping them into playlists it believes will be most beneficial to ecosystem community members. Playlists may include videos from any or all categories of videos. Viewers can determine the category of a video based on the branding (or non-branding) that appears. The additional information about each video category includes video standards that direct video creators on branding for each category of videos.
Videos in this category are produced by BDC3, or are produced by Platforms or Users that receive approval from the BDC Consortium (select organizations developing and maintaining the ecosystem). These videos contain pre-approved opening and closing BDC animations and sound.
Videos produced by the Consortium, or by Platforms or Users that submit for approval for recognition as a Consortium-approved video, must adhere to the following standards:
Comply with all requirements and, when possible, follow all best practices outlined in Addendum A: Consortium-produced / Consortium-approved Videos Best Practices. Platforms and users generating videos who wish to submit them for recognition as Consortium-approved must complete the BioData Catalyst Consortium Video Submission Pre-Approval Application. Submit the form BEFORE producing the video to improve the likelihood that the video receives Consortium approval.
Videos in this category are produced by one of the BDC platforms to support users' understanding of their platform. These videos are not vetted by BDC3, BDC3 Consortium members, or representatives of other BDC platforms. These videos must open with the creator's platform "Powered by" logo (downloadable from the BDC3 internal consortium website).
Unless a Platform plans to seek Consortium-approval status for a video, platforms should use the following standards in the production and posting of their platform-generated videos:
Producers of Platform-generated videos, like all BDC ecosystem users, are always obligated to protect participant privacy and must follow NIH policies for data protection. Platforms are accountable and may be subject to sanctions if policies are violated. Only produce videos that provide information specific to the Platform's BDC instance. Use the Platform's Powered by logo (and only the Powered by logo) for the YouTube thumbnail image. Videos should open with the following information: “In this video we will [discuss/cover/explore] BioData Catalyst Powered by [platform name] and [task/example]” YouTube description language should include: The following language: This is a BioData Catalyst platform-generated video to support ecosystem users' understanding of the BioData Catalyst Powered by [platform name]. The link to the NHLBI BioData Catalyst homepage: https://biodatacatalyst.nhlbi.nih.gov/ Videos should be uploaded using YouTube's auto-generated captions to support 508 compliance. Once the video is uploaded, email the link to: BDCatalystOutreach@nih.gov so BDC3 can make it visible on the BioData Catalyst YouTube channel.
Important Notes
Only videos offering information specific to the use of ecosystem Platform instances will be shared on the BioData Catalyst YouTube channel. Videos that support the use of Platforms but are not specific to BDC instances may be linked from the ecosystem documentation but will not appear on the BioData Catalyst YouTube channel.
Platform-generated videos that do not follow the above standards will not be made visible on the BioData Catalyst YouTube channel.
These videos are neither approved nor vetted by BDC, the BDC Consortium, BDC Platforms, or the organizations they represent. The opinions and other content in these videos are those of the video creators and sharers alone. These videos may NOT open or close with BDC branding and may only display BDC branding when capturing images of properties where it already appears (i.e., a screencap of an ecosystem platform instance).
BDC offers user-generated video tutorials and guides. Unless a user plans to seek Consortium-approval status for a video, BDC requires the following for user-generated videos, their creators, and their sharers:
Producers of user-generated videos, like all BDC ecosystem users, are always obligated to protect participant privacy and must follow NIH policies for data protection. User institutions are accountable and may be subject to sanctions if policies are violated. By submitting a video for inclusion, users are attesting that the content of the video follows NIH policies for data protection, agreeing to follow this guidance, and committing to the inclusion of the following statement in video descriptions: This is a user-generated video and is neither approved nor vetted by NHLBI BioData Catalyst (BDC), the members of the BDC Consortium, or the organizations they represent. For more information about BDC, go to https://biodatacatalyst.nhlbi.nih.gov/. For more BDC videos, go to https://www.youtube.com/channel/UCGkmY5oNK8uFZzT8vV_9KgQ. #BioDataCatalyst To share a video, please contact: BDCatalystOutreach@nih.gov
Important Notes
User-generated videos that do not follow the above standards will not be made visible on the BioData Catalyst YouTube channel.
User-generated videos are just one type of user-contributed content BDC seeks to share. To learn about other kinds of user-generated content BDC seeks, read Contributing User Resources to BDC.
Consortium-produced/Consortium-approved videos must adhere to this addendum. While not required of BDC Platforms and users, BDC encourages them to consider these best practices for the videos they produce.
Consider if the video is fulfilling a need/gap
Required
Ensure video isn't replicating information already available to users
Required
Pre-approval is required to ensure relevance & consistency
Outline the video
Best practice
Consider how info can be presented in a concise & useful manner
Avoid having too much text on slides
Best practice
Slides should be concise; keep text & bullets at a minimum; use images when possible as viewers respond to images more positively than text
Use clear language & explain jargon
Best Practice
Simple communications are preferred; many viewers may not speak English as a first language
Ensure Section 508 compliance
Required
Subtitles & transcripts are required to ensure equity in access for people with disabilities
Ensure privacy policy compliance
Required
Protect study participants' privacy by not including personally identifiable, confidential, sensitive, or personal health information (e.g., blur screenshots with data)
Required
For people with disabilities, readability can be essential to a successful user experience
Use appropriate branding according to the BDC Style Guide
Required
Required to create a unified look across the BioData Catalyst ecosystem. Work with your BDC3 contact to get a copy of the style guide
Best practice
Search for meaningful keywords for titles, descriptions & tags
Create a meaningful title
Required
The title should be under 66 characters to make it easier for Google to display; make the title engaging & descriptive
Required
Think about the action the user is trying to take & the keywords they might use to find your video
Required
Transcription is free but likely needs editing; you can make changes to the text & timestamps of your captions
Best practice
Cards are clickable calls to action that take viewers to another video, channel, or site
Best practice
End screens can be added to a video's last 5 - 20 seconds to promote other videos, encourage viewers to subscribe, etc.
Best practice
Break up videos into sections (each with an individual preview) to provide more info & context; eases re-playing certain sections
Required
A clear & colorful video thumbnail will catch viewers' attention & let them see a quick snapshot of your video as they're browsing
Required
Tags are descriptive keywords you can add to your video to help viewers find your content; include at least 10 tags
Add links to BDC
Best practice
Where possible, provide links to relevant parts of the BDC ecosystem
Share completed videos with BDC3
Required
BDC3 sets appropriate privacy settings according to policy with input from the video creator
If Approved
Videos can be Public, Unlisted (link needed), or Private (invite needed; most secure)
BDC3 uploads to YouTube channel & adds to relevant playlists
If Approved
Videos can be in multiple playlists but don't need to be in any playlists
Teams and BDC3 develop plans to promote the video, if appropriate.
Best Practice
Potential options include Facebook, Instagram, LinkedIn, Snapchat, Twitter, Vimeo, WeChat, Pinterest, Flipgrid, etc.
BDC3 will prompt teams annually to check videos to ensure continued relevance.
Required
Outdated videos could cause viewers to lose confidence in the accuracy of info available on the channel
In the context of agile development and a Consortium with a diverse set of members, the application of various agile-development terms may mean different things to different individuals.
The table below defines the BDC Core Terminology:
Term
Definition/Description
Example
User Narrative
Descriptions of a user interaction experience within the system from the perspective of a particular persona. User Narratives are further broken down into Features, Epics, and User Stories. Currently formulated into rough 6-month timelines to benchmark progress.
An experience bioinformatician wants to search TOPMed studies for a qualitative trait to be used in a GWAS study
Feature
A functionality at the system level that fulfills a meaningful stakeholder need
*Level at which the BDC3 coordinates
Search TOPMed datasets using PIC-SURE platform
Epic
A very large user story which can be broken down into executable stories
*NHLBI’s cost-monitoring level
PIC-SURE is accessible on BDC
User Stories
A backlog item that describes a requirement or functionality for a user
*Finest level of PM Monitoring
A user can access PIC-SURE through an icon on BDC to initiate search
Workstream
A collection of related features; orthogonal to a User Narrative
Workstreams impacted by the User Narrative above include:
production system
data analysis
data access
data management
How to access additional data stacks
The Genotype-Tissue Expression (GTEx) Program is a widely used data resource and tissue bank to study the relationship between genetic variants (inherited changes in DNA sequence) and gene expression (how genes are turned on and off) in multiple human tissues and across individuals. For information on access to GTEx data, refer to as part of the documentation.
The is currently working to establish and implement guidelines and technical standards to empower end-user analyses across participating cloud platforms and facilitate the realization of a trans-NIH, federated data ecosystem. Participating institutions include BioData Catalyst, AnVIL, Cancer Research Data Commons, and Kids First Data Resource Center. Learn what data is currently hosted by these platforms by using the .
An NIH eRA Commons ID (or appropriate NIH Login) is required for submitting a Data Access Request (DAR). If you do not have an eRA Commons account, you must request one through your institution’s Office of Sponsored Research or equivalent. For more information, refer to .
To submit a DAR, users must have PI status through their institution. Non-PI users must have a PI they work with that can submit a DAR and add them as a downloader.
Step 1: Go to to log in to dbGaP.
Step 2: Navigate to My Projects.
Step 3: Select Datasets.
You can search by Primary disease type, or if you know the dataset you are interested in, you can use Study lookup.
We want to request HCT for SCD, so we will use the accession number phs002385. As you type the accession number, the numbers will start to auto-populate.
Select the study to add it to the Data Access Request. You can request up to 200 studies that you are interested in accessing.
The user can add additional datasets as necessary needed to answer the research question.
Long-term survival and late death after hematopoietic cell transplant for sickle cell disease
Our project is limited to requested dataset. We have no plans to combine with other datasets.
In 2018, the National Heart Lung and Blood Institute (NHLBI) began work on BioData Catalyst, a shared virtual space where scientists can access NHLBI data and work with the digital objects needed for biomedical research (www.nhlbi.nih.gov/science/biodata-catalyst). This is a cloud-based platform that allows for tools, applications and workflows. It provides secure workspaces to share, store, cross-link and analyze large sets of data generated from biomedical research. Biodata Catalyst addresses the NHLBI Strategic Vision objective of leveraging emerging opportunities in data science to facilitate research in heart, lung, blood and sleep disorders. It offers specialized search functions, controlled access to data and analytic tools via programming interfaces and its interoperability will allow exchange of information with other components of the Data Commons. BioData Catalyst may be accessed by the biomedical researchers and the public at large. The first available datasets in BioData Catalyst include data from NHLBI’s Trans-Omics for Precision Medicine (TOPMed) Program and the Cure Sickle Cell Initiative. Rigor in designing and performing scientific research and the ability to reproduce biomedical research are two of the cornerstones of science advancement. In order to test reproducibility of biomedical data available in BioData Catalyst we accessed NHLBI data from the Cure Sickle Cell Initiative to test and validate the findings of a publication that utilized those data. That report, focused on the effect of donor type and transplant conditioning regime intensity on hematopoietic cell transplant outcomes for sickle cell disease. Hematopoietic cell transplant is potentially curative, yet this treatment is associated with risks for mortality from the treatment procedure. Published reports suggest the life expectancy of adults with sickle cell disease in the United States is shortened by at least two decades compared to the general population. Thus, a fundamental question that is often asked is whether hematopoietic cell transplant over time would offer a survival advantage compared to treatment with disease-modifying agents. In the report1 that examined for factors associated with survival after transplantation, young patients (aged =12 years) and patients who received their graft from an HLA-matched sibling had the highest survival. For those without an HLA-matched sibling the data did not favor one alternative donor type over another.1 The purpose of the current analyses is two-fold: 1) test and validate a publication that utilized data in the public domain and 2) the utility of these data to conduct an independent study. The aim of the later study was to estimate the conditional survival rates after hematopoietic cell transplantation stratified by time survived since transplantation and to compare all-cause mortality risks to those of an age, sex, and race-matched general population in the United States.
Investigators in the Cure Sickle Cell Initiative Data Consortium request access to data to examine rigor and reproducibility of data submitted - using a previous publication as reference. Additionally, we will calculate survival after hematopoietic cell transplant by time survived since transplant. For example, we will calculate the 5- and 10-year likelihood of being alive 2 and 5 years after transplantation.
The NHLBI-supported BioData Catalyst (www.nhlbiBioDataCatalyst.org) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces. The BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by FireCloud, hosted and operated by the Broad Institute, Fair4Cures, hosted and operated by Seven Bridges Genomics and PIC-SURE operated by Harvard Medical School. For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Additionally, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) for Harvard Medical School.
NHLBI BioData Catalyst, Private, The NHLBI-supported BioData Catalyst (https://biodatacatalyst.nhlbi.nih.gov/) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces.
The NHLBI BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. The NHLBI BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by FireCloud, hosted and operated by the Broad Institute, Fair4Cures, hosted and operated by Seven Bridges Genomics and PIC-SURE operated by Harvard Medical School.
For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Additionally, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) for Harvard Medical School.
Google Cloud Platform is a public cloud platform that provides solutions and services such as virtual machines, database instances, storage, and more. We will use the Google Compute, a service that provides resizable compute capacity, to allocate Machine Types in which we will develop the methods and infrastructure necessary to build the NHLBI BioData Catalyst. Google Cloud offers several storage options that work in conjunction with Compute Engine: Google Cloud Storage, and Google Compute Engine Persistent Disks. We expect to use each of these as the provide different capabilities including persistent storage and direct and networked storage for attaching to running machine instances. We will use networking technologies based on Google’s Andromeda architecture, which can create networking elements at any level with software. This software-defined networking allows Cloud Platform's services to implement networking features that fit their exact needs, such as secure firewalls for virtual machines in Google Compute Engine. We will use Google Cloud Identity & Access Management to control user access to these compute resources.
You can check your access to data on BioData Catalyst using the public website or on your specific platform.
Go to and click Check My Access.
Go to , select NIH Login, then log in using your NIH credentials. Once logged in, select the Exploration tab. From the Data Access panel on the left, make sure Data with Access is selected. Note whether you have access to all the datasets you expect.
Click your username in the upper right and select Account Settings.
Select the tab for Dataset Access.
Browse the datasets and note whether you have access to all the datasets you expect.
Datasets you have access to will have green check marks.
Datasets you do not have access to will have red check marks.
You do not need to check your data access on BioData Catalyst powered by Terra. But before submitting a help desk ticket, ensure that you’ve done the following steps:
This is a repository for documentation related to the platforms and services that are part of the BDC ecosystem.
Click here to access the website.
Welcome to the BDC ecosystem and thank you for joining our community of practice. The ecosystem offers secure workspaces to support your data analysis in addition to a number of bioinformatics tools for analysis. The ecosystem currently hosts datasets from the Transomics for Precision Medicine (TOPMed) program. There is a lot of information to understand and many resources (documentation, learning guides, videos, etc.) available, so we developed this overview to help you get started. If you have additional questions, please use the links at the very end of this document, under the "Questions" section, to contact us.
What is BDC?
The BDC ecosystem is a cloud-based platform providing tools, applications, and workflows in secure workspaces. It is designed to be nimble and responsive to the ever-changing conditions of the biomedical and data science communities. Though the primary goal of the BDC project is to build a data science ecosystem, at its core, this is a people-centric endeavor. BioData Catalyst is also building a community of practice working collaboratively to solve technical and scientific challenges.
What are we doing and why does it matter?
By increasing access to the NHLBI’s datasets and innovative data analysis capabilities, the BDC ecosystem accelerates efficient biomedical research that drives discovery and scientific advancement, leading to novel diagnostic tools, therapeutics, and prevention strategies for heart, lung, blood, and sleep disorders.
Who is developing BDC?
The ecosystem is funded by the National Heart, Lung, and Blood Institute (NHLBI). Researchers and other professionals receive funding from the NHLBI to work on the development of the ecosystem, together often referred to as “The BDC Consortium” or “The Consortium” for short. You can find on the about page of the project’s website and a is available in our documentation.
Learn about our culture.
The BDC community follows a that reflects the Consortium’s dedication to providing a harassment-free experience for everyone.
Find out the meanings of our terms and acronyms.
Like many professional communities, BDC has adopted terms to help us communicate quickly and more efficiently, but that can be a challenge for newcomers. To help, we created the NHLBI BioData Catalyst of terms and acronyms. If ever there is a time when an ecosystem term or acronym is unfamiliar and isn’t in the glossary, please so we can give you the information and add it to the glossary for future newcomers.
Learn about the platforms and services available in the ecosystem.
The BDC ecosystem features the following platforms and services.
Explore Available Data
BioData Catalyst Powered by Gen3 - Hosts genomic and phenotypic data and enables faceted search for authorized users to create and export cohorts to workspaces in a scalable, reproducible, and secure manner.
BioData Catalyst Powered by PIC-SURE - Enables access to all clinical data, feasibility queries to be conducted, and allows cohorts to be built in real-time and results to be exported via the API for analysis.
Analyze Data in Cloud-based Shared Workspaces
BioData Catalyst Powered by Seven Bridges - Collaborative workspaces where researchers can find and analyze hosted datasets (e.g. TOPMed) as well as their own data by using hundreds of optimized analysis tools and workflows in CWL, as well as JupyterLab and RStudio for interactive analysis.
BioData Catalyst Powered by Terra - Secure collaborative place to organize data, run and monitor workflow (e.g. WDL) analysis pipelines, and perform interactive analysis using applications such as Jupyter Notebooks and the Hail GWAS tool.
Use Community Tools on Controlled-access Datasets
Dockstore - Catalog of Docker-based workflows (from individuals, labs, organizations) that export to Terra or Seven Bridges.
How does data access work?
How do I login?
While all of the platforms within BioData Catalyst use eRA Commons credentials and iTrust performs authorization and authentication, respectively, there are some slight differences between the platforms when getting set up:
BioData Catalyst Powered by Gen3 - Users do not set up usernames on Gen3. Upon the first time logging in, select “Login from NIH”, then enter eRA commons credentials at the prompt. This ‘User Identity’ is used to track the user on the system.
BioData Catalyst Powered by PIC-SURE - Similar to Gen3, user identities are used - researchers log into the system by selecting “Log in with eRA Commons.”
BioData Catalyst Powered by Seven Bridges - Users set up platform accounts. The first time on the system, users select to “Create an account” and then proceed with entering their eRA Commons credentials. The user is then prompted to fill out a registration form with their name, email, and preferred username. Users are also asked to acknowledge that they have read the Privacy Act notice and then they can proceed to the platform.
BioData Catalyst Powered by Terra - Users initially log in using Google credentials and are asked to agree to the Terms of Service and Privacy Act notice. User activity is tracked via the Google credentials, but users can link their eRA Commons credentials to the account to get access to hosted datasets.
How do I check which data I can access?
What data are available in the ecosystem?
Harmonized data available.
Bring your own data and workflows into the system.
Learn about Genome-wide association study and genetic association testing on BioData Catalyst.
Share your workflows.
Costs and cloud credits.
Let us know about your publications and see how you can cite us.
Learn more, ask questions, or request help.
The BDC user community is essential to advancing science with new and exciting discoveries and informing the development of the ecosystem and its infrastructure. Members of the BDC user community learn how to explore the hosted data, use the services, and employ its tools in exciting and valuable ways that even developers may not know. Therefore, we actively invite user resource contributions to be shared with the community.
Consider supporting fellow ecosystem users in one of the following ways:
Written Documentation: Develop step-by-step guides, FAQs, checklists, and so on. Include screenshots to support user understanding.
Videos: Record a shortcut, tip, or process you think would be helpful to other users. Keep videos short by dividing larger processes into smaller segments and recording separate videos for each.
Respond to inquiries: Answer questions posed in the BioData Catalyst Forums. Forum content with significant engagement may get incorporated into written documentation or made into videos.
Note
All materials must ensure privacy policy compliance. Make certain to block any patient information on all content and protect study participants' privacy by not including personally identifiable, confidential, sensitive, or personal health information (for example, blur screenshots with data).
Experienced users who want to share their tips and tricks should consider the following questions.
Did someone already share my tip? Look through the resources already available to users before investing your time and energy into creating a new one.
Check the on the BioData Catalyst website
View the “Learn” and “Documentation” links available on the .
View the hosted on GitBook.
Explore the links to platform-specific documentation, videos, FAQs, community forums, blogs, tutorials, and upcoming events on the
Check out the videos on the .
Which format best suits your resource? Ask yourself, "Would I prefer to watch this on video or have a step-by-step guide to help me?" Then ask yourself which you think other users would prefer. Figuring out which you'd prefer is a great place to start because you are the one who identified the tip. But remember that you are creating something to help other people whose preferences will determine whether a resource gets used.
Is my tip complex, or does it require several steps? If so, a written how-to guide will probably be easier to follow than a video because someone watching a video may need to stop and restart it often. Still, visual aids will be helpful, so consider using screenshots in your how-to guide.
Is the guidance I want to share relatively straightforward, but it requires clicking through several pages/places? If so, a short video could be the best way to share your tip. Finding buttons or links can be much easier if shown rather than described.
If I create a video and make sure to go slowly enough that someone can follow along, will it be longer than 15 minutes? If so, creating a video may not be the right format, or breaking down the content into shorter (more digestible) videos may be preferable.
Am I comfortable following the ? If not, please create written documentation (e.g., a how-to guide).
Do I want to provide help in almost-real-time without needing to formally draft a document or record a video? Visit the often to provide answers to questions posed by other users or even just post your tip.
Once you decide upon the best way to share what you learned, you'll need to create your contribution and then share it.
In PIC-SURE, did you know you can use the search bar in the Data Access Table to find studies? Instead of scrolling through the table and looking at the list of available studies manually, you can search for studies. An example could be “MESA” for a specific study name, or a phenotype like “Sickle Cell” to find all sickle cell related studies. It seems obvious, but I’m not sure how many other users are aware of this, and I found it really helpful!
For Written Documentation, draft your suggestions and include screenshots to help lead users through the process you describe. Once complete, submit the file to BDCatalystOutreach@nih.gov for review and posting to the BioData Catalyst Gitbook. Note that we will accept Google Doc (with at least suggesting edits status preferred) and Microsoft Word formats; PDFs are not accepted.
Step-by-step guidance on using Dug Semantic Search: efficiently and effectively perform and interpret a search using Dug.
Dug Semantic Search is a tool that allows users to deep dive into BDC studies and biomedical topics, research, and publications to identify related studies, datasets, and variables. If you are interested in how Dug connects study variables to biomedical concepts, or visit the.
This tool applies semantic web and knowledge graph techniques to improve BDC research data Findability, Access, Interoperability, and Reusability (FAIR). Through this process, semantic search helps users identify novel relations, build unique research questions, and identify potential collaborations.
BDCatalyst-RFC-#: 11 BDCatalyst-RFC-Title: NHLBI BioData Catalyst Ecosystem Security Statement BDCatalyst-RFC-Type: Consensus Building Name of the person who is to be Point of Contact: Sarah Davis Email of the person who is to be Point of Contact: sdavis@renci.org Submitting Team: BDC3/NHLBI Requested BDCatalyst-RFC posting start date: 6/14/2021 Date Emailed for consideration: 6/14/2021 BDCatalyst-RFC-Status: Comment only URL Link to this document: URL Link to the website: License: This work is licensed under a .
The purpose of this RFC is to provide the NHLBI BioData Catalyst Consortium and users of the NHLBI BioData Catalyst ecosystem with a clear statement on security mechanisms of the ecosystem that protect the confidentiality, integrity, provenance, and availability of the hosted data as well as any data that may be uploaded using the ecosystem’s “Bring Your Own Data” (BYOD) functionality.
Figure 1. The NHLBI BioData Catalyst ecosystem leverages separately developed and managed platforms to maximize flexibility for users based on their research needs, expertise, and backgrounds. Utilizing multiple Authorizations to Operate (ATO), these platforms combine to provide secure, cloud-based workspaces, user authentication and authorization, search, tools and workflows, applications, and new innovative features to address community needs.
The NHLBI and BioData Catalyst Consortium recognizes the importance of protecting both the privacy and security of the data and respecting the consent of the study participants whose data is stored within the BioData Catalyst ecosystem. Tackling these issues produces some challenges beyond those faced by most Federal Information Systems. The BioData Catalyst Consortium has implemented many innovative approaches to enable compliance and ensure that users understand their responsibility to protect data as articulated in specific Data Use Agreements (DUA). These approaches and controls work to protect the confidentiality, integrity and availability of the data; the privacy of the study participants who have contributed data; and data that may be uploaded to BioData Catalyst using the ecosystem’s “Bring Your Own Data” (BYOD) functionality. While the same general security controls are applied to both system and BYOD data, BYOD data is further protected as the ecosystem provides access only to the data’s uploaders and their designated collaborators.
From a Federal Information Security Modernization Act (FISMA) perspective, the BioData Catalyst ecosystem is a set of software systems with distinct security boundaries. Each system owner holds an Authority to Operate (ATO) issued by the NIH. The ATO is the result of a rigorous Security Assessment and Authorization (SA&A) process and third party assessment consistent with guidance from the National Institute of Standards and Technology (NIST). The ecosystem operates via a set of Interconnection Security Agreements (ISA) (Reindl 1979) and utilizes several existing components of security infrastructure (Bridges 2017, Gutiérrez-Sacristán et al. 2018) developed for other NIH platforms. Where the documentation provided as part of the SA&A process describes how security controls are implemented based on the NIST Special Publication 800-53r4 (see Endnote), the ISAs describe the permitted exchange of data and establish ecosystem-wide incident response, logging and auditing expectations that enable the consortium to respond in a unified manner to any suspected cybersecurity incident. The SA&A documentation provides for regular evaluation of the security of the component systems including regular scanning for vulnerabilities and the conduct of an annual penetration test. This level of security represents a baseline, and the BioData Catalyst ecosystem will extend protections over time.
Where the processes, policies, and technical controls protect confidentiality, integrity, and availability of data in accordance with Federal statute and regulation, there are additional ways to ensure that data is used in a manner consistent with study participants’ wishes, as represented by the consent form participants sign when enrolling in a specific study. Respect for these consents is critical to maintaining the public’s trust and requires additional policy, process, and technical controls. The respect for consent in NHLBI BioData Catalyst is enforced using normative NIH policies and processes for data sharing and using the existing infrastructure provided by the National Center for Biomedical Information’s (NCBI) Database of Genotypes and Phenotypes (dbGaP). All NHLBI-provided data within the NHLBI BioData Catalyst ecosystem are registered in dbGaP; in this process, data are assigned “consent groups” that describe in a machine-readable format the parameters of the consent for the data. These range from the most expansive “General Research Use” to more restrictive, such as only allowing for research outcomes related to “Health Medical Biomedical topics” or even to specific diseases, such as Chronic Obstructive Pulmonary Disease. Further, while secondary analysis of data is not considered human subjects research as described in the Common Rule (45 CRF Part 46), some datasets require the review of a research proposal by an Independent Review Board (IRB) or a Letter of Collaboration (LOC) with the originating study Principal Investigator as determined by the informed consents or special considerations that the submitting institution has determined are needed. These measures provide additional protection for datasets with particular sensitivity or special criteria for use.
For instance, IRB review is required when 1) informed consents that were signed by the study participants state that IRB oversight for secondary use of the data is required, and/or 2) the study IRB of record determines that the data may contain sensitive information that requires IRB oversight for secondary research. For collaboration letters, the informed consents indicate that the secondary use of the data by researchers outside the study will work with the study and therefore formal collaborations need to be set in place. While these are rarely used, they provide additional protection under special circumstances, such as where an indigenous population or sovereign nation requires direct control of how their data is used. Because consent is expressed at the individual level there may be a variety of consents for a study, either because the study offered choices to their participants or because the study consent evolved over an extended longitudinal study, such as the Framingham Heart Study. These variations in consent are reflected as multiple “consent groups” within a study and may mean that an investigator may only receive permission for subsets of study participants.
While NIST-800-53r4 is a thorough standard for many kinds of systems, it has some gaps for the most-modern systems encountered. There are additional standards to apply, either from the High column of NIST-800-53r4 or addendums as “best practice”.
In particular, the Standard does not give guidance on modern applications (“appsec”) -- especially when the Infrastructure is serverless and the entire security surface is the application itself. For instance, there are requirements for regular scanning but scanning tools do not scan APIs or modern Single Page Apps (SPA) sufficiently. A “web app scan” with tools like IBM’s Appscan would return no vulnerabilities on an API without actually testing it, yet it would satisfy RA-5 of NIST-800-53 to run such a scan. Nor does the standard require that the Infrastructure as a Service layer (AWS, GCP, Azure) abide by a continual scanning posture for misconfiguration -- only Networks and VMs are specified. That is one example of where NIST-800-53r4 doesn’t work for modern applications.
There’s also no guidance from NIST at large about running an API where external parties are building “clients'' to that API. Does allowing clients extend the security boundary, and thus all 3rd party applications are to be evaluated as such? Is there a different consideration for these 3rd party applications? The standard is silent there. Companies like Apple and Google through their AppStores require all 3rd party apps to undergo evaluation to their own security standards and such an idea might be applicable here. NHLBI might consider adding some extra controls, what the AllOfUs Program calls FISMA+, for enhanced security.
Navigate to to access Dug Semantic Search.
Semantic search is a concept-based search engine designed for users to search biomedical concepts, such as “asthma,” “lung,” or “fever,” and the variables related to and/or used to measure them. For example, a search for “chronic pain acceptance” will return a list of related biomedical concepts, such as chronic pain, headaches, neuralgia, or fibromyalgia, each of which can be expanded to display related variables and CDEs. Semantic search can also find variable names and descriptions directly, using synonyms from its knowledge graphs to find search-related variables.
Enter a search term and press “Enter,” or click on the Search button. This will take you to the Semantic Search interface.
BioData Catalyst Powered by PIC-SURE has integrated clinical and genomic data from a variety of heart, lung, blood, and sleep related datasets. These include NHLBI Trans-Omics for Precision Medicine (TOPMed) and TOPMed related studies, BioLINCC datasets, and COVID-19 datasets.
View a summary of the data you have access to by viewing the Data Access Table.
This table displays information about the study and associated data, including the full and abbreviated name of the study, study design and focus, the number of clinical variables, participants, and samples sequenced, additional information with helpful links, consent group information, and the dbGaP accession number (or phs number). You are also able to see which studies you are authorized to access in the Access column of the table. For information from dbGaP on submitting a data access request, refer to . Note that studies with a sickle cell disease focus contain links to the for additional information.
The BioData Catalyst ecosystem hosts several datasets from the . To access the BioLINCC studies, you must request access through dbGaP even if you have authorization from BioLINCC.
This checklist is intended to help new users understand their obligations regarding access and permissions, which individual users are responsible for obtaining and maintaining.
Users log into BioData Catalyst platforms with their eRA Commons credentials. For more information, see.
Users create an eRA Commons Account through their institution's Office of Sponsored Research or equivalent. For more information, refer to.
Users who want to access a hosted controlled study on the BioData Catalyst ecosystem must be approved for access to that study in the NIH Database of Genotypes and Phenotypes (). For more information, see and. Note that obtaining these approvals can be a time-intensive process; failure to obtain them in a timely manner may delay data access.
Users have two options for obtaining dbGaP approval depending on whether they already are affiliated with a PI who has dbGaP access to the relevant data:
The BioData Catalyst user has no affiliation with an existing dbGaP approved project. In this case the user needs to create their own dbGaP project and then submit a data access request (DAR) for approval by the NHLBI Data Access Committee (DAC). This process often takes 2-6 months depending on whether local IRB approval is required by the dataset the user is requesting, the amount of time it takes for local review of the dbGaP application by the user’s home institution and processing by the DAR committee. See thel or. Once a DAR is approved, it can take a week or longer for the approval to be reflected on BioData Catalyst.
The BioData Catalyst user is affiliated with an existing principal investigator, who already has an approved dbGaP application with existing DAR access (for example, the BioData Catalyst user is a post-doctoral fellow in a PI’s lab). A principal investigator with dbGaP DAR access assigns the User as a “Downloader” in dbGaP. See. It can take about 24 hours for “Downloader” approval to be reflected on BioData Catalyst.
BioData Catalyst hosts data from the NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium. BioData Catalyst users are not automatically onboarded as TOPMed investigators. BioData Catalyst users who are not members of the TOPMed Consortium may apply for released data through the regular dbGaP Data Access Request process.
When conducting TOPMed-related research on BioData Catalyst, members of the TOPMed consortium must follow the and associated processes; for example, operating within Working Groups.
For more information, refer to the following resources:
Users must ensure that IRB data use agreements (DUAs) are approved and maintained as they are enforced by the BioData Catalyst ecosystem.
The BioData Catalyst ecosystem hosts several datasets from the . The PIC-SURE platform has integrated the clinical and genomic data from all studies listed in the Data Access Dashboard. Through the ingestion process, occasionally PIC-SURE will ingest phenotypic data for the TOPMed studies prior to the genomic data.
There are limited amounts of harmonized data available at this time. The TOPMed Data Coordinating Center (DCC) curation team has identified 44 variables that are shared across 17 NHLBI studies and normalized the participant values for these variables.
The 44 harmonized variables available are listed in the table in Appendix 2. For more information on this initiative, you can view the or on the .
To request special permission for other types of download, please contact the BioData Catalyst help desk at .
Complete & submit for pre-approval
for interaction
for marketing
& create Table of Contents
, including the required #BioDataCatalyst tag
Email with info on accessing the video, a thumbnail image, descriptive tags to include, and the video description
Amazon Web Services (AWS), Commercial Amazon Web services (AWS) is a public cloud platform that provides solutions and services such as virtual machines, database instances, storage, and more. We will use the Amazon Elastic Compute Cloud (Amazon EC2), a web service that provides resizable compute capacity, to allocate Amazon Machine Instances (AMIs) in which we will develop the methods and infrastructure necessary to build the NHLBI BioData Catalyst. AWS offers several storage options that work in conjunction with EC2: Amazon Simple Storage Service (Amazon S3), Amazon Elastic Block Store (EBS) and Amazon Elastic File System (Amazon EFS). We expect to use each of these as the provide different capabilities including persistent storage and direct and networked storage for attaching to running AMI(s). We will use the Amazon Virtual Private Cloud (VPC) to provide security and robust networking functionality to these compute resources and Amazon Identity and Access Management (IAM) to control user access to these compute resources. AWS offers extensive security and has written a white paper with guidelines for working with controlled access data sets in AWS which we will follow. (see ).
You can request access to data by visiting the . For more information on Data Access, see the on the page.
Go to and login. To check your data access:
Establish a link in to your eRA Commons/NIH Account and the University of Chicago DCP Framework. To link eRA Commons, NIH, and DCP Framework Services, go to your Profile page in BioData Catalyst powered by Terra and log in with your NIH credentials.
If your access still has issues using particular files or datasets in analyses on BioData Catalyst powered by Terra, submit a request to our .
You do not need to check your data access on BioData Catalyst powered by PIC-SURE. Instead, refer to the page, then click Check My Access.
The NHLBI BioData Catalyst website provides further details about the available in the ecosystem. We encourage you to create accounts on all the platforms as you get to know BioData Catalyst.
The BioData Catalyst ecosystem manages access to the hosted controlled data using data access approvals from the NIH Database of Genotypes and Phenotypes (). Therefore, users who want to access a hosted controlled study on the ecosystem must be approved for access to that study in dbGaP.
Users log into BioData Catalyst platforms with their eRA Commons credentials (see ) and authentication is performed by iTrust. Every time a user logs in, the ecosystem checks his/her user credentials to ensure s/he can only access the data for which s/he has dbGaP approval.
Details about how data access works on the NHLBI BioData Catalyst ecosystem are .
We recommend users first check their access to data before logging in. Do this by going to the and clicking on the “Check My Access” button. Once you confirm your data access, go to the page from which you click on the “Launch” hyperlink for the platform or service you wish to use. Platforms and services have login/sign in links on their pages that bring you to the pages on which you enter your eRA Commons credentials. on checking your access to data is also available.
The NHLBI BioData Catalyst currently hosts a subset of datasets from TOPMed including phs numbers with genomic data and related phs numbers with phenotype data. You can find information about which are currently hosted on the of the website as well as in the .
There are limited amounts of harmonized data available to users with appropriate access at this time. The TOPMed Data Coordinating Center curation team has produced forty-four (44) harmonized phenotype variables from seventeen (17) NHLBI studies. Information about the 17 studies and the 44 variables can be found in the .
We allow researchers to bring their own data and workflows into the ecosystem to support their analysis needs. Researchers can bring their own datasets into and . Users can also bring their own workflows to the system. Users can either add workflows to in CWL or WDL, or they can directly on BioData Catalyst Powered by Seven Bridges and for use on BioData Catalyst Powered by Terra.
Walk through our self-paced genome-wide association study and genetic association testing .
We encourage users to publish their workflows so they can be used by other researchers working in the NHLBI BioData Catalyst ecosystem. Share your workflows via .
BioData Catalyst hosts a number of datasets available for analysis to users with appropriate data access approvals. Users are not charged for the storage of these hosted datasets; however, if hosted data is used in analyses users incur costs for computation and storage of derived results. Cloud credits are available on the system, and you can .
If you are writing a manuscript about research you conducted using NHLBI BioData Catalyst, please use .
Immediately after learning your manuscript has been accepted, please email to let us know. Please include in your email the manuscript title, the name of the publication that accepted your manuscript, and information about pre-publication posting (if it will take place), along with your name and contact information.
Answers to are available on the website, as are many resources that can be found under . You can also use a form to , and if you aren’t sure which selections to make on the form, please see our .
For a quick tip that you want to distribute swiftly, draft something short that you can easily post to the . The following is an example of a quick tip for using PIC-SURE’s Data Access Table:
For videos, review the User-Generated Videos portion of the . By submitting a video, you agree to those conditions. Once your video is uploaded to your YouTube channel, email the link to BDCatalystOutreach@nih.gov for consideration to be linked to the BDC YouTube channel also.
Forum messages will post directly in the .
Written documentation will live in the .
User-generated videos will be linked in the .
BioData Catalyst uses telemetry provided by dbGaP to enforce compliance with consents. Accordingly, users of BioData Catalyst can see only the data for which they have completed the process of a dbGaP Data Access Request (DAR) and after they receive approval from an NIH Data Access Committee (DAC). DAC approval results in a Data Use Agreement (DUA) describing any Data Use Limitations asserted by the originating study Principal Investigator, including ensuring that any requirements for an IRB review or Letter of Collaboration are met. On the BioData Catalyst ecosystem, the Gen3 “Fence” service, developed and operated by the University of Chicago and utilized by other NIH platforms and many other non-federal data commons, ensures enforcement of data access requirements. In order to ensure that the NHLBI maintains control over the use of the data and has the ability to audit this use, the BioData Catalyst policy does not allow download of source data outside the cloud ecosystem. Instead, BioData Catalyst provides access to data in approved cloud environments where computation occurs, which is documented in an ISA via standard Application Programming Interfaces (APIs) that use the authentication and authorization provided by the NIH Researcher Auth Service (RAS) (REF), together with Fence, to protect access and to allow for monitoring and auditing for appropriate data use (e.g. within scope of the approved project). These APIs include implementations of protected GA4GH Data Repository Service (DRS) for access to data objects such as genomic data and protected PIC-SURE interfaces for access to phenotypic, genotypic variant, and electronic health record data. The use of these APIs will, once fully implemented, enable other trusted cloud-based systems that meet equivalent security requirements to access data stored within BioData Catalyst for analysis using that other systems’ tools without the data being downloaded outside the security perimeters of the systems. This commitment to the use of APIs together with the requirement that data stay within the designated security boundaries is a critical component of making NHLBI data FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016, Corpas et al. 2018), while also ensuring confidentiality of data and respect for consent, regardless of the platform where the data is analyzed. BioData Catalyst has extended this model through the use of the (RAS) to provide authentication and authorization controls which together with the use of secure APIs is enabling secure interoperability with other trusted NIH-funded platforms such as the NHGRI’s AnVIL and the Gabriella Miller Kids First Data Resource.
You can also check the data you have access to by going to the page on the BioData Catalyst website and clicking Check My Access.
Information on TOPMed
(login required)
Refer to the BioData Catalyst page to learn more about topics such as data privacy, access controls, and restrictions.
Use your eRA Commons account to review the data indexed by BioData Catalyst to which you have access on the page. For more information, see .
If your data is not indexed, inform BioData Catalyst team members during your onboarding meetings or by submitting a .
Parameter
Description
Data with Access (default)
Displays projects you have access to.
Data without Access
Displays data you do not have subject-level access to, but for which summary statistics can be accessed.
All Data
Displays all projects, but also projects you have no access to. A lock will appear for data you cannot access.
Atherosclerosis Risk in Communities Study
ARIC
phs000280
Cardiovascular Health Study
CHS
phs000287
Cleveland Family Study
CFS
phs000284
Coronary Artery Risk Development in Young Adults Study
CARDIA
phs000285
Epidemiology of Asthma in Costa Rica Study
CRA
phs000988
Framingham Heart Study
FHS
phs000007
Genetic Epidemiology Network of Arteriopathy
GENOA
phs001238
Genetic Epidemiology of COPD
COPDGene
phs000179
Genetics of Cardiometabolic Health in Amish
AMISH
phs000956
Genome-Wide Association Study of Venous Thrombosis Study
MAYOVTE
phs000289
Heart and Vascular Health Study
HVH
phs001013
Hispanic Community Health Study - Study of Latinos
HCHS-SOL
phs000810
Jackson Heart Study
JHS
phs000286
Multi-Ethnic Study of Atherosclerosis
MESA
phs000209
Study of Adiposity in Samoans
SAS
phs000914
Women’s Health Initiative WHI
WHI
phs000200
PIC-SURE provides two ways to search: PIC-SURE Open Access and PIC-SURE Authorized Access. PIC-SURE Open Access enables the user to explore aggregate-level data without any dbGaP data authorizations. PIC-SURE Authorized Access feature allows the user to explore participant-level data and requires authorization to access at least one study through an active dbGaP Data Access Request (DAR).
Table Comparison of PIC-SURE Open and Authorized Access
Removed stigmatizing variables
✓
Data obfuscation
✓
dbGaP approval to access required
✓
Access to aggregate counts
✓
✓
Access to participant-level data
✓
Phenotypic variable search
✓
✓
Phenotypic variable filtering
✓
✓
Genomic variable filtering
✓
Data retrieval
✓
Visualizations
✓
The BioData Catalyst ecosystem hosts several datasets from the NIH NHLBI Collaborating Network of Networks for Evaluating COVID-19 and Therapeutic Strategies (CONNECTS) program. These COVID-19 related studies follow the guidelines for implementing common data elements (CDEs) and for de-identifying dates, ages, and free text fields. For more information about these efforts, you can view the CDE Manual and De-Identification Guidance documents on the CONNECTS COVID-19 Therapeutic Trial Common Data Elements webpage.
Table of COVID-19 Studies Included in the CONNECTS Program Available in PIC-SURE
A Multicenter, Adaptive, Randomized Controlled Platform Trial of the Safety and Efficacy of Antithrombotic Strategies in Hospitalized Adults with COVID-19
ACTIV4a
phs002694
COVID-19 Positive Outpatient Thrombosis Prevention in Adults Aged 40-80
ACTIV4b
phs002710
Clinical-trial of COVID-19 Convalescent Plasma in Outpatients
C3PO
phs002752
How to get started with PIC-SURE and the common endpoints you can use to query any resource registered with PIC-SURE
The PIC-SURE v2 API is a meta-API used to host any number of resources exposed through a unified set of generalized operations.
PIC-SURE Repositories:
PIC-SURE API: This is the repository for version 2+ of the PIC-SURE API.
PIC-SURE Wiki: This is the wiki page for version 2+ of the PIC-SURE API.
BioData Catalyst PIC-SURE: This is the repository for the BioData Catalyst environment of PIC-SURE.
PIC-SURE-ALL-IN-ONE: This is the repository for PIC-SURE-ALL-IN-ONE.
Additional PIC-SURE Links:
Avillachlab-Jenkins Repository: A link to the Avillach Lab Jenkins repository.
Avillachlab-Jenkins Dev Release Control: A repository for Avillach Lab Jenkins development release control.
The following are the collected client libraries for the entire PIC-SURE project.
The PIC-SURE User Interface acts as a visual aid for running normal queries of resources through PIC-SURE.
PIC-SURE User Interface Repositories:
PIC-SURE HPDS UI: The main High Performance Data Store (HPDS) UI repository.
Additional PIC-SURE User Interface Links:
PIC-SURE UI Flow: Links to a google drawing of the PIC-SURE UI flow.
The PSAMA component of the PIC-SURE ecosystem authorizes and authenticates all actions taken within PIC-SURE.
PSAMA Repos:
Additional PSAMA Links:
PSAMA Core Logic: This is where the core of the PSAMA application is stored in GitHub
HPDS is a datastore designed to work with the PIC-SURE meta-API. It grants researchers fast, dependable access to static datasets and the ability to produce statistics-ready dataframes filtered on any variable they choose at any time.
HPDS Repositories:
PIC-SURE HPDS: The main HPDS repository.
PIC-SURE HPDS Python Client: Python client library to run queries against a PIC-SURE HPDS resource.
PIC-SURE HPDS R Client: R client library to run queries against a PIC-SURE HPDS resource.
PIC-SURE HPDS UI: The main HPDS UI repository.
HPDS Annotation: This repository describes steps to prepare and annotate VCF files for loading into HPDS.
Patient ID
This is the HPDS Patient num. This is PIC-SURE HPDS’s internal Identifier.
Topmed / Parent Study Accession with Subject ID
These are the identifiers used by each in the team in the consortium to link data.
Values must follow this mask <STUDY_ACCESSION_NUMBER>.<VERSION>_<SUBJECT_ID> Eg: phs000007.v30_XXXXXXX
DBGAP_SUBJECT_ID
This is a generated id that is unique to each patient in a study.
Controlled by dbgap
It is not unique across unrelated studies. However Patients can be linked across studies. See SOURCE_SUBJECT_ID.
However a patient will be assigned the same across related studies. For dbGaP to assign the same dbGaP subject ID, include the two variables, SUBJECT_SOURCE and SOURCE_SUBJECT_ID.
This identifier is used in all the phenotypic data files and is what we sequence to a HPDS Patient Num ( Patient ID ). All sequenced identifiers are stored in a PatientMapping file and stored in s3. These mappings allow HPDS data to be correlated back to the raw data sets.
SUBJECT_ID
This is a generated id that is unique to each patient in a study.
Controlled by the submitter of a study.
For FHS this is replaced with shareid for phs000007. For phs000974 It uses SUBJECT_ID. The values for these two columns are the same however.
SHARE_ID
For FHS phs000007 this was used instead of SUBJECT_ID, but not for FHS phs000974
SOURCE_SUBJECT_ID
This is used internally by DBGAP in conjunction with SUBJECT_SOURCE to allow submitters to associate subjects across studies.
SAMPLE_ID
De-identified sample identifier.
These are the ids that link to the molecular data in dbgap ( vcfs, etc.).
\_Topmed Study Accession with Subject ID\
Generated identifier for TOPMed Studies. These identifiers are a concatenation using the accession name and “SUBJECT_ID” from a study’s subject multi file.
<STUDY_ACCESSION_NUMBER>.<VERSION>_<SUBJECT_ID>
Eg: phs000974.v3_XXXXXXX
\_Parent Study Accession with Subject ID\
Generated identifier for PARENT Studies. In most studies this follows the same pattern as the TOPMed Study Accession with Subject id.
However, Framingham’s parent study phs000007 does not contain SUBJECT_ID column which is replaced using the SHAREID column.
Eg: phs000007.v3_XXXXXXX
\_VCF Sample Id\
This variable is stored in the sample multi file in each dbGaP study.
This is the TOPMed DNA sample identifier. This is used to give each sample/sequence a unique identifier across TOPMed studies.
Eg: NWD123456
Patient ID (not a concept path but exists in data exports)
This is PIC-SURE’s internal Identifier. It is commonly referred to as HPDS Patient num.
This identifier is generated and assigned to subjects when they are loaded. It is not meant for data correlation between different data sources.
BioData Catalyst Powered by PIC-SURE YouTube channel
Introduction to BioData Catalyst Powered by PIC-SURE
Basics: Applying a Variable on a Filter
Basics: Editing a Variable Filter
PIC-SURE Open Access: Interpreting the Results
PIC-SURE Authorized Access: Applying a Genomic Filter
PIC-SURE Authorized Access: Add Variables to Export
PIC-SURE Authorized Access: Select and Package Data Tool
PIC-SURE integrates clinical and genomic datasets across BioData Catalyst, including TOPMed and TOPMed related studies, COVID-19 studies, and BioLINCC studies. Each variable is organized as a concept path that contains information about the study, variable group, and variable. Though the specifics of the concept paths are dependent on the type of study, the overall information included is the same.
For more information about additional dbGaP, TOPMed, and PIC-SURE concept paths, refer to Appendix 1.
Table of Data Fields in PIC-SURE
General organization
Data organized using the format implemented by the . Find more information on the dbGaP data structure .
Generally, a given study will have several tables, and those tables have several variables.
Data do not follow dbGaP format; there are no phv or pht accessions.
Data are organized in groups of like variables, when available. For example, variables like Age, Gender, and Race could be part of the Demographics variable group.
Concept path structure
\phs\pht\phv\variable name\
\phs\variable name
Variable ID
phv corresponding to the variable accession number
Equivalent to variable name
Variable name
Encoded variable name that was used by the original submitters of the data
Encoded variable name that was used by the original submitters of the data
Variable description
Description of the variable
Description of the variable, as available
Dataset ID
pht corresponding to the trait table accession number
Equivalent to dataset name
Dataset name
Name of the trait table
Name of a group of like variables, as available
Dataset description
Description of the trait table
Description of a group of like variables, as available
Study ID
phs corresponding to the study accession number
phs corresponding to the study accession number
Study description
Description of the study from dbGaP
Description of the study from dbGaP
Note that there are two data types in PIC-SURE: categorical and continuous data. Categorical variables refers to any variables that have categorized values. For example, “Have you ever had asthma?” with values “Yes” and “No” is a categorical variable. Continuous variables refer to any variables that have a numeric range of values. For example, “Age” with a value range from 10 to 90 is a continuous variable. The internal PIC-SURE data load process determines the type of each variable based on the data.
PIC-SURE: Patient Information Commons Standard Unification of Research Elements
The Patient Information Commons: Standard Unification of Research Elements (PIC-SURE) integrates clinical and genomic data to allow users to search, query, and export data at the variable and variant levels. This allows users to create analysis-ready data frames without manually mapping and merging files.
BioData Catalyst Powered by PIC-SURE functions as part of the BioData Catalyst ecosystem, allowing researchers to explore studies funded by the National Heart, Lung, and Blood Institute (NHLBI), whether they have been granted access to the participant level data or not.
To obtain access to BioData Catalyst Powered by PIC-SURE, you must have an NIH eRA Commons account. For instructions and to register an account, refer to the eRA website.
After you have created an eRA Commons account, you can log in to BioData Catalyst Powered by PIC-SURE by navigating to https://picsure.biodatacatalyst.nhlbi.nih.gov and selecting to log in with eRA Commons. You will be directed to the NIH website to log in with your eRA Commons credentials. After signing in and accepting the terms of the agreement on the NIH RAS Information Sharing Consent page, allow the BioData Catalyst Powered by Gen3 service to manage your authorization.
Upon login, you will be directed to the Data Access Dashboard. This page provides a summary of PIC-SURE Authorized Access, PIC-SURE Open Access, and the studies you are authorized to access.
Search bar: Enter any phenotypic variable, study or table keyword into the search bar to search across studies. Users can also search specific variables by accession number, if known (phs/pht/phv).
Study Tags: Users can filter the results found through their search by limiting to studies of interest or excluding studies.
Variable Tags: Users can filter the results found through their search by limiting to keywords of interest or excluding keywords that are out of scope. For example, a user could filter to categorical variables, variables containing the term ‘blood’, and/or exclude variables containing the term ‘pressure’.
How are variable tags generated? Each variable has a set of associated tags, which are generated during the PIC-SURE data loading process. These tags are generated based on information associated with the variable, including the name of the study, study description, dataset name, PIC-SURE data type (continuous or categorical), and variable description. For a search in PIC-SURE, tags associated with a variable are displayed. Note that tags applicable to less than 5% or more than 95% of the search results are not displayed since these are not useful for filtering results.
Search Results table: View all variables associated with your search term and/or study & variable tags.
Results Panel: Panel with content boxes that describe the cohort based on the variable filters applied to the query.
Data Summary: Displays the total number of participants in the filtered cohort which meet the query criteria. When first opening the Open or Authorized Access page, the number will be the total number of participants that you can access.
Added Variable Filters summary: View all filters which have been applied to the cohort.
Filter Action: Click on the filter icon to filter cohort participants by specific variable values.
Reset button: Allows users to start a new search and query by removing all added filters and clearing all active study and variable tags.
Once you have refined your queries and created a cohort of interest, you can begin analyzing data using other components of the BioData Catalyst ecosystem.
Databases exposed through the PIC-SURE API encompass a wide heterogeneity of architectures and data organizations underneath. PIC-SURE hides this complexity and exposes the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible sciences. The API is available in two different programming languages, python and R, allowing investigators to query databases in the same way using either of those languages. The PIC-SURE API tutorial notebooks can be directly accessed on GitHub.
To access the PIC-SURE API, a user-specific token is needed. This is the way the API grants access to individual users to protected-access data. The user token is strictly personal; do not share it with anyone. You can copy your personalized access token by selecting the User Profile tab at the top of the screen.
Here, you can Copy your personalized access token, Reveal your token, and Refresh your token to retrieve a new token and deactivate the old token.
The PIC-SURE API can be accessed via tutorial notebooks on either BioData Catalyst Powered by Seven Bridges or Powered by Terra.
To launch one of the analysis platforms, go to the BioData Catalyst website. From the Resources menu, select Services. A list of platforms and services on the BioData Catalyst ecosystem will be displayed.
From the Analyze Data in Cloud-based Shared Workspaces section, select Launch for your preferred analysis platform.
Jupyter notebook examples in R and python can be found under the Public projects tab by selecting PIC-SURE API.
From the Data Studio tab, select an example that fits your research needs. Here, we will select PIC-SURE JupyterLab examples.
This will take you to the PIC-SURE API analysis workspace, where you can view the examples in python. Copy this workspace to your own project to edit or run the code yourself.
Note The project must have network access to run the PIC-SURE examples on Seven Bridges. To ensure this, go to the Settings tab and select “Allow network access”.
To access the Jupyter notebook examples in R and python for the PIC-SURE API, select View Workspaces from the Terra landing page.
Select the Public tab and search for “PIC-SURE”. Workspaces for both the python and R examples will be displayed. You must clone the workspaces to edit or run the code within them.
PIC-SURE Open Access allows you to search any clinical variable available in PIC-SURE. Your queries will return obfuscated aggregate counts per study and consent. There are some features specific to PIC-SURE Open Access, which are outlined below.
A. Stigmatizing Variables Removal: PIC-SURE Open Access data excludes clinical variables that contain potentially sensitive information. These variables are known as stigmatizing variables, which fall into the following categories:
Mental health diagnoses, history, and treatment
Illicit drug use history
Sexually transmitted disease diagnoses, history, and treatment
Sexual history
Intellectual achievement, ability, and educational attainment
Direct or surrogate identifiers of legal status
B. Data Obfuscation: Because participant-level data are not available in PIC-SURE Open Access, the aggregate counts are obfuscated to further anonymize the data. This means that:
If the consent group, study, and/or total participants of the query is between one and nine, the results will be shown as < 10.\
If the consent group results are between one and nine and the study and/or total participants of the query is greater than 10, the results will be obfuscated by ± 3.
Query results that are zero participants will display 0.
C. View Filtered Results by Study: The filtered number of participants which match the query criteria is shown broken down by study and consent group. Users can see if they do or do not have access to specific studies.
In this section, the functionalities of PIC-SURE Open Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating asthma in relation to obesity in adults.
I’m interested in two cohorts: obese adults with a body mass index (BMI) greater than 30 and healthy adults with a BMI between 18.5 and 24.9. However, I have not yet submitted a Data Access Request and therefore am not authorized to access any datasets.
First, let’s explore cohort A: Healthy adults with a BMI between 18.5 and 24.9 in Framingham Heart Study (FHS).
Search for ‘age’.
Apply ‘FHS’ study tag to view only ‘age’ variables within the Framingham Heart Study (phs000007).
Select the variable of interest. You may notice many variables that appear similar. These variables may be located in different datasets, or tables, but contain similar information. Open up the variable information modal by clicking on the row containing the variable of interest to learn more.
Now, let’s filter to healthy adults with a BMI between 18.5 and 24.9. Similar to before, we will search ‘BMI’. We can narrow down the search results using the variable-level tags by including terms related to our variable of interest (such as ‘continuous’ to view only continuous variables) and excluding out-of-scope terms (such as ‘allergy’). After selecting the variable of interest, we can filter to the desired ranges before adding the filter to our query. Notice how the total number of participants in our cohort changes.
Finally, we will filter for participants who have asthma.
Note the total participant count in the Data Summary.
We can easily modify our filters to explore cohort B: Obese adults with a body mass index (BMI) greater than 30 in Framingham Heart Study.
Note the total participant count in the Data Summary.
We can easily repeat these steps for other studies, such as the Genetic Epidemiology of COPD (COPDGene) study, and create a table like the one below. By comparing these two studies, I can see that COPDGene may be more promising for my research since it contains many more participants in my cohorts of interest than FHS does.
I can then use the Request Access button to go directly to the study’s dbGaP page and begin submitting a DAR.
If you are authorized to access any dbGaP dataset(s), the Authorized Access tab at the top will be visible. PIC-SURE Authorized Access provides access to complete, participant-level data, in addition to aggregate counts, and access to the Tool Suite.
A. Select Variables Action: Click the Select Variables icon to include variables when retrieving data. Users can select variables individually or at the dataset level.
Individually select variables: You can individually select variables from two locations:
Variable search results: From the search results you can click the data retrieval icon to include the variable in your data retrieval.
Variable modal variable data retrieval: The data retrieval icon next to the variable adds the variable to your data retrieval.
Select from a dataset or group of variables: In the variable modal the data retrieval icon next to the dataset opens a modal to allow you to select variables from the dataset table or group of variables.
B. Data Summary: In addition to the total number of participants in the filtered cohort, the number of variables the user has selected for data retrieval is also displayed.
There are four concept paths that are automatically included with any data export from PIC-SURE Authorized Access. These fields are listed and described below.
Patient ID: Internal PIC-SURE participant identifier. Please note that this field is not linking participants between studies and therefore should not be used for data correlation between different data sources or to the original data files.
Parent Study Accession with Subject ID: PIC-SURE generated identifier for parent studies. These identifiers are a combination of the study accession number and the subject identifier.
Topmed Study Accession with Subject ID: PIC-SURE generated identifier for TOPMed studies. These identifiers are a combination of the study accession number and the subject identifier.
Consents: Field used to determine which groups users are authorized to access from dbGaP. These identifiers are a combination of the study accession number and consent code.
C. Tool Suite: The Tool Suite contains tools that can be used to further explore filtered cohorts of interest. Note that at least one filter must be added to the query before using the Tool Suite.
Select and Package Data: Retrieve participant-level data corresponding to your filters and variable selections. Variables selected for data retrieval can be reviewed and modified. To learn more about the options associated with this tool, please refer to the Select and Package Data section.
Variable Distributions: View the distributions of query variables based on the filtered cohort. Note that there is a limit to the number of variable distributions that can be viewed at a given time. Additionally, genomic variables nor variables associated with any-record-of filter (e.g. entire datasets) will not be graphed.
The Select and Package Data tool is used to select and export participant-level data corresponding to your filters and variable selections. There are several options for selecting and exporting the data, which are shown using this tool.
In the top left corner of the modal, the number of participants and number of variables included in the query is shown. This is used to display the estimated number of data points in the export.
Note: Queries with more than 1,000,000 data points will not be exportable.
The table below displays a summary of the variables included in the export. Using the Selected column, variables that have been added to the export can be selected or deselected for the final dataframe.
Note: Variables with filters are automatically included in the export.
The Package Data button in the top right corner is used to prepare the data for export once the variable filters and selections have been finalized.
Once this button is clicked, there are several options to complete the export.
To export into a BioData Catalyst analysis workspace, the Export to Seven Bridges or Export to Terra buttons can be used. Once clicking either of these buttons, a new modal will be displayed with all information and instuctions needed to complete the export. This includes your personalized access token, the query ID associated with the dataframe. Additionally, there is the option to Copy Query ID without accessing Seven Bridges or Terra if you wish to use a different analysis platform.
The Export to Seven Bridges option includes a Go to Seven Bridges button, which will open a new tab to the Public PIC-SURE API Project on BioData Catalyst Powered by Seven Bridges.
The Export to Terra option includes a Go to Terra via R button and a Go to Terra via Python button, which will open the Public PIC-SURE API R Examples workspace and the Public PIC-SURE API Python Examples workspace on BioData Catalyst Powered by Terra, respectively.
In this section, the functionalities of PIC-SURE Authorized Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating some comorbidities of breast cancer in women with BRCA1 and BRCA2 gene variants, such as hypertension and COPD.
I already have been authorized to access the Women’s Health Initiative (WHI) study and am interested in a single cohort: women with breast cancer and variants of the BRCA1 and BRCA2 genes. I want to select hypertension-related variables of interest, check the distributions of some variables, and export all the data to an analysis workspace.
First, let’s apply our variable filters for the WHI study.
Search “breast cancer” in Authorized Access.
Add the WHI study tag to filter search results to only age variables found within the WHI study.
Click the “Genomic Filtering” button to begin a filter on genomic variants.
Select “BRCA1” and “BRCA2” genes of “High” and “Moderate” severity. Click “Apply genomic filter”.
Now, let’s filter to participants that have and do not have COPD. Similar to before, we will search ‘COPD’. After selecting the variable of interest, we can filter to the desired values before adding the filter to our query. Notice how the total number of participants in our cohort changes.
Search “hypertension”.
Notice how the number of variables changed in the Data Summary box.
Before we Select and Package the data for export, let’s view the distribution of our participants’ ages to see if we have a normal distribution. Open the Variable Distributions tool in the Tool Suite. Here, we can see the distributions of the two added variable filters: breast cancer (‘BREAST’) and COPD (‘F33COPD’).
Open the Select and Package Data tool in the Tool Suite. The variables shown in this table are those which will be available in your data export; you can remove variables as necessary.
Click “Package Data” when you are ready.
Once the data is packaged, you can select to either “Export to Seven Bridges” or “Export to Terra”. Copy over the personalized user token and query ID use the PIC-SURE API and export your data to an analysis workspace.
An explanation for the Exploration page on BioData Catalyst Powered by Gen3
The Exploration page located in the upper right-hand section of the toolbar allows users to search through data and create cohorts. The Exploration portal contains a dynamic summary statistics display, as well as search facets leveraging the DCC Harmonized Variables.
Users can navigate through data on the Exploration page by selecting any of the three Data Access categories.
Data with Access: A user can view all of the summary data and associated study information for studies the user has access to, including but not limited to Project ID, file types, and clinical variables.
Data without Access:
Projects will also be hidden if the select cohort contains fewer than 50 subjects (50
↓, "You may only view summary information for this project", example below); in this case grayed out boxes and locks both appear. An additional lock means users have no access.
All Data: Users can view all of the data available in the BioData Catalyst Gen3 platform, including studies with and without access. As a result, studies not available to a user will be locked as demonstrated below.
By default, all users visiting the Exploration page will be assigned to Data with Access
.
Project: Any specifically defined piece of work that is undertaken or attempted to meet a single investigative question or requirement.
Subject: The collection of all data related to a specific subject in the context of a specific experiment.
Harmonized Variables: A selection of different clinical properties from multiple nodes, defined by the Consortium.
NOTE: The facet filters are based on the DCC Harmonized Variables, which are a selected subset of clinical data that have been transformed for compatibility across the dbGaP studies. TOPMed studies that do not contain harmonized clinical data at this time will be filtered out when a facet is chosen, unless the
no data
option is also selected for certain facets.
After a cohort has been selected, the user has four different options for exporting the data.
The options for export are as follows:
Export to Workspaces
: Export a manifest to the user's workspace and make the case-associated data files available in the workspace under the /pd/data
directory.
NOTE: PFB export times can take up to 60 minutes, but often will complete in less than 10 minutes.
The Files tab displays study files from the facets chosen on the left-side panel (Project ID, Data Type, Data Format, Callset, and Bucket Path). Each time a facet selection is made, the data summary and displays will update to reflect the applied filters.
The Files tab also contains files that are either case-independent or project-level. This is important for files that are part of the Unharmonized Clinical Data
category under the Data Type field. Unharmonized clinical files are made available in two distinct data formats:
TAR
: Contain a complete directory of phenotypic datasets as XML
and TXT
files that are direct downloads of unharmonized clinical data from dbGaP on a study consent level project.
XML
: These files contain either dictionary or variable reports of the phenotypic datasets that are in the TXT files. These supporting files do contain information on a study-level and not on a subject-level.
TXT
: These files contain subject-level phenotypic datasets.
NOTE: The unharmonized clinical data sets contains all data from the dbGaP study, but it is not cross-compatible across all studies within BioData Catalyst.
Once the user has selected a cohort, there are five options for accessing the files:
Export to Workspace
: The files can be exported to a Gen3 workspace.
GUID Download File Page
: Aside from the 5 button options, users can download files by first clicking on the link(s) under the GUIDs column, followed by the Download button in the file information pages (see next section below).
Both the Data and File tabs contain a text-based search function that will initiate a list of suggestions below the search bar while typing.
In the Data tab, Submitter IDs can be searched under the Subject
tab.
In the File tab, File Names can be searched under the File
tab.
Click either on a single or on multiple suggestions in the list appearing underneath the search bar to create a cohort and export/download the data. The selections can be again clicked to be removed from the created cohort.
Overview of the Query page on BioData Catalyst Powered by Gen3
The Query page can search and return metadata from either the Flat Model or the Graph Model of a commons. Using GraphQL, these searches can be tailored to filter and return fields of interest for the data sets being queried. These queries can be made immediately after data submission as this queries the model directly.
How to login to the NHLBI BioData Catalyst Gen3 platform and view available genomic and phenotypic data.
In order to navigate and access data available on the Gen3 platform, start by visiting the . You will need an eRA Commons account as well as access permissions through the . If you are a researcher, login by selecting NIH Login and using your . BioData Catalyst consortia developers can login using their Google accounts. Make sure to use the correct login method that contains access to your available projects.
Once logged in, your username will appear in the upper right-hand corner of the page. You will also see a display with aggregate statistics for the total number of subjects, studies, aliquots and files available within the BioData Catalyst platform.
NOTE: These numbers may differ from those displayed in the dbGaP records as they include TOPMed studies as well as the associated parent studies.
The BioData Catalyst Gen3 platform contains five pages described below:
For more information about stigmatizing variables and the identification process, please refer to the documentation and code on the .
Filter to adults only by clicking the filter icon next to the variable. I am interested in adults, so I will set the minimum age to 18, then click “Add filter to query”.
Edit the BMI filter by clicking the edit icon in the Added Variable Filters section. Change the range to have a minimum of 30 and no maximum.
Filter to participants with breast cancer by clicking the filter icon next to the variable of interest. Select values to filter your variable on and click “Add Filter to Query”.
Add variables to data export by clicking the select variables icon in the Actions column next to the variable of interest. The icon next to variables selected for export will change to the checkmark icon.
Locks next to the project ID signify to users that they do not have subject-level access but they can still search through the available studies but only view summary statistics. Users can request access to data by visiting the .
Under the "Data" tab, users can leverage the to create custom cohorts. When facets are selected and/or updated to cover a desired range of values, the display will reflect the information relevant to the new applied filter. If no facets have been selected, all of the data accessible to the user will be displayed. At this time, a user can filter based on three categories of clinical information:
Export All to Terra
: Initiate a export of all clinical data and file GUIDs for the selected cohort to . At this time the max number of subjects that can be exported to Terra is 120,000.
Export All to Seven Bridges
: Initiate a export of all clinical data and file GUIDs for the selected cohort to
Export to PFB
: Initiate a export of all clinical data and file GUIDs for the selected cohort to your local storage.
AVRO
: These files are the same as the unharmonized clinical data from dbGaP as the TAR
files, but in form of a file.
Download Manifest
: Download the file manifest and use this manifest to download the enlisted data files using the .
Export All PFB
: Initiate a export of the selected files.
Export All to Terra
: Initiate a export of the selected files to .
Export All to Seven Bridges
: Initiate a export of the selected files to
A user can visit the File Information Page after clicking on any of the available GUID link(s) in the Files tab page. The page will display details such as data format, size, object_id, the last time it was updated and the md5sum. The page also contains a button to download the file via the browser (see below). For files that are 5GB or more, we suggest using the .
For more information about how to use the Query page, refer to the .
A number of clinical variables have been harmonized by the in order to facilitate cross-study analysis. Faceted search over the DCC Harmonized Variables is available via the page, under the "Data" tab.
Unharmonized clinical files are also available on the Gen3 platform and contain all of the raw phenotypic information for the hosted studies. Unlike the DCC Harmonized Variables, these files are located and searchable under the "" tab in the page.
The Gen3 platform hosts genomic data provided by the (TOPMed) program and the plus synthetic tutorial data from Terra. At present, these projects include CRAM and VCF files together with their respective index files. Specifically for TOPMed projects, each project will contain at least one multi-sample VCF that comprises all subjects within the consent group. CRAM and VCF are based on an individual level, whereas multi-sample VCFs are based on the study consent level.
All files are available under the "Files" tab in the page. More detailed information on currently hosted data on the Gen3 platform can be found .
: An interactive data dictionary display that details the contents and relationships between clinical and biospecimen data
: The facet filter custom cohort creation tool
: The GraphQL query tool to retrieve specific data within the graph model
: The launch page for Gen3 workspaces that includes Jupyter Notebooks and RStudio
: The information page for each user, displaying access and the location for credential file downloads
cac_volume_1
Coronary artery calcium volume using CT scan(s) of coronary arteries
decimal
cubic millimeters
UMLS
cac_score_1
Coronary artery calcification (CAC) score using Agatston scoring of CT scan(s) of coronary arteries
decimal
UMLS
cimt_1
Common carotid intima-media thickness, calculated as the mean of two values: mean of multiple thickness estimates from the left far wall and from the right far wall.
decimal
mm
UMLS
cimt_2
Common carotid intima-media thickness, calculated as the mean of four values: maximum of multiple thickness estimates from the left far wall, left near wall, right far wall, and right near wall.
decimal
mm
UMLS
carotid_stenosis_1
Extent of narrowing of the carotid artery.
encoded
UMLS
0=None||1=1%-24%||2=25%-49%||3=50%-74%||4=75%-99%||5=100%
carotid_plaque_1
Presence or absence of carotid plaque.
encoded
UMLS
0=Plaque not present||1=Plaque present
height_baseline_1
Body height at baseline.
decimal
cm
UMLS
current_smoker_baseline_1
Indicates whether subject currently smokes cigarettes.
encoded
UMLS
0=Does not currently smoke cigarettes||1=Currently smokes cigarettes
weight_baseline_1
Body weight at baseline.
decimal
kg
UMLS
ever_smoker_baseline_1
Indicates whether subject ever regularly smoked cigarettes.
encoded
UMLS
0=Never a cigarette smoker||1=Current or former cigarette smoker
bmi_baseline_1
Body mass index calculated at baseline.
decimal
kg/m^2
UMLS
hemoglobin_mcnc_bld_1
Measurement of mass per volume, or mass concentration (mcnc), of hemoglobin in the blood (bld).
decimal
g / dL = grams per deciliter
UMLS
hematocrit_vfr_bld_1
Measurement of hematocrit, the fraction of volume (vfr) of blood (bld) that is composed of red blood cells.
decimal
% = percentage
UMLS
rbc_ncnc_bld_1
Count by volume, or number concentration (ncnc), of red blood cells in the blood (bld).
decimal
millions / microliter
UMLS
wbc_ncnc_bld_1
Count by volume, or number concentration (ncnc), of white blood cells in the blood (bld).
decimal
thousands / microliter
UMLS
basophil_ncnc_bld_1
Count by volume, or number concentration (ncnc), of basophils in the blood (bld).
decimal
thousands / microliter
UMLS
eosinophil_ncnc_bld_1
Count by volume, or number concentration (ncnc), of eosinophils in the blood (bld).
decimal
thousands / microliter
UMLS
neutrophil_ncnc_bld_1
Count by volume, or number concentration (ncnc), of neutrophils in the blood (bld).
decimal
thousands / microliter
UMLS
lymphocyte_ncnc_bld_1
Count by volume, or number concentration (ncnc), of lymphocytes in the blood (bld).
decimal
thousands / microliter
UMLS
monocyte_ncnc_bld_1
Count by volume, or number concentration (ncnc), of monocytes in the blood (bld).
decimal
thousands / microliter
UMLS
platelet_ncnc_bld_1
Count by volume, or number concentration (ncnc), of platelets in the blood (bld).
integer
thousands / microliter
UMLS
mch_entmass_rbc_1
Measurement of the average mass (entmass) of hemoglobin per red blood cell(rbc), known as mean corpuscular hemoglobin (MCH).
decimal
pg = picogram
UMLS
mchc_mcnc_rbc_1
Measurement of the mass concentration (mcnc) of hemoglobin in a given volume of packed red blood cells (rbc), known as mean corpuscular hemoglobin concentration (MCHC).
decimal
g /dL = grams per deciliter
UMLS
mcv_entvol_rbc_1
Measurement of the average volume (entvol) of red blood cells (rbc), known as mean corpuscular volume (MCV).
decimal
fL = femtoliter
UMLS
pmv_entvol_bld_1
Measurement of the mean volume (entvol) of platelets in the blood (bld), known as mean platelet volume (MPV or PMV).
decimal
fL = femtoliter
UMLS
rdw_ratio_rbc_1
Measurement of the ratio of variation in width to the mean width of the red blood cell (rbc) volume distribution curve taken at +/- 1 CV, known as red cell distribution width (RDW).
decimal
% = percentage
UMLS
bp_systolic_1
Resting systolic blood pressure from the upper arm in a clinical setting.
decimal
mmHg
UMLS
bp_diastolic_1
Resting diastolic blood pressure from the upper arm in a clinical setting.
decimal
mmHg
UMLS
antihypertensive_meds_1
Indicator for use of antihypertensive medication at the time of blood pressure measurement.
encoded
UMLS
0=Not taking antihypertensive medication||1=Taking antihypertensive medication
race_1
Harmonized race category of participant.
encoded
UMLS
AI_AN=American Indian_Alaskan Native or Native American||Asian=Asian||Black=Black or African American||HI_PI=Native Hawaiian or other Pacific Islander||Multiple=More than one race||Other=Other race||White=White or Caucasian
ethnicity_1
Indicator of Hispanic or Latino ethnicity.
encoded
UMLS
both=ethnicity component dbGaP variable values for a subject were inconsistent/contradictory (e.g. over multiple visits)||HL=Hispanic or Latino||notHL=not Hispanic or Latino
hispanic_subgroup_1
classification of Hispanic/Latino background for Hispanic/Latino subjects where country or region of origin information is available
encoded
UMLS
CentralAmerican=Central American||CostaRican=from Costa Rica||Cuban=Cuban||Dominican=Dominican||Mexican=Mexican||PuertoRican=Puerto Rican||SouthAmerican=South American
annotated_sex_1
Subject sex, as recorded by the study.
encoded
UMLS
female=Female||male=Male
geographic_site_1
Recruitment/field center, baseline clinic, or geographic region.
encoded
UMLS
subcohort_1
A distinct subgroup within a study, generally indicating subjects who share similar characteristics due to study design. Subjects may belong to only one subcohort.
encoded
UMLS
lipid_lowering_medication_1
Indicates whether participant was taking any lipid-lowering medication at blood draw to measure lipids phenotypes
encoded
UMLS
0=Participant was not taking lipid-lowering medication||1=Participant was taking lipid-lowering medication.
fasting_lipids_1
Indicates whether participant fasted for at least eight hours prior to blood draw to measure lipids phenotypes.
encoded
UMLS
0=Participant did not fast_or fasted for fewer than eight hours prior to measurement of lipids phenotypes.||1=Participant fasted for at least eight hours prior to measurement of lipids phenotypes.
total_cholesterol_1
Blood mass concentration of total cholesterol
decimal
mg/dL
UMLS
triglycerides_1
Blood mass concentration of triglycerides
decimal
mg/dL
UMLS
hdl_1
Blood mass concentration of high-density lipoprotein cholesterol
decimal
mg/dL
UMLS
ldl_1
Blood mass concentration of low-density lipoprotein cholesterol
decimal
mg/dL
UMLS
vte_prior_history_1
An indicator of whether a subject had a venous thromboembolism (VTE) event prior to the start of the medical review process (including self-reported events).
encoded
UMLS
0=did not have prior VTE event||1=had prior VTE event
vte_case_status_1
An indicator of whether a subject experienced a venous thromboembolism event (VTE) that was verified by adjudication or by medical professionals.
encoded
UMLS
0=Not known to ever have a VTE event_either self-reported or from medical records||1=Experienced a VTE event as verified by adjudication or by medical professionals
age_at_*
For each phenotypic value for a given subject, an associated age at measurement is provided.
decimal
years
See TOPMed Harmonization Strategies for more information.
unit_*
For each harmonized variable, a paired “unit_variable” is provided, whose value indicates where in the documentation to look to find the set of component variables and the algorithm used to harmonize those variables.
encoded
See TOPMed Harmonization Strategies for more information.
Framingham Heart Study (FHS)
50 +/- 3
72 +/- 3
Genetic Epidemiology of COPD (COPDGene)
488 +/- 3
868
Overview of the Portable Format for Bioinformatics (PFB) file type
A Portable Format for Bioinformatics (PFB) allows users to transfer both the metadata from the the Data Dictionary as well as the Data Dictionary itself. As a result, data can be transferred while keeping the structure from the original source. Specifically, a PFB consists of three parts:
A schema
Metadata
Data
For more information and an in-depth review that includes Python tools for PFB creation and exploration, refer to the PyPFB github page and install the newest version.
Note
The following PFB example is a direct PFB export from the tutorial-synthetic_data_set_1
found on BioData Catalyst Powered by Gen3. Due to the large amount of data stored within PFB files, only small sections are shown with breaks (displayed as ...
) occurring in the output.
A schema is a JSON formatted Data Dictionary containing information about the properties, such as value types, descriptions, and so on.
To view the PFB schema, use the following command:
Example Output
NOTE: To make the outputs more human-readable, the above information was then piped through the program jq. Example:
pfb show -i PFB_file.avro schema | jq
The metadata in a PFB contains all of the information explaining the linkage between nodes and external references for each of the properties.
To view the PFB metadata, use the following command:
Example Output
The data in the PFB are the values for the properties in the format of the Data Dictionary.
To view the data within the PFB, use the following command:
To view at a certain number of entries in the PFB file, use the flag -n
to designate a number. For example, to view the first 10 data entries within the PFB, use the following command:
Example Output
BioData Catalyst Powered by Seven Bridges offers researchers collaborative workspaces for analyzing genomics data at scale. Researchers can find and analyze the hosted TOPMed studies by using hundreds of optimized analysis tools and workflows (pipelines); creating their own workflows; or interactive analysis. On the platform, researchers can utilize collaborative workspaces for analyzing genomics data at scale, and access hosted datasets along with Common Workflow Language (CWL) and GENESIS R package pipelines for analysis. This platform also enables users to bring their own data for analysis and work in RStudio and Jupyterlab Notebooks for interactive analysis.
Private, secure, workspaces (projects) for running analyses at scale
Collaboration features with the ability to set granular permissions on project members
Direct access to BioData Catalyst without needing to set up a Google or AWS billing account
Access hosted TOPMed studies all in one place and analyze data on the cloud at scale
Tools and features for performing multiple-variant and single-variant association studies including:
Annotation Explorer for variant aggregations
Cloud-optimized Genesis R package workflows in Common Workflow Language
Cohort creation by searching phenotype data
Use PIC-SURE API for searching phenotype data
Search by known dbGaP identifiers
Rstudio and Jupyterlab Notebooks built directly into the platform for easy interactive analysis and manipulation of phenotype data
Hosted TOPMed data you can combine with your own data on AWS or Google Cloud
Billing and administrative controls to help your research funding go further: avoid forgotten instances, abort infinite loops, get usage breakdowns by project.
Instructions on transferring files between NHLBI BioData Catalyst Powered by Seven Bridges and NHLBI BioData Catalyst Powered by Terra
This tutorial guides users through the process of transferring files between the two workspace environments of NHLBI BioData Catalyst: NHLBI BioData Catalyst Powered by Seven Bridges and NHLBI BioData Catalyst Powered by Terra.
Most researchers select one of the workspaces as their primary analysis environment and their labmates and collaborators typically work with them on the same workspace environment. However, there are cases where some collaborators work on Seven Bridges and others work on Terra. In this case, researchers need to share data files between the two workspaces to facilitate collaboration. When researchers run analyses on Seven Bridges, the results, or derived data, is only available on Seven Bridges. Likewise, when researchers run analyses on Terra, the results are only available on Terra. This tutorial provides step-by-step guidance on how to share derived data between the workspace environments. These instructions can also be used to share private data that has been uploaded to Seven Bridges or Terra.
Both open access data and controlled access data can be shared across workspace environments. Importantly, if a researcher intends to share controlled access data, they must ensure that all recipients have the necessary dbGaP permissions for those files. In some cases, this may mean the researchers must be listed as collaborators on their respective dbGaP applications. These instructions are intended for sharing files under 1 terabyte (TB) in size. If you want to share data larger than 1 TB, contact the BioData Catalyst Help Desk to discuss your use case.
It is not recommended to transfer large amounts of data between cloud providers or regions; for example, AWS --> Google costs approximately $100/TB.
The first consideration is platform accounts. Moving data between Seven Bridges and Terra is currently a manual process and requires that one of the researchers involved in sharing has an account on both platforms. It is recommended that the recipient of the shared data is the person to have accounts on both Seven Bridges and Terra.
Let’s consider an example case: Sebastian who is working on Seven Bridges and Teresa who is working on Terra. If Sebastian wants to share data with Teresa so that she can use the data on Terra, Teresa first needs to set up an account on Seven Bridges. Now Teresa has an account on Terra and an account on Seven Bridges. Sebastian will share the data with Teresa on Seven Bridges by adding her as a member of the project with the data he wants to share, with Copy permissions. For information on permissions, refer to the Seven Bridges Set permissions documentation. Once Teresa is added as a member of the project, she can move the data from the Seven Bridges project to a workspace on the Terra platform, following the instructions in the section titled Moving Data From Seven Bridges to Terra.
If Teresa (Terra) wants to share data with Sebastian (Seven Bridges) so that he can use the data on Seven Bridges, Sebastian first needs to create an account on Terra. Now Sebastian has an account on Seven Bridges and an account on Terra. Teresa can share the data with Sebastian on Terra by sharing the workspace with the data she wants to share with Sebastian. For information on sharing workspaces, refer to the Terra How to share a workspace documentation.
To create a Terra account, refer to the Terra documentation.
To create a Seven Bridges account, refer to the Seven Bridges documentation. If you are new to Seven Bridges, you may find this Getting Started Guide helpful.
The second consideration is making sure the researcher moving data between the two workspaces has billing groups set up on both workspaces to cover cloud costs if necessary. Contact the BioData Catalyst Help Desk if you have questions about how to get a billing group on Seven Bridges or Terra.
The following steps describe how to use the Seven Bridges platform to pull data securely from a Terra workspace into a Seven Bridges project.
Refer to the Terra documentation for Moving data to/from a Google bucket (workspace or external), specifically the section Upload and download data files in a terminal using gsutil. This method:
Works well for all size transfers.
Ideal for large file sizes or 1000s of files.
Can be used for transfers between local storage and a bucket, workspace VM or persistent disk and a Google bucket, as well as between Google buckets (external and workspace).
You will use the terminal in JupyterLab on the Seven Bridges workspace environment. The reason for this is that although Seven Bridges can run on the Google Cloud Platform, the Google bucket API is not exposed in the same manner as it is on Terra. Therefore you will start a JupyterLab notebook on Seven Bridges, using the project you would like to be the destination for the copied data. Refer to the Seven Bridges documentation for launching Jupyter Lab notebooks on Seven Bridges and accessing the terminal in a JupyterLab environment.
After launching the notebook, the next step is to open the terminal and install the program gsutil
which is a python program that lets end users add data to or copy data from a Google Cloud bucket. After opening the terminal, run the following commands:
Installing gsutil
takes only a few seconds.
The config
command provides a secure URL for you to navigate to in the browser. You will authenticate with the same credentials that were used to login to Terra. The shortcut to access the printed URL in the JupyterLab terminal is to press shift and right click, which will display options to copy the URL. Copy and then navigate to the URL in a new browser tab, which will direct you to Google authentication:
Google will provide an authentication code that you will copy and paste into the terminal.
Next, you will type in the Google Project id. This is found on the right side of the Terra Workspace Dashboard.
Next, run the command below to display the different Google buckets that are attached to the project id.
The Google bucket name for the Terra project can be found in the lower right corner of the Terra Workspace.
Running gsutil ls
on the Google bucket name will display the folders and files from the Terra workspace.
To copy a folder to the Seven Bridges workspace environment, run the following command:
There are a couple important things to mention about the gsutil cp
command. First, the -R
flag for gsutil cp
is used to recursively copy a folder and all of its subfolders and files. Most users will likely want to use the -R flag. This flag should be omitted if copying individual files or if using a wild card such as “*.vcf”.
Additionally, /sbgenomics/output-files
should be the destination folder when bringing in data from Terra, as this will ensure the files or folders get populated back to the Seven Bridges project. Refer to the Save analysis outputs documentation for information about working with files in Data Cruncher environments. After the JupyterLab instance is shut down, your files will automatically be populated in your project-files tab on Seven Bridges.
In this section we will discuss pushing data from a Seven Bridges project to a Terra workspace.
The process of moving data from Seven Bridges to Terra is the same setup as the previous section with some modifications to the gsutil
copy command. Instead, we reverse the arguments.
You will still use the -R
flag but the destination is a Terra bucket. The Terra workspace’s Google bucket name/id can be found on the Terra workspace Dashboard tab. You can verify that the folder has been copied by navigating to the Files section of the Data tab in your Terra workspace.
Clicking on the folder, you will see that all three files have been copied.
For researchers interested in performing genotype-phenotype association studies, Seven Bridges offers a suite of tools for both single-variant and multiple-variant association testing on NHLBI BioData Catalyst Powered by Seven Bridges. These tools and features include the GENetic EStimation and Inference in Structured samples (GENESIS) pipelines, which were developed by the Trans-Omics for Precision Medicine (TOPMed) Data Coordinating Center (DCC) at the University of Washington. The Seven Bridges team collaborated with the TOPMed DCC to create Common Workflow Language (CWL) tools for the GENESIS R functions, and arranged these tools into five computationally-efficient workflows (pipelines).
These GENESIS pipelines offer methods for working with genotypic data obtained from sequencing and microarray analysis. Importantly, these pipelines have the robust ability to estimate and account for population and pedigree structure, which makes them ideal for performing association studies on data from the TOPMed program. These pipelines also implement linear mixed models for association testing of quantitative phenotypes, as well as logistic mixed models for association testing of binary (e.g. case/control) phenotypes.
Below, we feature our GENESIS Benchmarking Guide to assist users in estimate cloud costs when running GENESIS workflows on NHLBI BioData Catalyst Powered by Seven Bridges.
The objective of the GENESIS Benchmarking Guide is to instruct users on the drivers of cloud costs when running GENESIS workflows on the NHLBI BioData Catalyst Powered by Seven Bridges.
For all GENESIS workflows, the Seven Bridges team has performed comprehensive benchmarking analysis on Amazon Web Services (AWS) and Google Cloud Provider (GCP) instances for different scenarios:
2.5k samples (1000G data)
10k samples (TOPMed Freeze5 data)
36k samples (TOPMed Freeze5 data)
50k samples (TOPMed Freeze5 data)
The resulting execution times, costs, and general advice for running GENESIS workflows can be found in the sections below. In these sections, each GENESIS workflow is described, followed by the benchmarking results and some tips for implementing that workflow from the Seven Bridges Team. Lastly, we included a Methods section to describe our approach to benchmarking and interpretation for your reference.
The contents of this guide are arranged as follows:
Introduction
Helpful Terms to Know
GENESIS VCF to GDS
GENESIS Null model
GENESIS Single Association testing
GENESIS Aggregate Association testing
GENESIS Sliding window Association testing
General considerations
Below is a link to download the results of our Benchmarking Analysis described herein. It may prove useful to have this file open for reference when reading through this guide.
Before continuing on to the benchmarking results, please familiarize yourself with the following helpful terms to know:
Tool: Refers to a stand-alone bioinformatics tool or its Common Workflow Language (CWL) wrapper that is created or already available on the platform.
Workflow/Pipeline (interchangeably used): Denotes a number of tools connected together in order to perform multiple analysis steps in one run.
App: Stands for a CWL wrapper of a tool or a workflow that is created or already available on the platform.
Task: Represents an execution of a particular tool or workflow on the platform. Depending on what is being executed (tool or workflow), a single task can consist of only one tool execution (tool case) or multiple executions (one or more per each tool in the workflow).
Figure 1. The jobs for an example run of RNA-Seq Quantification (HISAT2, StringTie) public workflow
The green bars under the gray ones (apps) represent the jobs (Figure 1). As you can see, some apps (e.g. HISAT2_Build) consist of only one job, whereas others (e.g. HISAT2) contain multiple jobs that are executed simultaneously.
In this section, we detail the process of converting a VCF to a GDS via a GENESIS workflow. This VCF to GDS workflow consists of 3 steps:
Vcf2gds
Unique variant id
Check GDS
The first two steps are required while the last one is optional. The Check GDS step when included is the biggest cost driver in these tasks.
Check GDS tool is QC, which checks whether the final GDS file contains all variants that the input VCF/BCF has. This step is computationally intensive and its execution time can be 4-5 times longer than the rest of the workflow. Also, a failure of this step is something that we experience very rarely. In our results, there is a Check GDS column which is used as an indicator whether the Check GDS was performed or not.
We advise anyone who is using this workflow to consider results from the table below because differences in execution time and price with and without this check are considerable. A final decision on the approach that someone will use depends on the resources that one has (budget and time), and the preference of including or excluding the optional QC step.
In addition, CPU/job and Memory/job parameters have direct effects on execution time and the cost of the GENESIS VCF to GDS workflow. A combination of these parameters defines the number of jobs (files) that will be processed in parallel.
For example:
If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job the number of jobs run in parallel will be min{36/1,72/4}=18. If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job the number of jobs run in parallel will be min{36/1,72/8}=9. In this example, the second case would take twice as long as the first.
The following conclusions were drawn from performed benchmarking:
Benchmarking showed that the most suitable AWS instances for this workflow are c5 instances.
For all tasks that we run (from 2.5k up to 50k samples), 1CPU and 4GB per job were sufficient.
For small sample sizes (up to 10k samples), tasks can be run on spot/preemptible instances to additionally decrease the cost.
For samples up to 10k, 2GB per job could suffice, but consider that if we run check GDS step as well, execution time and price will not be much lower because CPU and Mem per job inputs are related only to vcf2gds step and not to the whole workflow.
We recommend using VCF.GZ as input files rather than BCF, as the conversion process cannot be parallelized when using BCFs.
If you have more files to convert (e.g. multiple chromosomes), we recommend running one analysis with all files as an input, rather than batch analysis with separate tasks for each file.
The GENESIS Null model workflow is not computationally intensive and it is relatively low-cost compared to other GENESIS workflows. For that reason, we present results that we obtained without any optimization below:
The null model can be fit with relatedness matrices (i.e. mixed models) or without relatedness matrices (i.e. simple regression models). If a relatedness matrix is provided, it can be sparse or dense. The tasks with dense relatedness matrix are the most expensive and take the longest to run. For the Null model workflow, available AWS instances appear to be more suitable than Google instances available on the platform.
Results of the GENESIS Single Association Testing workflow benchmarking can be seen in the table above. Some important notes to consider when using this workflow:
Null model type effect: The main cost and duration driver in these tasks is the null model type. The null model can be fit without a relatedness matrix (i.e. simple regression models), or with a relatedness matrix that can be sparse or dense (i.e. mixed models). The table above shows that task cost and execution time increases as the null model is weighted (dense > sparse > none). Differences between the model with dense matrix and the model with sparse matrix are significant, which is driven by both increased CPU time and memory required to use a dense matrix. Our advice is to check the relatedness matrix before fitting the model, and transform it to sparse if it is dense, especially if sample size is large. The null model type has direct influences on further parameters setup such as: instance type selection, CPU per job, Memory per job, etc.
Instance type: Benchmarking showed that the most appropriate instance type is an AWS r5.x instance. Instance type selection is especially important when we are performing analysis with many samples (30k participants and above). In the tasks with up to 30k samples, r5.4xlarge instances can be used, and r5.12xlarge with more participants included. In addition, it is important to note that if a single association test is performed with a dense null model, then r5.12xlarge or r5.24xlarge instances should be picked. When it comes to Google instances, results can be seen in the above table as well. Since there often isn’t a Google instance that is the exact equivalent of the AWS instance, , we recommend choosing the most appropriate Google instance (matching the chosen AWS instance) from the list of available Google instances on BioData Catalyst.
CPU and memory per job: CPUs and memory per job input parameters are determining the number of jobs to be run in parallel on one instance. For example:
If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job, the number of jobs run in parallel will be min{36/1,72/4}=18.
If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job, the number of jobs run in parallel will be min{36/1,72/8}=9.
The bottleneck in single variant association testing is memory, so we suggest carefully considering this parameter and instance type. Workflow defaults are 1 CPU/job and 8GB/job. The table above shows that these tasks require much more memory than CPUs, therefore r.x instances are most appropriate in these cases. The table additionally shows that the task where the null model is fit with the dense relatedness matrix requires the most memory per job. This parameter also depends on the number of participants included in the analysis.
Maximum number of parallel instances: The default number of parallel instances is 8. The impact of changing this number is mainly reflected through execution time. The tasks with more parallel instances will be finalized faster. This parameter can be set in Execution settings when drafting a task. However, each user has a limited total number of parallel instances and capturing a big number of parallel instances per task leads to a decrease in the total number of different tasks that can run at the same time.
Spot instances: If it is expected for the task to be finalized within a few hours, it can be run on spot instances in order to reduce the execution cost. However, losing a spot instance leads to rerunning the task using on-demand instances which can lead to a higher cost than running the task on on-demand instances from the beginning. That is why spot instances are generally only suitable for short tasks.
GENESIS Aggregate association testing can be performed using burden, SKAT, SMMAT, fastSKAT and SKAT-O tests. Our general conclusions are as follows:
Null model selection: The same as in the single variant association testing, the main cost and duration driver in these tasks is the null model type. The null model can be fit without a relatedness matrix (i.e. simple regression models), or with a relatedness matrix that can be sparse or dense (i.e. mixed models). The table above shows that task cost and execution time increases as the null model is weighted (dense > sparse > none). Differences between the model with dense matrix and the model with sparse matrix are significant, which is driven by both increased CPU time and memory required to use a dense matrix. Our advice is to check the relatedness matrix before fitting the model, and transform it to sparse if it is dense, especially if sample size is large. The null model type has direct influences on further parameters setup such as: instance type selection, CPU per job, Memory per job, etc.
Instance type: Benchmarking showed that the most appropriate instance type is an AWS r5.x instance. The majority of the tasks can be run on r5.12xlarge instances or on r5.24xlarge instances when the null model is with the dense relatedness matrix. Results for Google instances can be seen in the above table as well. Since the Google instance options often do not have an exact AWS equivalent, we selected the closest match from the list of available Google instances on BioData Catalyst.
CPU and memory per job: CPUs and memory per job input parameters determine the number of jobs to be run in parallel on one instance. For example:
If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job, the number of jobs run in parallel will be min{36/1,72/4}=18.
If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job, the number of jobs run in parallel will be min{36/1,72/8}=9.
Different tests can require different computational resources.
As can be seen for small sample sizes, up to 10GB per job can be sufficient for successfully completed tasks. One exception is the case when running a task with null model fit with the dense relatedness matrix when approx. 36GB/job is needed. When there are 50k samples, jobs require 70GB. Details can be seen in the table above. In addition to sample size, the memory required is determined by the number of variants included in each aggregation unit, as all variants in an aggregation unit are analyzed together.
SKAT and SMMAT tests are similar when it comes to CPU and Memory per job requirements. Roughly, these tests require 8GB/CPU, and details for different task configurations can be seen in the table below:
Maximum number of parallel instances: The default number of parallel instances is 8. The impact of changing this number is mainly reflected through execution time. The tasks with more parallel instances will be finalized faster. This parameter can be set in Execution settings when drafting a task. However, each user has a limited total number of parallel instances and capturing a big number of parallel instances per task leads to a decrease in the total number of different tasks that can run at the same time.
Spot instances: If it is expected for the task to be finalized in a few hours it can be run on spot instances. This will reduce the execution cost. However, losing a spot instance leads to rerunning the task on on-demand instances, which can lead to a higher cost than running the task on on-demand instances from the beginning. That is why spot instances are generally suitable for short tasks.
GENESIS Sliding window association testing can be performed using burden, SKAT, SMMAT, fastSKAT and SKAT-O tests. When running sliding window test is good to know:
Null model selection: The same as in the previous tests the main cost and duration driver in these tasks is the null model type. The null model can be fit without a relatedness matrix (i.e. simple regression models), or with a relatedness matrix that can be sparse or dense (i.e. mixed models). The table below shows that task cost and execution time increases as the null model is weighted (dense > sparse > none). Differences between the model with dense matrix and the model with sparse matrix are significant, which is driven by both increased CPU time and memory required to use a dense matrix. Our advice is to check the relatedness matrix before fitting the model, and transform it to sparse if it is dense, especially if sample size is large. The null model type has direct influences on further parameters setup such as: instance type selection, CPU per job, Memory per job, etc.
Instance type: Benchmarking showed that for analysis with or without sparse relatedness matrix tasks can be completed on a c5.9xlarge AWS instance. For analysis with dense relatedness matrix included in the null model and with 50k samples or more, r5.12xlarge instances can be used. Also, it is important to note that in this case increasing the instance (for example from c5.9xlarge to c5.18xlarge) will not lead to shorter execution time. Furthermore, it can be completely opposite. By increasing the size of the instance we also increase the number of jobs running in parallel. At one point there will be a lot of jobs running in parallel and accessing the same memory space which can reduce the performance and increase task duration. Results for Google instances can be seen in respective tables. Since the Google instance options often do not have an exact AWS equivalent, we selected the closest match from the list of available Google instances on BioData Catalyst.
CPU and memory per job: When running a sliding window test it is important to ensure that CPU resources at the instances that we are using are not overused. Avoiding 100% CPU usage in these tasks is crucial for fast execution. For that reason, it is good to decrease the number of jobs which are running in parallel on one instance. The number of parallel jobs is highlighted in the summary table as it is an important parameter for the execution of this task. We can choose different CPU and memory inputs as long as that combination gives us an appropriate number of parallel jobs. This is example how the number of parallel jobs are calculated:
If we run our task on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job, the number of jobs run in parallel will be min{36/1,72/4}=18.
If we run our task on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job, the number of jobs run in parallel will be min{36/1,72/8}=9.
For details on the number of jobs that we’ve set for each tested case please refer to the table below.
Window size and window step: The default values for these parameters are 50kb and 20kb (kilobases), respectively. Please have in mind that since the sliding window algorithm is considering all bases inside the window, the window length and number of windows are parameters that are directly affecting the execution time and the price of the task.
Maximum number of parallel instances: The default number of parallel instances is 8. The impact of changing this number is mainly reflected through execution time. The tasks with more parallel instances will be finalized faster. This parameter can be set in Execution settings when drafting a task. However, each user has a limited total number of parallel instances and capturing a big number of parallel instances per task leads to a decrease in the total number of different tasks that can run at the same time.
Spot instances: If it is expected for the task to be finalized in a few hours it can be run on spot instances. This will reduce the execution cost. However, losing a spot instance leads to rerunning the task on on-demand instances which can lead to a higher cost than running the task on on-demand instances from the beginning. That is why spot instances are generally suitable for short tasks.
Benchmarking results:
In this text, we have highlighted the main cloud cost and execution time drivers when running GENESIS analyses. Please have in mind that when running an analysis users may experience additional costs due to different factors such as task failures or need for rerunning the analysis. When estimating cloud costs for your study, please account for a cost buffer for these two factors as well.
To prevent task failures, we advise you to carefully read app descriptions and if you have any questions or doubts, contact our Support Team at support@sevenbridges.com. Also, using memoization can help in cost reduction when rerunning the task after initial failure.
Throughout this document, it is important to note that the figures in the tables above are intended to be informative as opposed to predictive. The actual costs incurred for a given analysis will also depend on the number of samples and number of variants in the input files. For our analysis described above, we selected 1000G and TOPMed Freeze5 data as inputs. For TOPMed Freeze5, we selected cohorts of 10k, 36k, and 50k subjects. The benchmarking results for the selected tasks would vary if the cohorts were defined differently.
The selection of instances is another factor that can lead to variation in results for a given analysis. The results highly depend on the user’s skill to choose an appropriate instance and use the instance resources optimally. For that reason, if two users run the same task with different configurations (different instance type CPU/job and/or RAM/job parameters), the results may vary.
The results (execution time and cost) are directly connected to the CPU per job and Memory per job parameters. Different resources dedicated to a given job will result in a different number of total jobs run on the selected instance, as so with the different execution time and cost. For that reason, setting up a task draft properly is crucial. In this document, we provided details on what we consider optimal CPU and Memory per job inputs for TOPMed Freeze5 and 1000G data. These numbers can be used as a good starting point, bearing in mind that each study has its own unique requirements.
For both Single and Sliding Window Association Testing:
Please note that results for single and sliding window tests are approximations. To avoid unnecessary cloud costs, we performed both single and sliding window tests only on 2 chromosomes. These results were the basis on which we assessed the cost and execution time for the whole genome.
The following is an explanation of the procedure we applied for GENESIS Single Association testing workflow and TOPMed freeze5 data (the similar stands for GENESIS Sliding window Association testing workflow):
In GENESIS Single Association testing workflow, the variants are tested in segments. The number of segments that the workflow will process is a ratio of the total number of variants and a segment length (which is one of the input parameters in this workflow). For example: if we are testing a whole genome with 3,000,000,000 variants and use the default segment length value of 10,000kb, we will have 300 segments. Furthermore, if we use the default value for maximum number of parallel instances, which is 8, we can approximate the average number of segments that each instance processes: 37.
The GENESIS Single Association testing workflow can process segments in parallel (processing of one segment is a job). The number of parallel segments (jobs) depends on the CPU per job and Memory per job parameters, and can be calculated as described previously. For example: if we are running the analysis on a c5.9xlarge instance (36 CPUs and 72GB RAM) with 1 CPU/job and 4GB/job, we will have 18 jobs in parallel. Knowing that each of our 8 instances is processing approximately 37 jobs in parallel it means that each instance will have approximately 2 cycles. Furthermore, knowing the average job length we can approximate the running time of 1 instance: it will be 2 cycles multiplied by average job length. Since the instances are running in parallel, this will be the total execution time. Lastly, when execution time is known, we can calculate the task price: the number of instances multiplied by execution time per hour, multiplied by instance price per hour. For each tested scenario in our benchmarking analysis, we obtained the average job length based on the corresponding tasks which included 2 chromosomes, such that the total number of jobs was above 30.
Overview of current projects hosted on BioData Catalyst Powered by Gen3, including their dependencies, characteristics, and relationships.
A list of current project IDs can be found in the Data tab, under Filters>Project>Project Id. The current project IDs are:
Parent
TOPMed
Open_Access
Tutorial
The Parent and TOPMed study types have been categorized on Gen3 by their Program designation. An example of this designation by Program is presented below.
The Program types can be further identified by whether there is an underscore (_
) at the end of the study:
Parent studies will include an underscore at the end of the study name.
Example: parent-WHI_HMB-IRB_
TOPMed studies will not include an underscore at the end of the study name.
Example: topmed-BioMe_HMB-NPU
There are three distinct relationships possible between Parent and TOPMed studies. The first two relationships are streamlined:
Parent only: The Parent study does not have a TOPMed counterpart study. This usually means that there are no genomic data, such as WXS (whole exome sequencing) or WGS (whole genome sequencing), located within the study; only phenotypic data.
TOPMed only: This TOPMed study does not have a Parent counterpart study. These studies will contain both genomic data, WXS or WGS, and phenotypic data.
Parent study with a counterpart TOPMed study: The Parent study will contain the phenotypic data, while the TOPMEd study will contain the genomic data. Under dbGaP, these studies would be kept separate from one another and the user would need to create the linkages. In the Gen3 platform, these studies have been linked together under the Parent study, based on the participant IDs found in dbGaP. This allows our system to produce valuable information and cohort creation as it combines both phenotypic and genomic data.
The most notable difference between the Program categories is the type of hosted data.
Genomic data: None
Phenotypic data: Like with TOPMed studies, any phenotypic data found within the Graph Model, will only be DCC harmonized variables. For the raw phenotypic data from dbGaP, again, it can be found in the reference_file
node.
Genomic data: Available data can include CRAM, VCFs and Cohort-level VCF files
Phenotypic data: TOPMed studies without an associated Parent study will include phenotypic data in the data graph by way of DCC harmonized variables. Additionally, raw phenotypic data from dbGaP can be found in the reference_file
as tar files that share this common naming scheme: RootStudyConsentSet_phs######.<study_shorthand>.v#.p#.c#.<consent_codes>.tar.gz
The 1000 Genomes Project is an international research effort (2008-2015) to establish the most detailed catalogue of human variation and genotype data. On the Gen3 platform, the Program open_access contains:
Genotypic data: Available data can include CRAM and VCF files.
Phenotypic data: The data graph will contain phenotypic data by way of DCC harmonized variables. Additionally, raw phenotypic data can be found in the reference_file
as VCF and TXT files.
This program contains genomic data from 1000 Genomes and synthetic clinical data generated by Terra. Purpose of this dataset is to use it as a genome-wide association study (GWAS) tutorial. GWAS is an approach used in genetics research to associate specific genetic variations with particular diseases. For more information, see Terra Tutorials.
On the Gen3 platform, the Program tutorial contains:
Genotypic data: Available data can include CRAM and VCF files.
Phenotypic data: The data graph will contain phenotypic data by way of DCC harmonized variables. Additionally, raw phenotypic data can be found in the reference_file
as VCF and GDS files.
Overview of the Profile page on the BioData Catalyst Powered by Gen3
The Profile page contains two sections: API keys and Project access.
To download large amounts of data, an API key will be required as a part of the gen3-client. To create a key on your local machine, click Create API key, which will activate the following pop-up window:
Click Download json to save the credential file to your local machine. After completion, a new entry will appear in the API key(s) section of the Profile page. It will display the API key key_id
and the expiration date (one month after the key creation). The user should delete the key after it has expired. If for any reason a user feels that their API key has been compromised, the key should be deleted before subsequently creating a new one.
This section of the Profile page lists the projects and the methods of access for the data within in the Gen3 BioData Catalyst system. If you do not see access to a specific study, check that you have been granted access within dbGaP. If access has been granted for over a week, contact the BioData Catalyst Help Desk: bdcat-support@datacommons.io
Interactive Data Dictionary on BioData Catalyst Powered by Gen3
The Dictionary page contains an interactive visual representation of the Gen3 data model. The default graph model view, as pictured below, displays all of the nodes and relationships between nodes in a hierarchical structure. The model further specifies the node types and links between nodes, as highlighted in the legend located at the top right side of the page.
Users can click on any of the graph nodes in order to learn more about their respective properties. By clicking on a node, the graph will highlight that specific node and all associated links that connect it to the Program node. A "Data Model Structure" list will also appear on the left side toolbar. This will display the node path required to reach the selected node from the Program node.
When a second node in the path is selected, it will then gray out the other possible paths and only highlight the selected path. It will also change the "Data Model Structure" list on the left side toolbar.
The left side toolbar has two options available:
Open properties: Will download the submission files for all the nodes in the "Data Model Structure" list. This option can also be found on the node that was first selected.
Download templates: Will open the node properties in a new pop-up window; an example is displayed in the following screenshot.
This property view will display all properties in the node and information about each property:
Property: Name of the property.
Type: The type of input for the node. Examples of this are string
, integer
, Boolean
and enumerated values (enum
), which are displayed as preset strings.
Required: This field will display whether the property is required for the submission of the node into the data model.
Description: This field will display further information about the property.
Term: This field can be populated with external resources that have further information about the property.
The Table view is similar to the Properties view, and nodes are displayed as a list of entries grouped by their node category.
Clicking on one of the nodes will open the Properties view of the node.
The Dictionary contains a text-based search function that will search through the names of the properties and the descriptions. While typing, a list of suggestions appears below the search bar. Click on a suggestion to search for it.
When the search function is used, it will default to the graph model and highlight nodes that contain the search term. Frames around the node boxes indicate whether the searched word was identified in the name of the node (full line) or in the node's description and properties' names/descriptions (dashed line).
Clicking on one of these nodes, it will only display the properties that have this keyword present in either the property name or the description.
Click Clear Search Result to clear the free text search if needed.
The search history is saved below the search bar in the "Last Search" list. Click on an item here to display the results again.
Overview of Workspaces on BioData Catalyst Powered by Gen3
When navigating to a Workspace, users are presented with multiple workspace options.
The Gen3 platform offers two workspace environments: Jupyter Notebooks and R Studio.
There are six workspaces:
Virtual machines (VM):
Small Jupyter Notebook VM
Large Jupyter Notebook Power VM
R Studio VM
Pre-made workflow workspaces:
Autoencoder Demo
CIP Demo
Tensorflow-Pytorch.
To start a workspace, select Launch. You will see the following launch loading screen.
Launching a VM can take up to five minutes depending on the size and complexity of the workspace.
Once the VM is ready, the initial screen for the workspace will appear. For scripts and output that need to be saved when the workspace is terminated, store those files in the pd/
directory.
This workspace will persist once the user has logged out of the Gen3 BioData Catalyst system. If the workspace is no longer being used, terminate the workspace by selecting Terminate Workspace at the bottom of the window. You will be returned to the Workspace page with all of the workspace options.
For more information about the Gen3 Workspace, refer to Data Analysis in a Gen3 Data Commons.
One of the key steps to becoming an advanced user and being able to fully understand and leverage the power of BioData Catalyst Powered by Seven Bridges is to learn how to detect and correct errors that prevent the successful execution of your analyses. The Troubleshooting tutorial presents some of the most common errors in task execution on the platform and shows you how to debug and resolve them. There is also a corresponding public project on the platform called "Troubleshooting Failed Tasks" which has examples of the failed analyses presented in the written tutorial.
Find the written tutorial .
Find the platform public project with examples .
This guide has been prepared to help you with your first set of projects on BioData Catalyst Powered by Seven Bridges.
This guide aims to help you learn how to take advantage of all the various features and functionality for performing analyses on Seven Bridges and ensure that you can set up your analyses in the most efficient way possible to save time and money.
The following topics are covered in this guide:
The basics of working with CWL tools and workflows on the platform.
How to specify computational resources on the platform and how to use the default options selected by the execution scheduler.
How to run Batch analyses and take advantage of parallelization with scatter.
The basics of working with Jupyterlab Notebooks and Rstudio for interactive analysis.
You can refer to the guide .
Just starting out on NHLBI BioData Catalyst Powered by Seven Bridges, and need to get up to speed on how to use the platform? Our experts have created a Getting Started Guide to help you jump right in. We recommend users begin learning how to use BioData Catalyst Powered by Seven Bridges by following the steps in this guide. After reading this guide, you will know how to create an account on NHLBI BioData Catalyst Powered by Seven Bridges, learn the basics of creating a workspace (project), run an analysis, and search through the hosted data.
To read our Getting Started Guide, please refer to our documentation page .
is a collection of user documentation which describes all of the various components of the platform, with step-by-step guides on their use. Our Knowledge Center is the central location where you can learn how to store, analyze, and jointly interpret your bioinformatic data using BioData Catalyst Powered by Seven Bridges.
From the Knowledge Center, you can access platform documentation. This content is organized into sections that deal with the important aspects of accessing and using BioData Catalyst Powered by Seven Bridges.
You can also read the Release Notes in the Knowledge Center, keeping you up-to-date on all of the latest updates and new features for BioData Catalyst Powered by Seven Bridges.
Job: This refers to the “execution” part from the “Task” definition (see above). It represents a single run of a single tool found within a workflow. If you are coming from a computer science background, you will notice that the definition is quite similar to a common understanding of the term “job” (). Except that the “job” is a component of a bigger unit of work called a “task” and not the other way around, as in some other areas may be the case. To further illustrate what job means on the platform, we can visually inspect jobs after the task has been executed using the View stats & logs panel (button in the upper right corner on the task page):
Logging in to Terra for the first time is a quick and straight-forward process. The process is easiest if you already have an email address hosted by Google. If you want to use an email address that is not hosted by Google, we have instructions for that as well. Article: How to register for a Terra account Article: Setting up a Google account with a non-Google email We also recommend our article on navigating in Terra to get familiar with basic menus and options in Terra, as well as this video introduction to Terra.
Read on in the next two subsections for primers on how to set up billing and how to manage costs.
BioData Catalyst Powered by Terra is a user-friendly system for doing biomedical research in the cloud. Terra workspaces integrate data, analysis tools, and built-in security components to deliver smooth research flows from data to results.
The following entries in this section of the BioData Catalyst documentation are a starting point for learning how to use Terra in the context of the BioData Catalyst ecosystem. You can also dive deeper into Terra by visiting the Terra website and the Terra Support Center. Wherever possible, we highlight specific articles, tutorial videos, and example workspaces that will help you learn what you need to know to accelerate your research.
If you can't find what you are looking for, we are happy to help. See the Troubleshooting and Support section for more information.
Please note that Terra is designed for and tested with the Chrome browser.
Now that you can log in, you’ll want to make sure that you have access to a Billing Account and Billing Project. This will allow you to charge storage and analysis costs through a Google account linked to Terra. A Terra Billing Project is Terra's way of connecting a workspace where you accrue costs for things, back to a Google Billing account where you pay for it. You must have a Google Billing Account established before creating a Terra Billing Project. Outlined here are the steps necessary to set this up, as well as instructions on how to add or be added to an existing account/billing project.
Detailed instructions for setting up your billing can be found by following the links below. If you are a BioData Catalyst Fellow, your procedure for billing set up is a bit different, but you may find some of the information below still relevant (sharing a billing project with another user, for example). Step 1: Get Cloud credits for BioData Catalyst Step 2: Wait for approval & review the Billing overview for BioData Catalyst users Step 3: Credits approved. Now create a new Terra billing project Step 4 (optional): Sharing Billing Projects among colleagues
Workspaces are the fundamental building blocks of Terra. You can think of them as modular digital laboratories that enable you to organize and access your data in a number of ways for analysis.
To learn about the basics of operating a Terra workspace, we recommend these resources: Article: Working with workspaces Video: Introduction to using workspaces in Terra
Read on in this section to get familiar with:
Sharing a workspace allows collaborators to actively work together in the same project workspace. Workspaces can be used as repositories of data, workflows, and Jupyter notebooks. Learn more about how to securely share a workspace: Article: How to share a workspace Article: Reader, writer or owner? Workspace access controls, explained Article: Using permissions Video: Introduction to Collaboration and Sharing in Terra
BioData Catalyst Powered by Gen3 provides data for many projects and conveniently supports search across the vast set of subjects to identify the best available cohorts for research analysis. Searches are based on harmonized phenotypic variables and may be performed both within and across projects.
When a desired cohort has been identified in Gen3, the cohort may be conveniently "handed-off" to Terra for analysis. Optionally, this dataset may be enhanced with additional metadata from dbGaP, or extended to include additional researcher-provided subject data.
Here we provide essential information for all researchers using BioData Catalyst data from Gen3, including how to access and select Gen3 subject data and hand it off to Terra, as well as a description of the GA4GH Data Repository Service (DRS) protocol and data identifiers used by Gen3 and Terra.
The resources below contain the information you’ll need to access your desired data: Video: Data Analysis with Gen3, Terra and Dockstore Article: Discovering Data Using Gen3 Article: Understanding and using Gen3 data in Terra Article: Data Access with the GA4GH Data Repository Service (DRS) Article: Linking Terra to External Servers Article: Understanding and setting up a proxy group Workspace: BioDataCatalyst Gen3 data on Terra tutorial Workspace: TOPMed Aligner workspace
We have a number of articles on tracking and minimizing the costs of operating on Terra. There are multiple ways of estimating how much your analyses are costing you, including built-in tools and external resources. The articles below contain instructions and advice on managing your cloud resources in a variety of ways: Article: Understanding and controlling cloud costs Article: Best practices for managing shared team costs Article: How much did a workflow cost? Article: How to disable billing on a Terra project
Terra has a number of features to ensure the security of sensitive data accessed through the platform. Many of these features are in place automatically, while tools like authorization domains give you greater control over your data. These articles contain an overview of the security features enabled on Terra: Article: Authorization Domain overview for BioData Catalyst users Article: Managing data privacy and access with Authorization Domains Article: Best Practices for accessing external resources Article: Terra security posture
Terra workspaces include a dedicated workspace Google bucket, as well as a built-in data model for managing your data. We provide articles in Terra’s knowledge base explaining how to organize and access data in a variety of ways.
A key to understanding the power of Terra is understanding it’s built-in data model, which allows you to rewire the inputs and outputs of your workflows and Jupyter notebooks.
The following resources give you guided instructions using cloud-based data with Terra: Article: Managing data with table VIdeo: Introduction to Terra data tables Article: Uploading to a workspace Google bucket Article: How to import metadata to a workspace data table Video: Making and uploading data tables to Terra
You can import data into your workspace by either linking directly to external files you have access to, or by interfacing with a number of platforms with which Terra has integrated access.
For BioData Catalyst researchers, one of the most relevant of these interfacing platforms is Gen3. However this section also provides you with resources that teach how to import data from other public datasets integrated into Terra’s data library, as well as how to bring in your own data.
Read on in this section for more information on:
The Annotation Explorer is an application developed by Seven Bridges in collaboration with the TOPMed Data Coordinating Center. The application enables users to interactively explore, query, and study characteristics of an inventory of annotations for the variants called in TOPMed studies. This application can be used pre-association testing to interactively explore aggregation and filtering strategies for variants based on annotations and generate input files for multiple-variant association testing. It can also be used post-association testing to explore annotations associated with a set of variants, like variant sets found significant during association testing.
The Annotation Explorer currently hosts a subset of genomic annotations obtained using Whole Genome Sequence Annotator software for TOPMed variants. Currently, annotations for TOPMed Freeze5 variants and TOPMed Freeze8 variants are integrated with the Annotation Explorer. Researchers who are approved to access one or more of the TOPMed studies included in Freeze8 or Freeze5 will be able to access these annotations in the Annotation Explorer.
For more information, refer to the Annotation Explorer's Public Project Page.
Terra’s includes a number of integrated datasets, many of which have individualized Data Explorer interfaces, useful for generating and exporting custom cohorts. If you click into a dataset and have the proper permissions, you'll be able to explore the data. If you don't have the necessary permission, you'll be taken to a page that tells you whom to contact for access.
The resources linked below provide guided instructions for creating custom cohorts from the data library and importing them to your workspace, and using a Jupyter notebook to interact with the data: Article: Video: Workspace:
If things aren’t going quite as expected, there are a number of avenues to help unblock any issues you may have.
Troubleshooting This section of the Terra knowledge base contains many useful articles on how to address problems, including a variety of articles describing common workflow errors, as well as more general articles that explain how to find which errors are affecting your work, and how to proceed once you’ve diagnosed your problem.
Monitor your jobs The Job History tab is your workflow operations dashboard, where you can check the status of past and current workflow submissions and find links to the job manager where you can diagnose issues.
How to report an issue There are a number of ways you can report an issue directly to us outlined in this article. If something appears broken, slow, or just plain weird, feel free to let us know.
Community forum A lot of answers can be found on our forum, which is monitored by our dedicated frontline support team and has an integrated search function. If you suspect that you’re running into a common issue but can’t find an answer in the documentation, this is a great place to check.
The interactive analysis features of Terra support interactive data exploration, including the use of statistical methods and graphical display. Versatile and powerful interactive analysis is provided through Jupyter Notebooks in both Python and R languages.
Jupyter Notebooks run on a virtual machine (VM). You can customize your VM’s installed software by selecting one of Terra's preinstalled notebook cloud environments or choosing a custom environment by specifying a Docker container. Dockers ensure you and your colleagues analyze with the same software, making your results reproducible.
Article: Interactive statistics and visualization with Jupyter notebooks Article: Customizing your interactive analysis application compute Article: Terra's Jupyter Notebooks environment Part I: Key components Article: Terra's Jupyter Notebooks environment Part II: Key operations Article: Terra's Jupyter Notebooks environment Part III: Best Practices Video: Notebooks overview Video: Notebooks Quickstart walkthrough Workspace: Notebooks Quickstart workspace Workspace: BioData Catalyst notebooks collection Workspace: PIC-SURE Tutorial in R Workspace: PIC-SURE Tutorial in Python
Terra supports the following types of analysis: Batch processing with Workflows and Interactive analysis with Jupyter Notebooks. This section will orient you with resources that teach you how to do:
As an introduction, we recommend reading our article on the kinds of analysis you can do in Terra.
How to use Dockstore workflows in our cloud partner platforms
Using the NHLBI BioData Catalyst ecosystem, you can launch workflows from Dockstore in both of our partner analysis platforms, Terra and SevenBridges. It is important to know that these platforms use different workflow languages: Terra uses WDL and SevenBridges uses CWL.
When you open any WDL and CWL workflow in Dockstore, you will see the option to "Launch with NHLBI BioData Catalyst":
If you selected a CWL workflow, this workflow will launch in BioData Catalyst Powered by SevenBridges. Learn more about how this integration works.
If you selected a WDL workflow, this workflow will launch in BioData Catalyst Powered by Terra. Learn more about how this integration works.
Terra provides powerful support for performing Genome-Wide Association Studies (GWAS). The following featured and template workspaces include Jupyter notebooks for phenotypic and genomic data preparation (using Hail) and workflows (using GENESIS) to perform single or aggregate variant association tests using mixed models. We will continue to provide more resources for performing more complex GWAS scenarios in BioData Catalyst.
A Jupyter Notebook in both of the following workspaces uses Hail to generate Genetic Related Matrices for input into the GWAS workflows. Users with access to kinship matrices from the TOPMed consortium may wish to exclude these steps and instead import kinship files using the bring your own data instructions.
The BioData Catalyst GWAS tutorial workspace was created to walk users through a GWAS with training data that includes synthetic phenotypic data (modeled after traits available in TOPMed) paired with 1000 Genomes open-access data. This tutorial aims to familiarize users with the Gen3 data model so that they can become empowered to use this data model with any existing tutorials available in the Terra library’s showcase section.
This template is an example workspace that asks researchers to export TOPMed projects (for which they have access) into an example template for conducting a common variant, mixed-models GWAS of a blood pressure trait. Our goal was to include settings and suggestions to help users interact with data exactly as they receive it through BioData Catalyst. Accommodating other datasets may require modifying many parts of this notebook. Inherently, the notebook is an interactive analysis where decisions are made as you go. It is not recommended that the notebook be applied to another dataset without careful thought.
Cost Examples Below are reported costs from using 1,000 and 10,000 samples to conduct a GWAS using the BioData Catalyst GWAS Blood Pressure Trait template workspace. The costs were derived from single variant tests that used Freeze 5b VCF files that were filtered for common variants (MAF <0.05) for input into workflows. The way these steps scale will vary with the number of variants, individuals, and parameters chosen. TOPMed Freeze 5b VCF files contain 582 million variants and Freeze 8 increases to ~1.2 billion. For GWAS analyses with Freeze 8 data, computational resources and costs are expected to be significantly higher.
Analysis Step
Cost (n=1,000; Freeze5b)
Cost (n=10,000; Freeze 5b)
$29.34 ($19.56/hr for 1.5 hours)
$336 ($56/hr for 6 hours)
workflow
$1.01
$5.01
workflow
$0.94
$6.67
TOTAL
$32.29
$347.68
These costs were derived from running these analyses in Terra in June 2020.
Dockstore offers faceted search, which allows for flexible querying of tools and workflows. Tabs are used to split up the results between tools and workflows. You can search for basic terms/phrases, filter using facets (like CWL vs WDL), and also use advanced search queries. Learn more.
You can also search curated workflows in Dockstore's page.
Organizations are landing pages for collaborations, institutions, consortiums, companies, etc. that allow users to showcase tools and workflows. This is achieved through the creation of collections, which are groupings of related tools and workflows. Learn more about , including how your research group can create your own organization to share your work with the community.
Dockstore Organizations relevant to BioData Catalyst users:
Here, you can find a suite of analysis tools we have developed with researchers that are aimed at the BioData Catalyst community. Examples include workflows for performing GWAS and Structural Variant Calling. Many of these collections also point users to tutorials where you can launch these workflows in our partner platforms and run an analysis.
These workflows are based on pipelines the University of Michigan developed to perform alignment and variant calling on TOPMed data. If you're bringing your own data to BioData Catalyst to compare with TOPMed data, these may be helpful resources.
Our mission is to catalyze open, reproducible research in the cloud
We hope Dockstore provides a reference implementation for tool sharing in the sciences. Dockstore is essentially a living and evolving proof of concept designed as a starting point for two activities that we hope will result in community standards within the GA4GH:
a best practices guide for describing tools in Docker containers with CWL/WDL/Nextflow
a minimal web service standard for registering, searching and describing CWL/WDL-annotated Docker containers that can be federated and indexed by multiple websites
We plan on expanding the Dockstore in several ways over the coming months. Please see our for details and discussions.
To help Dockstore grow, we encourage users to publish their tools and workflows on Dockstore so that they can be used by the greater scientific community. Here is how to get started:
Register your or on Dockstore
Create an , invite your collaborators, and promote your work in collections
This forum is a great place to find and post questions about Docker files, workflow languages, Dockstore features, and workflow learning resources. The user base includes CWL, WDL, Nextflow, and Galaxy workflow authors and users.
"An app store for bioinformatics workflows"
Dockstore is an open platform used by the GA4GH for sharing Docker-based tools described with either the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL). Dockerized workflows come packaged with all of their requirements, meaning you spend less time searching the web for obscure installation errors and more time doing research.
Dockstore is aimed at scientific use cases, and we hope this helps users find helpful resources more quickly. Our documentation is also created with researchers in mind: we work to distill down information about the technologies we use to the relevant points to get users started quickly.
This section highlights the documentation relevant to BioData Catalyst users. If you are brand new to Dockstore, it is suggested to review the Getting Started Guide. Our entire suite of documentation is available here.
Authors: Beth Sheets (UC Santa Cruz, Genomics Institute), Dave Roberson (Seven Bridges)
Contributors: Dan Vicente (Seven Bridges), Alison Leaf (Seven Bridges), Stephanie Gogarten (Fellow), Sheila Gaynor (Fellow), Jean Monlong (Fellow), Kenny Westermann (Fellow)
Reproducibility is one of the biggest challenges facing science. Several issues associated with reproducibility have been well summarized in the FAIR (Findable, Accessible, Interoperable, and Reusability) Guiding Principles. The BioData Catalyst ecosystem promotes FAIR and reproducible analyses by leveraging Docker-based reproducible tools in two descriptor languages. The Common Workflow Language (CWL) is currently supported in Seven Bridges workspaces, while the Workflow Description Language (WDL) is currently supported in Terra workspaces.
A combination of software containers (like Docker) and workflow languages wrap your bioinformatics pipeline, making your analysis portable across local and cloud execution environments. This allows researchers to reproduce your method(s) with exactly the same software, dependencies, and configurations. For example, BioData Catalyst researchers have been able to reuse CWL and WDL versions of a Genome-Wide Association pipeline developed by the TOPMed Data Coordinating Center in multiple cloud workspaces.
There are hundreds of CWL and WDL pipelines already available for researchers to run on BioData Catalyst. Both CWL pipelines and WDL pipelines can be discovered in Dockstore’s open-access catalog and then executed in the workspace environments. In addition, the Seven Bridges platform hosts CWL workflows directly on the platform in the Public Apps Gallery, and the Terra platform hosts WDL workflows in the Broad Methods Repository. However, many researchers will want to work with pipelines that do not have CWL or WDL versions yet or need to make changes to existing CWL and WDL pipelines. This guide will describe the steps for how to “Bring Your Own Tool” to the BioData Catalyst ecosystem.
Whether you are working with WDL or CWL tools, all users will begin by creating a containerized version of their pipeline. There are multiple methods users take to create these tools, but we simplify this process by walking through two example paths. For researchers utilizing the Terra workspace environment, we describe how to start by writing your WDL tool locally and then configuring and testing in the cloud workspace. For researchers performing analyses on the Seven Bridges workspace environment, we describe how to use the Seven Bridges platform web composer and web editor features to add a CWL wrapper to the Docker image. You may find it easiest to start with learning one language (for example, the one that works in your chosen workspace environment) and then expanding to multiple languages if needed.
Technologies for reproducible analysis in the cloud
Docker is a fantastic tool for creating light-weight containers to run your tools. It gives you a fast, VM-like environment for Linux where you can automatically install dependencies, make configurations, and setup your tool exactly the way you want, just as you would on a “normal” Linux host. You can then quickly and easily share these Docker images with the world using registries like Quay.io (indexed by Dockstore), Docker Hub, and GitLab.
Learn how to create a Docker image
There are multiple workflow languages currently available to use with docker technology. In the BioData Catalyst ecosystem, SevenBridges uses CWL and Terra uses WDL. To learn more about how these language compare and differ, read Dockstore's documentation on tools and workflows.
Once you have picked what language works best for you, prepare your pipeline for analysis in the cloud with these tutorials aimed at bioinformaticians:
Learn how to create a tool in Common Workflow Language (CWL)
Learn how to create a tool in Workflow Descriptor Language (WDL)
Dockstore’s integration with BioData Catalyst allows researchers the ability to easily launch reproducible tools and workflows in secure workspace environments for use with sensitive data. This privilege to work with sensitive data requires assurances of safe software.
We believe we can enhance the security and reliability of tools and workflows through open, community-driven best practices that exemplify the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles. We have established a best practices framework for secure and FAIR workflows published in Dockstore. We ask that users try to implement these practices for all workflows they develop.
If you want your workflow to be available to both WDL and CWL communities, you can use conversion tools to aid in the process. It is best practice to review if the conversion was correctly done.
If you are interested in using Docker on your High-Performance Compute cluster, you may find the Singularity tool helpful.
You can use the workflow runner Toil for large parallelized CWL jobs in the AWS and/or Google clouds, locally, on Kubernetes, and/or high-performance computer clusters. Toil is built for researchers and should run any CWL 1.0 workflow from Dockstore at scale. Toil also has some experimental support for WDL.
An introduction to terms used in this document
Each platform within BioData Catalyst may have slight variations on these definitions. You will find a more specific definition within the section of the BYOT document. Below, we highlight a few terms to introduce you to before you get started.
App: 1) In Seven Bridges, an app is a general term to refer to both tools and workflows. 2) App may also refer to persistent software that is integrated into a platform.
Container: A standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another (for example, Docker).
Command: In workflow writing, the command specifies the literal command line run (akin to what you would run in the terminal).
Common Workflow Language (CWL): Simple scripting language for describing computational workflows for performing sequential operations on data. CWL is a way to describe command-line tools and connect them together to create workflows. CWL is well suited for describing large-scale workflows in cluster, cloud, and high-performance computing environments where tasks are scheduled in parallel across many nodes.
Docker: Software for running packaged, portable units of code, and dependencies that can be run in the same way across many computers. See also Container.
Dockerfile: A text document that contains all the commands a user could call on the command line to assemble an image.
Dockstore: An open platform developed by the Cancer Genome Collaboratory and used by the GA4GH for sharing Docker-based tools described with the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL).
Image: In the context of containers and Docker, this refers to the resting state of the software.
Instance: Refers to a virtual server instance from a public or private cloud network.
Task: In workflow writing, the term task encompasses all of the information necessary to execute a command, such as specifying input/output files and parameters.
Tool: In CWL, the term tool specifies a single command. This specification is not as discrete in other languages such as WDL.
Workflow Description Language (WDL): Way to specify data processing workflows with a human-readable and writable syntax. Define complex analysis tasks, chain them together in workflows, and parallelize their execution.
Workflow: A sequence of processes, usually computational in this context, through which a user may analyze data.
Workspace: Areas to work on/with data within a platform. Examples: projects within Seven Bridges.
Wrapping: The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.
Virtual Machine (VM): An isolated computing environment with its own operating system.
For other terms, you can reference the BioData Catalyst glossary.
Version control is vital in reproducibility since it helps track changes you or contributors make to your code and documentation. We suggest using GitHub to host your workflows in an open access repository so that the research community can benefit from your work, and your work can benefit from feedback from the research community. Below, find steps for getting started with GitHub :
Upload your descriptor file (workflow), parameter files, and source code to a GitHub repository (see an example)
We encourage users to publish their tools and workflows on Dockstore so that they can be used by the greater scientific community. Dockstore features allow users to build their pipelines to be open, reusable, and interoperable. Publishing your work in this way will enhance the value of your work and the resources available to the scientific community.
Here is how to get started sharing your work on Dockstore:
Create a Dockstore Account and link your account to external services, such as GitHub
Link your Dockstore account to your ORCID to display your scientific identity.
Create an Organization, invite your collaborators, and promote your work in collections
We believe we can enhance the security and reusability of tools and workflows we share through open, community-driven best practices that exemplify the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles. We have established best practices for secure and FAIR workflows published in Dockstore. We ask that users try to implement these practices in the workflows that they share with the community.
Dockstore can help you create more accessible and transparent data science methods in your scientific publications. In this section, we want to provide some examples of FAIR workflows the community has shared.
In this 2020 Science paper by Lemieux, et al., the researchers provided transparent methods by citing immutable DOI archives of their container-based workflows, and also shared a Viral Genomics collection in the Broad Institute's organization on Dockstore. This collection includes several workflows, a README, and a link to a public workspace tutorial in the Terra cloud environment where users can learn exactly how to recreate their methods.
In this 2020 Nature paper by Li, et al. the authors shared their pipelines written in the Workflow Description Language in this Cumulus collection on Dockstore, and created a public Terra workspace where the community can recreate an exact analysis and figure from their publication.
This page describes how researchers may bring their own data files and metadata into Terra. Some researchers may choose to bring their own data to Terra in addition to - or instead of - using BioData Catalyst data from Gen3. For example, this may be done when bringing additional (e.g., longitudinal) phenotypic data to enhance the harmonized metadata available from Gen3, or when using Joint variant calling with additional researcher provided genomic data, or even using researcher provided data exclusively,
Generally, there are two types of data that researchers typically bring to Terra. Data files (e.g., genomic data, including CRAM and VCF data), and metadata (e.g., tables of clinical/phenotypic or other data, typically regarding the subjects in their study). These are described separately below.
There are two ways a researcher's data files may be made available in Terra: By uploading data to the researcher's workspace bucket or enabling Terra to access the researcher's data in a researcher managed Google bucket, for which you need to set up a proxy group.
Article: Uploading to a workspace Google bucket Article: Understanding and setting up a proxy group
The ways in which a researcher may import metadata to the Terra Data tables are described in the articles and tutorials below:
Article: Managing data with tables Article: How to import metadata to a workspace data table VIdeo: Introduction to Terra data tables Video: Making and uploading data tables to Terra
Various costs associated with cloud computing
Platforms within BDC use a combination of Google Cloud Platform (GCP) and Amazon Web Services (AWS) for storing and analyzing data in the ecosystem. Researchers on BioData Catalyst begin to incur fees when they use the ecosystem in one of the following ways:
Data Storage: When a researcher uploads their own data or stores derived results on a cloud environment, they will begin to incur data storage costs on the platform their instance is located on.
Computing / Analysis: When a researcher runs a task in a platform they will incur charges based on their usage.
Egress charges: When a researcher transfers data out of cloud storage.
Platform Support: When projects require a significant amount of support researchers may need to purchase time from ecosystem platforms, though standard support is provided free of charge.
For more information on each of these categories, see below. You may also use the following links to view platform-specific guidance for BioData Catalyst Powered by Terra or BioData Catalyst Powered by Seven Bridges.
In general, storage charges are billed on all files in a workspace that belong to that project. This includes all files a researcher uploads to BioData Catalyst and any results files generated by their workflows and analysis. This does NOT include controlled dataset files hosted by BioData Catalyst.
Storage costs vary based on the amount of data a researcher stores, what type of disk or service they use for storing the data, and the services they select (AWS or GCP). For the most up-to-date information on storage rates, see these articles on Amazon S3 storage and Google Cloud Storage.
Compute costs vary and depend on a range of factors including:
The platform and cloud infrastructure provider where an analysis is performed
Workspace & cloud instance settings
Length of time to workflow completion
By default any data uploaded or generated in a workspace is stored on a single cloud provider instance. If a researcher opts to move these files, they will be charged Egress fees, otherwise known as Data Transfer fees. These fees will occur if they:
Transfer files to another cloud provider, OR
Download files to a local machine
Fees for data egress vary based on service providers and what actions a researcher takes.
BioData Catalyst provides general support for researchers on all ecosystem platforms free of charge. If a researcher anticipates needing a large amount of support for specialized activities, such as organizing a large training workshop, they can reach out to the BioData Catalyst Coordinating Center (bdc3@renci.org) and/or the platform liaisons to discuss these needs as they develop their proposal.
Each NHLBI BioData Catalyst platform offers tools and tutorials to help you estimate your cloud costs. For information on these tools and how to run them, please see the below articles.
For more help on estimating your anticipated cloud costs, please contact the NHLBI BioData Catalyst help desk.
The batch workflow features of Terra provide support for computationally-intensive, long-running, and large-scale analysis.
You can perform whole pipelines—from preprocessing and trimming sequencing data to alignment and downstream analyses—using Terra workflows. Written in the human-readable Workflow Description Language (WDL), you can search for and import workflows into your workspace from Dockstore or the Broad Methods Repository.
Video: Data Analysis with Gen3, Terra and Dockstore Article: How to import data from Gen3 into Terra and run the TOPMed aligner workflow Article: Configure a workflow to process your data Article: Getting workflows up and running faster with a JSON file Article: Importing a Dockstore workflow into Terra Video: Importing a Dockstore workflow into Terra walkthrough Video: Workflows Quickstart walkthrough Workspace: Workflows Quickstart workspace Workspace for BioData Catalyst: TOPMed Aligner workspace Workspace for BioData Catalyst: GWAS with 1000 Genomes and synthetic clinical data Workspace for BioData Catalyst: GWAS with TOPMed data
The 2024-10-21 release marks the 19th release for the NHLBI BioData Catalyst® (BDC) ecosystem. This release includes several new features (e.g., supporting seqr genomics analysis, and exporting selected cohort data in PFB format). Please find more detail on the new features in the sections below.
The 2024-10-21 data releases include the addition of studies on asthma and sickle cell disease, plus new imaging from cardiovascular and atherosclerosis studies. Updates are highlighted for COPD, atrial fibrillation, and childhood asthma studies, and new additions include liver disease, myocardial genomics, and exRNA studies. The release also introduces the RECOVER-Pediatric project and the REDS-IV-P Epidemiology of COVID-19 study. Please refer to the Data Releases section below for more information as well as the Data page on the BDC website.
BDC Powered by Terra (BDC-Terra) now supports seqr genomics analysis: seqr provides rich gene and variant-level annotations and powerful filtration tools to perform variant searches within a family or across projects. To get started, check out the video tutorials, including a video describing how to load your data in seqr.
Export selected cohort data in Portable Format for Biomedical Data (PFB): BDC Powered by PIC-SURE (BDC-PIC-SURE) now allows researchers to export selected participant-level data in PFB file format. When using the Select and Package Data tool in Authorized PIC-SURE, simply choose “Package Data as PFB” to export in this file format.
The table below highlights which studies were included in the 2024-10-21 data release.
The latest release features NHLBI TOPMed projects such as the Severe Asthma Research Program (SARP) and Pharmacogenomics of Hydroxyurea in Sickle Cell Disease (PharmHU). Additionally, it includes new imaging XML schemas from the Cardiovascular Health Study (CHS) and the Multi-Ethnic Study of Atherosclerosis. Updates are also highlighted in the Boston Early-Onset COPD Study, Cleveland Clinic Atrial Fibrillation Study, and the Childhood Asthma Management Program (CAMP). New additions include the Human Liver Cohort and studies on myocardial genomics and exRNA profiles. The release also introduces the RECOVER-Pediatric project and the REDS-IV-P Epidemiology of COVID-19 study.
The data is now available for access across the entire ecosystem.
NHLBI TOPMed: Severe Asthma Research Program (SARP)
phs001446.v3.p2.c1
topmed-SARP_GRU
Yes
No
NHLBI TOPMed: Severe Asthma Research Program (SARP)
phs001446.v3.p2.c2
topmed-SARP_DS-AAI-PUB
Yes
No
NHLBI TOPMed: Pharmacogenomics of Hydroxyurea in Sickle Cell Disease (PharmHU)
phs001466.v2.p1.c2
topmed-pharmHU_DS-SCD-RD
Yes
No
NHLBI TOPMed: Pharmacogenomics of Hydroxyurea in Sickle Cell Disease (PharmHU)
phs001466.v2.p1.c3
topmed-pharmHU_DS-SCD
Yes
No
Cardiovascular Health Study (CHS) - Imaging
phs003639.v1.p1.c1
imaging-img_CHS_HMB-MDS
Yes
No
Cardiovascular Health Study (CHS) - Imaging
phs003639.v1.p1.c2
imaging-img_CHS_HMB-NPU-MDS
Yes
No
Cardiovascular Health Study (CHS) - Imaging
phs003639.v1.p1.c3
imaging-img_CHS_DS-CVD-MDS
Yes
No
Cardiovascular Health Study (CHS) - Imaging
phs003639.v1.p1.c4
imaging-img_CHS_DS-CVD-NPU-MDS
Yes
No
Multi-Ethnic Study of Atherosclerosis (Electrocardiogram Tracing Repository)
phs003703.v1.p1.c1
imaging-img_MESA_ECG_HMB
Yes
No
Multi-Ethnic Study of Atherosclerosis (Electrocardiogram Tracing Repository)
phs003703.v1.p1.c2
imaging-img_MESA_ECG_HMB-NPU
Yes
No
Sleep Heart Health Study (SHHS-BioLINCC)
phs003637.v1.p1.c1
BioLINCC-BL_SHHS_HMB-MDS
No
No
NHLBI TOPMed: Boston Early-Onset COPD Study
phs000946.v6.p2.c1
topmed-EOCOPD_DS-CS-RD
No
Yes
NHLBI TOPMed: Cleveland Clinic Atrial Fibrillation (CCAF) Study
phs001189.v5.p1.c1
topmed-CCAF_AF_GRU-IRB
No
Yes
NHLBI TOPMed: NHGRI CCDG: AF Biobank LMU in the context of the MED Biobank LMU
phs001543.v3.p1.c1
topmed-AFLMU_HMB-IRB-PUB-COL-NPU-MDS
No
Yes
NHLBI TOPMed - NHGRI CCDG: The GENetics in Atrial Fibrillation (GENAF) Study
phs001547.v3.p1.c1
topmed-GENAF_HMB-NPU
No
Yes
NHLBI TOPMed: Early-Onset Atrial Fibrillation in the Estonian Biobank
phs001606.v3.p1.c1
topmed-EGCUT_GRU
No
Yes
NHLBI TOPMed: NHGRI CCDG: The BioMe Biobank at Mount Sinai
phs001644.v3.p2.c1
topmed-BioMe_HMB-NPU
No
Yes
NHLBI TOPMed: Childhood Asthma Management Program (CAMP)
phs001726.v3.p1.c1
topmed-CAMP_DS-AST-COPD
No
Yes
Human Liver Cohort (HLC)
phs000253.v1.p1.c1
heartfailure-HLC_GRU
Yes
No
NHLBI Exome Sequencing in SCID
phs000479.v1.p1.c1
heartfailure-Exome_SCID_GRU
Yes
No
Familial Exome Sequencing in Rare Pediatric Phenotypes
phs000553.v1.p1.c1
heartfailure-FamExome_RarePeds_GRU-MDS
Yes
No
PCGC: Congenital Heart Disease Genetic Network Study
phs000571.v6.p2.c2
PCGC-CHD-GENES_DS-CHD
Yes
No
NHLBI GO-ESP: Family Studies (Mendelian Lipid Disorders)
phs000587.v1.p1.c1
heartfailure-Fam_MLD_DS-CLA
Yes
No
NextGen Consortium: iPS Derived Hepatocytes Study (PhLiPS Study)
phs001341.v1.p1.c1
heartfailure-PhLiPS_GRU
Yes
No
Myocardial Applied Genomics Network (MAGNet) Study
phs001539.v4.p1.c1
heartfailure-MAGNet_HMB-MDS
Yes
No
Cardiovascular ATVB: Atherosclerosis Thrombosis and Vascular Biology
phs001592.v1.p1.c1
heartfailure-CardioATVB_DS-CVD
Yes
No
Profiles of exRNA in CSF and Plasma from Subarachnoid Hemorrhage Patients
phs001759.v1.p1.c1
heartfailure-exRNA_CSF_HMB
Yes
No
miRNA Profiling of Maternal and Non-Maternal Healthy Adult Blood Plasma Using Small RNA-Sequencing
phs001892.v1.p1.c1
heartfailure-miRNA_Maternal_Plasma_GRU
Yes
No
NHLBI TOPMed: NHGRI CCDG: UCSF Atrial Fibrillation Study
phs001933.v2.p1.c1
topmed-UCSF_Afib_HMB-MDS
Yes
No
NIH RECOVER-Pediatric: Understanding the Long-Term Impact of COVID on Children and Families
phs003461.v1.p1.c1
RECOVER-RC_Pediatrics_GRU
Yes
No
REDS-IV-P Epidemiology, Surveillance and Preparedness of the Novel SARS-CoV-2 Epidemic (RESPONSE)
phs003578.v1.p1.c1
REDS-RESPONSE_GRU
Yes
No
Sudden Cardiac Death in Heart Failure Trial (SCD-HeFT-BioLINCC)
phs003654.v1.p1.c1
BioLINCC-BL_SCD-HeFT_GRU
Yes
No
NHLBI TOPMed: Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE)
phs001472.v3.p2.c1
topmed-ECLIPSE_DS-CS-MDS-RD
No
Yes
NHLBI TOPMed: Characterizing the Response to a Leukotriene Receptor Antagonist and an Inhaled Corticosteroid (CLIC)
phs001729.v3.p1.c2
topmed-CARE_CLIC_DS-ASTHMA-IRB-COL
No
No
TRanscriptomic ANalySis of left ventriCulaR gene Expression (TRANSCRibE)
phs001679.v1.p1.c1
heartfailure-TRANSCRibE_GRU
Yes
No
TRanscriptomic ANalySis of left ventriCulaR gene Expression (TRANSCRibE)
phs001679.v1.p1.c2
heartfailure-TRANSCRibE_DS-CI
Yes
No
Molecular Genetics of Heterotaxy and Related Congenital Heart Defects
phs001814.v1.p1.c1
heartfailure-MolGen_CHD_GRU
Yes
No
NHLBI TOPMed: Whole Genome Sequencing of Venous Thromboembolism (WGS of VTE)
phs001402.v3.p1.c1
topmed-Mayo_VTE_GRU
No
Yes
NHLBI TOPMed: My Life Our Future (MLOF) Research Repository of Patients with Hemophilia A (Factor VIII Deficiency) or Hemophilia B (Factor IX Deficiency)
phs001515.v2.p2.c1
topmed-MLOF_HMB-PUB
No
Yes
NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study
phs001368.v4.p2.c1
topmed-CHS_HMB-NPU-MDS
No
Yes
NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study
phs001368.v4.p2.c2
topmed-CHS_HMB-MDS
No
Yes
NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study
phs001368.v4.p2.c3
topmed-CHS_DS-CVD-NPU-MDS
No
Yes
NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study
phs001368.v4.p2.c4
topmed-CHS_DS-CVD-MDS
No
Yes
NHLBI TOPMed: San Antonio Family Heart Study (SAFHS)
phs001215.v4.p2.c1
topmed-SAFHS_DS-DHD-IRB-PUB-MDS-RD
No
Yes
NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica
phs000988.v6.p1.c1
topmed-CRA_DS-ASTHMA-IRB-MDS-RD
No
Yes
NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing
phs001607.v3.p2.c1
topmed-IPF_DS-ILD-IRB-NPU
No
Yes
NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing
phs001607.v3.p2.c2
topmed-IPF_DS-LD-IRB-NPU
No
Yes
NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing
phs001607.v3.p2.c3
topmed-IPF_DS-PFIB-IRB-NPU
No
Yes
NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing
phs001607.v3.p2.c4
topmed-IPF_DS-PUL-ILD-IRB-NPU
No
Yes
NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing
phs001607.v3.p2.c5
topmed-IPF_HMB-IRB-NPU
No
Yes
NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing
phs001607.v3.p2.c6
topmed-IPF_DS-LD-IRB-COL-NPU
No
Yes
NIH RECOVER: A Multi-Site Observational Study of Post-Acute Sequelae of SARS-CoV-2 Infection in Adults
phs003463.v2.p2.c1
RECOVER-RC_Adult_GRU
No
Yes
NHLBI TOPMed: Women's Health Initiative (WHI)
phs001237.v3.p1.c1
topmed-WHI_HMB-IRB
No
Yes
NHLBI TOPMed: Women's Health Initiative (WHI)
phs001237.v3.p1.c2
topmed-WHI_HMB-IRB-NPU
No
Yes
NHLBI TOPMed: Genomic Activities such as Whole Genome Sequencing and Related Phenotypes in the Framingham Heart Study
phs000974.v5.p4.c1
topmed-FHS_HMB-IRB-MDS
No
Yes
NHLBI TOPMed: Genomic Activities such as Whole Genome Sequencing and Related Phenotypes in the Framingham Heart Study
phs000974.v5.p4.c2
topmed-FHS_HMB-IRB-NPU-MDS
No
Yes
NHLBI TOPMed - NHGRI CCDG: Atherosclerosis Risk in Communities (ARIC)
phs001211.v4.p3.c1
topmed-ARIC_HMB-IRB
No
Yes
NHLBI TOPMed - NHGRI CCDG: Atherosclerosis Risk in Communities (ARIC)
phs001211.v4.p3.c2
topmed-ARIC_DS-CVD-IRB
No
Yes
NHLBI TOPMed: MESA and MESA Family AA-CAC
phs001416.v3.p1.c1
topmed-MESA_HMB
No
Yes
NHLBI TOPMed: MESA and MESA Family AA-CAC
phs001416.v3.p1.c2
topmed-MESA_HMB-NPU
No
Yes
NHLBI TOPMed: Pediatric Cardiac Genomics Consortium (PCGC)'s Congenital Heart Disease Biobank
phs001735.v2.p1.c1
topmed-PCGC_HMB
No
Yes
NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE)
phs000921.v5.p2.c2
topmed-SAGE_DS-LD-IRB-COL
No
Yes
NHLBI TOPMed - NHGRI CCDG: Genes-Environments and Admixture in Latino Asthmatics (GALA II)
phs000920.v6.p4.c2
topmed-GALAII_DS-LD-IRB-COL
No
Yes
NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing
phs001607.v3.p2.c1
topmed-IPF_HMB-IRB-NPU
No
Yes
NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing
phs001607.v3.p2.c2
topmed-IPF_DS-LD-IRB-NPU
No
Yes
NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing
phs001607.v3.p2.c3
topmed-IPF_DS-ILD-IRB-NPU
No
Yes
NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing
phs001607.v3.p2.c4
topmed-IPF_DS-PFIB-IRB-NPU
No
Yes
NHLBI TOPMed: Pulmonary Fibrosis Whole Genome Sequencing
phs001607.v3.p2.c5
topmed-IPF_DS-PUL-ILD-IRB-NPU
No
Yes
The Collaborative Cohort of Cohorts for COVID-19 Research (C4R)
phs003045.v1.p1.c1
COVID19-C4R_CARDIA_HMB
Yes
No
The Collaborative Cohort of Cohorts for COVID-19 Research (C4R)
phs003045.v1.p1.c2
COVID19-C4R_CARDIA_HMB-NPU
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)
phs002909.v1.p1.c1
COVID19-C4R_SPIROMICS_GRU
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)
phs002909.v1.p1.c2
COVID19-C4R_SPIROMICS_GRU_NPU
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)
phs002909.v1.p1.c3
COVID19-C4R_SPIROMICS_COPD
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)
phs002909.v1.p1.c4
COVID19-C4R_SPIROMICS_COPD_NPU
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)
phs002909.v1.p1.c5
COVID19-C4R_SPIROMICS_GRU_COL
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)
phs002909.v1.p1.c6
COVID19-C4R_SPIROMICS_GRU-NPU-COL
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)
phs002909.v1.p1.c7
COVID19-C4R_SPIROMICS_COPD-COL
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS)
phs002909.v1.p1.c8
COVID19-C4R_SPIROMICS_COPD-NPU-COL
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Multi-Ethnic Study of Atherosclerosis (MESA)
phs003017.v1.p1.c1
COVID19-C4R_MESA_HMB
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Multi-Ethnic Study of Atherosclerosis (MESA)
phs003017.v1.p1.c2
COVID19-C4R_MESA_HMB-NPU
Yes
No
BDC Powered by Gen3 release notes BDC Powered by Terra release notes BDC Powered by Seven Bridges release notes BDC Powered by PIC-SURE release notes
Guidance on writing BDC into a research proposal
BDC is a cloud-based ecosystem which seeks to empower researchers analyzing phenotypic and genotypic heart, lung, blood, and sleep data. Researchers on NHLBI BioData Catalyst have access to a number of controlled and open datasets, as well as the power to bring their own data to the ecosystem for analysis.
This document intends to serve as a resource for researchers writing NHLBI BioData Catalyst into grant proposals.
The BDC ecosystem leverages two well-known cloud computing services, Google Cloud Platform (GCP) and Amazon Web Services (AWS), to perform computational analysis and store data. Users may scale their workloads up or down by toggling the virtual machine (VM) instance size and attached storage, as well as horizontally scale workloads by specifying a number of parallel instances. Increasing compute power, storage, and parallelization has an associated increase in cost, which is estimated for the researcher.
The platforms within the BDC ecosystem come equipped with cloud workspaces containing workflows and analysis tools. Depending on the platform, workflows may be available in WDL (Workflow Description Language) or CWL (Common Workflow Language), and accessible from Dockstore.org, the Seven Bridges public gallery, or the Broad Methods Repository. In total, these sources contribute over 2000 workflows. Additionally, researchers may access standard analytical tools such as R Studio, JupyterLab, Jupyter Notebooks, and SAS Studio.
For information on the different costs you should budget for and how to estimate costs, see Incurring Cloud Costs.
The below sample language can be used as a resource for when a researcher is preparing to write a budget justification for including NHLBI BioData Catalyst cloud costs in a proposal.
Note In the following sample language, items in [brackets] that are bolded and italicized are where you should insert your details.
All users with appropriate access credentials will have access to data hosted on BioData Catalyst. Controlled and open datasets already hosted on NHLBI BioData Catalyst will not incur storage costs. Our data storage budget will fund the storage of any derived results data (e.g, temporary and secondary files generated as a result of analyses on hosted data) and/or [XX TB] of data we plan to upload using the Bring Your Own Data tool. Data storage estimates were generated using amounts pulled on [MM/DD/YYYY estimate was generated].
The BioData Catalyst ecosystem features several platforms with secure workspaces where researchers can run workflow analyses of genomic and phenotypic data. Our estimated analysis costs include [insert time amount] of analyst time, as well as an overall compute estimate generated with BioData Catalyst documentation help.
BioData Catalyst, and all files generated by it, are hosted on the Google Cloud Platform and Amazon Web Services. We anticipated during the course of this project that some data will be subject to egress charges as a result of transferring across cloud providers or downloading data to local compute infrastructure. We currently anticipate [insert data estimate] will be subject to egress charges each year. Our estimated egress costs are based on pricing information gathered on [MM/DD/YYYY estimate was generated].
After consulting with members of the BioData Catalyst support team, we anticipate needing to purchase support time for additional training not covered under the standard provisions. The support team estimated we will require [insert estimate funding amount] to purchase this additional time.
To request a Letter of Support from the BioData Catalyst Coordinating Center, email bdc3@renci.org with the following information:
Researcher Name
Role in BioData Catalyst, if any
For example: Fellow, graduate student working with a Fellow, and so on
Project title
Brief project description
Brief description of how you plan to use BioData Catalyst in your project
What resources you might need from the Consortium, if any
For example: training, ingestion of new data, and so on
What resources you might add to the Consortium, if any
For example: New workflows or tools
How to cite and acknowledge NHLBI BioData Catalyst® (BDC)
For citation of BDC:
National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services (2020). The NHLBI BioData Catalyst. Zenodo. https://doi.org/10.5281/zenodo.3822858
To acknowledge BDC, use:
The authors wish to acknowledge the contributions of the consortium working on the development of the NHLBI BioData Catalyst® (BDC) ecosystem.
In this section, the reader will learn how to use the Terra and Dockstore platforms for the creation of WDL workflows for analysis and sharing with the scientific community. Below we have compiled community and BioData Catalyst resources to help users get started learning WDL to create their own workflows.
Workspace: A dedicated space where you and collaborators can access and organize the same data and tools and run analyses together. They can include: data, notebooks, and workflows. They can be public or controlled access.
Workflow: Chains of connected tools to accomplish a full analysis. Tools are often connected in a specific way to enable maximum computational efficiency and are also constructed to accomplish a specific analysis goal. A workflow typically describes a full analysis (e.g. variant discovery, differential expression, or multiple variant association tests).
Workflow Description Language (WDL): A community-driven standard for describing data analysis pipelines and is easily portable across different computing environments. It is the language currently used to run batch-processes in Terra, which uses Cromwell as an executor. Like other descriptor languages, it is paired with Docker containers and can execute pipelines written in any language (bash, R, Python, etc.). Below, we have compiled community and BioData Catalyst resources to help users get started learning WDL to create their own tools and workflows.
Authoring:
SublimeText offers a nice balance between usability and editing features.
Syntax highlighters: Plugins that enable syntax highlighting (i.e. coloring code elements based on their function) for supported text editors. Syntax highlighting has been developed for SublimeText and Visual Studio, vim, and IntelliJ.
Visualization: Pipeline Builder is a web-based tool that creates an interactive graphical representation of any workflow written in WDL; also includes WDL code generation functionality.
Validation & inputs: WOMTool is a Java command-line tool co-developed with WDL that performs utility functions, including syntax validation and generation of input JSON templates. See the doc entries on validation and inputs for quickstart instructions.
Running tools:
Terra is a cloud-based analysis platform for running workflows written in WDL via Cromwell on Google Cloud; it is open to the public and offers sophisticated data and workflow management features. In this BYOT document, we walk through all of the steps to run a workflow in Terra.
Wdl_runner is a lightweight command-line workflow submission system that runs WDLs via Cromwell on Google Cloud.
Below are a few learning resource tutorials we have compiled from various sources:
Open WDL’s Learn WDL offers a comprehensive set of exercises for users that are just learning WDL.
Getting Started with WDL from Dockstore is an introductory guide.
These Dockstore training exercises along with this accompanying video provide more complex examples using common bioinformatics tools.
Once you are more familiar with writing workflows, we suggest you continue with WDL Best Practices from Dockstore.
You can start developing your WDL workflow locally with Dockstore’s CLI and a small test dataset. This route allows you to debug syntax errors while avoiding cloud costs. Once your workflow is debugged, you can launch in a cloud environment to test for permissions errors and scaling issues. The Dockstore CLI automatically installs the Cromwell execution engine for running WDL workflows locally.
Instructions:
Install Dockstore’s CLI locally
Install Docker locally
This example WDL exercise using Dockstore’s CLI steps through creating a basic WDL workflow locally and pushing the tool to GitHub, triggering an automated build on Quay.io.
In order to transition your workflow from local development to Terra, a typical approach is to make the workflow available in a GitHub repository and then build. Quay.io integrates with Dockstore and GitHub by automatically building upon GitHub pushes. The Quay.io build can then be registered on Dockstore. You can follow the steps for linking your Dockstore account to external services like Quay.io in this document.
You can find more information about this process in the section Version Control, Publishing, and Validation of Workflows below.
Now that you have a workflow ready for running in a cloud environment, you can port your workflow into Terra in two ways. First, if you are already using Dockstore and GitHub for version control, you can navigate to your Dockstore WDL workflow and use the "Launch with NHLBI BioData Catalyst" button. This article Importing a Dockstore workflow into Terra provides instructions for selecting a workflow in Dockstore then conveniently importing that workflow into Terra
Figure 1. Dockstore’s “Launch with BioData Catalyst” button.
If you haven’t published your workflow to Dockstore, you can also upload a workflow directly into Terra using the Broad Methods Repository. The Broad Methods Repository can easily be found in the “Add workflows” section of your Terra workspace. Similar to Dockstore, this repository hosts many WDL workflows that have been created by the Terra community. These workflows are only public once a user has signed into Terra.
Figure 2. In Terra workspaces, when you are in the "Workflows" tab you can “Find Additional Workflows” from Dockstore and the Broad Methods Repository.
Once your workflow is in Terra, you may want to check out some of the learning resources below for configuring, troubleshooting, and optimizing your workflow. There are likely additional configuring and troubleshooting steps needed for getting your workflow up and running on larger datasets hosted in the cloud.
Terra also has several tips for reducing costs in order to promote the efficiency of a workflow. These approaches include deleting intermediate files and returning only final output to limit storage costs. Virtual machines can be configured with certain settings with reduced costs, such as using preemptible machines that trade-off reduced costs for the potential interruption. Cost optimizations are described at the following links:
Once your workflow is working as expected, we ask that you publish your work to share with the research community. You can find resources for how to publish your work on GitHub and Dockstore in the section below titled Version Control, Publishing, and Validation of Workflows.
In this section, the reader will first learn how the Seven Bridges Software Development Kit (SDK) enables the easy creation of CWL workflows that will benefit the greater BDCatalyst community. We will review the benefits of the SDK and then walk through an example of workflow creation, testing, and scaling. There are also links to more detailed resources for further reading.
Wrapping: The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.
Tool: A CWL description of a reusable piece of software that performs one specific function. An example is the bwa read alignment tool which can be applied to multiple workflows in different contexts. Tools need to have several things specified in the CWL description that includes Docker image, Linux base command, input files (or parameters), and output files. Tools can be used in completely disparate workflows and can be thought of as building blocks.
Workflow: Chains of connected tools to accomplish a full analysis. Tools are often connected in a specific way to enable maximum computational efficiency and are also constructed to accomplish a specific analysis goal. Whereas tools describe a single software step (e.g. alignment or read sorting), a workflow describes a full analysis (e.g. variant discovery, differential expression, or multiple variant association tests).
App: An app is a general term to refer to both tools and workflows. The platform user will typically only see the term “app” in reference to mixed groups of tools and workflows, such as in the Public Apps Gallery, the Apps collection tab, or within a workspace.
Throughout this guide, it will be useful for the reader to refer to our documentation found for each section. For the Seven Bridges Software Development Kit documentation, please see the following: https://sb-biodatacatalyst.readme.io/docs/sdk-overview
NHLBI BioData Catalyst powered by SevenBridges (Seven Bridges) provides a full Software Development Kit (SDK) that enables the BioData Catalyst community to easily create CWL apps that can be tested and scaled up to production level directly on the platform. Once validated by the user, these workflows can be exported and published on Dockstore so they can become searchable and findable by other users.
The SDK consists of a tool editor and a pipeline editor. Both are based on the open-source project Rabix, a portmanteau of "Reproducible Analyses for Bioinformatics" (for more information, see rabix.io). The goal of the SDK is to guide the user through the process of creating fully functional analytical pipelines that can be tested, scaled up to population-scale analysis, and shared with the research community. The SDK also has built-in version control at the tool and workflow level to enable the full reproducibility of previous versions.
The Tool Editor guides the user through the creation of a portable CWL description by linking a pre-built Docker image (see section Working with Docker) to a command line or script that will be run inside the container. The above image shows the tool wrapping process. The Tool Editor enables users to easily create CWL by filling out the GUI template (Figure 4). This simplifies the technical aspects of this process and makes it as easy as possible for users to get their tools set up on the platform. The CWL code can also be edited directly in the tool editor if that is desired. For users working with JavaScript, JavaScript dynamic expressions can be tested without having to leave the tool editor.
Learn more via this tutorial.
The Workflow Editor enables users to create full pipelines by linking together multiple discrete tools. The workflow editor is a drag-and-drop visual interface that makes it easy to build even the most complex pipelines.
Before we dive into more detail on how to use the Tool Editor and the Workflow Editor, it is important to understand the distinction between tools and workflows. The distinction is only present in the CWL, and it is an important one. Wrapping a tool requires knowledge of Docker and Linux command lines. The Tool Editor helps the user get past even the most technical and dynamic of command-line and script issues, with the goal being the creation of a reusable and shareable component. For building workflows, the Docker and Linux command lines are abstracted away to enable less-technical users to build full analytical pipelines. We can refer to this as “separation of concerns.” Each tool should be designed to handle one functional aspect, and therefore will be able to operate in multiple analytical pipelines. For example, BWA-MEM or the Samtools suite can be used in both DNA analysis workflows and RNA analysis workflows.
Linking together multiple tools into a full computational pipeline can have many advantages. It is important to understand the benefits of building a full and robust workflow that includes each of the analysis steps. The most apparent benefit is that it makes the entire pipeline easier to share, as there will only be one resulting CWL file. The CWL file is a human-readable text file that can be distributed digitally in multiple ways, such as through Dockstore, Seven Bridges, GitHub, or over email. A novice user can easily reproduce the full analysis using one file. They can also use the SDK to make adjustments if necessary, or even decompose the workflow to get at the constituent tools for use in other contexts (more on this below in the section Version Control, Publishing, and Validation of workflows). The Seven Bridges platform has built-in optimizations to execute a workflow for maximum efficiency and cost savings. For example, workflows only save final output files back to the project. Since intermediate files from earlier steps in the workflow are not saved, they do not accumulate cloud storage costs, saving funds that would otherwise be used for intermediate file object storage. Users can still make use of intermediate files for subsequent reruns of a task by simply turning on “memorization” for that task and intermediate files will be re-used where appropriate.
Finally, linking multiple tools together also has the added benefit of increasing computational efficiency. When running workflows, multiple tools can use the same compute instance if multiple CPU cores are available. This saves time and funds and increases the ability to run jobs in parallel with no additional configurations.
In the following sections, we will build the workflow in the above image. Here, we can visually see the importance of creating a workflow: running each of these tools separately would require more steps from the user and require more unnecessary data to be moved back and forth from the cloud computational instance to the user’s workspace. Therefore, running as a single workflow achieves the best efficiency.
Before getting started with this section, we recommend first creating a development workspace (called projects on Seven Bridges) to house the new tool(s) and workflow(s) while they are being created and tested. Please see the Seven Bridges Getting Started Guide for detailed instructions about how to create projects.
Figure 6 shows all the options available when creating a project on Seven Bridges including selecting the Billing Group. If used conservatively, the NHLBI BioData Catalyst pilot funding is adequate to cover the costs associated with developing a tool or workflow on the platform.
For the purposes of this tutorial, we will create a Next-Generation Sequencing (NGS) alignment Quality Control (QC) workflow as an example problem. BioData Catalyst hosts data from TOPMed and TOPMed studies generally have the most up-to-date alignments to HG38. Therefore, for this example tutorial we will (1) create a pipeline that can be used to make sure these CRAMs have high-quality reads, and (2) perform alignment read depth QC. We will also show how to bring a new tool to the platform that will combine the outputs of the previous tools.
Researchers should outline their pipeline into individual steps. These steps should correspond to individual software executables (i.e. bwa, samtools) or scripts (i.e. R, Python, shell).
A great place to outline your tool is in your development project description, shown below:
It is important to determine if there are tools (steps in your outline) that have already been wrapped and are published in either Dockstore or the Seven Bridges Public Apps Gallery. This reduces the time in porting analytical workflows to the cloud because these steps will not have to be re-validated or re-benchmarked. This also promotes developing with “separation of concerns.” This means that every tool (step) can be versioned, tested, and improved without adversely affecting the entire workflow. We recommend searching the Seven Bridges Public Apps Gallery and the BioData Catalyst Organization on Dockstore to find validated and reusable components.
Tools from the Seven Bridges Public Apps Gallery can be easily imported directly into your project. These apps have been validated and optimized for the cloud. By re-using existing tools, the development time is dramatically reduced.
Searching the Public Apps Gallery reveals that CWL tools are available for Fastqc and Picard CollectWgsMetrics. Therefore, the only tool that needs to be wrapped is MultiQC.
As described previously, the process of describing a command-line tool or script in CWL so that it can be run in a cloud environment either by itself or in a larger workflow is known as wrapping. Let us proceed with wrapping our MultiQC tool. The first step is to either (1) create a Docker image from a Docker build file or (2) find one available to us on a hosted repository. Next, we should run the Docker locally to test out the basic command line. If a Docker image was previously created and hosted for us we can use that to save time. On the other hand, if the software programs are not available in a single Docker image you will need to build it. Please see the section on Working with Docker for more information on creating images.
For this example, a MultiQC Docker image is available for us via biocontainers.pro with the image specially hosted at quay.io/biocontainers/multiqc:1.9--pyh9f0ad1d_0.
To use the SDK tool editor to wrap MultiQC, we will follow these steps in the development project:
Step 1: In the development project, click on the Apps tab and select “Add app.”
Step 2: On the next screen, select “Create a Tool.”
Step 3: Name your tool “MultiQC” and create a version CWL 1.0 tool. This will automatically take you to the visual Tool Editor.
Step 4: To complete our wrapping of MultiQC, we need to fill in the Docker Image, Base Command, Input Ports, and Outputs sections in the Visual Editor.
The Docker image/repository is quay.io/biocontainers/multiqc:1.9--pyh9f0ad1d_0.
For this example, the base command (see screenshot below) is simply: MultiQC (See Creating a tool from a script for how this will look for your own custom script).
For this example, we just need one input port which is an array of our quality control files from the upstream apps. We do not need any additional parameters for this example. If you are wrapping your own custom script, you can configure multiple input ports of different types.
Please see Figure 8 for where to fill in the details.
Be sure to create an input of type “array” which has items of type “File.” The MultiQC executable does not require that inputs of different types be prefaced with any flags or indicators. When wrapping an executable that requires distinguishing inputs (e.g. “--arg1 --arg2”), multiple inputs would need to be added.
The tool editor gives the user a preview of the resulting command line:
This is a relatively simple one and the arguments are just the full paths in the input file array. Features such as file metadata and JavaScript expression can be used to create a more sophisticated Linux command line for other tools.
The output port is the comprehensive MultiQC report (Figure 9) that the software tool creates. For this output, we will use a wildcard inside the “glob” field. The “glob” is simply how the tool will select which files to keep from the current working directory. The user can create as many output ports as necessary. Since MultiQC is a simplification and summarization tool we will only have one HTML report which can be acquired using a glob of “*.html.”
The output “glob” field (like all fields in the tool editor) has the ability to use JavaScript expressions to dynamically search for files in a very specific manner such as the full path that is based on the input files or to scan through a deep folder structure.
The completed tool will look like this:
Finally, we should consider the Computational Resources section of the Tool Editor. Here it is important to specify the minimum compute required. Because our example tool is not computationally intensive we can require a minimal amount of RAM and CPU. Through JavaScript dynamic expressions we can customize these computational requirements to scale with either the input file sizes or user input parameters. The Seven Bridges job scheduler will select the appropriate cloud instance(s) based on these constraints. In the next section, we will discuss how the user can also specify a suggested AWS or GCP cloud instance by adding “hints.”
Tools can be tested by themselves, but in some cases, it makes more sense to test the tool in the context of the complete workflow. For simplicity, we will add MultiQC to the workflow and use the output from the tool upstream of MultiQC in the workflow to test the MultiQC tool.
Finding appropriate test data is key to testing tools and workflows. Wherever possible, we recommend working with data that is small in size when testing tools and workflows. Small in size generally means a small size on disk and usually correlates to a smaller number of NGS reads, a smaller number of variants, or a smaller number of samples. Sometimes this small data is referred to as a “toy” dataset or a “subset” of data. Testing the tool wrapper will generally require multiple test runs using this small data set.
Seven Bridges hosts a number of test files in the Public Files Gallery that range from reference files to test size input data. Users can link these test files to their project instead of uploading their own test data to avoid storage costs. One of these test files is the human whole-exome sequencing sample merged-normal.bam which we will use for testing here. You can view the provenance of this test file by clicking on the file name and then on “metadata”:
This file is a “subset” of the whole exome data and is, therefore, a good choice for testing since the cost per analysis will be less than if data from all chromosomes were used. Tools should always be tested separately. When wrapping a tool the user should obtain access to data they can use for testing. The above metadata description also tells us the exact reference that was used for the read alignment. Seven Bridges also has the same reference file in the Public Files Gallery called human_g1k_v37_decoy.fasta.
Make sure to copy both testing files to your development project. Because these files are hosted in the Public Files Gallery, linking these files to your project will not lead to any additional storage costs.
The next step is to add our tool to a workflow with the upstream QC tools. We will use the pipeline editor to do this.
Step 1. The first step is to create a new “blank canvas” in the workflow editor. Go to the Apps tab in the development project and click on “Add app.” This time select “Create a workflow”.
Step 2. After creating the workflow, the next screen is a blank canvas in the Workflow Editor. From here, we can add multiple QC apps that are compatible with MultiQC to the canvas directly from the Public Apps Gallery. Search for “fastqc” and then for “picard alignment metrics” and use the mouse to drag them onto the workflow canvas.
Add the MultiQC CWL tool from the current project in the “My Projects” tab. The screen will look like this:
The next step is to connect the apps together. The nodes that are displayed on the workflow canvas represent apps. The input and output ports are represented by small circles on the perimeter of the node. Circles on the left of the node represent input ports whereas the ones on the right indicate output ports. Use the mouse to connect the wireframe together. The completed workflow will look like this:
This simple workflow highlights several advantages of the workflow editor. Notice that the “input file” input port node which represents an aligned bam file for this workflow feeds into both the Picard CollectWgsMetrics and FastQC tools. This means that the end-user only needs to specify this input one time when running the task and that the alignment metrics and FastQC tools will run in parallel, conserving time and funds.
Take note that one of the FastQC outputs is not connected to any downstream tool. In this case, this output port creates a zip file of the raw report data. However, the MultiQC tool does not need this output file and therefore it does not need to be moved or persisted outside the Docker container of the FastQC tool. In addition, although the CollectWgsMetrics and FastQC nodes feed into MultiQC, they do not have output nodes for themselves. This workflow has only 1 output which is the MultiQC HTML report. The intermediate reports will be saved temporarily in case the tool needs to be re-run, but will not persist in the file page of the user’s workspace, highlighting another way the workflow conserves funding.
We can test the workflow directly on the platform. Seven Bridges has multiple reference files in the Public Gallery. A completed task of the workflow will have one interactive report as an output. See the completed task in Figure 12.
The output of MultiQC is an interactive report that is viewable directly on the platform:
For more information about the workflow editor and for other examples please refer to the following materials in the Seven Bridges documentation:
There are two easy ways to scale your workflows on Seven Bridges. We refer to these as “batching” and “scattering.” The batch analysis separates files into batches or groups when running your analysis. The batching is done according to the specified metadata criteria of your input files or according to a files list you provide. A batch analyses can be defined at run time with no special setup in the tool. However, each batch is run on a separate instance. For more information on batch analyses, please see here.
Using our NGS QC workflow example we can create a batch task for every file in the input file port, as shown in Figure 14. This batch task will create 1 child task for each input bam file.
We can use another method called “scattering,” which operates inside a single task. This means that a workflow can utilize multiple cores in a single compute instance, which is often more efficient than using multiple instances. Scattering can only be used at the workflow level, not at the tool level. To use scattering, we need to edit our workflow. We make the input file of type “array” and the array type “file” as shown in Figure 15.
Click on each of our QC tools and select “Step.” In the “step” panel select the appropriate input to scatter on. In this case, we scatter by “input_bam” for the Picard Collect WGS Metrics tool and by “input_fastq” for the FastQC tool. When the workflow is run, the user can select multiple input files and each of them will be processed in parallel on separate compute nodes.
This was a brief introduction to the powerful scatter ability of the workflow editor. Please see the section Comprehensive tips for reliable and efficient analysis set-up of the Seven Bridges documentation for more information.
When running your custom workflow, you can define computational requirements so that there are enough memory and CPUs to run multiple jobs in parallel. For example, if your tool requires 4GB of RAM and you select an instance with 8 CPUs and 32G RAM, you will see that 8 jobs are running in parallel when you run your workflow as shown in Figure 18.
If you have followed this guide, your tool has now been wrapped and added to a workflow. It has also been tested on a “toy” dataset and validated against real data for your project. In the next sections, you will learn how to export CWL from the platform, create a GitHub repository for version control, and also how to publish to Dockstore.
Not all tools need to be command-line binaries. Many researchers bring their shell scripts, Python and, R scripts to Seven Bridges and this is all possible using the Seven Bridges Tool Editor.
For example, if we wanted to run an R script using the GENESIS Docker image we could do that without having to recreate the Docker image. To run a specific script that is not included in the Docker image, use the “File requirements” field shown in Figure 19. Specify a name for your file and paste in the file contents.
Then enter the name of the file in the “Base command” section along with the command required to execute it (e.g. Rscript):
Similarly, if you were using a python script the base command would be “Python.” Using the files requirements section of the Tool Editor we can execute any type of script without having to create a new Docker container.
Training Exercises (CWL solutions available for the same exercise)
The 2023-07-11 release marks the fourteenth release for the NHLBI BioData Catalyst® (BDC) ecosystem. This release includes several new features, e.g., Faceted Search in BDC Powered by Seven Bridges (BDC-Seven Bridges), along with documentation to help new users get started on the ecosystem, e.g., updated WDL documentation in BDC Powered by Terra (BDC-Terra). This release also includes enhanced support for discovering what datasets are available via BDC Powered by Gen3. Please find more detail on the new features and user support materials in the sections below.
The 2023-07-11 data releases include the addition of various research projects related to COVID-19, lung development, platelet transfusion refractoriness, sickle cell anemia, asthma, pregnancy outcomes, and family health studies. Please refer to the Data Releases section below for information on upcoming data releases. A list of currently available data can be viewed on the of the BDC website.
Faceted Search in BDC-Seven Bridges: Version 1 of Faceted Search has been deployed for all users on BDC-Seven Bridges. This feature enables users to query or filter any BDC ingested data in a faceted way to find files and form groups of files by searching characteristics such as authorization status, study accession number, type of data, etc. With the release of v1 Faceted Search, users can now more easily find data that is relevant to their research. Faceted Search is currently available for 10 datasets and will be expanded to all hosted datasets in the following quarter. The Faceted Search feature can be found under the Data drop-down menu.
BDC-Gen3 Metadata Being Updated to bring data from dbGaP FHIR database: BDC-Gen3’s Discovery Page (and underlying BDC-Gen3 Source of Truth Metadata API) allows unauthenticated users to discover what datasets are available in BDC. Fast Health Interoperability Resources (FHIR) is an Health Level Seven International (HL7) specification for Healthcare Interoperability. Last quarter, BDC-Gen3 worked to consume the new metadata from the dbGaP FHIR Server (as part of the officially defined data ingestion process). This quarter, BDC-Gen3’s Data Ingestion Pipeline has been updated to load FHIR metadata every new data release. The loaded metadata is available to all clients/users through BDC-Gen3’s Metadata API, and loaded metadata is viewable in BDC-Gen3’s Discovery Page.
New and Improved Genomic Filtering on BDC Powered by PIC-SURE (BDC-PIC-SURE): The Genomic Filtering modal on BDC-PIC-SURE has been updated to more accurately represent the relatedness between the various filtering fields. This includes the revamped “Variant consequence calculated” field, which includes different levels of severity and their associated consequences. Additionally, the “Selected Genomic Filters” section now more explicitly summarizes the filter criteria being applied.
Edit Queries Built in BDC-PIC-SURE Using the API: Researchers that created a cohort on BDC-PIC-SURE’s user interface can now edit that query’s parameters using Python or R code via the BDC-PIC-SURE API. This provides more flexibility for researchers wanting to refine or change their cohort after export and eliminates the need to return to the user interface.
Updated WDL documentation in BDC-Terra: Based on user feedback, Terra documentation has been expanded and updated to include: A new with a section dedicated to resources created by the WDL community, a new wdl-docs website to host the documentation from the new wdl-docs GitHub repository, updates to all existing WDL syntax documentation to match the WDL 1.0 spec, 17 new articles, 11 cookbook-style documents to teach users about specific use cases and provide example workflows, and 6 best practices documents to help users understand some of the grayer areas of coding in WDL. The documents are now available on the new wdl-docs GitHub repository.
New Code in “0_Export_from_UI” BDC-PIC-SURE API Examples: The example code has been updated to include new coding examples on how to use the BDC-PIC-SURE API to edit query parameters of a cohort built in the BDC-PIC-SURE user interface. These examples are available in both Python and R in both Jupyter and RStudio.
The table below highlights which studies were included in the 2023-07-11 data release. The Q2 data release included various research projects related to COVID-19, lung development, platelet transfusion refractoriness, sickle cell anemia, asthma, pregnancy outcomes, and family health studies. These include two studies from the COVID-19 Therapeutic Interventions and Vaccines initiative (ACTIV4a and ACTIV4c). There is a study on lung development (LungMAP) and another tackling platelet transfusion refractoriness in patients with severe thrombocytopenia using Eculizumab (DIR-Eculizumab). Other studies revolve around the use of hydroxyurea in children with sickle cell anemia (BABYHUG), the genetic epidemiology of asthma in Costa Rica (CRA), nulliparous pregnancy outcomes (nuMoM2b), multicenter study of hydroxyurea (MSH), and the Cleveland Family Study (CFS). The data is now available for access across the entire ecosystem.
The 2024-07-02 release marks the 18th release for the NHLBI BioData Catalyst® (BDC) ecosystem. This release includes several new features (e.g., an expanded workflow cost estimator, cascading authorization from parent to child studies, and DOIs at the dataset level). Please find more detail on the new features and user support materials in the sections below.
The 2024-07-02 data releases include the addition of research on atrial fibrillation, asthma, sickle cell disease, atherosclerosis, and more. Please refer to the Data Releases section below for more information as well as the on the BDC website.
Fixed Interoperability on BioData Catalyst Powered By Seven Bridges (BDC-Seven Bridges): BDC-Seven Bridges completed work on updating interoperability functionality. The initial release of the project-based data download restriction functionality inadvertently interfered with DRS data interoperability between BDC-Seven Bridges and other ecosystems such as CAVATICA. This unintentionally re-siloed data on those systems and runs counter to the overarching NIH data ecosystem goals of making data available to users across NIH institute/system boundaries.
Workflow Cost Estimator Expansion: A feature that enables users to estimate analysis costs before running has been expanded to three new workflows on BDC-Seven Bridges: 1) Cyrius, a tool to genotype CYP2D6 from WGS BAM or CRAM files, 2) kallisto quant, a tool to quantify RNA-seq data, and 3) BEDTools Coverage, a tool that computes both the depth and breadth of coverage of features in file B on the features in file A, useful for comparing WGS files. Users can filter tools based on the interactive cost estimator. .
Support Cascading authorization from dbGaP parent to child studies: Gen3 has updated the authorization process in BDC to enable a researcher with access to a dbGaP parent study to automatically gain access to relevant child studies. The authorization process as it existed previously in BDC expected dbGaP to explicitly grant access to both parent and its associated substudies individually. Since dbGaP did not provide explicit access for child studies, users were not able to access these child studies without additional authorization requested manually. With the implementation of support for cascading of authorization from parent to child study, a researcher with access to a dbGaP parent study will also gain access to relevant child studies in BDC, eliminating the need for any manual authorization process.
Implementation of DOIs at Dataset level: A digital object identifier (DOI) is a persistent identifier or handle used to identify objects uniquely, standardized by the International Organization for Standardization (ISO). In BDC, DOIs have been created and made available at the dataset level to assign a persistent identifier in a standard format. The DOIs are available via the Gen3 discovery page as well as the API. DataCite was used as the registration service. Going forward, every BDC dataset will have a DOI minted as part of the data ingestion process. For a user, having assigned DOIs to datasets will promote research reproducibility and data FAIR-ness.
View Stigmatizing Variables in PIC-SURE Open Access: Researchers can now view all variables, including stigmatizing variables, that are relevant to their search. Though these variables are not filterable in Open Access to protect participant data, this allows researchers to better understand what information is present in BDC. For more information about stigmatizing variables, please visit the .
The table below highlights which studies were included in the 2024-07-02 data release.
The latest release includes studies from NHLBI TOPMed projects such as Partners HealthCare Biobank, Novel Risk Factors for the Development of Atrial Fibrillation in Women, and the Study of Asthma Phenotypes and Pharmacogenomic Interactions by Race-Ethnicity (SAPPHIRE). New versions of studies like Walk-PHaSST Sickle Cell Disease, the Malmo Preventive Project, and the Johns Hopkins University School of Medicine Atrial Fibrillation Genetics Study are also featured. Additionally, the release includes updates to studies like Outcome Modifying Genes in Sickle Cell Disease (OMG) and the Vanderbilt University BioVU Atrial Fibrillation Genetics Study. The Collaborative Cohort of Cohorts for COVID-19 Research (C4R) and NIH RECOVER projects are also part of this release, including studies from the Hispanic Community Health Study/Study of Latinos and the Multi-Ethnic Study of Atherosclerosis.
The data is now available for access across the entire ecosystem.
The 2023-01-09 release marks the twelfth release for the NHLBI BioData Catalyst® (BDC) ecosystem. This release includes several new features (e.g., Azure volumes now available on both main analysis platforms) along with documentation and tutorials (e.g., information on how variable tags are generated) to help new users get started on the system. This release also includes enhanced support for moving data seamlessly across platforms. Please find more detail on the new features and user support materials in the sections below.
The 2023-01-09 data releases include the addition of the Pediatric Cardiac Genomics Consortium (PCGC). Please refer to the Data Releases section below for more information as well as the page on the BDC website.
Azure volumes are now available on BDC Powered by Seven Bridges: Users can now link a Microsoft Azure bucket to their Seven Bridges workspaces. After logging in, go to Data > Volumes and select “Microsoft Azure” to be led through a bucket-linking wizard.
DRS Manifest Export: In order to further improve interoperability and allow users to move their data in a seamless way across platforms, the DRS export option on the Seven Bridges’ platforms is now available. With the new functionality, users can generate links to platform files (DRS URIs) and metadata into a manifest file, which can then be used for importing the files and metadata on other platforms.
OmicsCircos R Shiny app now available on BDC-Seven Bridges: OmicCircos app is a R Shiny application created around the OmicCircos R package for more effective generation of high-quality circular plots for visualizing genomic data. Common use cases include mutation patterns, copy number variations (CNVs), expression patterns, and methylation patterns. Such variations can be displayed as scatterplot, line, or text-label figures.
Introduction to SAS Public Project on BDC-Seven Bridges: Seven Bridges released a Public Project to train users on how to use SAS. The public project contains three notebooks that walk a user through: 1) loading and cleaning data in SAS using ICD9 codes, 2) pulling the CDC’s Social Vulnerability Index data via API and running a regression, and 3) loading hosted 1000 Genomes data into SAS and visualizing mutation information. A user can copy the public project to their own workspace and modify the tutorial notebooks to suit their needs.
New CWL Tools/Workflows on BDC-Seven Bridges:
BEDTools 2.30.0 toolkit:
BEDTools Coverage - returns the depth and breadth of coverage of features from B on the intervals in A
BEDTools Genomecov - computes histograms of feature coverage for a given genome
BEDTools GetFasta - extracts sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file
BEDTools Intersect - screens for overlaps between two sets of genomic features
BEDTools Merge - combines overlapping or “book-ended” features in an interval file into a single feature
BEDTools Sort - sorts a feature file by chromosome and other criteria
FlowSOM 2.4.0 which presents an algorithm used to distinguish cell populations from both flow and mass cytometry data in an unsupervised way.
cytofkit2 0.99.80 which is designed to analyze mass cytometry data from FCS files. It includes preprocessing, cell subset detection, cell subset visualization and interpretation, and inference of subset progression.
flowAI 1.24.0 which performs quality control on FCS data acquired using flow cytometry instruments. By evaluating three different properties: flow rate, signal acquisition, dynamic range, and quality control, it enables the detection and removal of anomalies.
CNVkit 0.9.9 toolkit for inferring and visualizing copy number from high-throughput DNA sequencing data.
SBG Single-Cell RNA Deep Learning - Training is a single cell classifier pipeline for human data. It relies on the transfer learning approach, which uses pre-trained gene embeddings as the starting point for building a model adjusted to given single-cell datasets.
SBG Single-Cell RNA Deep Learning - Predict is a single-cell classifier pipeline for human data. This tool uses the deep learning model generated by the SBG Single-Cell RNA Deep Learning - Training workflow to classify the input dataset.
Azure is now available on BDC Powered by Terra: Users can now log into Terra with a Microsoft Azure Cloud account. This is an invite-only version of Terra on the Azure platform. The public offering of Terra on Azure is expected in early 2023.
A new spend report is now available for BDC-Terra billing projects: The report identifies which workspaces are costing the most, to provide more transparency around cloud costs incurred in Terra. To access the spend report, go to your billing project (main menu > billing > billing project) and click on the "Spend report" tab.
New streamlined user journey from BDC Powered by PIC-SURE to analysis platforms: PIC-SURE has added “Export to Seven Bridges” and “Export to Terra” buttons to streamline data export into a BioData Catalyst analysis workspace. After exploring and filtering variables in PIC-SURE Authorized Access, users can package their data with the Select and Package Data Tool. Once the data is packaged, users can select their preferred BDC analysis platform with the new Export buttons. This provides all information needed and points the user directly to the public PIC-SURE project on either Seven Bridges or Terra.
Take a Tour of BDC-PIC-SURE: PIC-SURE has updated the guided tour of the interface to interactively display search results based on the user’s authorization. This guided tour walks through the different parts of the platform, including how to use tags, where search results are displayed, and how to interpret the Results Panel.
BABYHUG Data Field Issue: The study BABYHUG, phs002415, contained a data file that included SAS-derived new line characters in data fields. As provided by the data submitter this caused shifts in the data rows, leading to fields being incorrectly mapped to the wrong variable. A new corrected version of the file has been requested from the data submitter.
BDC GitBook on BDC-PIC-SURE: Users can now access the BDC GitBook documentation directly from the PIC-SURE platform under the “Help” tab.
The table below highlights which studies were included in the 2023-01-09 data release.
The PCGC substudy contains whole exome sequences, targeted sequences, and SNP array data. It is a multi-center, observational cohort study of individuals with congenital heart defects. The study aims to investigate the relationship between genetic factors and phenotypic and clinical outcomes in patients with CHD. Summary level phenotypes for the study participants can be viewed on the top-level study page. Individual level data and molecular data for the study are available by requesting Authorized Access. The study has collected phenotypic data and source DNA from 10,000 probands, parents, and families of interest. The data is now available for access across the entire ecosystem.
The 2024-01-08 release marks the 16th release for the NHLBI BioData Catalyst® (BDC) ecosystem. This release includes several new features (e.g., enabling Azure and searching data without logging in) along with documentation and tutorials (e.g., data dictionary field documentation) to help new users get started on the system. Please find more detail on the new features and user support materials in the sections below.
The 2024-01-08 data releases include the addition of research on multisystem inflammatory syndrome in children linked to COVID-19, bone marrow transplant and pulmonary hypertension in sickle cell disease, atherosclerosis, and psoriasis. Please refer to the Data Releases section below for more information as well as the on the BDC website.
Azure available on BDC Powered by Seven Bridges (BDC-SB): Velsera expanded their existing multi-cloud offerings by enabling Microsoft Azure (southcentralus) on BDC-SB. Users can select that computing and storage environment when creating a project. This allows users to avoid any egress charges when computing on data stored in Azure. This is of particular interest to users who want to connect their own Azure cloud buckets to BDC-SB.
SAS upgrade in BDC-SB: SAS on BDC-SB has been upgraded from SAS Viya 3.5 to SAS Studio 9.4. SAS 9.4 has improved functionality over SAS 3.5 including more complete data management solutions and additional programming languages.
Open PIC-SURE without login: Open PIC-SURE is now publicly available on BDC Powered by PIC-SURE (BDC-PIC-SURE), meaning no eRA Commons credentials are required to access the site. Researchers can access this site to search terms of interest, apply filters at the variable-value level, retrieve obfuscated, aggregate counts, and view single variable distributions of their selected cohort. This new functionality allows researchers to discover and interact with data available on BDC without needing to log in, decreasing the barrier to data exploration. Check out Open PIC-SURE .
Data Hierarchies in BDC-PIC-SURE: Researchers are now able to view the data hierarchy associated with variables in BDC-PIC-SURE by clicking the “Data Tree” icon in the “Actions” column of the search results. This enables researchers to understand better how variables are related and obtain additional context for these variables. Note that this feature is currently in beta and will only be available for some studies. Feedback and input on this feature is welcome!
BDC-PIC-SURE Data Dictionary fields documentation: Documentation outlining the data dictionary fields returned from the PIC-SURE API was created. This provides a detailed account of what each field represents, including relationships between fields. This documentation can be found in the BDC-PIC-SURE GitBook .
The table below highlights which studies were included in the 2024-01-08 data release. The release features research on long-term outcomes of multisystem inflammatory syndrome in children linked to COVID-19 (COVID19-MUSIC_GRU), bone marrow transplant for severe sickle cell disease (BioLINCC-BMT_CTN_HMB), and ApoA-1, atherosclerosis, and psoriasis (DIR-ApoA-1_Atherosclerosis_in_Psoriasis_GRU). Additionally, updated metadata is provided for the ongoing study on sildenafil therapy in treating pulmonary hypertension in sickle cell disease (walk-PHaSST). This data includes clinical files and is now available for access. The data is now available for access across the entire ecosystem.
The 2023-10-04 release marks the 15th release for the NHLBI BioData Catalyst® (BDC) ecosystem. This release includes several new features (e.g., the ability to view cohort variables prior to access, and the ability to export selected data into an analysis workspace). Please find more detail on the new features in the section below.
The 2023-10-04 data releases include the addition of TOPMed studies spanning early-onset COPD, heart studies from various geographies, diabetes heart studies, and more. CRAMs and unharmonized clinical files were updated for six TOPMed studies already in BDC. BioLINCC Multi-Ethnic Study of Atherosclerosis studies were also added. Please refer to the Data Releases section below for more information as well as the on the BDC website.
BDC Powered by PIC-SURE (BDC-PIC-SURE): Open Access Variable Distributions Tool: Researchers can now view the variable distributions for their selected cohort with BDC-PIC-SURE Open Access to further their data discovery and exploration prior to access. Once variable filters have been applied, the Variable Distributions Tool displays bar charts for categorical variables and histograms for continuous variables. Note that the visualizations are obfuscated to protect participant-level data.
BDC Powered by Seven Bridges (BDC-Seven Bridges): : This public project enables users to use a CWL tool to export selected data from BDC-PIC-SURE into a BDC-Seven Bridges project using a query from the BDC-PIC-SURE UI and the BDC-PIC-SURE API. This project is a continuation of our original . Combined, these public projects give savvy and novice users the ability to transfer and make cohorts on BDC-PIC-SURE and bring data frames over to BDC-Seven Bridges for analysis.
BDC Powered by Terra (BDC-Terra) workspace data security: When users import data from NIH data repositories such as BDC, they are only allowed to import into existing BDC-Terra workspaces that have an authorization domain and/or protected data setting. Import of these datasets into unprotected workspaces will not succeed. This ensures that the data access is appropriately logged by BDC-Terra.
The table below highlights which studies were included in the 2023-10-04 data release. This release includes a significant representation from the NHLBI TOPMed program with studies spanning areas such as early-onset COPD, heart studies from various geographies, diabetes heart studies, and more. Notably, CRAMs and unharmonized clinical files have been updated for 6 TOPMed studies that were already a part of BDC. Additionally, new studies pertaining to the BioLINCC Multi-Ethnic Study of Atherosclerosis have been introduced. The data is now available for access across the entire ecosystem.
The 2023-04-04 release marks the thirteenth release for the NHLBI BioData Catalyst® (BDC) ecosystem. This release includes several new features, e.g., a new gallery for Public Projects and new project-based download restrictions on BDC Powered by Seven Bridges (BDC-Seven Bridges). It also includes documentation and tutorials to help new users get started on the system, e.g., how to start using the BDC Powered by PIC-SURE (BDC-PIC-SURE) API. Please find more details on the new features and user support materials in the sections below.
Please refer to the Data Releases section below for information on upcoming data releases. A list of currently available data can be viewed on the of the BDC website.
New gallery for Public Projects on BDC-Seven Bridges: BDC-Seven Bridges has released a new user interface to make browsing and selecting public projects easier. Previously, Public Projects were found as a list under a dropdown menu. The interface has been updated where the Public Resources > Projects dropdown displays a gallery of project cards with summaries and easily clickable “Copy Project” buttons.
Project-based download restrictions on BDC-Seven Bridges: Many consortia have found value in using the BDC-Seven Bridges project member permissions to collaborate and distribute data prior to public release. However, the ability to add new files to a project also allows a user to download files to their local environment. BDC-Seven Bridges released a new feature providing project-based download restrictions to the owner of the project. When creating a project, a user can turn on Download Restrictions and select to either allow analysis (CWL tools/workflows or Data Studio) but no download to a local environment, or no analysis and no download to the local environment. To request access to the new feature, email .
New CWL tools and workflows on BDC-Seven Bridges:
Minimac 4 4.1.2: a tool for imputing genotypes.
GATK 4.4.0.0
GATK IndexFeatureFile for indexing of provided feature files.
GATK MergeVcfs for combining multiple variant files.
GATK VariantEval BETA for evaluating variant calls.
GATK FilterMutectCalls filter somatic SNVs and indels called by Mutect2.
HTSeq-count 2.0.2: HTSeq-count is a Python tool for counting how many reads map to each feature.
GraphicsMagick 1.3.38
GraphicsMagick compare compares two images using statistics and/or visual differencing. The tool compares two images and reports difference statistics according to specified metrics, and/or outputs an image with a visual representation of the differences.
GraphicsMagick composite composites (combines) images to create a new image.
GraphicsMagick conjure interprets and executes scripts in the Magick Scripting Language (MSL). The Magick scripting language (MSL) will primarily benefit those that want to accomplish custom image processing tasks but do not wish to program.
GraphicsMagick convert is used to convert an input image file using one image format to an output file with the same or different image format while applying an arbitrary number of image transformations.
GraphicsMagick montage creates a composite image by combining several separate images.
MHC-I Binding Prediction tool (MHC I 3.1.2 toolkit) - which is used for prediction of peptides that bind to MHC I molecules.
MHC-II Binding Prediction tool (MHC II 3.1.6 toolkit) - which is used for prediction of peptides that bind to MHC II molecules.
MHCflurry Predict tool (MHCflurry 2.0.4 toolkit) - which is used for peptide/MHC I binding affinity prediction.
MHCflurry Scan tool (MHCflurry 2.0.4 toolkit) - which is designed to scan protein sequences and predict MHC-I ligands.
AXEL-F: Antigen eXpression based Epitope Likelihood-Function tool (AXEL-F 1.0.0 toolkit) - which is used for MHC-I epitope prediction.
NetChop tool (NetChop 3.0 toolkit) - which is a predictor of proteasomal processing based upon a neural network.
NetCTL tool (NetCTL 3.0 toolkit) - which is a T cell epitopes predictor.
NetCTLpan tool (NetCTLpan 3.0 toolkit) - which is a T cell epitopes predictor.
Class I Immunogenicity tool (Class I Immunogenicity 3.0 toolkit) - which predicts the immunogenicity of a peptide MHC (pMHC) complex.
TCRMatch tool (TCRMatch 1.0.2 toolkit) - which predicts T-Cell receptor specificity based on sequence similarity to characterized receptors.
BCell tool (BCell 3.1 toolkit) - which predicts linear B cell epitopes based on the antigen characteristics.
ElliPro tool (ElliPro 1.0 toolkit) - which predicts antibody epitopes based upon solvent-accessibility and flexibility.
Population Coverage tool (Population Coverage 3.0 toolkit) - which calculates the fraction of individuals predicted to respond to a given set of epitopes.
Epitope Cluster Analysis tool (Epitope Cluster Analysis 1.0 toolkit) - which groups epitopes into clusters based on sequence identity.
Picard 3.0.0 toolkit:
Picard CollectMultipleMetrics collects BAM statistics by running multiple Picard modules at once.
Picard ValidateSamFile validates an alignments file against the SAM specification.
Picard SortSam sorts alignment files (BAM or SAM).
Picard RevertSam reverts a BAM/SAM file to a previous state.
Picard MarkDuplicates marks duplicate reads in alignment files.
Picard GenotypeConcordance calculates genotype concordance between two VCF files.
Picard GatherBamFiles merges BAM files after a scattered analysis.
Picard FixMateInformation verifies and fixes mate-pair information.
Picard FastqToSam converts FASTQ files to an unaligned SAM or BAM file.
Picard CrosscheckFingerprints checks a set of data files for sample identity.
Picard CreateSequenceDictionary creates a DICT index file for a sequence.
Picard CollectWgsMetricsWithNonZeroCoverage evaluates the coverage and performance of WGS experiments.
Picard CollectVariantCallingMetrics can be used to collect variant call statistics after variant calling.
Picard CollectSequencingArtifactMetrics collects metrics to quantify single-base sequencing artifacts.
Picard CollectHsMetrics collects hybrid-selection metrics for alignments in SAM or BAM format.
Picard CollectAlignmentSummaryMetrics produces a summary of alignment metrics from a SAM or BAM file.
Picard CheckFingerprint checks sample identity of provided data against known genotypes.
Picard BedToIntervalList converts a BED file to a Picard INTERVAL_LIST format.
Picard AddOrReplaceReadGroups assigns all reads to the specified read group.
MetaCyto workflow (1.16.0 in CWL 1.2): based on R package MetaCyto that performs meta-analysis of both flow cytometry and mass cytometry (CyTOF) data. It is able to jointly analyze cytometry data from different studies with diverse sets of markers.
New and improved R adapter for BDC-PIC-SURE API: The R adapter for the BDC-PIC-SURE API has been completely revamped to improve performance, address known bugs, and make the API easier to use for R coders. All example code, in both Jupyter and RStudio, has been updated to show these code improvements in practice. Note: The old version of the R API will be available for use until August 31st, 2023. It is recommended that you update your code with the new changes.
A FHIR client
Direct interaction with dbGaP’s FHIR API
Extract, Transform, Load (ETL) logic to parse the content from dbGaP’s FHIR and load into BDC-Gen3’s Metadata API
BDC-Gen3’s Data Ingestion Pipeline will be updated to use the above tool to load FHIR metadata every new data release. In April 2023, loaded metadata will be available to all clients/users through BDC-Gen3’s Metadata API, and loaded metadata will be viewable in BDC-Gen3’s Discovery Page.
Learn about and start using the BDC-PIC-SURE API on the new “API” page: The “API” page on the BDC-PIC-SURE website provides everything you need to get started with the BDC-PIC-SURE API. This includes the personalized access token, links to publicly available R and Python code on both BDC Powered by Seven Bridges and Powered by Terra, and links to additional documentation.
In Q1 2023, progress was made in establishing procedures, clarifying data submission, and reworking screening protocols for multiple datasets for use with upcoming dataset ingestion. This included collaborative efforts with NHLBI to support pre-ingestion quality assurance, as well as data support for screening and assisting data submitters in preparing their data for future ingestion into BDC. Key datasets that underwent these processes include nuMoM2b (phs002808.v1.p1.c1), BABY HUG (phs002415.v1.p1.c1), MSH (phs002348.v1.p1.c1), NSRR-CFS (phs002715.v1.p1.c1), and CRA (phs000988.v4.p1.c1).
BDC-Gen3 release notes
BDC-Gen3 release notes
BDC-PIC-SURE Tag Generation: PIC-SURE has updated help text in the user interface and documentation to address the frequently asked question, “How are variable tags generated?” Users can find this help text in the “Filter by Variable Tags” box on the PIC-SURE platform and in the .
Updated BDC-PIC-SURE documentation on the Export buttons: The and were updated to include information about the new Export buttons. These updates were also released in the .
Gen3 release notes PIC-SURE release notes
BDC-Gen3 release notes
BDC-Gen3 release notes
BDC Powered by Gen3 (BDC-Gen3) Metadata Being Updated to bring data from dbGaP FHIR database: BDC-Gen3’s Discovery Page (and underlying BDC-Gen3 Source of Truth Metadata API) allows unauthenticated users to discover what datasets are available in BDC. Fast Health Interoperability Resources (FHIR) is an Health Level Seven International (HL7) specification for Healthcare Interoperability. The database of Genotypes and Phenotypes (dbGaP) has recently exposed a . BDC-Gen3 has worked to consume the new metadata from the dbGaP FHIR Server (as part of the officially defined data ingestion process). BDC-Gen3’s Python-based Software Development Kit (SDK) and Command Line Interface (CLI) now has:
Gen3 release notes
Study Name
phs I.D. #
Acronym
New to BioData Catalyst
New study version
NHLBI TOPMed: Boston Early-Onset COPD Study in the TOPMed Program (EOCOPD)
phs000946.v5.p1.c1
topmed-EOCOPD_DS-CS-RD
No
Yes
NHLBI TOPMed: The Cleveland Family Study (CFS)
phs000954.v4.p2.c1
topmed-CFS_DS-HLBS-IRB-NPU
No
Yes
NHLBI TOPMed: The Jackson Heart Study (JHS)
phs000964.v5.p1.c1
topmed-JHS_HMB-IRB-NPU
No
Yes
NHLBI TOPMed: The Jackson Heart Study (JHS)
phs000964.v5.p1.c2
topmed-JHS_DS-FDO-IRB-NPU
No
Yes
NHLBI TOPMed: The Jackson Heart Study (JHS)
phs000964.v5.p1.c3
topmed-JHS_HMB-IRB
No
Yes
NHLBI TOPMed: The Jackson Heart Study (JHS)
phs000964.v5.p1.c4
topmed-JHS_DS-FDO-IRB
No
Yes
NHLBI TOPMed: Whole Genome Sequencing and Related Phenotypes in the Framingham Heart Study (FHS)
phs000974.v5.p3.c1
topmed-FHS_HMB-IRB-MDS
No
Yes
NHLBI TOPMed: Whole Genome Sequencing and Related Phenotypes in the Framingham Heart Study (FHS)
phs000974.v5.p3.c2
topmed-FHS_HMB-IRB-NPU-MDS
No
Yes
NHLBI TOPMed: Heart and Vascular Health Study (HVH)
phs000993.v5.p2.c1
topmed-HVH_HMB-IRB-MDS
No
Yes
NHLBI TOPMed: Heart and Vascular Health Study (HVH)
phs000993.v5.p2.c2
topmed-HVH_DS-CVD-IRB-MDS
No
Yes
NHLBI TOPMed: The Vanderbilt AF Ablation Registry (VAFAR)
phs000997.v5.p2.c1
topmed-VAFAR_HMB-IRB
No
Yes
NHLBI TOPMed: The Vanderbilt Atrial Fibrillation Registry (VU)
phs001032.v6.p2.c1
topmed-VU_AF_GRU-IRB
No
Yes
NHLBI TOPMed: The Genetics and Epidemiology of Asthma in Barbados (BAGS)
phs001143.v4.p1.c1
topmed-BAGS_GRU-IRB
No
Yes
NHLBI TOPMed: Cleveland Clinic Atrial Fibrillation Study (CCAF)
phs001189.v4.p1.c1
topmed-CCAF_AF_GRU-IRB
No
Yes
NHLBI TOPMed: Cardiovascular Health Study (CHS)
phs001368.v3.p2.c1
topmed-CHS_HMB-MDS
No
Yes
NHLBI TOPMed: Cardiovascular Health Study (CHS)
phs001368.v3.p2.c2
topmed-CHS_HMB-NPU-MDS
No
Yes
NHLBI TOPMed: Cardiovascular Health Study (CHS)
phs001368.v3.p2.c3
topmed-CHS_DS-CVD-MDS
Yes
Yes
NHLBI TOPMed: Cardiovascular Health Study (CHS)
phs001368.v3.p2.c4
topmed-CHS_DS-CVD-NPU-MDS
No
Yes
NHLBI TOPMed: Diabetes Heart Study (DHS) African American Coronary Artery Calcification (AACAC)
phs001412.v3.p1.c1
topmed-AACAC_HMB-IRB-COL-NPU
No
Yes
NHLBI TOPMed: Diabetes Heart Study (DHS) African American Coronary Artery Calcification (AACAC)
phs001412.v3.p1.c2
topmed-AACAC_DS-DHD-IRB-COL-NPU
No
Yes
NHLBI TOPMed: MESA and MESA Family AA-CAC (MESA)
phs001416.v2.p1.c1
topmed-MESA_HMB
No
Yes
NHLBI TOPMed: MESA and MESA Family AA-CAC (MESA)
phs001416.v2.p1.c2
topmed-MESA_HMB-NPU
No
Yes
Clinical-trial of COVID-19 Convalescent Plasma in Outpatients (C3PO)
phs002752.v1.p1.c1
COVID19-C3PO_GRU
No
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Genetic Epidemiology of COPD Study (COPDGene)
phs002910.v1.p1.c1
COVID19-C4R_COPDGene_HMB
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Genetic Epidemiology of COPD Study (COPDGene)
phs002910.v1.p1.c2
COVID19-C4R_COPDGene_DS-CS
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Atherosclerosis Risk in Communities Study (ARIC)
phs002988.v1.p1.c1
COVID19-C4R_ARIC_HMB-IRB
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Framingham Heart Study (FHS)
phs002911.v1.p1.c1
COVID19-C4R_FHS_HMB-IRB-MDS
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Framingham Heart Study (FHS)
phs002911.v1.p1.c2
COVID19-C4R_FHS_HMB-IRB-NPU-MDS
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Severe Asthma Research Program (SARP)
phs002913.v1.p1.c1
COVID19-C4R_GRU-PUB-NPU
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Severe Asthma Research Program (SARP)
phs002913.v1.p1.c2
COVID19-C4R_GRU-PUB
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Severe Asthma Research Program (SARP)
phs002913.v1.p1.c3
COVID19-C4R_DS-AAI-PUB-NPU
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Severe Asthma Research Program (SARP)
phs002913.v1.p1.c4
COVID19-C4R_DS-AAI-PUB
Yes
Yes
Multi-Ethnic Study of Atherosclerosis (BioLINCC)
phs003288.v1.p1.c1
BioLINCC-MESA_HMB
Yes
Yes
Multi-Ethnic Study of Atherosclerosis (BioLINCC)
phs003288.v1.p1.c2
BioLINCC-MESA_HMB-NPU
Yes
Yes
NHLBI TOPMed: Pharmacogenomics of Hydroxyurea in Sickle Cell Disease (PharmHU)
phs001466.v2.p1.c1
topmed-pharmHU_HMB
No
Yes
HLBI TOPMed: Pharmacogenomics of Hydroxyurea in Sickle Cell Disease (PharmHU)
phs001466.v2.p1.c2
topmed-pharmHU_DS-SCD-RD
No
Yes
NHLBI TOPMed: Pharmacogenomics of Hydroxyurea in Sickle Cell Disease (PharmHU)
phs001466.v2.p1.c3
topmed-pharmHU_DS-SCD
No
Yes
NHLBI TOPMed: Partners HealthCare Biobank
phs001024.v6.p1.c1
topmed-PARTNERS_HMB
No
Yes
NHLBI TOPMed - NHGRI CCDG: The Vanderbilt University BioVU Atrial Fibrillation Genetics Study
phs001624.v3.p2.c1
topmed-BioVU_AF_HMB-GSO
No
Yes
NHLBI TOPMed: Novel Risk Factors for the Development of Atrial Fibrillation in Women
phs001040.v6.p1.c1
topmed-WGHS_HMB
No
Yes
NHLBI TOPMed - NHGRI CCDG: The Johns Hopkins University School of Medicine Atrial Fibrillation Genetics Study
phs001598.v3.p1.c1
topmed-JHU_AF_HMB-NPU-MDS
No
Yes
NHLBI TOPMed - NHGRI CCDG: Malmo Preventive Project (MPP)
phs001544.v3.p1.c1
topmed-MPP_HMB-NPU-MDS
No
Yes
NHLBI TOPMed: Pathways to Immunologically Mediated Asthma (PIMA)
phs001727.v3.p1.c2
topmed-PIMA_DS-ASTHMA-IRB-COL
No
Yes
NHLBI TOPMed: Characterizing the Response to a Leukotriene Receptor Antagonist and an Inhaled Corticosteroid (CLIC)
phs001729.v3.p1.c2
topmed-CARE_CLIC_DS-ASTHMA-IRB-COL
No
Yes
NHLBI TOPMed: Best ADd-on Therapy Giving Effective Response (BADGER)
phs001728.v3.p1.c2
topmed-CARE_BADGER_DS-ASTHMA-IRB-COL
No
Yes
Guiding Evidence Based Therapy Using Biomarker Intensified Treatment in Heart Failure (GUIDE-IT-BioLINCC)
phs003621.v1.p1.c1
BioLINCC-BL_GUIDE-IT_GRU
Yes
Yes
Heart Failure: A Controlled Trial Investigating Outcomes of Exercise Training (HF-ACTION-BioLINCC)
phs003599.v1.p1.c1
BioLINCC-BL_HF-ACTION_HMB
Yes
Yes
Heart Failure: A Controlled Trial Investigating Outcomes of Exercise Training (HF-ACTION-BioLINCC)
phs003599.v1.p1.c2
BioLINCC-BL_HF-ACTION_HMB-NPU
Yes
Yes
Sleep Heart Health Study (SHHS-BioLINCC)
phs003637.v1.p1.c1
BioLINCC-BL_SHHS_HMB-MDS
Yes
Yes
The Pediatric Cardiac Genomics Consortium (PCGC)
phs000571.v6.p2.c1
PCGC-CHD-GENES_HMB
No
Yes
The Collaborative Cohort of Cohorts for COVID-19 Research (C4R)
phs002988.v1.p1.c1
phs002910.v1.p1.c1
phs002910.v1.p1.c2
phs002911.v1.p1.c1
phs002911.v1.p1.c2
phs003017.v1.p1.c1
phs002919.v1.p1.c1
C4R_ARIC_phs002988
C4R_COPDGene_phs002910
C4R_FHS_phs002911
C4R_MESA_phs003017
C4R_REGARDS_phs002919
No
Yes
Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-Be (nuMoM2b)
phs002339.v1.p1.c1
topmed-NuMom2B_GRU-IRB
Yes
Yes
Genetic Epidemiology of COPD Study (COPDGene)
phs002910.v1.p1.c1
COVID19-C4R_COPDGene_HMB
Yes
Yes
Genetic Epidemiology of COPD Study (COPDGene)
phs002910.v1.p1.c2
COVID19-C4R_COPDGene_DS-CS
Yes
Yes
The Mediators of Atherosclerosis in South Asians Living in America (MASALA)
phs002980.v1.p1.c1
COVID19-C4R_MASALA_HMB-IRB-COL
Yes
Yes
Prevent Pulmonary Fibrosis (PrePF)
phs002975.v1.p1.c1
COVID19-C4R_PrePF_HMB
Yes
Yes
A Multi-site Observational Study of Post-Acute Sequelae of SARS-CoV-2 Infection in Adults (RECOVER)
phs003463.v1.p1.c1
RECOVER-Adult
Yes
Yes
Hispanic Community Health Study (HCHS)
phs003457.v1.p1.c1
NSRR-HCHS
Yes
Yes
Hispanic Community Health Study (HCHS)
phs003457.v1.p1.c2
NSRR-HCHS
Yes
Yes
NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica (CRA)
phs000988.v5.p1.c1
topmed-CRA_DS-ASTHMA-IRB-MDS-RD
No
Yes
NHLBI TOPMed - NHGRI CCDG: Genes-Environments and Admixture in Latino Asthmatics (GALA II)
phs000920.v5.p2.c2
topmed-GALAII_DS-LD-IRB-COL
No
Yes
NHLBI TOPMed: HyperGEN - Genetics of Left Ventricular (LV) Hypertrophy
phs001293.v3.p1.c1
topmed-HyperGEN_GRU-IRB
No
Yes
NHLBI TOPMed: HyperGEN - Genetics of Left Ventricular (LV) Hypertrophy
phs001293.v3.p1.c2
HyperGEN_DS-CVD-IRB-RD
No
Yes
NHLBI TOPMed: Whole Genome Sequencing of Venous Thromboembolism (WGS of VTE)
phs001402.v3.p1.c1
Mayo_VTE_GRU
No
Yes
NHLBI TOPMed - NHGRI CCDG: Massachusetts General Hospital (MGH) Atrial Fibrillation Study
phs001062.v5.p2.c2
MGH_AF_DS-AF-IRB-RD
No
Yes
NHLBI TOPMed - NHGRI CCDG: Massachusetts General Hospital (MGH) Atrial Fibrillation Study
phs001062.v5.p2.c1
MGH_AF_HMB-IRB
No
Yes
NHLBI TOPMed: African American Sarcoidosis Genetics Resource
phs001207.v3.p1.c1
Sarcoidosis_DS-SAR-IRB
No
Yes
NHLBI TOPMed: Women's Health Initiative (WHI)
phs001237.v3.p1.c1
WHI_HMB-IRB
No
Yes
NHLBI TOPMed: Women's Health Initiative (WHI)
phs001237.v3.p1.c2
WHI_HMB-IRB-NPU
No
Yes
NHLBI TOPMed - NHGRI CCDG: Atherosclerosis Risk in Communities (ARIC)
phs001211.v4.p2.c2
ARIC_DS-CVD-IRB
No
Yes
NHLBI TOPMed - NHGRI CCDG: Atherosclerosis Risk in Communities (ARIC)
phs001211.v4.p2.c1
ARIC_HMB-IRB
No
Yes
NHLBI TOPMed: Genetics of Cardiometabolic Health in the Amish
phs000956.v5.p1.c2
Amish_HMB-IRB-MDS
No
Yes
NHLBI TOPMed: Australian Familial Atrial Fibrillation Study
phs001435.v2.p1.c1
AustralianFamilialAF_HMB-NPU-MDS
No
Yes
NHLBI TOPMed - NHGRI CCDG: Early-onset Atrial Fibrillation in the CATHeterization GENetics (CATHGEN) Cohort
phs001600.v3.p2.c1
CATHGEN_DS-CVD-IRB
No
Yes
NHLBI TOPMed: Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE)
phs001472.v2.p1.c1
ECLIPSE_DS-COPD-MDS-RD
No
Yes
NHLBI TOPMed: Genetic Epidemiology Network of Arteriopathy (GENOA)
phs001345.v3.p1.c1
GENOA_DS-ASC-RF-NPU
No
Yes
NHLBI TOPMed: Genetic Epidemiology Network of Salt Sensitivity (GenSalt)
phs001217.v3.p1.c1
GenSalt_DS-HCR-IRB
No
Yes
NHLBI TOPMed: GOLDN Epigenetic Determinants of Lipid Response to Dietary Fat and Fenofibrate
phs001359.v3.p1.c1
GOLDN_DS-CVD-IRB
No
Yes
NHLBI TOPMed: Defining the time-dependent genetic and transcriptomic responses to cardiac injury among patients with arrhythmias
phs001434.v2.p1.c1
miRhythm_GRU
No
Yes
NHLBI TOPMed: Partners HealthCare Biobank
phs001024.v5.p1.c1
PARTNERS_HMB
No
Yes
NHLBI TOPMed: Pharmacogenomics of Hydroxyurea in Sickle Cell Disease (PharmHU)
phs001466.v2.p1.c3
pharmHU_DS-SCD
No
Yes
NHLBI TOPMed: Pharmacogenomics of Hydroxyurea in Sickle Cell Disease (PharmHU)
phs001466.v2.p1.c2
pharmHU_DS-SCD-RD
No
Yes
NHLBI TOPMed: Pharmacogenomics of Hydroxyurea in Sickle Cell Disease (PharmHU)
phs001466.v2.p1.c1
pharmHU_HMB
No
Yes
NHLBI TOPMed: REDS-III Brazil Sickle Cell Disease Cohort (REDS-BSCDC)
phs001468.v3.p1.c1
REDS-III_Brazil_SCD_GRU-IRB-PUB-NPU
No
Yes
NHLBI TOPMed: San Antonio Family Heart Study (SAFHS)
phs001215.v4.p2.c1
SAFHS_DS-DHD-IRB-PUB-MDS-RD
No
Yes
NHLBI TOPMed: Study of Asthma Phenotypes and Pharmacogenomic Interactions by Race-Ethnicity (SAPPHIRE)
phs001467.v2.p1.c1
SAPPHIRE_asthma_DS-ASTHMA-IRB-COL
No
Yes
NHLBI TOPMed: Novel Risk Factors for the Development of Atrial Fibrillation in Women
phs001040.v5.p1.c1
WGHS_HMB
No
Yes
NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study (CHS)
phs001368.v3.p2.c3
topmed-CHS_DS-NPU-MDS
Yes
Yes
NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica
phs000988.v5.p1.c1
topmed-CRA_DS-ASTHMA-IRB-MDS-RD
No
Yes
NHLBI TOPMed - NHGRI CCDG: Genes-Environments and Admixture in Latino Asthmatics (GALA II)
phs000920.v5.p3.c2
topmed-GALAII_DS-LD-IRB-COL
No
Yes
NHLBI TOPMed: HyperGEN - Genetics of Left Ventricular (LV) Hypertrophy
phs001293.v3.p1.c2
topmed-HyperGEN_DS-CVD-IRB-RD
No
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Genetic Epidemiology of COPD Study (COPDGene)
phs002910.v1.p1.c1
C4R-COPDGene_HMB
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Genetic Epidemiology of COPD Study (COPDGene)
phs002910.v1.p1.c2
C4R-COPDGene_DS-CS
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Atherosclerosis Risk in Communities Study (ARIC)
phs002988.v1.p1.c1
C4R-ARIC_HMB-IRB
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Atherosclerosis Risk in Communities Study (ARIC)
phs002988.v1.p1.c2
C4R-ARIC_DS-CVD-IRB
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Severe Asthma Research Program (SARP)
phs002913.v1.p1.c1
C4R-SARP_GRU-PUB-NPU
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Severe Asthma Research Program (SARP)
phs002913.v1.p1.c2
C4R-SARP_GRU-PUB
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Severe Asthma Research Program (SARP)
phs002913.v1.p1.c3
C4R-SARP_DS-AAI-PUB-NPU
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Severe Asthma Research Program (SARP)
phs002913.v1.p1.c4
C4R-SARP_DS-AAI-PUB
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Framingham Heart Study (FHS)
phs002911.v1.p1.c1
C4R-FHS_HMB-IRB-MDS
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Framingham Heart Study (FHS)
phs002911.v1.p1.c2
C4R-FHS_HMB-IRB-NPU-MDS
Yes
Yes
ApoA-1 and Atherosclerosis in Psoriasis (DIR)
phs003231.v1.p1.c1
DIR-AAP_GRU
Yes
Yes
Method to Assess Lung Water Accumulation During Exercise (DIR)
phs003346.v1.p1.c1
DIR-MALWADE_GRU-IRB
Yes
Yes
Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-Be (nuMoM2b)
phs002808.v1.p1.c1
topmed-NuMom2B_GRU-IRB
Yes
Yes
Hydroxyurea to Prevent Organ Damage in Children with Sickle Cell Anemia (BABY HUG)
phs002415.v1.p1.c1
BioLINCC-BabyHug_DS-SCD-IRB-RD
No
No
Multicenter Study of Hydroxyurea (MSH)
phs002348.v1.p1.c1
BioLINCC-MSH_GRU
No
No
The Cleveland Family Study (NSRR-CFS)
phs002715.v1.p1.c1
NSRR-NSRR-CFS_DS-HLBS-IRB-NPU
No
No
The Genetic Epidemiology of Asthma in Costa Rica (CRA)
phs000988.v4.p1.c1
topmed-CRA_DS-ASTHMA-IRB-MDS-RD
No
Yes
Long-Term Outcomes after the Multisystem Inflammatory Syndrome In Children (MUSIC)
phs002770
-
Yes
Yes
Accelerating COVID-19 Therapeutic Interventions and Vaccines 4 ACUTE (ACTIV4a) v1.0, v1.1
phs002694.v1.p1.c1
COVID19-ACTIV4A_GRU
No
Yes
Molecular Atlas of Lung Development (LungMAP)
phs001961.v2.p1.c1
-
Yes
Yes
Freeze 9 version Updates: Batch 1
-
-
No
Yes
Accelerating COVID-19 Therapeutic Interventions and Vaccines 4 ACUTE (ACTIV4a) v1.0, v1.1
phs002694.v3.p1.c1
COVID19-ACTIV4A_GRU
No
Yes
COVID-19 Post-hospital Thrombosis Prevention Study (ACTIV4c)
phs003063.v1.p1.c1
COVID19-ACTIV4C_GRU
No
Yes
Molecular Atlas of Lung Development (LungMAP)
phs001961.v2.p1.c1
LungMAP-MALD_GRU
Yes
Yes
Complement Inhibition Using Eculizumab to Overcome Platelet Transfusion Refractoriness in Patients with Severe Thrombocytopenia (DIR-Eculizumab)
phs003212.v1.p1.c1
DIR-Eculizumab_GRU
Yes
Yes
Hydroxyurea to Prevent Organ Damage in Children with Sickle Cell Anemia (BABYHUG)
phs002415.v1.p1.c1
BioLINCC-BabyHug_DS-SCD-IRB-RD
No
No
The Genetic Epidemiology of Asthma in Costa Rica (CRA)
phs000988.v4.p1.c1
topmed-CRA_DS-ASTHMA-IRB-MDS-RD
No
Yes
Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-Be (nuMoM2b)
phs002808.v1.p1.c1
topmed-NuMom2B_GRU-IRB
Yes
Yes
Multicenter Study of Hydroxyurea (MSH)
phs002348.v1.p1.c1
BioLINCC-MSH_GRU
No
No
The Cleveland Family Study (NSRR-CFS)
phs002715.v1.p1.c1
NSRR-NSRR-CFS_DS-HLBS-IRB-NPU
No
No
NHLBI TOPMed: Partners HealthCare Biobank
phs001024.v6.p1.c1
topmed-PARTNERS_HMB
No
Yes
NHLBI TOPMed: Novel Risk Factors
phs001040.v6.p1.c1
topmed-WGHS_HMB
No
Yes
NHLBI TOPMed: Study of Asthma Phenotypes and Pharmacogenomic Interactions by Race-Ethnicity (SAPPHIRE)
phs001467.v2.p2.c1
topmed-SAPPHIRE_asthma_HMB-COL
No
Yes
NHLBI TOPMed: Walk-PHaSST Sickle Cell Disease (SCD)
phs001514.v2.p1.c1
topmed-Walk_PHaSST_SCD_HMB-IRB-PUB-COL-NPU-MDS-GSO
No
Yes
NHLBI TOPMed: Walk-PHaSST Sickle Cell Disease (SCD)
phs001514.v2.p1.c2
otopmed-Walk_PHaSST_SCD_DS-SCD-IRB-PUB-COL-NPU-MDS-RDN
No
Yes
NHLBI TOPMed - NHGRI CCDG: Malmo Preventive Project (MPP)
phs001544.v3.p1.c1
topmed-MPP_HMB-NPU-MDS
No
Yes
NHLBI TOPMed - NHGRI CCDG: The Johns Hopkins University School of Medicine Atrial Fibrillation Genetics Study
phs001598.v3.p1.c1
topmed-JHU_AF_HMB-NPU-MDS
No
Yes
NHLBI TOPMed: Outcome Modifying Genes in Sickle Cell Disease (OMG)
phs001608.v2.p1.c1
topmed-OMG_SCD_DS-SCD-IRB-PUB-COL-MDS-RD
No
Yes
NHLBI TOPMed - NHGRI CCDG: The Vanderbilt University BioVU Atrial Fibrillation Genetics Study
phs001624.v3.p2.c1
topmed-BioVU_AF_HMB-GSO
No
Yes
NHLBI TOPMed: Genetic Causes of Complex Pediatric Disorders - Asthma (GCPD-A)
phs001661.v3.p1.c1
topmed-GCPD-A_DS-ASTHMA-GSO
No
Yes
NHLBI TOPMed: Lung Tissue Research Consortium (LTRC)
phs001662.v2.p1.c2
topmed-LTRC_HMB-MDS
No
Yes
NHLBI TOPMed: Pulmonary Hypertension and the Hypoxic Response in SCD (PUSH)
phs001682.v2.p1.c1
topmed-PUSH_SCD_DS-SCD-IRB-PUB-COL
No
Yes
NHLBI TOPMed - NHGRI CCDG: Groningen Genetics of Atrial Fibrillation (GGAF) Study
phs001725.v2.p1.c1
topmed-GGAF_GRU
No
Yes
NHLBI TOPMed: Childhood Asthma Management Program (CAMP)
phs001726.v2.p1.c1
topmed-CAMP_DS-AST-COPD
No
Yes
NHLBI TOPMed: Best ADd-on Therapy Giving Effective Response (BADGER)
phs001728.v3.p1.c2
topmed-CARE_BADGER_DS-ASTHMA-IRB-COL
No
Yes
NHLBI TOPMed: Characterizing the Response to a Leukotriene Receptor Antagonist and an Inhaled Corticosteroid (CLIC)
phs001729.v3.p1.c2
topmed-CARE_CLIC_DS-ASTHMA-IRB-COL
No
Yes
NHLBI TOPMed: Pediatric Asthma Controller Trial (PACT)
phs001730.v2.p1.c2
topmed-CARE_PACT_DS-ASTHMA-IRB-COL
No
Yes
NHLBI TOPMed: TReating Children to Prevent EXacerbations of Asthma (TREXA)
phs001732.v2.p1.c2
topmed-CARE_TREXA_DS-ASTHMA-IRB-COL
No
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Hispanic Community Health Study/Study of Latinos (HCHS/SOL)
phs002908.v1.p1.c1
COVID19-C4R_HCHS_SOL_HMB-NPU
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Hispanic Community Health Study/Study of Latinos (HCHS/SOL)
phs002908.v1.p1.c2
COVID19-C4R_HCHS_SOL_HMB
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Multi-Ethnic Study of Atherosclerosis (MESA)
phs003017.v1.p1.c1
COVID19-C4R_MESA_HMB
Yes
Yes
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Multi-Ethnic Study of Atherosclerosis (MESA)
phs003017.v1.p1.c2
COVID19-C4R_MESA_HMB-NPU
Yes
Yes
NIH RECOVER: A Multi-Site Observational Study of Post-Acute Sequelae of SARS-CoV-2 Infection in Adults
phs003463.v2.p2.c1
RECOVER-RC-Adult_GRU
No
Yes
Heart Failure Network: Functional Impact of GLP-1 for Heart Failure Treatment (HFN FIGHT-BioLINCC)
phs003542.v1.p1.c1
BioLINCC_BL_HFN-FIGHT_GRU
No
Yes
Action to Control Cardiovascular Risk in Diabetes (ACCORD-BioLINCC)
phs003551.v1.p1.c1
BioLINCC-BL_ACCORD_GRU
No
Yes
Action to Control Cardiovascular Risk in Diabetes (ACCORD - Imaging)
phs003562.v2.p1.c1
imaging-ACCORD_GRU
No
Yes
Systolic Blood Pressure Intervention Trial (SPRINT-Imaging)
phs003566.v2.p1.c1
imaging-SPRINT_GRU
No
Yes
Framingham Heart Study-Cohort (FHS-Cohort) - Imaging
phs003593.v1.p1.c1
Imaging-img_FHS_HMB-IRB-MDS
No
Yes
Framingham Heart Study-Cohort (FHS-Cohort) - Imaging
phs003593.v1.p1.c2
Imaging-img_FHS_HMB-IRB-NPU-MDS
No
Yes
Long-TerM OUtcomes after the Multisystem Inflammatory Syndrome In Children (MUSIC)
phs002770.v1.p1.c1
COVID19-MUSIC_GRU
Yes
Yes
Unrelated Donor Reduced Intensity Bone Marrow Transplant for Children with Severe Sickle Cell Disease (BMT CTN-0601-BioLINCC)
phs003470.v1.p1.c1
BioLINCC-BMT_CTN_HMB
Yes
Yes
ApoA-1 and Atherosclerosis in Psoriasis (DIR)
phs003231.v1.p1.c1
DIR-ApoA-1_Atherosclerosis_in_Psoriasis_GRU
Yes
Yes
Treatment of Pulmonary Hypertension and Sickle Cell Disease With Sildenafil Therapy (walk-PHaSST)
phs002383.v1.p1.c1
BioLINCC-Walk_PHaSST_DS-SCD-IRB-PUB-COL-NPU-MDS-RD
No
No
Study Name
phs I.D. #
Acronym
New to BioData Catalyst
New study version
NHLBI TOPMed: Boston Early-Onset COPD Study (EOCOPD)
phs000946.v5.p1.c1
topmed-EOCOPD_DS-CS-RD
No
No
NHLBI TOPMed: The Cleveland Family Study (CFS)
phs000954.v4.p2.c1
topmed-CFS_DS-HLBS-IRB-NPU
No
No
NHLBI TOPMed: The Jackson Heart Study (JHS)
phs000964.v5.p1.c1
topmed-JHS_HMB-IRB-NPU
No
No
NHLBI TOPMed: The Jackson Heart Study (JHS)
phs000964.v5.p1.c2
topmed-JHS_DS-FDO-IRB-NPU
No
No
NHLBI TOPMed: The Jackson Heart Study (JHS)
phs000964.v5.p1.c3
topmed-JHS_HMB-IRB
No
No
NHLBI TOPMed: The Jackson Heart Study (JHS)
phs000964.v5.p1.c4
topmed-JHS_DS-FDO-IRB
No
Yes
NHLBI TOPMed: Genomic Activities such as Whole Genome Sequencing and Related Phenotypes in the Framingham Heart Study (FHS)
phs000974.v5.p3.c1
topmed-FHS_HMB-IRB-MDS
No
No
NHLBI TOPMed: Genomic Activities such as Whole Genome Sequencing and Related Phenotypes in the Framingham Heart Study (FHS)
phs000974.v5.p3.c2
topmed-FHS_HMB-IRB-NPU-MDS
No
No
NHLBI TOPMed: Heart and Vascular Health Study (HVH)
phs000993.v5.p2.c1
topmed-HVH_HMB-IRB-MDS
No
No
NHLBI TOPMed: Heart and Vascular Health Study (HVH)
phs000993.v5.p2.c2
topmed-HVH_DS-CVD-IRB-MDS
No
No
NHLBI TOPMed - NHGRI CCDG: The Vanderbilt AF Ablation Registry
phs000997.v5.p2.c1
topmed-VAFAR_HMB-IRB
No
No
NHLBI TOPMed: Heart and Vascular Health Study (HVH)
phs001032.v6.p2.c1
topmed-VU_AF_GRU-IRB
No
No
NHLBI TOPMed: The Genetics and Epidemiology of Asthma in Barbados
phs001143.v4.p1.c1
topmed-BAGS_GRU-IRB
No
No
NHLBI TOPMed: Cleveland Clinic Atrial Fibrillation (CCAF) Study
phs001189.v4.p1.c1
topmed-CCAF_AF_GRU-IRB
No
No
NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study (CHS)
phs001368.v3.p2.c1
topmed-CHS_HMB-MDS
No
No
NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study (CHS)
phs001368.v3.p2.c2
topmed-CHS_HMB-NPU-MDS
No
No
NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study (CHS)
phs001368.v3.p2.c4
topmed-CHS_DS-CVD-NPU-MDS
No
No
NHLBI TOPMed: Diabetes Heart Study (DHS) African American Coronary Artery Calcification (AACAC)
phs001412.v3.p1.c1
topmed-AACAC_HMB-IRB-COL-NPU
No
No
NHLBI TOPMed: Diabetes Heart Study (DHS) African American Coronary Artery Calcification (AACAC)
phs001412.v3.p1.c2
topmed-AACAC_DS-DHD-IRB-COL-NPU
No
No
NHLBI TOPMed: MESA and MESA Family AA-CAC (MESA)
phs001416.v3.p1.c1
topmed-MESA_HMB
No
No
NHLBI TOPMed: MESA and MESA Family AA-CAC (MESA)
phs001416.v3.p1.c2
topmed-MESA_HMB-NPU
No
No
Clinical-trial of COVID-19 Convalescent Plasma in Outpatients (C3PO)
phs002752.v1.p1.c1
COVID19-C3PO_GRU
No
No
COVID-19 Post-hospital Thrombosis Prevention Study (ACTIV-4C)
phs003063.v1.p1.c1
COVID19-ACTIV4C_GRU
No
No
Multi-Ethnic Study of Atherosclerosis (BioLINCC)
phs003288.v1.p1.c1
BioLINCC-MESA_HMB
Yes
Yes
Multi-Ethnic Study of Atherosclerosis (BioLINCC)
phs003288.v1.p1.c2
BioLINCC-MESA_HMB-NPU
Yes
Yes
RECOVER Synthetic Data Set
tutorial-RECOVER_synthetic_data_set_1
tutorial-RECOVER_synthetic_data_set_1
Yes
Yes
The 2024-04-01 release marks the 17th release for the NHLBI BioData Catalyst® (BDC) ecosystem. This release includes several new features (e.g., SRA import via DRS and the ability to save dataset IDs). Please find more details on the new features below.
The 2024-04-01 data releases include the addition of research on heart failure and COVID-19 plus version updates to ongoing genetic and genomic studies including COPD and atrial fibrillation. Please refer to the Data Releases section below for more information as well as the Data page on the BDC website.
BDC Powered by Seven Bridges (BDC-Seven Bridges) SRA Import via DRS: The Sequence Read Archive (SRA) has been accessible via the SRA Toolkit, which involves users downloading a copy to their local environment and then downloading the SRA data to their project on BDC-Seven Bridges. NCBI is now storing the SRA data in cloud buckets on Amazon and Google, allowing users to avoid egress charges and simplifying access to the data via BDC-Seven Bridges’ new SRA to DRS Converter workflow.
BDC Powered by PIC-SURE Save Dataset ID: Users can now save the dataset ID after applying filters and building a cohort, allowing them to view and access their saved cohorts at a later time. Saved dataset IDs can be viewed and managed on the Authorized PIC-SURE Dataset Management page.
The table below highlights which studies were included in the 2024-04-01 data release.
The latest release incorporates studies from the Heart Failure Network (HFN), National Sleep Research Resource (NSRR), Observational Study of Post-Acute Sequelae of SARS-CoV-2 Infection (RECOVER Adult), and the Collaborative Cohort of Cohorts for COVID-19 Research (C4R). Additionally, the release broadens its scope with version updates to ongoing genetic and genomic studies, including the NHLBI TOPMed projects such as the evaluation of COPD longitudinally, and the genetic epidemiology of conditions like atrial fibrillation within the CATHGEN cohort, among others.
The data will be available for access across the entire ecosystem by 2024-04-05.
Heart Failure Network: Diuretic Optimization Strategies Evaluation in Acute Heart Failure (HFN DOSE-BioLINCC)
phs003524.v1.p1.c1
BioLINCC-BL_HFN_DOSE_AHF_GRU
Yes
No
National Sleep Research Resource (NSRR): Hispanic Community Health Study/Study of Latinos
phs003543.v1.p1.c1
NSRR-HCHS_HMB-NPU
Yes
No
National Sleep Research Resource (NSRR): Hispanic Community Health Study/Study of Latinos
phs003543.v1.p1.c2
NSRR-HCHS_HMB
Yes
No
NIH RECOVER: A Multi-Site Observational Study of Post-Acute Sequelae of SARS-CoV-2 Infection in Adults
phs003463.v1.p1.c1
RECOVER-RC_Adult_GRU
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Prevent Pulmonary Fibrosis (PrePF)
phs002975.v1.p1.c1
COVID19-C4R_PREPF_DS-PMD-IRB
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Genetic Epidemiology of COPD Study (COPDGene)
phs002910.v1.p1.c2
COVID19-C4R_COPDGENE_DS-CS
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Genetic Epidemiology of COPD Study (COPDGene)
phs002910.v1.p1.c1
COVID19-C4R_COPDGENE_HMB
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Multi-Ethnic Study of Atherosclerosis (MESA)
phs003017.v1.p1.c1
COVID19-C4R_MESA_HMB
Yes
No
Collaborative Cohort of Cohorts for COVID-19 Research (C4R): Multi-Ethnic Study of Atherosclerosis (MESA)
phs003017.v1.p1.c2
COVID19-C4R_MESA_HMB-NPU
Yes
No
NHLBI TOPMed: Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE)
phs001472.v2.p1.c1
topmed-ECLIPSE_DS-COPD-MDS-RD
No
Yes
NHGRI CCDG: Early-onset Atrial Fibrillation in the CATHeterization GENetics (CATHGEN) Cohort
phs001600.v3.p2.c1
topmed-CATHGEN_DS-CVD-IRB
No
Yes
NHLBI TOPMed: Genetic Epidemiology Network of Arteriopathy (GENOA)
phs001345.v3.p1.c1
topmed-GENOA_DS-ASC-RF-NPU
No
Yes
NHLBI TOPMed: Genetics of Lipid Lowering Drugs and Diet Network (GOLDN)
phs001359.v3.p1.c1
topmed-GOLDN_DS-CVD-IRB
No
Yes
NHLBI TOPMed: University of Massachusetts Medical School (UMMS) miRhythm Study
phs001434.v2.p1.c1
topmed-miRhythm_GRU
No
Yes
NHLBI TOPMed: Genetics of Cardiometabolic Health in the Amish
phs000956.v4.p1.c2
topmed-Amish_HMB-IRB-MDS
No
Yes
NHLBI TOPMed: Genetic Epidemiology of COPD (COPDGene) in the TOPMed Program
phs000951.v5.p4.c1
topmed-COPDGene_HMB
No
Yes
NHLBI TOPMed: Genetic Epidemiology of COPD (COPDGene) in the TOPMed Program
phs000951.v5.p4.c2
topmed-COPDGene_DS-CS-RD
No
Yes
NHLBI TOPMed: Trans-Omics for Precision Medicine Whole Genome Sequencing Project: ARIC
phs001211.v4.p2.c1
topmed-ARIC_HMB-IRB
No
Yes
NHLBI TOPMed: Trans-Omics for Precision Medicine Whole Genome Sequencing Project: ARIC
phs001211.v4.p2.c2
topmed-ARIC_DS-CVD-IRB
No
Yes
NHLBI TOPMed: REDS-III Brazil Sickle Cell Disease Cohort (REDS-BSCDC)
phs001468.v3.p1.c1
topmed-REDS-III_Brazil_SCD_GRU-IRB-PUB-NPU
No
Yes
NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica
phs000988.v5.p1.c1
topmed-CRA_DS-ASTHMA-IRB-MDS-RD
No
Yes
NHLBI TOPMed: Genes-environments and Admixture in Latino Asthmatics (GALA II) Study
phs000920.v5.p2.c2
topmed-GALAII_DS-LD-IRB-COL
No
Yes
LungMAP: Molecular Atlas of Lung Development - Human Lung Tissue
phs001961.v2.p1.c1
LungMAP-MALD_GRU
No
No
Unrelated Donor Reduced Intensity Bone Marrow Transplant for Children with Severe Sickle Cell Disease (BMT CTN-0601-BioLINCC)
phs003470.v1.p1.c1
BioLINCC-BMT_CTN-0601_GRU
No
No
NHLBI TOPMed: HyperGEN - Genetics of Left Ventricular (LV) Hypertrophy
phs001293.v3.p1.c2
topmed-HyperGEN_DS-CVD-IRB-RD
No
Yes
NHLBI TOPMed - NHGRI CCDG: AF Biobank LMU in the context of the MED Biobank LMU
phs001543.v2.p1.c1
topmed-AFLMU_HMB-IRB-PUB-COL-NPU-MDS
No
Yes
NHLBI TOPMed: Australian Familial Atrial Fibrillation Study
phs001435.v2.p1.c1
topmed-AustralianFamilialAF_HMB-NPU-MDS
No
Yes
NHLBI TOPMed - NHGRI CCDG: Penn Medicine BioBank Early Onset Atrial Fibrillation Study
phs001601.v2.p1.c1
topmed-CCDG_PMBB_AF_HMB-IRB-PUB
No
Yes
NHLBI TOPMed: Children's Health Study (CHS) Integrative Genetic Approaches to Gene-Air Pollution Interactions in Asthma (GAP)
phs001602.v2.p1.c1
topmed-ChildrensHS_GAP_GRU
No
Yes
NHLBI TOPMed: Children's Health Study (CHS) Integrative Genomics and Environmental Research of Asthma (IGERA)
phs001603.v2.p1.c1
topmed-ChildrensHS_IGERA_GRU
No
Yes
NHLBI TOPMed: Children's Health Study (CHS) Effects of Air Pollution on the Development of Obesity in Children (Meta-AIR)
phs001604.v2.p1.c1
topmed-ChildrensHS_MetaAir_GRU
No
Yes
NHLBI TOPMed: Chicago Initiative to Raise Asthma Health Equity (CHIRAH)
phs001605.v2.p1.c2
topmed-CHIRAH_DS-ASTHMA-IRB-COL
No
Yes
NHLBI TOPMed: Determining the association of chromosomal variants with non-PV triggers and ablation-outcome in AF (DECAF)
phs001546.v2.p1.c1
topmed-DECAF_GRU
No
Yes
NHLBI TOPMed: Early-onset Atrial Fibrillation in the Estonian Biobank
phs001606.v2.p1.c1
topmed-EGCUT_GRU
No
Yes
NHLBI TOPMed: Genetics of Asthma in Latino Americans (GALA)
phs001542.v2.p1.c2
topmed-GALA_DS-LD-IRB-COL
No
Yes
NHLBI TOPMed: Pharmacogenomics of Hydroxyurea in Sickle Cell Disease (PharmHU)
phs001466.v2.p1.c3
topmed-pharmHU_DS-SCD
No
Yes
NHLBI TOPMed: Pharmacogenomics of Hydroxyurea in Sickle Cell Disease (PharmHU)
phs001466.v2.p1.c2
topmed-pharmHU_DS-SCD-RD
No
Yes
NHLBI TOPMed - NHGRI CCDG: The GENetics in Atrial Fibrillation (GENAF) Study
phs001547.v2.p1.c1
topmed-GENAF_HMB-NPU
No
Yes
NHLBI TOPMed: Genetic Study of Atherosclerosis Risk (GeneSTAR)
phs001218.v3.p1.c2
topmed-GeneSTAR_DS-CVD-IRB-NPU-MDS
No
Yes
NHLBI TOPMed: Genetic Epidemiology Network of Salt Sensitivity (GenSalt)
phs001217.v3.p1.c1
topmed-GenSalt_DS-HCR-IRB
No
Yes
NHLBI TOPMed - NHGRI CCDG: Hispanic Community Health Study/Study of Latinos (HCHS/SOL)
phs001395.v2.p1.c2
topmed-HCHS-SOL_HMB
No
Yes
NHLBI TOPMed - NHGRI CCDG: Hispanic Community Health Study/Study of Latinos (HCHS/SOL)
phs001395.v2.p1.c1
topmed-HCHS-SOL_HMB-NPU
No
Yes
NHLBI TOPMed: HyperGEN - Genetics of Left Ventricular (LV) Hypertrophy
phs001293.v3.p1.c1
topmed-HyperGEN_GRU-IRB
No
Yes
NHLBI TOPMed - NHGRI CCDG: Intermountain INSPIRE Registry
phs001545.v2.p1.c1
topmed-INSPIRE_AF_DS-MULTIPLE_DISEASES-MDS
No
Yes
NHLBI TOPMed - NHGRI CCDG: The Johns Hopkins University School of Medicine Atrial Fibrillation Genetics Study
phs001598.v2.p1.c1
topmed-JHU_AF_HMB-NPU-MDS
No
Yes
NHLBI TOPMed: Whole Genome Sequencing of Venous Thromboembolism (WGS of VTE)
phs001402.v3.p1.c1
topmed-Mayo_VTE_GRU
No
Yes
NHLBI TOPMed - NHGRI CCDG: Massachusetts General Hospital (MGH) Atrial Fibrillation Study
phs001062.v5.p2.c2
topmed-MGH_AF_DS-AF-IRB-RD
No
Yes
NHLBI TOPMed - NHGRI CCDG: Massachusetts General Hospital (MGH) Atrial Fibrillation Study
phs001062.v5.p2.c1
topmed-MGH_AF_HMB-IRB
No
Yes
NHLBI TOPMed: MyLifeOurFuture (MLOF) Research Repository of patients with hemophilia A (factor VIII deficiency) or hemophilia B (factor IX deficiency)
phs001515.v2.p1.c1
topmed-MLOF_HMB-PUB
No
Yes
NHLBI TOPMed - NHGRI CCDG: Malmo Preventive Project (MPP)
phs001544.v2.p1.c1
topmed-MPP_HMB-NPU-MDS
No
Yes
NHLBI TOPMed: Partners HealthCare Biobank
phs001024.v5.p1.c1
topmed-PARTNERS_HMB
No
Yes
NHLBI TOPMed: Pharmacogenomics of Hydroxyurea in Sickle Cell Disease (PharmHU)
phs001466.v2.p1.c1
topmed-pharmHU_HMB
No
Yes
NHLBI TOPMed: San Antonio Family Heart Study (SAFHS)
phs001215.v4.p2.c1
topmed-SAFHS_DS-DHD-IRB-PUB-MDS-RD
No
Yes
NHLBI TOPMed: Study of Asthma Phenotypes and Pharmacogenomic Interactions by Race-Ethnicity (SAPPHIRE)
phs001467.v2.p1.c1
topmed-SAPPHIRE_asthma_DS-ASTHMA-IRB-COL
No
Yes
NHLBI TOPMed: African American Sarcoidosis Genetics Resource
phs001207.v3.p1.c1
topmed-Sarcoidosis_DS-SAR-IRB
No
Yes
NHLBI TOPMed: Genome-Wide Association Study of Adiposity in Samoans
phs000972.v5.p1.c1
topmed-SAS_GRU-IRB-PUB-COL-NPU-GSO
No
Yes
NHLBI TOPMed: Rare Variants for Hypertension in Taiwan Chinese (THRV)
phs001387.v3.p1.c3
topmed-THRV_DS-CVD-IRB-COL-NPU-RD
No
Yes
NHLBI TOPMed: Novel Risk Factors for the Development of Atrial Fibrillation in Women
phs001040.v5.p1.c1
topmed-WGHS_HMB
No
Yes
NHLBI TOPMed: Women's Health Initiative (WHI)
phs001237.v3.p1.c1
topmed-WHI_HMB-IRB
No
Yes
NHLBI TOPMed: Women's Health Initiative (WHI)
phs001237.v3.p1.c2
topmed-WHI_HMB-IRB-NPU
No
Yes
BDC-Gen3 release notes BDC-Terra release notes BDC-Seven Bridges release notes BDC-PIC-SURE release notes
Docker technology has revolutionized reproducibility by creating a fast, portable, easily shareable method to generate the exact compute environment, with all dependencies and configurations, that were used to run a tool or workflow.
Below, we provide resources for finding public Docker images or creating your own image to use with your analysis. Docker is commonly used by software engineers, and learning material on the internet may be overly complex for the researcher use case. We compiled learning materials from each platform within BioData Catalyst to help you get started using Docker specifically for bioinformatics pipelines.
We highly recommend users begin with an official or maintained image (for example, from BioContainer) to ensure you are using secure software.
Below, we have compiled learning resources from various sources to help you get started learning Docker:
Dockstore’s Getting Started with Docker
The 2022-10-03 release marks the eleventh release for the NHLBI BioData Catalyst ecosystem. This release includes several new features (e.g., PIC-SURE's new search interface) along with updated documentation. This release also includes updated versions of the Study Variable Explorer and the Annotation Explorer. Please find more detail on the new features and user support materials in the sections below.
The 2022-10-03 data releases include the addition of TOPMed Boston-Brazil SCD and PCGC datasets. Please refer to the Data Releases section below for more information as well as the Data page on the BioData Catalyst website.
Now export with Study Variable Explorer on BioData Catalyst Powered by Seven Bridges: The Study Variable Explorer on BioData Catalyst Powered by Seven Bridges allows researchers to explore phenotypic variables from the TOPMed data dictionaries in an open access manner. Seven Bridges released Study Variable Explorer version 2 which expands on version 1 by adding tag search, notes, and data export. The latest update enables researchers to track their variable selection process through notes tied to study and variable information which can be shared with collaborators through .json export. This gives analysts tractable information for reproducing decision-making during the harmonization process.
New Interactive Web Apps Gallery: Under the “Public Gallery” dropdown on BioData Catalyst Powered by Seven Bridges, a new display for “Interactive Web Apps” provides access to the LocusZoom and Model Explorer R Shiny applications.
Annotation Explorer Version 2: The Annotation Explorer enables users to interactively explore, query, and study characteristics of an inventory of annotations for the variants across the genome. This application can be used pre-association testing to interactively explore variant aggregation, filtering strategies, and generate input files for multiple-variant association testing, or post-association testing to explore annotations associated with a set of significant variants or variants of interest. Seven Bridges previously released the Annotation Explorer R Shiny application through a Public Project. Now, Annotation Explorer is integrated with BioData Catalyst Powered by Seven Bridges through the “Data” dropdown. The new integration enables querying genome wide annotations and variants (including the TOPMed Freeze5 and Freeze8 datasets) in a more user-friendly interface without running an R Studio notebook. This release is integrated into the billing system so a user can select their compute needs based on price and monitor Annotation Explorer-specific costs through their billing group.
New CWL Tools and Workflows on BioData Catalyst Powered by Seven Bridges:
GATK VariantEval BETA 4.2.5.0 tool which is used for evaluating variant calls.
GATK FilterMutectCalls 4.2.5.0 tool which is used to filter somatic SNVs and indels called by Mutect2.
Picard CreateSequenceDictionary 2.25.7 tool for creating a DICT index file for a sequence.
WARP ExomeGermlineSingleSample 2.4.4 pipeline for data pre-processing and variant calling in human WES data.
BCFtools 1.15.1 toolkit - CWL1.2
Kraken2 2.1.2 toolkit
SRA (v3.0.0, CWL1.2)
SRA sam-dump that converts SRA data into SAM format. With aligned data, NCBI uses Compression by Reference, which only stores the differences in base pairs between sequence data and the segment it aligns to. The process to restore original data, for example as FASTQ, requires fast access to the reference sequences that the original data was aligned to.
SRA fasterq-dump tool that converts SRA data into FASTQ format while using temporary files and multi-threading to speed up the extraction.
SRA fastq-dump tool that converts SRA data into FASTQ format.
Salmon (v1.5.2, CWL1.2)
Salmon Alevin tool that introduces a family of algorithms for quantification and analysis of 3’ tagged-end single-cell sequencing data.
Salmon Index tool that builds an index necessary for the Salmon Quant and Salmon Alevin tools. To create an index, it uses a transcriptome reference file in FASTA format. Additionally, one can provide a genome reference along with transcriptome to create a hybrid index compatible with the improved mapping algorithm named Selective Alignment.
Updated Interactive Analysis interface on Terra: Under the new design, the “Notebooks” tab is transformed into the more general “Analyses” tab, from where you can access the multiple applications available for Interactive Analysis in Terra. Accordingly, the list of Notebook files (.ipynb) becomes the list of “Your Analyses”, which now supports including R Markdown files (.Rmd). Just like Notebook files, any R Markdown files created in or added to the Analyses tab will be automatically stored in the workspace bucket and synced between the bucket and your persistent disk.
PIC-SURE's new search interface: PIC-SURE has released an improved dynamic data exploration experience, allowing users to easily search and query at the variable value and genomic variant level. The streamlined search experience enables users to search variables and view associated information, such as decoded variable level information, details about the dataset, and study information - all without opening any data files. Updates to the interface include filtering search results by variable and study tags, a new genomic filtering model, adding variables to export without filtering, a simpler select and package data process, and visualizing single variable distributions.
Dedicated PIC-SURE images within Seven Bridges analysis workspaces: The Seven Bridges and PIC-SURE teams have collaborated to provide users with dedicated workspace images that contain all the pre-installed packages necessary to run the PIC-SURE example notebooks. PIC-SURE API users in Seven Bridges will not have to worry about changes to package dependencies and/or versions, and R users in particular will notice a significantly faster start-up time during environment set-up. The PIC-SURE images are available in both the JupyterLab and RStudio Seven Bridges environments. Users can find this feature by specifying the Environment setup of any Data Cruncher analysis.
Cure Sickle Cell Metadata Catalog integration: PIC-SURE has updated the Data Access Table to integrate information about sickle cell disease (SCD) studies from the Cure Sickle Cell Metadata Catalog (MDC). The “Additional Information” column includes a link to that SCD study’s page on the MDC. The Data Access Table also includes other new information, such as study design and study focus.
New BioData Catalyst Powered by PIC-SURE search interface: The documentation associated with PIC-SURE has been updated to reflect the recent release of the new search interface. This includes the BioData Catalyst Powered by PIC-SURE User Guide and the tutorial videos on the BioData Catalyst Powered by PIC-SURE YouTube playlist.
Updated documentation on new Terra Interface: The documentation associated with Terra has been updated to reflect the recent release of the new analysis interface. This includes the Terra Workspace Quickstart Guide and the tutorial videos on the Terra YouTube channel.
The table below highlights which studies were included in the Q3 2022 data releases. The data is now available for access across the entire ecosystem.
BostonBrazil_SCD
phs001599
topmed-BostonBrazil_SCD_HMB-IRB-COL
Yes
PCGC
phs001735.c1
topmed-PCGC_CHD_HMB
No
Yes
PCGC
phs001735.c2
topmed-PCGC_CHD_DS-CHD
No
Yes
National Sleep Research Resource (NSRR)
phs002715-c1
NSRR-CFS_DS-HLBS-IRB-NPU
Yes
FHS_phs000974_TOPMed_WGS_freeze.9b
phs000974
TOPMed_FHS
No
Yes
PCGC SRA
phs000571.v6.p2
PCGC-CHD-GENES_HMB
Yes
National Sleep Research Resource (NSRR)
This dataset had to be ingested again to accommodate additional data provided by data owners
phs002715-c1
NSRR-CFS_DS-HLBS-IRB-NPU
No
No
Gen3 release notes Terra release notes Seven Bridges release notes PIC-SURE release notes Dockstore release notes
The 2022-07-11 release marks the tenth release for the NHLBI BioData Catalyst ecosystem. This release includes several new features (e.g., importing files from AnVIL via DRS and creating multi-sample VCFs). This release also includes enhanced support for CWL tools on GitHub. Please find more detail on the new features in the sections below.
The 2022-07-11 data release includes the addition of COVID-19 dataset C3PO and TOPMed Freeze 9 batch 3 and 4. Please refer to the Data Release section below for more information as well as the Data page on the BioData Catalyst website.
Import files from AnVIL to BioData Catalyst Powered by Seven Bridges via DRS
Seven Bridges released an interoperability feature enabling import of data from AnVIL to BioData Catalyst Powered by Seven Bridges. A TOPMed researcher working in BioData Catalyst who identifies a causal variant through association testing might want to next investigate how that variant affects gene expression. The AnVIL ecosystem hosts the Genotype-Tissue Expression (GTEx) datasets which can be used to understand which tissues are affected by novel variants. Seven Bridges’ latest release allows a TOPMed researcher to go to AnVIL and push data they have permissions for to BioData Catalyst Powered by Seven Bridges, thus allowing the researcher to run the variant association test on TOPMed data and identify how that variant changes tissue expression with GTEx data in one workspace.
Create multi-sample VCFs with the Variant Store
Researchers who have access to many TOPMed studies will want to mix and combine VCF files into a multi-sample VCF. Additionally, researchers might want to subset samples based on genomic regions. Using standard bioinformatics tools, this process involves many manual steps and can be time intensive and cost prohibitive. The Variant Store on BioData Catalyst Powered by Seven Bridges uses a series of API calls to combine VCFs from studies of interest and subset the multi-sample VCF based on the selected genomic region. The latest release allows researchers to track the costs associated with generating multi-sample VCFs via the Variant Store as a dedicated line item in their billing group separate from analysis and storage costs.
Explore, tag, and annotate phenotypes in the Study Variable Explorer
The Study Variable Explorer on BioData Catalyst Powered by Seven Bridges allows researchers to explore phenotypic variables from the TOPMed data dictionaries in an open access manner. Previously, researchers were limited to searching data dictionary information on dbGaP and making comparisons between different study variables was cumbersome with poor UX. Study Variable Explorer enables researchers to select phenotypic variables from across TOPMed studies and view detailed information and distributions of the variable data. By searching keywords, such as obesity, a researcher can compare like variables within and across hosted datasets including the number of subjects and descriptions of the variables. Additionally, users can create custom searchable tags and notes for each variable to track their variable selection and pre-harmonization process.
New CWL Tools and Workflows on BioData Catalyst Powered by Seven Bridges
An updated version of the SRA Download and Set Metadata workflow (SRA Toolkit 3.0.0) that downloads metadata associated with SRA accession via SRA Run Info CGI, (on-demand instance) FASTQ files and sets corresponding metadata.
fastENLOC (v1.0, CWL1.2), a tool that enables integrative genetic association analysis of molecular QTL data and GWAS data. It performs integration of the results from molecular quantitative trait loci (QTL) mapping into genome-wide genetic association analysis of complex traits, with the primary objective of quantitatively assessing the enrichment of the molecular QTLs in complex trait-associated genetic variants and the colocalizations of the two types of association signals.
GATK Somatic SNVs and INDELs (Mutect2) 4.2.5.0, a workflow used for somatic short variant calling. It runs on a single tumor-normal pair or on a single tumor sample, and performs additional filtering and functional annotation tasks, and
GATK Create Mutect2 Panel of Normals 4.2.5.0 that creates a panel of normals for use in other GATK workflows. The workflow takes multiple normal sample callsets and passes them to GATK Somatic SNVs and INDELs (Mutect2) 4.2.5.0 with tumor-only mode (although it is called tumor-only, normal samples are given as the input) and additionally collates sites present in two or more samples into a sites-only VCF.
Three apps from the MetaXcan toolkit:
S-PrediXcan for computing associations between omic features and a complex trait starting from GWAS summary statistics.
S-MultiXcan for computing association from predicted gene expression to a trait, using multiple studies for each gene.
MetaMany for serially performing multiple MetaXcan runs on a GWAS study from summary statistics using multiple tissues.
The MetaXcan Workflow for computing associations between omic features and complex traits across multiple tissues. The workflow includes two tools from the MetaXcan framework - MetaMany and S-MultiXcan and it uses summary statistics from a GWAS study and multiple models that predict the expression or splicing quantification.
MaxQuant (v2.0.3.0, CWL1.2), a quantitative proteomics tool designed for analyzing large mass-spectrometric data. It uses a target-decoy search strategy to estimate and control the extent of false positives. Within the target-decoy strategy, MaxQuant applies the concept of posterior error probability (PEP) to integrate multiple peptide properties (e.g., length, charge, number of modifications) together with Andromeda score into a single quantity, reflecting the quality of a peptide spectrum match (PSM).
Dockstore GitHub app support expanded to CWL tools
Researchers can now register your tool to automatically sync with GitHub. Using GitHub Apps, Dockstore can react to changes on GitHub as they are made, keeping Dockstore synced with GitHub automatically. Additional details are available here.
The table below highlights which studies were included in the Q2 2022 data releases. The data is now available for access across the entire ecosystem.
C3PO (COVID-19)
phs002752
C3PO
true
1
TOPMed Freeze 9 - Batch 3
various
various
false
NA
TOPMed Freeze 9 - Batch 4
various
various
false
NA
National Sleep Research Resource (NSRR)
phs002715
NSRR-CFS
true
1
SPIROMICS (topmed: phs001927)
phs001927
SPIROMICS
true
1
BostonBrazil_SCD (TOPMed - phs001599)
phs001599
BostonBrazil_SCD
true
1
TOPMed - PCGC (Version update)
phs001735
PCGC
false
2
PCGC SRA Data
phs000571
true
5
TOPMed Freeze 9 - WHI
various
various
false
NA
MUSIC/CARING (COVID-19)
phs002770
MUSIC/CARING
true
1
Gen3 release notes Terra release notes Seven Bridges release notes PIC-SURE release notes Dockstore release notes