Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 128 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

BDC Documentation

Loading...

Community

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Written Documentation

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Request for Comments

NHLBI BioData Catalyst® (BDC) Documentation

This is a repository for documentation related to the platforms and services that are part of the BDC ecosystem.

Click here to access the NHLBI BioData Catalyst® (BDC) website.

Welcome to NHLBI BioData Catalyst® (BDC)

Welcome to the BDC ecosystem and thank you for joining our community of practice. The ecosystem offers secure workspaces to support your data analysis in addition to a number of bioinformatics tools for analysis. The ecosystem currently hosts datasets from the Transomics for Precision Medicine (TOPMed) program. There is a lot of information to understand and many resources (documentation, learning guides, videos, etc.) available, so we developed this overview to help you get started. If you have additional questions, please use the links at the very end of this document, under the "Questions" section, to contact us.

About BDC and Our Community

What is BDC?

NHLBI BioData Catalyst® (BDC) is a cloud-based ecosystem that offers researchers data, analytical tools, applications, and workflows in secure workspaces. BDC is a community where researchers can find, access, share, store, and analyze heart, lung, blood, and sleep data. BDC is an NHLBI data repository where researchers share scientific data from NHLBI-funded research, so they and others can reproduce findings and reuse data to advance science.

By increasing access to NHLBI data and innovative analytic capabilities, BDC accelerates reproducible biomedical research to drive scientific advances that can help prevent, diagnose, and treat heart, lung, blood, and sleep disorders.

What are we doing and why does it matter?

By increasing access to the NHLBI’s datasets and innovative data analysis capabilities, the BDC ecosystem accelerates efficient biomedical research that drives discovery and scientific advancement, leading to novel diagnostic tools, therapeutics, and prevention strategies for heart, lung, blood, and sleep disorders.

Who is developing BDC?

The ecosystem is funded by the National Heart, Lung, and Blood Institute (NHLBI). Researchers and other professionals receive funding from the NHLBI to work on the development of the ecosystem, together often referred to as “The BDC Consortium” or “The Consortium” for short. You can refer to a list of partners and platforms powering the ecosystem on the Overview page of the BDC website and a list of the principal investigators is available in our documentation.

Find out the meanings of our terms and acronyms.

Like many professional communities, BDC has adopted terms to help us communicate quickly and more efficiently, but that can be a challenge for newcomers. To help, we created a BDC glossary of terms and acronyms. If ever there is a time when an ecosystem term or acronym is unfamiliar and isn’t in the glossary, contact us so we can give you the information and add it to the glossary.

The BDC Ecosystem and Services

Learn about the platforms and services available in the ecosystem.

The BDC ecosystem features the following platforms and services.

Explore Available Data

  • BioData Catalyst Powered by Gen3 - Hosts genomic and phenotypic data and enables faceted search for authorized users to create and export cohorts to workspaces in a scalable, reproducible, and secure manner.

  • BioData Catalyst Powered by PIC-SURE - Enables access to all clinical data, feasibility queries to be conducted, and allows cohorts to be built in real-time and results to be exported via the API for analysis.

Analyze Data in Cloud-based Shared Workspaces

  • BioData Catalyst Powered by Seven Bridges - Collaborative workspaces where researchers can find and analyze hosted datasets (e.g. TOPMed) as well as their own data by using hundreds of optimized analysis tools and workflows in CWL, as well as JupyterLab and RStudio for interactive analysis.

  • BioData Catalyst Powered by Terra - Secure collaborative place to organize data, run and monitor workflow (e.g. WDL) analysis pipelines, and perform interactive analysis using applications such as Jupyter Notebooks and the Hail GWAS tool.

Use Community Tools on Controlled-access Datasets

  • Dockstore - Catalog of Docker-based workflows (from individuals, labs, organizations) that export to Terra or Seven Bridges.

The NHLBI BioData Catalyst website provides further details about the platforms and services available in the ecosystem. We encourage you to create accounts on all the platforms as you get to know BioData Catalyst.

Ecosystem Access, Hosted Data, and System Services

How does data access work?

The BioData Catalyst ecosystem manages access to the hosted controlled data using data access approvals from the NIH Database of Genotypes and Phenotypes (dbGaP). Therefore, users who want to access a hosted controlled study on the ecosystem must be approved for access to that study in dbGaP.

How do I login?

Users log into BioData Catalyst platforms with their eRA Commons credentials (see Understanding eRA Commons Accounts) and authentication is performed by iTrust. Every time a user logs in, the ecosystem checks his/her user credentials to ensure s/he can only access the data for which s/he has dbGaP approval.

While all of the platforms within BioData Catalyst use eRA Commons credentials and iTrust performs authorization and authentication, respectively, there are some slight differences between the platforms when getting set up:

  • BioData Catalyst Powered by Gen3 - Users do not set up usernames on Gen3. Upon the first time logging in, select “Login from NIH”, then enter eRA commons credentials at the prompt. This ‘User Identity’ is used to track the user on the system.

  • BioData Catalyst Powered by PIC-SURE - Similar to Gen3, user identities are used - researchers log into the system by selecting “Log in with eRA Commons.”

  • BioData Catalyst Powered by Seven Bridges - Users set up platform accounts. The first time on the system, users select to “Create an account” and then proceed with entering their eRA Commons credentials. The user is then prompted to fill out a registration form with their name, email, and preferred username. Users are also asked to acknowledge that they have read the Privacy Act notice and then they can proceed to the platform.

  • BioData Catalyst Powered by Terra - Users initially log in using Google credentials and are asked to agree to the Terms of Service and Privacy Act notice. User activity is tracked via the Google credentials, but users can link their eRA Commons credentials to the account to get access to hosted datasets.

Details about how data access works on the NHLBI BioData Catalyst ecosystem are on the website.

How do I check which data I can access?

We recommend users first check their access to data before logging in. Do this by going to the Accessing BioData Catalyst page and clicking on the “Check My Access” button. Once you confirm your data access, go to the Platforms and Services page from which you click on the “Launch” hyperlink for the platform or service you wish to use. Platforms and services have login/sign in links on their pages that bring you to the pages on which you enter your eRA Commons credentials. Documentation on checking your access to data is also available.

What data are available in the ecosystem?

The NHLBI BioData Catalyst currently hosts a subset of datasets from TOPMed including phs numbers with genomic data and related phs numbers with phenotype data. You can find information about which TOPMed studies are currently hosted on the Data page of the website as well as in the Release Notes.

Harmonized data available.

There are limited amounts of harmonized data available to users with appropriate access at this time. The TOPMed Data Coordinating Center curation team has produced forty-four (44) harmonized phenotype variables from seventeen (17) NHLBI studies. Information about the 17 studies and the 44 variables can be found in the BioData Catalyst Powered by PIC-SURE User Guide.

Bring your own data and workflows into the system.

We allow researchers to bring their own data and workflows into the ecosystem to support their analysis needs. Researchers can bring their own datasets into BioData Catalyst Powered by Seven Bridges and BioData Catalyst Powered by Terra. Users can also bring their own workflows to the system. Users can either add workflows to Dockstore in CWL or WDL, or they can create CWL tools directly on BioData Catalyst Powered by Seven Bridges and develop custom workflows for use on BioData Catalyst Powered by Terra.

Learn about Genome-wide association study and genetic association testing on BioData Catalyst.

Walk through our self-paced genome-wide association study and genetic association testing tutorials.

Share your workflows.

We encourage users to publish their workflows so they can be used by other researchers working in the NHLBI BioData Catalyst ecosystem. Share your workflows via Dockstore.

Costs and cloud credits.

BioData Catalyst hosts a number of datasets available for analysis to users with appropriate data access approvals. Users are not charged for the storage of these hosted datasets; however, if hosted data is used in analyses users incur costs for computation and storage of derived results. Cloud credits are available on the system, and you can learn more here.

BioData Catalyst Publications

Let us know about your publications and see how you can cite us.

If you are writing a manuscript about research you conducted using NHLBI BioData Catalyst, please use the citation available here.

Immediately after learning your manuscript has been accepted, please email BDCatalystOutreach@nih.gov to let us know. Please include in your email the manuscript title, the name of the publication that accepted your manuscript, and information about pre-publication posting (if it will take place), along with your name and contact information.

Questions?

Learn more, ask questions, or request help.

Answers to frequently asked questions are available on the website, as are many resources that can be found under Learn & Support. You can also use a form to Contact Us, and if you aren’t sure which selections to make on the form, please see our help desk directory.

BDC Glossary

Glossary of terms used in the context of the BDC Consortium and platform.

  • Agile Development

    Agile software development is an approach to software development under which requirements and solutions evolve through the collaborative effort of self-organizing and cross-functional teams and their customer(s)/end user(s).

  • Alpha Users

    A small group of users who are more willing to tolerate working in a system that isn’t as fully developed, providing detailed feedback & maybe some back & forth discussions

  • [Amazon] EFS

    [Amazon] Elastic File System a simple, scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources.

  • Ambassadors

    A small group of experts that represent the personas featured within the priority User Narratives. For their time and help, Ambassadors will receive early access to the BDC platform, free compute time, monetary fee for time, and relevant travel expenses will be covered.

  • App

    1. In Seven Bridges, an app is a general term to refer to both tools and workflows.

    2. App may also refer to persistent software that is integrated into a platform.

  • API

    Application Programmer Interfaces. API technologies serve as software-based intermediaries to exchange data.

  • AWS

    Amazon Web Services. A provider of cloud services available on-demand.

  • BagIt

    BagIt is a hierarchical file packaging format for storage and transfer of arbitrary digital content.

  • BDC3

    BDC Coordinating Center

  • Beta Users

    A slightly larger group than the alpha users who are not as tolerant to a difficult/clunky environment but understand that the version they are using is not polished and they need to give feedback.

  • Beta-User Training

    Once the platform is available to a broader audience, we will support freely-accessible online training for beta-users at any time.

  • Carpentries Instructor Training Program

    Ambassadors attend this training program to become BDC trainers.

  • CCM

    Change Control Management; the systematic approach to managing all changes made to a document or process. Ensures no unnecessary changes are made, all changes are documented, and a process exists for implementing approved change.

  • CIO

    Chief Information Officer

  • Cloud Computing

    Internet-based computing, wherein computing power, networking, storage or applications running on computers outside an organization are presented to that organization in a secure, services-oriented way.

  • Components

    Software units that implement a specific function or functions and which can be reused.

  • ConOps

    Concept of Operations

  • Consortium

    A collection of teams and stakeholders working to deliver on the common goals of integrated and advanced cyberinfrastructure, leading-edge data management and analysis tools, FAIR data, and HLBS researcher engagement.

  • Containers

    A standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another (for example, Docker).

  • Command

    In workflow writing, the command specifies the literal command line run (akin to what you would run in the terminal).

  • COPDGene

    Chronic Obstructive Pulmonary Disease (COPD) Gene

  • Cost Monitoring (level)

    At the Epic Level The Coordinating Center will facilitate this process by developing reporting templates (see example in PM Plan, Financial Management) for distribution to the teams. The BDC teams will complete these templates and send them directly to NHLBI. Each team is responsible for tracking their finances based upon the award conditions and for providing status updates as requested to NHLBI.

  • CRAM File Compressed columnar file format for storing biological sequences aligned to a reference sequence. Designed to be an efficient reference-based alternative to the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence, reducing storage costs. Additionally each column in the SAM format is separated into its own blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller than BAM, depending on the data held within them (from Wikipedia).

  • CSOC Alpha

    Common Services Operations Center (CSOC): operates cloud, commons, compliance and security services that enable the operation of data commons; has ATO and hosts production system.

  • CSOC Beta

    Development/testing; Real data in pilot (not production) that can be accessed by users

  • Common Workflow Language (CWL)

    Simple scripting language for describing computational workflows for performing sequential operations on data. CWL is a way to describe command line tools and connect them together to create workflows. CWL is well suited for describing large-scale workflows in cluster, cloud and high performance computing environments where tasks are scheduled in parallel across many nodes.

  • DAC

    Data Access Committee: reviews all requests for access to human studies datasets

  • DAR Data Access Request

  • Data Access

    A process that involves authorization to access different data repositories; part of a User Narrative for the December 2020 release goal A Work Stream PM Plan constraint: NHLBI, as the project sponsor, will identify a process to enable data access by the BDC team members and for research users

  • Data Commons

    Provides tools, applications, and workflows to enable computing large scale data sets in secure workspaces.

  • Data Repository Service (DRS) Generic interface (API) to data repositories so data consumers, including workflow systems, can access data in a single, standardized way regardless of where it’s stored or how it’s managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID (from GA4GH).

  • Data Steward

    Members of the TOPMed and COPDGene communities who are working with BDC teams.

  • dbGaP

    Database of Genotypes and Phenotypes

  • DCPPC

    Data Commons Pilot Phase Consortium. The Other Transaction Awardees, Data Stewards, and the NIH.

  • Decision Tree

    A decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility

  • Deep Learning

    A machine learning method based on neural networks to learn from data through training to recognize patterns in the data.

  • Deliverables

    Demonstrations and products.

  • Demos

    Activities and documentation resulting from the DCPPC to build, test and demonstrate completion of goals of the Data Commons Pilot Phase.

  • DEV Environment

    Set of processes and programming tools used to create the program or software product

  • DMI

    Data Management Incident

  • Docker

    Software for running containers, packaged, portable units of code and dependencies that can be run in the same way across many computers. See also Containers.

  • Dockerfile

    A text document that contains all the commands a user could call on the command line to assemble an image.

  • Dockstore

    An open platform developed by the Cancer Genome Collaboratory and used by the GA4GH for sharing Docker-based tools described with the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL)

  • DOI

    Digital Object Identifier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself.

  • DUO

    Data Use Ontology - a GA4GH standard for automating access (API) to human genomics data (https://github.com/EBISPOT/DUO)

  • DUOS

    Data Use Oversight System, https://duos.broadinstitute.org/

  • Ecosystem

    A software ecosystem is a collection of processes that execute on a shared platform or across shared protocols to provide flexible services. Example: The "BDC Ecosystem" - inclusive of all platforms and tools

  • EEP

    External Expert Panel. A group of experts who provide guidance and direction to NIH about the program.

  • Epic

    A very large user story which can be broken down into executable stories

    *NHLBI’s cost-monitoring level

  • eRA Commons

    Designated ID provider for whitelist

  • External Expert Panel

    An independent body of experts that inform and advise the work of the BDC Consortium.

  • FAIR

    Findable Accessible Interoperable Reusable.

  • Feature

    A functionality at the system level that fulfills a meaningful stakeholder need

    *Level at which the CC coordinates

  • FireCloud

    Broad Institute secure cloud environment for analytical processing, https://software.broadinstitute.org/firecloud/

  • FISMA moderate environment

    Federal Information Security Modernization Act of 2014, amends the Federal Information Security Management Act of 2002 (FISMA), see https://www.dhs.gov/fisma

  • FS

    Full Stack

  • GA4GH

    Global Alliance for Genomics and Health

  • GA4GH APIs

    The Genomic Data Working Group is a coalition assembled to create interoperability standards for storing and sharing genomic data. The GA4GH Genomics API offers Interoperability for exchanging genomic data between various platforms and organizations by sending simple HTTP requests through a JSON equipped RESTful API.

  • GCP

    Google Cloud Platform

  • GCR

    Governance, Compliance, and Risk

  • Gen3

    Gen3 is open source and licensed under the Apache license, which you can use for setting up, developing and operating data commons

  • GitHub

    An online hub for storing and sharing computer programs and other plain text files. We use it for storage, hosting websites, communication and project management.

  • Gold Master

    A gold master, or GM, is the final version of software or data ready for release to production; a master version from which copies can be made.

  • GWAS

    Genome-wide Association Study

  • HLBS

    Heart, Lung, Blood, Sleep

  • Identity Providers

    A system entity that creates, maintains, and manages identity information for principals while providing authentication services to relying applications within a federation or distributed network; identity providers offer user authentication as a service

  • Interoperability

    The ability of data or tools from multiple resources to effectively integrate data, or operate processes, across all systems with a moderate degree of effort.

  • Instance

    In cloud computing, refers to a virtual server instance from a public or private cloud network.

  • Image

    In the context of containers and Docker, this refers to the resting state of the software.

  • IP

    BDC Implementation Plan; outlines how the various elements from the planning phase of the BDC project will come together to form concrete, operationalized BDC platform.

  • IRB

    Institutional Review Board; the entity within a research organization that reviews and approves research protocols and clinical research protocols to protect human and animal subjects.

  • IRC

    Informatics Research Core

  • ISA

    Interoperability Service Agreement

  • ITAC

    Information Technology Applications Center

  • Jupyter Notebooks

    A web-based interactive environment for organizing data, performing computation, and visualizing output.

  • Linux

    An open source computer operating system

  • Metadata

    Data about other data

  • Milestone

    Marks specific progress points on the development timeline, and they can be invaluable in measuring and monitoring the evolution and risk of a program. © Scaled Agile, Inc.

  • MSD

    Minimum set of documents

  • MVP

    Minimum viable product

  • NHLBI

    National Heart, Lung, and Blood Institute

  • NIH

    National Institutes of Health

  • NIST Moderate controls

    NIST 800-53 - A collection of security controls and assessment procedures that both U.S. Federal and non-governmental organizations can apply to their information systems, policies, and procedures.

  • OTA

    Other Transaction Authority - the mechanism of award that NHLBI chose because it provides a degree of flexibility in the scope of the work that is needed to advance this type of high risk/high reward project

  • PI

    Principal Investigator

  • Platform

    A piece of the BDC ecosystem. Examples: Terra, Gen3, Seven Bridges, etc.

  • PM

    Project Manager

  • PMP

    BDC Project Management Plan; breaks down the implementation of BDC from the perspective of the project managers involved in the project including details on roles, specific milestones, and the project schedule.

  • PO

    Program Officer

  • Portable Format for Biomedical Data (PFB) Avro-based serialization format with specific schema to import, export and evolve biomedical data. Specifies metadata and data in one file. Metadata includes data dictionary, ontology references and relations between nodes. Supports versioning, back- and forward compatibility. A binary format.

  • Portfolio for Jira

    Software-as-a-Service project management tool, used to track, roadmap, and visualize various project metrics.

  • Python

    Open source programming language, used extensively in research for data manipulation, analysis, and modeling

  • Quality Assurance

    The planned and systematic activities implemented in quality management so that quality requirements for a product or service satisfy stated goals and expectations.

  • Quality Control

    The operational techniques and activities aimed at monitoring and measuring work processes and eliminating the causes of unsatisfactory outputs.

  • RACI

    Responsible, Accountable, Consulted and Informed; tool that can be used for identifying roles and responsibilities during an organizational change process; BDC RACI

  • Researcher Auth Service (RAS) Will be a service provided by NIH's Center for Information Technology to facilitate access to NIH’s open and controlled data assets and repositories in a consistent and user-friendly manner. The RAS initiative is advancing data infrastructure and ecosystem goals defined in the NIH Strategic Plan for Data Science.

  • RFC

    Request for Comment: A process that documents and enables effective interactions between stakeholders to support shared decision making.

  • Risk Register

    A tool used to continuously identify risk, risk response planning and status updates throughout the project lifecycle. This project risk register is the primary risk reporting tool, and is located in the Project Management Plan.

  • SC

    Steering Committee

  • Scientific use case

    Defined in this project as an analysis of data from the designated sources which has relevance and value in the domain of health sciences, probably implementation and software agnostic.

  • SF or SFP

    BDC Strategic Framework [Plan]; defines what the BDC teams have accomplished up to this point, what we plan to accomplish in a timeline fashion, and milestones to track and measure implementation.

  • SFTP

    Secure File Transfer Protocol

  • Software Developers Kit

    A set of software development tools that allows the creation of applications for a certain software package, software framework, hardware platform, computer system, or similar development platform

  • Sprints

    Term of art used in software generation, referring to short, iterative cycles of development, with continuous review of code through daily builds and end-of-sprint demos

  • Stack

    Term of art referring to a suite of services that run in the cloud and enable ubiquitous, convenient, on-demand access to a shared pool of configurable computing resources.

  • Steering Committee

    Responsible for decision-making and communication in BDC.

  • STRIDES

    Science & Technology Research Infrastructure for Discovery, Experimentation, and Sustainability

  • Task

    In workflow writing, the term task encompasses all of the information necessary to execute a command, such as specifying input/output files and parameters.

  • Team

    Groups of people led by a Principal Investigator (PI), or PIs, who will complete milestones and produce deliverables. Each group has been assigned a name, represented by the elements on the periodic chart.

  • Tiger Teams

    A diversified group of experts brought together to investigate, solve, build, or recommend possible solutions to unique situations or problems. Populated with mature experts who know what's at stake, what needs to be done, and how to work well with others; their strengths are diversity of knowledge, a single focus or purpose, cross-functional communications, decision-making sovereignty, and organizational agility.

  • Tool

    In Common Workflow Language, the term tool specifies a single command. This definition is not as discrete in other workflow languages such as WDL.

  • Tool Registry Service (TRS)

    The GA4GH Cloud Work Stream has released a standard API for exchanging tools and workflows to analyze, read, and manipulate genomic data. The Tool Registry Service (TRS) API is one of a series of technical standards from the Cloud Work Stream that together allow genomics researchers to bring algorithms to datasets in disparate cloud environments, rather than moving data around.

  • TOPMed

    Trans-Omics for Precision Medicine. One of the primary data sets of the DCPPC.

  • TOPMed DCC

    TOPMed Data Coordinating Center

  • Trans-cloud

    A provider-agnostic multi-cloud deployment architecture.

  • User Narrative

    Descriptions of a user interaction experience within the system from the perspective of a particular persona. User Narratives are further broken down into Features, Epics, and User Stories. Currently formulated into rough 6-month timelines to benchmark progress.

  • User story

    A description of a software feature from a technical/process-oriented perspective; a backlog item that describes a requirement or functionality for a user

    *Finest level of PM Monitoring

  • Variant Call Format (VCF)

    File format for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome. There is also a Genomic VCF (gVCF) extended format, which includes additional information about "blocks" that match the reference and their qualities (from Wikipedia). See http://www.internationalgenome.org/wiki/Analysis/vcf4.0/.

  • VDS

    A composite of complete server hardware, along with the operating system (OS), which is powered by a remote access layer that allows end users to globally access their server via the Internet

  • VPC

    Virtual Private Cloud

  • Whitelist

    A security measure to permit only an approved list of entities. We recommend instead using the term "allow list".

  • Workflow

    A sequence of processes, usually computational in this context, through which a user may analyze data.

  • Workflow Description Language (WDL) Way to specify data processing workflows with a human-readable and writeable syntax. Define complex analysis tasks, chain them together in workflows, and parallelize their execution.

  • Workspace

    Areas to work on/with data within a platform. Examples: projects within Seven Bridges

  • Workstream

    A collection of related features; orthogonal to a User Narrative

  • Wrapping

    The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.

  • Virtual Machine (VM)

    An isolated computing environment with its own operating system.

Who We Are

Our Culture: Though the primary goal of the BDC project is to build a data science ecosystem, at its core, this is a people-centric endeavor. BDC is also building a community of practice working collaboratively to solve technical and scientific challenges in biomedical science.

Principal Investigators (PIs):

  • Stan Ahalt, PI RENCI (Coordination Center)

  • Rebecca Boyles, Co-PI RTI (Coordination Center)

  • Paul Avillach, PI HMS (Team Carbon)

  • Kira Bradford, Co-PI RENCI (Team Helium)

  • Steve Cox, Co-PI RENCI (Team Helium)

  • Brandi Davis-Dusenbery, PI Seven Bridges (Team Xenon)

  • Robert Grossman, PI UChicago (Team Calcium)

  • Ashok Krishnamurthy, PI RENCI (Team Helium )

  • Benedict Paten, PI UCSC (Team Calcium)

  • Anthony Philippakis, PI Broad Institute (Team Calcium)

Note: BDC collaboration is organized around teams based on elements in the periodic table. There are additional modes of collaboration in BDC including Tiger Teams, Working Groups, Steering Committee, and Publications.

More about who we are and the partners empowering our ecosystem can be found at the .

Citation and Acknowledgement

How to cite and acknowledge NHLBI BioData Catalyst® (BDC)

For citation of BDC:

National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services (2020). The NHLBI BioData Catalyst. Zenodo.

To acknowledge BDC, use:

The authors wish to acknowledge the contributions of the consortium working on the development of the NHLBI BioData Catalyst® (BDC) ecosystem.

NHLBI BioData Catalyst Ecosystem Security Statement

BDCatalyst-RFC-#: 11 BDCatalyst-RFC-Title: NHLBI BioData Catalyst Ecosystem Security Statement BDCatalyst-RFC-Type: Consensus Building Name of the person who is to be Point of Contact: Sarah Davis Email of the person who is to be Point of Contact: sdavis@renci.org Submitting Team: BDC3/NHLBI Requested BDCatalyst-RFC posting start date: 6/14/2021 Date Emailed for consideration: 6/14/2021 BDCatalyst-RFC-Status: Comment only URL Link to this document: URL Link to the website: License: This work is licensed under a .

Overview

The purpose of this RFC is to provide the NHLBI BioData Catalyst Consortium and users of the NHLBI BioData Catalyst ecosystem with a clear statement on security mechanisms of the ecosystem that protect the confidentiality, integrity, provenance, and availability of the hosted data as well as any data that may be uploaded using the ecosystem’s “Bring Your Own Data” (BYOD) functionality.

Figure 1. The NHLBI BioData Catalyst ecosystem leverages separately developed and managed platforms to maximize flexibility for users based on their research needs, expertise, and backgrounds. Utilizing multiple Authorizations to Operate (ATO), these platforms combine to provide secure, cloud-based workspaces, user authentication and authorization, search, tools and workflows, applications, and new innovative features to address community needs.

NHLBI BioData Catalyst Ecosystem Security Statement

The NHLBI and BioData Catalyst Consortium recognizes the importance of protecting both the privacy and security of the data and respecting the consent of the study participants whose data is stored within the BioData Catalyst ecosystem. Tackling these issues produces some challenges beyond those faced by most Federal Information Systems. The BioData Catalyst Consortium has implemented many innovative approaches to enable compliance and ensure that users understand their responsibility to protect data as articulated in specific Data Use Agreements (DUA). These approaches and controls work to protect the confidentiality, integrity and availability of the data; the privacy of the study participants who have contributed data; and data that may be uploaded to BioData Catalyst using the ecosystem’s “Bring Your Own Data” (BYOD) functionality. While the same general security controls are applied to both system and BYOD data, BYOD data is further protected as the ecosystem provides access only to the data’s uploaders and their designated collaborators.

From a Federal Information Security Modernization Act (FISMA) perspective, the BioData Catalyst ecosystem is a set of software systems with distinct security boundaries. Each system owner holds an Authority to Operate (ATO) issued by the NIH. The ATO is the result of a rigorous Security Assessment and Authorization (SA&A) process and third party assessment consistent with guidance from the National Institute of Standards and Technology (NIST). The ecosystem operates via a set of Interconnection Security Agreements (ISA) (Reindl 1979) and utilizes several existing components of security infrastructure (Bridges 2017, Gutiérrez-Sacristán et al. 2018) developed for other NIH platforms. Where the documentation provided as part of the SA&A process describes how security controls are implemented based on the NIST Special Publication 800-53r4 (see Endnote), the ISAs describe the permitted exchange of data and establish ecosystem-wide incident response, logging and auditing expectations that enable the consortium to respond in a unified manner to any suspected cybersecurity incident. The SA&A documentation provides for regular evaluation of the security of the component systems including regular scanning for vulnerabilities and the conduct of an annual penetration test. This level of security represents a baseline, and the BioData Catalyst ecosystem will extend protections over time.

Where the processes, policies, and technical controls protect confidentiality, integrity, and availability of data in accordance with Federal statute and regulation, there are additional ways to ensure that data is used in a manner consistent with study participants’ wishes, as represented by the consent form participants sign when enrolling in a specific study. Respect for these consents is critical to maintaining the public’s trust and requires additional policy, process, and technical controls. The respect for consent in NHLBI BioData Catalyst is enforced using normative NIH policies and processes for data sharing and using the existing infrastructure provided by the National Center for Biomedical Information’s (NCBI) Database of Genotypes and Phenotypes (dbGaP). All NHLBI-provided data within the NHLBI BioData Catalyst ecosystem are registered in dbGaP; in this process, data are assigned “consent groups” that describe in a machine-readable format the parameters of the consent for the data. These range from the most expansive “General Research Use” to more restrictive, such as only allowing for research outcomes related to “Health Medical Biomedical topics” or even to specific diseases, such as Chronic Obstructive Pulmonary Disease. Further, while secondary analysis of data is not considered human subjects research as described in the Common Rule (45 CRF Part 46), some datasets require the review of a research proposal by an Independent Review Board (IRB) or a Letter of Collaboration (LOC) with the originating study Principal Investigator as determined by the informed consents or special considerations that the submitting institution has determined are needed. These measures provide additional protection for datasets with particular sensitivity or special criteria for use.

For instance, IRB review is required when 1) informed consents that were signed by the study participants state that IRB oversight for secondary use of the data is required, and/or 2) the study IRB of record determines that the data may contain sensitive information that requires IRB oversight for secondary research. For collaboration letters, the informed consents indicate that the secondary use of the data by researchers outside the study will work with the study and therefore formal collaborations need to be set in place. While these are rarely used, they provide additional protection under special circumstances, such as where an indigenous population or sovereign nation requires direct control of how their data is used. Because consent is expressed at the individual level there may be a variety of consents for a study, either because the study offered choices to their participants or because the study consent evolved over an extended longitudinal study, such as the Framingham Heart Study. These variations in consent are reflected as multiple “consent groups” within a study and may mean that an investigator may only receive permission for subsets of study participants.

BioData Catalyst uses telemetry provided by dbGaP to enforce compliance with consents. Accordingly, users of BioData Catalyst can see only the data for which they have completed the process of a dbGaP Data Access Request (DAR) and after they receive approval from an NIH Data Access Committee (DAC). DAC approval results in a Data Use Agreement (DUA) describing any Data Use Limitations asserted by the originating study Principal Investigator, including ensuring that any requirements for an IRB review or Letter of Collaboration are met. On the BioData Catalyst ecosystem, the Gen3 “Fence” service, developed and operated by the University of Chicago and utilized by other NIH platforms and many other non-federal data commons, ensures enforcement of data access requirements. In order to ensure that the NHLBI maintains control over the use of the data and has the ability to audit this use, the BioData Catalyst policy does not allow download of source data outside the cloud ecosystem. Instead, BioData Catalyst provides access to data in approved cloud environments where computation occurs, which is documented in an ISA via standard Application Programming Interfaces (APIs) that use the authentication and authorization provided by the NIH Researcher Auth Service (RAS) (REF), together with Fence, to protect access and to allow for monitoring and auditing for appropriate data use (e.g. within scope of the approved project). These APIs include implementations of protected GA4GH Data Repository Service (DRS) for access to data objects such as genomic data and protected PIC-SURE interfaces for access to phenotypic, genotypic variant, and electronic health record data. The use of these APIs will, once fully implemented, enable other trusted cloud-based systems that meet equivalent security requirements to access data stored within BioData Catalyst for analysis using that other systems’ tools without the data being downloaded outside the security perimeters of the systems. This commitment to the use of APIs together with the requirement that data stay within the designated security boundaries is a critical component of making NHLBI data FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016, Corpas et al. 2018), while also ensuring confidentiality of data and respect for consent, regardless of the platform where the data is analyzed. BioData Catalyst has extended this model through the use of the (RAS) to provide authentication and authorization controls which together with the use of secure APIs is enabling secure interoperability with other trusted NIH-funded platforms such as the NHGRI’s AnVIL and the Gabriella Miller Kids First Data Resource.

Endnote

While NIST-800-53r4 is a thorough standard for many kinds of systems, it has some gaps for the most-modern systems encountered. There are additional standards to apply, either from the High column of NIST-800-53r4 or addendums as “best practice”.

In particular, the Standard does not give guidance on modern applications (“appsec”) -- especially when the Infrastructure is serverless and the entire security surface is the application itself. For instance, there are requirements for regular scanning but scanning tools do not scan APIs or modern Single Page Apps (SPA) sufficiently. A “web app scan” with tools like IBM’s Appscan would return no vulnerabilities on an API without actually testing it, yet it would satisfy RA-5 of NIST-800-53 to run such a scan. Nor does the standard require that the Infrastructure as a Service layer (AWS, GCP, Azure) abide by a continual scanning posture for misconfiguration -- only Networks and VMs are specified. That is one example of where NIST-800-53r4 doesn’t work for modern applications.

There’s also no guidance from NIST at large about running an API where external parties are building “clients'' to that API. Does allowing clients extend the security boundary, and thus all 3rd party applications are to be evaluated as such? Is there a different consideration for these 3rd party applications? The standard is silent there. Companies like Apple and Google through their AppStores require all 3rd party apps to undergo evaluation to their own security standards and such an idea might be applicable here. NHLBI might consider adding some extra controls, what the AllOfUs Program calls FISMA+, for enhanced security.

BioData Catalyst About page
https://doi.org/10.5281/zenodo.3822858

Data Access

Getting Started

Documentation for getting started on the NHLBI BioData CatalystⓇ (BDC) ecosystem.

Contributing User Resources to BDC

The BDC user community is essential to advancing science with new and exciting discoveries and informing the development of the ecosystem and its infrastructure. Members of the BDC user community learn how to explore the hosted data, use the services, and employ its tools in exciting and valuable ways that even developers may not know. Therefore, we actively invite user resource contributions to be shared with the community.

Types of Resources

Consider supporting fellow ecosystem users in one of the following ways:

  • Written Documentation: Develop step-by-step guides, FAQs, checklists, and so on. Include screenshots to support user understanding.

  • Videos: Record a shortcut, tip, or process you think would be helpful to other users. Keep videos short by dividing larger processes into smaller segments and recording separate videos for each.

  • Respond to inquiries: Answer questions posed in the BDC Forums. Forum content with significant engagement may get incorporated into written documentation or made into videos.

Note

All materials must ensure privacy policy compliance. Make certain to block any patient information on all content and protect study participants' privacy by not including personally identifiable, confidential, sensitive, or personal health information (for example, blur screenshots with data).

Decide How to Share What You Know

Experienced users who want to share their tips and tricks should consider the following questions.

  • Did someone already share my tip? Look through the resources already available to users before investing your time and energy into creating a new one.

    • Check the Frequently Asked Questions page on the BDC website

    • View the “Learn” and “Documentation” links available on the BioData Catalyst Services webpage.

    • View the BDC Documentation hosted on GitBook.

    • Explore the links to platform-specific documentation, videos, FAQs, community forums, blogs, tutorials, and upcoming events on the BioData Catalyst Learn webpage

    • Check out the videos on the BDC YouTube channel.

  • Which format best suits your resource? Ask yourself, "Would I prefer to watch this on video or have a step-by-step guide to help me?" Then ask yourself which you think other users would prefer. Figuring out which you'd prefer is a great place to start because you are the one who identified the tip. But remember that you are creating something to help other people whose preferences will determine whether a resource gets used.

  • Is my tip complex, or does it require several steps? If so, a written how-to guide will probably be easier to follow than a video because someone watching a video may need to stop and restart it often. Still, visual aids will be helpful, so consider using screenshots in your how-to guide.

  • Is the guidance I want to share relatively straightforward, but it requires clicking through several pages/places? If so, a short video could be the best way to share your tip. Finding buttons or links can be much easier if shown rather than described.

  • If I create a video and make sure to go slowly enough that someone can follow along, will it be longer than 15 minutes? If so, creating a video may not be the right format, or breaking down the content into shorter (more digestible) videos may be preferable.

  • Am I comfortable following the requirements outlined for the user-generated video tutorials? If not, please create written documentation (e.g., a how-to guide).

  • Do I want to provide help in almost-real-time without needing to formally draft a document or record a video? Visit the Community Forum often to provide answers to questions posed by other users or even just post your tip.

Creating and Sharing Your Contribution

Once you decide upon the best way to share what you learned, you'll need to create your contribution and then share it.

  • For a quick tip that you want to distribute swiftly, draft something short that you can easily post to the Community Forum. The following is an example of a quick tip for using PIC-SURE’s Data Access Table:

    • In PIC-SURE, did you know you can use the search bar in the Data Access Table to find studies? Instead of scrolling through the table and looking at the list of available studies manually, you can search for studies. An example could be “MESA” for a specific study name, or a phenotype like “Sickle Cell” to find all sickle cell related studies. It seems obvious, but I’m not sure how many other users are aware of this, and I found it really helpful!

  • For Written Documentation, draft your suggestions and include screenshots to help lead users through the process you describe. Once complete, submit the file to BDCatalystOutreach@nih.gov for review and posting to the BDC Gitbook. Note that we will accept Google Doc (with at least suggesting edits status preferred) and Microsoft Word formats; PDFs are not accepted.

  • For videos, review the User-Generated Videos portion of the BDC Video Content Guidance page. By submitting a video, you agree to those conditions. Once your video is uploaded to your YouTube channel, email the link to BDCatalystOutreach@nih.gov for consideration to be linked to the BDC YouTube channel also.

Finding User-Generated User Resources

  • Forum messages will post directly in the community forums.

  • Written documentation will live in the BDC Documentation, hosted in Gitbook.

  • User-generated videos will be linked in the BDC YouTube Channel.

Dug Semantic Search

Step-by-step guidance on using Dug Semantic Search: efficiently and effectively perform and interpret a search using Dug.

Overview

Dug Semantic Search is a tool that allows users to deep dive into BDC studies and biomedical topics, research, and publications to identify related studies, datasets, and variables. If you are interested in how Dug connects study variables to biomedical concepts, read the Dug paper or visit the Help Portal.

This tool applies semantic web and knowledge graph techniques to improve BDC research data Findability, Access, Interoperability, and Reusability (FAIR). Through this process, semantic search helps users identify novel relations, build unique research questions, and identify potential collaborations.

Data Interoperability

How to access additional data stacks

GTEx Data

The Genotype-Tissue Expression (GTEx) Program is a widely used data resource and tissue bank to study the relationship between genetic variants (inherited changes in DNA sequence) and gene expression (how genes are turned on and off) in multiple human tissues and across individuals. For information on access to GTEx data, refer to GTEx v8 - Free Egress Instructions as part of the AnVIL documentation.

NCPI Data Portal

The NIH Cloud Platform Interoperability Effort (NCPI) is currently working to establish and implement guidelines and technical standards to empower end-user analyses across participating cloud platforms and facilitate the realization of a trans-NIH, federated data ecosystem. Participating institutions include BDC, AnVIL, Cancer Research Data Commons, and Kids First Data Resource Center. Learn what data is currently hosted by these platforms by using the NCPI Data Portal.

Understanding Access

This checklist is intended to help new users understand their obligations regarding access and permissions, which individual users are responsible for obtaining and maintaining.

eRA Commons Account

Users log into BDC platforms with their eRA Commons credentials. For more information, see Ecosystem Access, Hosted Data, and System Services.

Users create an eRA Commons Account through their institution's Office of Sponsored Research or equivalent. For more information, refer to Understanding eRA Commons Accounts.

dbGaP

Users who want to access a hosted controlled study on the BDC ecosystem must be approved for access to that study in the NIH Database of Genotypes and Phenotypes (dbGaP). For more information, see Ecosystem Access, Hosted Data, and System Services and BioData Catalyst FAQs. Note that obtaining these approvals can be a time-intensive process; failure to obtain them in a timely manner may delay data access.

Users have two options for obtaining dbGaP approval depending on whether they already are affiliated with a PI who has dbGaP access to the relevant data:

  1. The BDC user has no affiliation with an existing dbGaP approved project. In this case the user needs to create their own dbGaP project and then submit a data access request (DAR) for approval by the NHLBI Data Access Committee (DAC). This process often takes 2-6 months depending on whether local IRB approval is required by the dataset the user is requesting, the amount of time it takes for local review of the dbGaP application by the user’s home institution and processing by the DAR committee. See the dbGap Authorized Access Portal or dbGaP Overview: Requesting Controlled-Access Data. Once a DAR is approved, it can take a week or longer for the approval to be reflected on BDC.

  2. The BDC user is affiliated with an existing principal investigator, who already has an approved dbGaP application with existing DAR access (for example, the BDC user is a post-doctoral fellow in a PI’s lab). A principal investigator with dbGaP DAR access assigns the User as a “Downloader” in dbGaP. See Assign Downloaders for dbGaP Data. It can take about 24 hours for “Downloader” approval to be reflected on BDC.

Notes

DARs must be renewed annually to maintain your data access permissions. If your permissions expire, you may lose access to hosted data in BDCatalyst during your renewal process.

A Cloud Use Statement may be required as part of the DAR.

TOPMed

BDC hosts data from the NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium. BDC users are not automatically onboarded as TOPMed investigators. BDC users who are not members of the TOPMed Consortium may apply for released data through the regular dbGaP Data Access Request process.

When conducting TOPMed-related research on BDC, members of the TOPMed consortium must follow the TOPMed Publications Policy and associated processes; for example, operating within Working Groups.

For more information, refer to the following resources:

  • Information on joining TOPMed

  • TOPMed website

  • TOPMed FAQs (login required)

  • BioData Catalyst FAQs

IRB

Users must ensure that IRB data use agreements (DUAs) are approved and maintained as they are enforced by the BDC ecosystem.

BDC

Refer to the BDC Data Protection page to learn more about topics such as data privacy, access controls, and restrictions.

Use your eRA Commons account to review the data indexed by BDC to which you have access on the Explore BioData Catalyst Data page. For more information, see Checking Access.

If your data is not indexed, inform BDC team members during your onboarding meetings or by submitting a Help Desk ticket.

Explore Available Data

https://bdcatalyst.gitbook.io/biodata-catalyst-documentation/community/request-for-comments/nhlbi-biodata-catalyst-ecosystem-security-statement
https://www.biodatacatalyst.org/collaboration/rfcs/bdcatalyst-rfc-11/
CC-BY-4.0 license
NIH Researcher Auth Service

Strategic Planning

In the context of agile development and a Consortium with a diverse set of members, the application of various agile-development terms may mean different things to different individuals.

The table below defines the BDC Core Terminology:

Term

Definition/Description

Example

User Narrative

Descriptions of a user interaction experience within the system from the perspective of a particular persona. User Narratives are further broken down into Features, Epics, and User Stories. Currently formulated into rough 6-month timelines to benchmark progress.

An experience bioinformatician wants to search TOPMed studies for a qualitative trait to be used in a GWAS study

Feature

A functionality at the system level that fulfills a meaningful stakeholder need

*Level at which the BDC3 coordinates

Search TOPMed datasets using PIC-SURE platform

Epic

A very large user story which can be broken down into executable stories

*NHLBI’s cost-monitoring level

PIC-SURE is accessible on BDC

User Stories

A backlog item that describes a requirement or functionality for a user

*Finest level of PM Monitoring

A user can access PIC-SURE through an icon on BDC to initiate search

Workstream

A collection of related features; orthogonal to a User Narrative

Workstreams impacted by the User Narrative above include:

  • production system

  • data analysis

  • data access

  • data management

148KB
PM-graphic.pdf
PDF
Open
Project Management Approach PDF

Strategic Planning Documents Reviewed & Approved by NHLBI Leadership

471KB
BioData-Catalyst-Strategic-Framework-Plan-V1-v2.0 (1).pdf
PDF
Open
685KB
BioData-Catalyst-Implementation-Plan-V1-v2.0.pdf
PDF
Open
491KB
BioData Catalyst Data Management Strategy - V1.0(3).pdf
PDF
Open
622KB
BioData Catalyst Project Management Plan V2.0 (1).pdf
PDF
Open

NHLBI DICOM Medical Image De-Identification Baseline Protocol

BDC-RFC-#: 28 Title: DICOM Medical Image De-Identification Baseline Protocol Type: Process Contact Name and Email: Keyvan Farahani, farahank@mail.nih.gov Submitting Teams: NHLBI, DMC Date Sent to Consortium: Oct. 11, 2023 Status: Closed for comment URL Link to this Google Document: https://docs.google.com/document/d/14-WfeMqgZz115DbBnFs-8AvcdRY1oIjCgi0K33pwMjE/edit?usp=sharing License: This work is licensed under a CC-BY-4.0 license.

Medical Image De-Identification: BDC Baseline Protocol

Contributors:

  • Zixin Nie (BDC Data Management Core)

  • Keyvan Farahani (NHLBI)

  • David Clunie (PixelMed Publishing)

Why image de-identification?

De-identification of protected health information (PHI) is often a necessary procedure to undertake in order to share potentially sensitive information, such as health data. Many data repositories that allow human data to be deposited and shared require the data to be de-identified. Medical images and their associated metadata (i.e., DICOM headers) often contain PHI, such as patient names, dates of birth, or medical record numbers. The de-identification of these images is essential to minimize privacy risk and comply with regulations and standards that require the protection of PHI. The overarching goal in medical image de-identification is to reduce the risk of identification as much as possible.

De-identification facilitates the sharing of medical imaging data, enabling greater access by researchers and the public and allowing for secondary research to be conducted. Several standards exist for de-identification of medical images, including the confidentiality profile detailed in the DICOM Part 15 standard, HIPAA Safe Harbor and Expert Determination. The BioData Catalyst Data Management Core (BDC DMC) performed an evaluation of these standards and used them to create the protocol detailed in this document. This document describes the de-identification processes and technical considerations for de-identifying medical images as they are being added to BDC and made available to researchers using the BDC platform. The protocol, referred to as the “BDC Baseline Protocol for Image De-identification,” takes into account the data use cases for researchers accessing the BDC platform by defining a de-identification profile that strikes a balance between privacy protection and preserving utility.

The Baseline protocol only applies to the metadata in radiologic (DICOM) images (see table below). It does not apply to image pixel information, other imaging formats, or other types of data that may be imported into BDC, such as clinical and omics data. It reflects the understanding of the de-identification needs of BDC as of October 2023. Future RFCs are planned that will address masking of unique identifiers, the details of how imaging pixel data will be de-identified, the de-identification process workflow, and quality management.

Major medical imaging modalities

Imaging Data Type

Conventional formats

Radiologic (X-ray, PET/CT, MRI, ultrasound)

DICOM (Digital Imaging and Communication in Medicine)

Cardiac ECG

XML

Digital Pathology

Proprietary TIFF and DICOM Pathology

The focus of this RFC is on de-identification of DICOM images.

The Baseline De-Identification Protocol

The de-identification protocol described in this section is intended to be a baseline for de-identification within BDC. The protocol is compliant with regulations such as the HIPAA Privacy Rule and the Common Rule, while retaining the maximal amount of research utility possible. It is designed based on the experiences from the HeartShare imaging pilot project. The protocol will evolve over time, with future iterations to address new issues as they arise, and customizations to address specific research use cases. These may involve Expert Determinations, which can both increase privacy protections and improve research utility. This protocol is to be used for all medical imaging data to be submitted to the BDC. The protocol may be implemented in an image de-identification tool at the submitter’s site, or in a central BDC-related data curation service. Any deviation from this protocol must be discussed with and approved by the BDC/DMC. The baseline de-identification protocol can be found at this link: DICOM_deid_part_15_classified_09_26_2024_Baseline.xlsx.

Introduction to HIPAA Safe Harbor and DICOM Part 15

De-identification of DICOM data can be performed according to different standards. Two commonly accepted standards are HIPAA Safe Harbor and Normative E Attribute Confidentiality Profiles defined in part 15 of the DICOM standard (referred to in the rest of this document as the DICOM Part 15 Standard).

HIPAA Safe Harbor de-identification calls for the removal of 18 types of identifiers (detailed here: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#standard). The standard legally applies to PHI handled by HIPAA Covered Entities, however as it has been in use for over 20 years it is generally accepted as a standard for de-identification for other types of data as well.

The DICOM Part 15 Standard was developed through a careful review of all DICOM attributes, identifying any that had the possibility of containing identifying information and creating a mitigation strategy. It is more extensive than HIPAA Safe Harbor, covering attributes that are not part of the 18 prescribed types of identifiers such as ethnicity and biological sex. Various mitigation strategies are presented to treat the attributes detailed as part of the standard, with the Basic DICOM Part 15 Confidentiality Profile being the most conservative, calling for suppression of most of the attributes.

De-Identification of DICOM Header Data

In order to have de-identified data that still possesses analytic utility for BDC researchers, while also being a standardized implementation of de-identification that can be applied across most data to be ingested by BDC, an evaluation was performed to produce a set of de-identification rules that can be applied to DICOM header attributes. The evaluation leveraged the de-identification profiles detailed in the DICOM Part 15 standard by evaluating its contents and aligning with the minimum requirements to comply with HIPAA Safe Harbor. The resulting de-identification strategy should be sufficient to construct a de-identification profile that can be applied across all DICOM headers.

The steps for performing this evaluation were as follows:

  1. Attributes from each profile were classified into the following categories: Direct Identifier (DI), Quasi-Identifier (QI), and Non-Identifier (NI), according to the classification framework detailed in the following diagram:

  1. After classification, DIs and QIs were then aligned with the 18 types of identifiers specified for removal within the HIPAA Safe Harbor provision.

  2. Each of the attributes that aligns with one of the HIPAA Safe Harbor identifiers was then assigned a mitigation technique to remove the identifying information that could appear in the field.

Of the attributes within the DICOM Part 15 standard that must be removed for compliance with HIPAA Safe Harbor, there are:

  • 4 name attributes

  • 4 patient address attributes

  • 122 date attributes

  • 5 telephone number attributes

  • 91 other unique ID attributes

Names, addresses, and telephone numbers should be suppressed from the data. Dates can be kept accurate to the year (a future BDC medical image de-identification RFC will address improving this approach for longitudinally acquired imaging studies). The other unique IDs can either be suppressed or they can be masked in a way so that their original values cannot be re-obtained. The specifics of how the other unique IDs will be masked will come in a separate RFC that describes the masking procedures. Additionally, there are 26 attributes that contain various forms of free text, such as comments, notes, labels, and text strings. Identifying information may be written in these attributes. As such, they should be suppressed to prevent the leakage of identifying information.

The other attributes detailed in the DICOM Part 15 standard do not necessarily require mitigation for compliance with HIPAA Safe Harbor. However, if they do not have analytic usage, it is recommended to mitigate them according to the specifications detailed in the DICOM Part 15 standard in order to decrease the risk of re-identification represented by indirectly identifying fields not mentioned in HIPAA Safe Harbor.

De-Identification of Image Pixel Data

Image pixel data, often encountered in ultrasound (echo) imaging, can contain PHI, such as patient names, dates of birth, and the hospital or imaging center names. This information can be shown either in labels on images, which usually have pre-specified areas, or in the form of burned-in text, which can appear anywhere on the image. Any identifying information contained within pixel data should be removed before it is made available to researchers.

Methods for removal of image pixel data include the following:

  • Masking through opaque boxes over parts of the image

  • AI assisted removal of identifying information, deploying optical character recognition (OCR)

  • Deletion of images from the dataset that contain identifying information

Image pixel de-identification will be performed as a service by tools provided by existing third party tool provided by DMC contractors. After de-identification, images will still require review to ensure that the process was able to capture and remove all identifying information on the images. This is a necessary quality control to ensure that there is no leakage of identifying information.

De-Identification of Filenames and File Paths

Metadata associated with images, such as filenames and file paths, can often include unique IDs and dates of medical events. This information is important to associate imaging data correctly with other types of data for linkage, processing, and analysis, however it can also present a risk of leakage of identifying information on de-identified data files. To prevent that from happening, the following rules should be followed:

  1. Folder names should only include the study name and associated visit number, and no further information

    1. e.g., for the first visit of the MESA study, the folder name should be called MESA_V1

  2. Image filenames are to be set to the following format: STUDYNAME_TYPE_VISITNN_ YYYYMMDD_SEQ

    1. VISITNN: ”VISIT”+VisitNumber (specifically include the label “VISIT” to inform investigator what the number is referring

    2. YYYYMMDD: AcquisitionDate set to set to 01-01-YYYY, where YYYY is the year of acquisition

    3. SEQ: sequence number to ensure filename is unique

    4. e.g., MESA_ECG_VISIT05_20220101_999.xml

Risk Mitigation

The risks presented by using the de-identification methods detailed in this RFC are as follows:

  1. HIPAA Safe Harbor, while being an accepted standard for de-identification, does not cover all potential identifiers (leaving out potentially attributes such as race, employment, diagnoses, procedures, and treatments). Data de-identified under HIPAA Safe Harbor holds a residual risk of re-identification.

  2. Automated imaging de-identification solutions are not 100% accurate, leaving the potential for small amounts of identifying information to be retained.

Data made available through BDC is provided for research purposes to investigators who should not have ulterior motives to perform re-identification. HIPAA Safe Harbor represents a standard that has been in use for over 20 years, so the risks presented from using that standard are well understood and acceptable by BDC. The risk presented by leakage of identifying information from imaging data can be mitigated through human review of de-identified images to ensure that all identifying information has been removed.

In the event that PHI is discovered in de-identified imaging data in BDC, such data shall be pulled off-line, checked for removal of offending PHI, before being posted again on BDC. In such cases, the data submitter shall be informed of the incident.

Local vs. Cloud-based Image De-Identification

Depending on the capabilities of the de-identification tool and the legal and logistic requirements for access to original identifiable images, de-identification may be done locally on the data-generating site or through a central cloud-based service. Although the latter is often more efficient (semi-automated and scalable), the transfer of identifiable (PHI-containing) images to a central cloud may require agreements between the data provider (submitter) and the de-identification service provider, stipulated through execution of Data Transfer Agreement (DTA). Details as to the image de-identification process that will be used will be provided in a future RFC.

BDC Video Content Guidance

Overview

BDC recognizes the importance of multimedia resources for ecosystem users, particularly audio/visual recordings. This document provides guidelines on the program's video content approach. Using these guidelines will ensure users get optimized video experiences, from consistent branding that offers insights into the sources of the videos to best practices in video creation that support learning.

Overview of BDC Videos

To share video content - from the consortium, platforms, and users, as described in the following sections - BDC created a YouTube channel: https://www.youtube.com/channel/UCGkmY5oNK8uFZzT8vV_9KgQ

The BioData Catalyst Coordinating Center (BDC3) , has authority (with direction from the NHLBI) to post (or not post), remove, edit, and otherwise change video content on this channel with or without permission from or notice to video creators, owners, or sharers. Feedback about videos on the BDC YouTube channel should be sent to BDCatalystOutreach@nih.gov.

Categories and Organization of Videos

The BDC YouTube Channel hosts three categories of videos based on their sources and/or approval statuses:

  • Consortium-produced / Consortium-approved

  • Platform-generated

  • User-generated

Learn more about each video category below. Note that each category has its own set of standards that must be adhered to when creating and publishing video content, whether the final outlet is the BDC YouTube channel or another channel.

BDC3 is responsible for organizing videos on the BDC YouTube channel, grouping them into playlists it believes will be most beneficial to ecosystem community members. Playlists may include videos from any or all categories of videos. Viewers can determine the category of a video based on the branding (or non-branding) that appears. The additional information about each video category includes video standards that direct video creators on branding for each category of videos.

Consortium-produced / Consortium-approved Videos

Videos in this category are produced by BDC3, or are produced by Platforms or Users that receive approval from the BDC Consortium (select organizations developing and maintaining the ecosystem). These videos contain pre-approved opening and closing BDC animations and sound.

Consortium-produced / Consortium-approved Video Standards

Videos produced by the Consortium, or by Platforms or Users that submit for approval for recognition as a Consortium-approved video, must adhere to the following standards:

Comply with all requirements and, when possible, follow all best practices outlined in Addendum A: Consortium-produced / Consortium-approved Videos Best Practices. Platforms and users generating videos who wish to submit them for recognition as Consortium-approved must complete the BDC Consortium Video Submission Pre-Approval Application. Submit the form BEFORE producing the video to improve the likelihood that the video receives Consortium approval.

Platform-generated Videos

Videos in this category are produced by one of the BDC platforms to support users' understanding of their platform. These videos are not vetted by BDC3, BDC3 Consortium members, or representatives of other BDC platforms. These videos must open with the creator's platform "Powered by" logo (downloadable from the BDC3 internal consortium website).

Platform-generated Video Standards

Unless a Platform plans to seek Consortium-approval status for a video, platforms should use the following standards in the production and posting of their platform-generated videos:

Producers of Platform-generated videos, like all BDC ecosystem users, are always obligated to protect participant privacy and must follow NIH policies for data protection. Platforms are accountable and may be subject to sanctions if policies are violated. Only produce videos that provide information specific to the Platform's BDC instance. Use the Platform's Powered by logo (and only the Powered by logo) for the YouTube thumbnail image. Videos should open with the following information: “In this video we will [discuss/cover/explore] BioData Catalyst Powered by [platform name] and [task/example]” YouTube description language should include: The following language: This is a BioData Catalyst platform-generated video to support ecosystem users' understanding of the BioData Catalyst Powered by [platform name]. The link to the NHLBI BioData Catalyst homepage: https://biodatacatalyst.nhlbi.nih.gov/ Videos should be uploaded using YouTube's auto-generated captions to support 508 compliance. Once the video is uploaded, email the link to: BDCatalystOutreach@nih.gov so BDC3 can make it visible on the BioData Catalyst YouTube channel.

Important Notes

  • Only videos offering information specific to the use of ecosystem Platform instances will be shared on the BDC YouTube channel. Videos that support the use of Platforms but are not specific to BDC instances may be linked from the ecosystem documentation but will not appear on the BioData Catalyst YouTube channel.

  • Platform-generated videos that do not follow the above standards will not be made visible on the BDC YouTube channel.

User-generated Videos

These videos are neither approved nor vetted by BDC, the BDC Consortium, BDC Platforms, or the organizations they represent. The opinions and other content in these videos are those of the video creators and sharers alone. These videos may NOT open or close with BDC branding and may only display BDC branding when capturing images of properties where it already appears (i.e., a screencap of an ecosystem platform instance).

User-generated Video Standards

BDC offers user-generated video tutorials and guides. Unless a user plans to seek Consortium-approval status for a video, BDC requires the following for user-generated videos, their creators, and their sharers:

Producers of user-generated videos, like all BDC ecosystem users, are always obligated to protect participant privacy and must follow NIH policies for data protection. User institutions are accountable and may be subject to sanctions if policies are violated. By submitting a video for inclusion, users are attesting that the content of the video follows NIH policies for data protection, agreeing to follow this guidance, and committing to the inclusion of the following statement in video descriptions: This is a user-generated video and is neither approved nor vetted by NHLBI BioData Catalyst (BDC), the members of the BDC Consortium, or the organizations they represent. For more information about BDC, go to https://biodatacatalyst.nhlbi.nih.gov/. For more BDC videos, go to https://www.youtube.com/channel/UCGkmY5oNK8uFZzT8vV_9KgQ. #BioDataCatalyst To share a video, please contact: BDCatalystOutreach@nih.gov

Important Notes

  • User-generated videos that do not follow the above standards will not be made visible on the BioData Catalyst YouTube channel.

  • User-generated videos are just one type of user-contributed content BDC seeks to share. To learn about other kinds of user-generated content BDC seeks, read Contributing User Resources to BDC.

Addendum A: Consortium-produced / Consortium-approved Videos Best Practices

Consortium-produced/Consortium-approved videos must adhere to this addendum. While not required of BDC Platforms and users, BDC encourages them to consider these best practices for the videos they produce.

Gaining Approval: Submitting Your Idea

Phase/Task
Required or Best Practice
Context

Consider if the video is fulfilling a need/gap

Required

Ensure video isn't replicating information already available to users

Complete & submit for pre-approval

Required

Pre-approval is required to ensure relevance & consistency

Planning the video: Considerations before recording

Phase/Task
Required or Best Practice

Outline the video

Best practice

Consider how info can be presented in a concise & useful manner

Avoid having too much text on slides

Best practice

Slides should be concise; keep text & bullets at a minimum; use images when possible as viewers respond to images more positively than text

Shooting the video: Best practices

Phase/Task
Required or Best Practice
Context

Use clear language & explain jargon

Best Practice

Simple communications are preferred; many viewers may not speak English as a first language

Policy compliance: Federal regulations & BDC3 best practices

Phase/Task
Required or Best Practice
Context

Ensure Section 508 compliance

Required

Subtitles & transcripts are required to ensure equity in access for people with disabilities

Ensure privacy policy compliance

Required

Protect study participants' privacy by not including personally identifiable, confidential, sensitive, or personal health information (e.g., blur screenshots with data)

Required

For people with disabilities, readability can be essential to a successful user experience

Use appropriate branding according to the BDC Style Guide

Required

Required to create a unified look across the BioData Catalyst ecosystem. Work with your BDC3 contact to get a copy of the style guide

Technical aspects: Steps after shooting

Phase/Task
Required or Best Practice
Context

Best practice

Search for meaningful keywords for titles, descriptions & tags

Create a meaningful title

Required

The title should be under 66 characters to make it easier for Google to display; make the title engaging & descriptive

Required

Think about the action the user is trying to take & the keywords they might use to find your video

Required

Transcription is free but likely needs editing; you can make changes to the text & timestamps of your captions

for interaction

Best practice

Cards are clickable calls to action that take viewers to another video, channel, or site

for marketing

Best practice

End screens can be added to a video's last 5 - 20 seconds to promote other videos, encourage viewers to subscribe, etc.

& create Table of Contents

Best practice

Break up videos into sections (each with an individual preview) to provide more info & context; eases re-playing certain sections

Required

A clear & colorful video thumbnail will catch viewers' attention & let them see a quick snapshot of your video as they're browsing

, including the required #BioDataCatalyst tag

Required

Tags are descriptive keywords you can add to your video to help viewers find your content; include at least 10 tags

Add links to BDC

Best practice

Where possible, provide links to relevant parts of the BDC ecosystem

Publishing & promoting: Publicizing & sharing video

Phase/Task
Require or Best Practice
Context

Share completed videos with BDC3

Required

Email with info on accessing the video, a thumbnail image, descriptive tags to include, and the video description

BDC3 sets appropriate privacy settings according to policy with input from the video creator

If Approved

Videos can be Public, Unlisted (link needed), or Private (invite needed; most secure)

BDC3 uploads to YouTube channel & adds to relevant playlists

If Approved

Videos can be in multiple playlists but don't need to be in any playlists

Teams and BDC3 develop plans to promote the video, if appropriate.

Best Practice

Potential options include Facebook, Instagram, LinkedIn, Snapchat, Twitter, Vimeo, WeChat, Pinterest, Flipgrid, etc.

Library maintenance: Keeping an up-to-date catalog

Phase/Task
Required or Best Practice
Context

BDC3 will prompt teams annually to check videos to ensure continued relevance.

Required

Outdated videos could cause viewers to lose confidence in the accuracy of info available on the channel

Checking Access

You can check your access to data on BDC using the public website or on your specific platform.

Public website

Go to Accessing BioData Catalyst Data and click Check My Access.

BDC powered by Gen3 (BDC-Gen3) platform

Go to BioData Catalyst Powered by Gen3, select NIH Login, then log in using your NIH credentials. Once logged in, select the Exploration tab. From the Data Access panel on the left, make sure Data with Access is selected. Note whether you have access to all the datasets you expect.

Checking data access in Gen3

Data Access

Parameter

Description

Data with Access (default)

Displays projects you have access to.

Data without Access

Displays data you do not have subject-level access to, but for which summary statistics can be accessed.

All Data

Displays all projects, but also projects you have no access to. A lock will appear for data you cannot access.

Request Access

You can request access to data by visiting the dbGaP homepage. For more information on Data Access, see the Data Accessibility on the Exploration page.

BDC powered by Seven Bridges (BDC-Seven Bridges) platform

Go to BioData Catalyst powered by Seven Bridges and login. To check your data access:

  1. Click your username in the upper right and select Account Settings.

  2. Select the tab for Dataset Access.

  3. Browse the datasets and note whether you have access to all the datasets you expect.

    • Datasets you have access to will have green check marks.

    • Datasets you do not have access to will have red check marks.

BDC powered by Terra (BDC-Terra) platform

You do not need to check your data access on BDC-Terra. But before submitting a help desk ticket, ensure that you’ve done the following steps:

Establish a link in BioData Catalyst powered by Terra to your eRA Commons/NIH Account and the University of Chicago DCP Framework. To link eRA Commons, NIH, and DCP Framework Services, go to your Profile page in BDC-Terra and log in with your NIH credentials.

Screenshot of NIH account credentials

If your access still has issues using particular files or datasets in analyses on BDC-Terra, submit a request to our help desk.

BDC powered by PIC-SURE (BDC-PIC-SURE) platform

You do not need to check your data access on BDC-PIC-SURE. Instead, refer to the Accessing BioData Catalyst Data page, then click Check My Access.

Submitting a dbGaP Data Access Request

Requirements

  • An NIH eRA Commons ID (or appropriate NIH Login) is required for submitting a Data Access Request (DAR). If you do not have an eRA Commons account, you must request one through your institution’s Office of Sponsored Research or equivalent. For more information, refer to Understanding eRA Commons Accounts.

  • To submit a DAR, users must have PI status through their institution. Non-PI users must have a PI they work with that can submit a DAR and add them as a downloader.

Data Access Request Process

Step 1: Go to https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login to log in to dbGaP.

Step 2: Navigate to My Projects.

Step 3: Select Datasets.

You can search by Primary disease type, or if you know the dataset you are interested in, you can use Study lookup.

We want to request HCT for SCD, so we will use the accession number phs002385. As you type the accession number, the numbers will start to auto-populate.

Select the study to add it to the Data Access Request. You can request up to 200 studies that you are interested in accessing.

The user can add additional datasets as necessary needed to answer the research question.

Sample Research Use Statement

Title

Long-term survival and late death after hematopoietic cell transplant for sickle cell disease

Research Use Statement

Our project is limited to requested dataset. We have no plans to combine with other datasets.

In 2018, the National Heart Lung and Blood Institute (NHLBI) began work on BioData Catalyst, a shared virtual space where scientists can access NHLBI data and work with the digital objects needed for biomedical research (www.nhlbi.nih.gov/science/biodata-catalyst). This is a cloud-based platform that allows for tools, applications and workflows. It provides secure workspaces to share, store, cross-link and analyze large sets of data generated from biomedical research. Biodata Catalyst addresses the NHLBI Strategic Vision objective of leveraging emerging opportunities in data science to facilitate research in heart, lung, blood and sleep disorders. It offers specialized search functions, controlled access to data and analytic tools via programming interfaces and its interoperability will allow exchange of information with other components of the Data Commons. BioData Catalyst may be accessed by the biomedical researchers and the public at large. The first available datasets in BioData Catalyst include data from NHLBI’s Trans-Omics for Precision Medicine (TOPMed) Program and the Cure Sickle Cell Initiative. Rigor in designing and performing scientific research and the ability to reproduce biomedical research are two of the cornerstones of science advancement. In order to test reproducibility of biomedical data available in BioData Catalyst we accessed NHLBI data from the Cure Sickle Cell Initiative to test and validate the findings of a publication that utilized those data. That report, focused on the effect of donor type and transplant conditioning regime intensity on hematopoietic cell transplant outcomes for sickle cell disease. Hematopoietic cell transplant is potentially curative, yet this treatment is associated with risks for mortality from the treatment procedure. Published reports suggest the life expectancy of adults with sickle cell disease in the United States is shortened by at least two decades compared to the general population. Thus, a fundamental question that is often asked is whether hematopoietic cell transplant over time would offer a survival advantage compared to treatment with disease-modifying agents. In the report1 that examined for factors associated with survival after transplantation, young patients (aged =12 years) and patients who received their graft from an HLA-matched sibling had the highest survival. For those without an HLA-matched sibling the data did not favor one alternative donor type over another.1 The purpose of the current analyses is two-fold: 1) test and validate a publication that utilized data in the public domain and 2) the utility of these data to conduct an independent study. The aim of the later study was to estimate the conditional survival rates after hematopoietic cell transplantation stratified by time survived since transplantation and to compare all-cause mortality risks to those of an age, sex, and race-matched general population in the United States.

Non-technical summary

Investigators in the Cure Sickle Cell Initiative Data Consortium request access to data to examine rigor and reproducibility of data submitted - using a previous publication as reference. Additionally, we will calculate survival after hematopoietic cell transplant by time survived since transplant. For example, we will calculate the 5- and 10-year likelihood of being alive 2 and 5 years after transplantation.

Cloud-Use Statement

The NHLBI-supported BioData Catalyst (www.nhlbiBioDataCatalyst.org) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces. The BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by FireCloud, hosted and operated by the Broad Institute, Fair4Cures, hosted and operated by Seven Bridges Genomics and PIC-SURE operated by Harvard Medical School. For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Additionally, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) for Harvard Medical School.

Cloud Provider Information

Cloud Provider:

NHLBI BioData Catalyst, Private, The NHLBI-supported BioData Catalyst (https://biodatacatalyst.nhlbi.nih.gov/) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces.

The NHLBI BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. The NHLBI BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by FireCloud, hosted and operated by the Broad Institute, Fair4Cures, hosted and operated by Seven Bridges Genomics and PIC-SURE operated by Harvard Medical School.

For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Additionally, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) for Harvard Medical School.

Amazon Web Services (AWS), Commercial Amazon Web services (AWS) is a public cloud platform that provides solutions and services such as virtual machines, database instances, storage, and more. We will use the Amazon Elastic Compute Cloud (Amazon EC2), a web service that provides resizable compute capacity, to allocate Amazon Machine Instances (AMIs) in which we will develop the methods and infrastructure necessary to build the NHLBI BioData Catalyst. AWS offers several storage options that work in conjunction with EC2: Amazon Simple Storage Service (Amazon S3), Amazon Elastic Block Store (EBS) and Amazon Elastic File System (Amazon EFS). We expect to use each of these as the provide different capabilities including persistent storage and direct and networked storage for attaching to running AMI(s). We will use the Amazon Virtual Private Cloud (VPC) to provide security and robust networking functionality to these compute resources and Amazon Identity and Access Management (IAM) to control user access to these compute resources. AWS offers extensive security and has written a white paper with guidelines for working with controlled access data sets in AWS which we will follow. (see https://d0.awsstatic.com/whitepapers/compliance/AWS_dBGaP_Genomics_on_AWS_Best_Practices.pdf).

Google Cloud Platform, Commercial

Google Cloud Platform is a public cloud platform that provides solutions and services such as virtual machines, database instances, storage, and more. We will use the Google Compute, a service that provides resizable compute capacity, to allocate Machine Types in which we will develop the methods and infrastructure necessary to build the NHLBI BioData Catalyst. Google Cloud offers several storage options that work in conjunction with Compute Engine: Google Cloud Storage, and Google Compute Engine Persistent Disks. We expect to use each of these as the provide different capabilities including persistent storage and direct and networked storage for attaching to running machine instances. We will use networking technologies based on Google’s Andromeda architecture, which can create networking elements at any level with software. This software-defined networking allows Cloud Platform's services to implement networking features that fit their exact needs, such as secure firewalls for virtual machines in Google Compute Engine. We will use Google Cloud Identity & Access Management to control user access to these compute resources.

Requirements and Login

Requirements

To obtain access to BDC-PIC-SURE, you must have an NIH eRA Commons account. For instructions and to register an account, refer to the eRA website.

Login

After you have created an eRA Commons account, you can log in to BDC-PIC-SURE by navigating to https://picsure.biodatacatalyst.nhlbi.nih.gov and selecting to log in with eRA Commons. You will be directed to the NIH website to log in with your eRA Commons credentials. After signing in and accepting the terms of the agreement on the NIH RAS Information Sharing Consent page, allow the BDC-Gen3 service to manage your authorization.

Upon login, you will be directed to the Data Access Dashboard. This page provides a summary of PIC-SURE Authorized Access, PIC-SURE Open Access, and the studies you are authorized to access.

PIC-SURE Data Access Dashboard

PIC-SURE User Guide

PIC-SURE: Patient Information Commons Standard Unification of Research Elements

The Patient Information Commons: Standard Unification of Research Elements (PIC-SURE) integrates clinical and genomic data to allow users to search, query, and export data at the variable and variant levels. This allows users to create analysis-ready data frames without manually mapping and merging files.

BDC Powered by PIC-SURE (BDC-PIC-SURE) functions as part of the BDC ecosystem, allowing researchers to explore studies funded by the National Heart, Lung, and Blood Institute (NHLBI), whether they have been granted access to the participant level data or not.

BioLINCC Datasets

The BDC ecosystem hosts several datasets from the . To access the BioLINCC studies, you must request access through dbGaP even if you have authorization from BioLINCC.

Search and Results

  1. Navigate to to access Dug Semantic Search.

  2. Semantic search is a concept-based search engine designed for users to search biomedical concepts, such as “asthma,” “lung,” or “fever,” and the variables related to and/or used to measure them. For example, a search for “chronic pain acceptance” will return a list of related biomedical concepts, such as chronic pain, headaches, neuralgia, or fibromyalgia, each of which can be expanded to display related variables and CDEs. Semantic search can also find variable names and descriptions directly, using synonyms from its knowledge graphs to find search-related variables.

  3. Enter a search term and press “Enter,” or click on the Search button. This will take you to the Semantic Search interface.

TOPMed and TOPMed related datasets

The BDC ecosystem hosts several datasets from the . The PIC-SURE platform has integrated the clinical and genomic data from all studies listed in the Data Access Dashboard. Through the ingestion process, occasionally PIC-SURE will ingest phenotypic data for the TOPMed studies prior to the genomic data.

Harmonized Data (TOPMed Harmonized Clinical Variables)

There are limited amounts of harmonized data available at this time. The TOPMed Data Coordinating Center (DCC) curation team has identified 44 variables that are shared across 17 NHLBI studies and normalized the participant values for these variables.

The 44 harmonized variables available are listed in the table in Appendix 2. For more information on this initiative, you can view the or on the .

Table of Studies Included in the TOPMed Harmonized Dataset Available in PIC-SURE

Available Data and Managing Data Access

BDC-PIC-SURE has integrated clinical and genomic data from a variety of heart, lung, blood, and sleep related datasets. These include NHLBI Trans-Omics for Precision Medicine (TOPMed) and TOPMed related studies, BioLINCC datasets, and COVID-19 datasets.

View a summary of the data you have access to by viewing the Data Access Table.

This table displays information about the study and associated data, including the full and abbreviated name of the study, study design and focus, the number of clinical variables, participants, and samples sequenced, additional information with helpful links, consent group information, and the dbGaP accession number (or phs number). You are also able to see which studies you are authorized to access in the Access column of the table. For information from dbGaP on submitting a data access request, refer to . Note that studies with a sickle cell disease focus contain links to the for additional information.

You can also check the data you have access to by going to the page on the BDC website and clicking Check My Access.

Video Submission Pre-Approval Form
Ensure accessibility, including readability & making slides available for download
Search Google Trends
Create a meaningful description
Edit automatic transcription
Create cards
Create end screens
Divide into chapters
Create thumbnail
Create meaningful tags
BDCatalystOutreach@nih.gov
NIH NHLBI Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC)
https://biodatacatalyst.nhlbi.nih.gov/use-bdc/explore-data/dug/

Atherosclerosis Risk in Communities Study

ARIC

phs000280

Cardiovascular Health Study

CHS

phs000287

Cleveland Family Study

CFS

phs000284

Coronary Artery Risk Development in Young Adults Study

CARDIA

phs000285

Epidemiology of Asthma in Costa Rica Study

CRA

phs000988

Framingham Heart Study

FHS

phs000007

Genetic Epidemiology Network of Arteriopathy

GENOA

phs001238

Genetic Epidemiology of COPD

COPDGene

phs000179

Genetics of Cardiometabolic Health in Amish

AMISH

phs000956

Genome-Wide Association Study of Venous Thrombosis Study

MAYOVTE

phs000289

Heart and Vascular Health Study

HVH

phs001013

Hispanic Community Health Study - Study of Latinos

HCHS-SOL

phs000810

Jackson Heart Study

JHS

phs000286

Multi-Ethnic Study of Atherosclerosis

MESA

phs000209

Study of Adiposity in Samoans

SAS

phs000914

Women’s Health Initiative WHI

WHI

phs000200

NHLBI Trans-Omics for Precision Medicine (TOPMed) program
additional documentation from the TOPMed DCC GitHub repository
NHLBI Trans-Omics for Precision Medicine website
Overview of PIC-SURE search interface
Tips for Preparing a Successful Data Access Request documentation
Cure SCi Metadata Catalog
BioData Catalyst Data Access
Sample summary table of studies available and user-based authorization via the Data Table.
“Check my access” on the BDC Access page.

Getting Started

CONNECTS Dataset

The BDC ecosystem hosts several datasets from the NIH NHLBI Collaborating Network of Networks for Evaluating COVID-19 and Therapeutic Strategies (CONNECTS) program. These COVID-19 related studies follow the guidelines for implementing common data elements (CDEs) and for de-identifying dates, ages, and free text fields. For more information about these efforts, you can view the CDE Manual and De-Identification Guidance documents on the CONNECTS COVID-19 Therapeutic Trial Common Data Elements webpage.

Table of COVID-19 Studies Included in the CONNECTS Program Available in PIC-SURE

A Multicenter, Adaptive, Randomized Controlled Platform Trial of the Safety and Efficacy of Antithrombotic Strategies in Hospitalized Adults with COVID-19

ACTIV4a

phs002694

COVID-19 Positive Outpatient Thrombosis Prevention in Adults Aged 40-80

ACTIV4b

phs002710

Clinical-trial of COVID-19 Convalescent Plasma in Outpatients

C3PO

phs002752

PIC-SURE Open Access vs. PIC-SURE Authorized Access

PIC-SURE provides two ways to search: PIC-SURE Open Access and PIC-SURE Authorized Access. PIC-SURE Open Access enables the user to explore aggregate-level data without any dbGaP data authorizations. PIC-SURE Authorized Access feature allows the user to explore participant-level data and requires authorization to access at least one study through an active dbGaP Data Access Request (DAR).

Table Comparison of PIC-SURE Open and Authorized Access

Removed stigmatizing variables

✓

Data obfuscation

✓

dbGaP approval to access required

✓

Access to aggregate counts

✓

✓

Access to participant-level data

✓

Phenotypic variable search

✓

✓

Phenotypic variable filtering

✓

✓

Genomic variable filtering

✓

Data retrieval

✓

Visualizations

✓

PIC-SURE Open Access

PIC-SURE Open Access allows you to search any clinical variable available in PIC-SURE. Your queries will return obfuscated aggregate counts per study and consent. There are some features specific to PIC-SURE Open Access, which are outlined below.

PIC-SURE Open Access specific features and layout.

A. Stigmatizing Variables Removal: PIC-SURE Open Access data excludes clinical variables that contain potentially sensitive information. These variables are known as stigmatizing variables, which fall into the following categories:

  • Mental health diagnoses, history, and treatment

  • Illicit drug use history

  • Sexually transmitted disease diagnoses, history, and treatment

  • Sexual history

  • Intellectual achievement, ability, and educational attainment

  • Direct or surrogate identifiers of legal status

For more information about stigmatizing variables and the identification process, please refer to the documentation and code on the BioData Catalyst Powered by PIC-SURE Stigmatizing Variables GitHub repository.

B. Data Obfuscation: Because participant-level data are not available in PIC-SURE Open Access, the aggregate counts are obfuscated to further anonymize the data. This means that:

  • If the consent group, study, and/or total participants of the query is between one and nine, the results will be shown as < 10.\

  • If the consent group results are between one and nine and the study and/or total participants of the query is greater than 10, the results will be obfuscated by ± 3.

  • Query results that are zero participants will display 0.

C. View Filtered Results by Study: The filtered number of participants which match the query criteria is shown broken down by study and consent group. Users can see if they do or do not have access to specific studies.

Use Case: Using PIC-SURE Open Access to Investigate Asthma in Healthy and Obese Adult Populations

In this section, the functionalities of PIC-SURE Open Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating asthma in relation to obesity in adults.

I’m interested in two cohorts: obese adults with a body mass index (BMI) greater than 30 and healthy adults with a BMI between 18.5 and 24.9. However, I have not yet submitted a Data Access Request and therefore am not authorized to access any datasets.

First, let’s explore cohort A: Healthy adults with a BMI between 18.5 and 24.9 in Framingham Heart Study (FHS).

  1. Search for ‘age’.

  2. Apply ‘FHS’ study tag to view only ‘age’ variables within the Framingham Heart Study (phs000007).

  3. Select the variable of interest. You may notice many variables that appear similar. These variables may be located in different datasets, or tables, but contain similar information. Open up the variable information modal by clicking on the row containing the variable of interest to learn more.

    Variable Information modal for ‘age1’ variable from Framingham Heart Study.
  4. Filter to adults only by clicking the filter icon next to the variable. I am interested in adults, so I will set the minimum age to 18, then click “Add filter to query”.

    Adding a filter to the ‘age1’ variable from Framingham Heart Study.
  5. Now, let’s filter to healthy adults with a BMI between 18.5 and 24.9. Similar to before, we will search ‘BMI’. We can narrow down the search results using the variable-level tags by including terms related to our variable of interest (such as ‘continuous’ to view only continuous variables) and excluding out-of-scope terms (such as ‘allergy’). After selecting the variable of interest, we can filter to the desired ranges before adding the filter to our query. Notice how the total number of participants in our cohort changes.

  6. Finally, we will filter for participants who have asthma.

    Adding a filter to the ‘B128’ variable from Framingham Heart Study.
  7. Note the total participant count in the Data Summary.

We can easily modify our filters to explore cohort B: Obese adults with a body mass index (BMI) greater than 30 in Framingham Heart Study.

  1. Edit the BMI filter by clicking the edit icon in the Added Variable Filters section. Change the range to have a minimum of 30 and no maximum.

  2. Note the total participant count in the Data Summary.

We can easily repeat these steps for other studies, such as the Genetic Epidemiology of COPD (COPDGene) study, and create a table like the one below. By comparing these two studies, I can see that COPDGene may be more promising for my research since it contains many more participants in my cohorts of interest than FHS does.

Framingham Heart Study (FHS)

50 +/- 3

72 +/- 3

Genetic Epidemiology of COPD (COPDGene)

488 +/- 3

868

I can then use the Request Access button to go directly to the study’s dbGaP page and begin submitting a DAR.

PIC-SURE Features and General Layout

General layout of PIC-SURE search
  1. Search bar: Enter any phenotypic variable, study or table keyword into the search bar to search across studies. Users can also search specific variables by accession number, if known (phs/pht/phv).

  2. Study Tags: Users can filter the results found through their search by limiting to studies of interest or excluding studies.

  3. Variable Tags: Users can filter the results found through their search by limiting to keywords of interest or excluding keywords that are out of scope. For example, a user could filter to categorical variables, variables containing the term ‘blood’, and/or exclude variables containing the term ‘pressure’.

    How are variable tags generated? Each variable has a set of associated tags, which are generated during the PIC-SURE data loading process. These tags are generated based on information associated with the variable, including the name of the study, study description, dataset name, PIC-SURE data type (continuous or categorical), and variable description. For a search in PIC-SURE, tags associated with a variable are displayed. Note that tags applicable to less than 5% or more than 95% of the search results are not displayed since these are not useful for filtering results.

  4. Search Results table: View all variables associated with your search term and/or study & variable tags.

  5. Results Panel: Panel with content boxes that describe the cohort based on the variable filters applied to the query.

  6. Data Summary: Displays the total number of participants in the filtered cohort which meet the query criteria. When first opening the Open or Authorized Access page, the number will be the total number of participants that you can access.

  7. Added Variable Filters summary: View all filters which have been applied to the cohort.

  8. Filter Action: Click on the filter icon to filter cohort participants by specific variable values.

  9. Reset button: Allows users to start a new search and query by removing all added filters and clearing all active study and variable tags.

Data Organization in PIC-SURE

PIC-SURE integrates clinical and genomic datasets across BDC, including TOPMed and TOPMed related studies, COVID-19 studies, and BioLINCC studies. Each variable is organized as a concept path that contains information about the study, variable group, and variable. Though the specifics of the concept paths are dependent on the type of study, the overall information included is the same.

For more information about additional dbGaP, TOPMed, and PIC-SURE concept paths, refer to Appendix 1.

Table of Data Fields in PIC-SURE

General organization

Data organized using the format implemented by the . Find more information on the dbGaP data structure .

Generally, a given study will have several tables, and those tables have several variables.

Data do not follow dbGaP format; there are no phv or pht accessions.

Data are organized in groups of like variables, when available. For example, variables like Age, Gender, and Race could be part of the Demographics variable group.

Concept path structure

\phs\pht\phv\variable name\

\phs\variable name

Variable ID

phv corresponding to the variable accession number

Equivalent to variable name

Variable name

Encoded variable name that was used by the original submitters of the data

Encoded variable name that was used by the original submitters of the data

Variable description

Description of the variable

Description of the variable, as available

Dataset ID

pht corresponding to the trait table accession number

Equivalent to dataset name

Dataset name

Name of the trait table

Name of a group of like variables, as available

Dataset description

Description of the trait table

Description of a group of like variables, as available

Study ID

phs corresponding to the study accession number

phs corresponding to the study accession number

Study description

Description of the study from dbGaP

Description of the study from dbGaP

Note that there are two data types in PIC-SURE: categorical and continuous data. Categorical variables refers to any variables that have categorized values. For example, “Have you ever had asthma?” with values “Yes” and “No” is a categorical variable. Continuous variables refer to any variables that have a numeric range of values. For example, “Age” with a value range from 10 to 90 is a continuous variable. The internal PIC-SURE data load process determines the type of each variable based on the data.

database of Genotypes and Phenotypes (dbGaP)
here

Additional Resources

Video Walkthroughs

Playlist

BioData Catalyst Powered by PIC-SURE YouTube channel

Videos

Introduction to BioData Catalyst Powered by PIC-SURE

Basics: Finding Variables

Basics: Applying a Variable on a Filter

Basics: Editing a Variable Filter

PIC-SURE Open Access: Interpreting the Results

PIC-SURE Authorized Access: Applying a Genomic Filter

PIC-SURE Authorized Access: Add Variables to Export

PIC-SURE Authorized Access: Select and Package Data Tool

PIC-SURE Authorized Access: Variable Distributions Tool

PIC-SURE Open Application Programming Interface (API)

Appendix 1: BDC Identifiers - dbGaP, TOPMed, and PIC-SURE

Table of BDC dbGAP/TOPMed Identifiers

Patient ID

This is the HPDS Patient num. This is PIC-SURE HPDS’s internal Identifier.

Topmed / Parent Study Accession with Subject ID

  • These are the identifiers used by each in the team in the consortium to link data.

  • Values must follow this mask <STUDY_ACCESSION_NUMBER>.<VERSION>_<SUBJECT_ID> Eg: phs000007.v30_XXXXXXX

DBGAP_SUBJECT_ID

  • This is a generated id that is unique to each patient in a study.

  • Controlled by dbgap

  • It is not unique across unrelated studies. However Patients can be linked across studies. See SOURCE_SUBJECT_ID.

  • However a patient will be assigned the same across related studies. For dbGaP to assign the same dbGaP subject ID, include the two variables, SUBJECT_SOURCE and SOURCE_SUBJECT_ID.

  • This identifier is used in all the phenotypic data files and is what we sequence to a HPDS Patient Num ( Patient ID ). All sequenced identifiers are stored in a PatientMapping file and stored in s3. These mappings allow HPDS data to be correlated back to the raw data sets.

SUBJECT_ID

  • This is a generated id that is unique to each patient in a study.

  • Controlled by the submitter of a study.

  • For FHS this is replaced with shareid for phs000007. For phs000974 It uses SUBJECT_ID. The values for these two columns are the same however.

SHARE_ID

  • For FHS phs000007 this was used instead of SUBJECT_ID, but not for FHS phs000974

SOURCE_SUBJECT_ID

  • This is used internally by DBGAP in conjunction with SUBJECT_SOURCE to allow submitters to associate subjects across studies.

SAMPLE_ID

  • De-identified sample identifier.

  • These are the ids that link to the molecular data in dbgap ( vcfs, etc.).

Table of PIC-SURE Identifiers

\_Topmed Study Accession with Subject ID\

Generated identifier for TOPMed Studies. These identifiers are a concatenation using the accession name and “SUBJECT_ID” from a study’s subject multi file.

<STUDY_ACCESSION_NUMBER>.<VERSION>_<SUBJECT_ID>

Eg: phs000974.v3_XXXXXXX

\_Parent Study Accession with Subject ID\

Generated identifier for PARENT Studies. In most studies this follows the same pattern as the TOPMed Study Accession with Subject id.

However, Framingham’s parent study phs000007 does not contain SUBJECT_ID column which is replaced using the SHAREID column.

Eg: phs000007.v3_XXXXXXX

\_VCF Sample Id\

This variable is stored in the sample multi file in each dbGaP study.

This is the TOPMed DNA sample identifier. This is used to give each sample/sequence a unique identifier across TOPMed studies.

Eg: NWD123456

Patient ID (not a concept path but exists in data exports)

This is PIC-SURE’s internal Identifier. It is commonly referred to as HPDS Patient num.

This identifier is generated and assigned to subjects when they are loaded. It is not meant for data correlation between different data sources.

PIC-SURE API Documentation

How to get started with PIC-SURE and the common endpoints you can use to query any resource registered with PIC-SURE

The PIC-SURE v2 API is a meta-API used to host any number of resources exposed through a unified set of generalized operations.

PIC-SURE Repositories:

  • PIC-SURE API: This is the repository for version 2+ of the PIC-SURE API.

  • PIC-SURE Wiki: This is the wiki page for version 2+ of the PIC-SURE API.

  • BioData Catalyst PIC-SURE: This is the repository for the BDC environment of PIC-SURE.

  • PIC-SURE-ALL-IN-ONE: This is the repository for PIC-SURE-ALL-IN-ONE.

Additional PIC-SURE Links:

  • DCPPC Presentation on PIC-SURE as a meta-API

  • Avillachlab-Jenkins Repository: A link to the Avillach Lab Jenkins repository.

  • Avillachlab-Jenkins Dev Release Control: A repository for Avillach Lab Jenkins development release control.

Client Libraries

The following are the collected client libraries for the entire PIC-SURE project.

  • R Client Library

  • Python Client Library

PIC-SURE User Interface

The PIC-SURE User Interface acts as a visual aid for running normal queries of resources through PIC-SURE.

PIC-SURE User Interface Repositories:

  • PIC-SURE HPDS UI: The main High Performance Data Store (HPDS) UI repository.

Additional PIC-SURE User Interface Links:

  • PIC-SURE UI Flow: Links to a google drawing of the PIC-SURE UI flow.

PIC-SURE Auth Micro-App (PSAMA)

The PSAMA component of the PIC-SURE ecosystem authorizes and authenticates all actions taken within PIC-SURE.

PSAMA Repos:

  • PIC-SURE Auth MicroApp Repository

Additional PSAMA Links:

  • PSAMA Core Logic: This is where the core of the PSAMA application is stored in GitHub

High Performance Data Store (HPDS)

HPDS is a datastore designed to work with the PIC-SURE meta-API. It grants researchers fast, dependable access to static datasets and the ability to produce statistics-ready dataframes filtered on any variable they choose at any time.

HPDS Repositories:

  • PIC-SURE HPDS: The main HPDS repository.

  • PIC-SURE HPDS Python Client: Python client library to run queries against a PIC-SURE HPDS resource.

  • PIC-SURE HPDS R Client: R client library to run queries against a PIC-SURE HPDS resource.

  • PIC-SURE HPDS UI: The main HPDS UI repository.

  • HPDS Annotation: This repository describes steps to prepare and annotate VCF files for loading into HPDS.

Data Analysis Using the PIC-SURE API

Once you have refined your queries and created a cohort of interest, you can begin analyzing data using other components of the BDC ecosystem.

What is the PIC-SURE API?

Databases exposed through the PIC-SURE API encompass a wide heterogeneity of architectures and data organizations underneath. PIC-SURE hides this complexity and exposes the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible sciences. The API is available in two different programming languages, python and R, allowing investigators to query databases in the same way using either of those languages. The PIC-SURE API tutorial notebooks can be directly accessed on GitHub.

PIC-SURE Access Token

To access the PIC-SURE API, a user-specific token is needed. This is the way the API grants access to individual users to protected-access data. The user token is strictly personal; do not share it with anyone. You can copy your personalized access token by selecting the User Profile tab at the top of the screen.

Here, you can Copy your personalized access token, Reveal your token, and Refresh your token to retrieve a new token and deactivate the old token.

User Profile modal displaying personalized access token.

Analysis in the BDC Ecosystem

The PIC-SURE API can be accessed via tutorial notebooks on either BDC-Seven Bridges or BDC-Terra.

To launch one of the analysis platforms, go to the BioData Catalyst website. From the Resources menu, select Services. A list of platforms and services on the BDC ecosystem will be displayed.

From the Analyze Data in Cloud-based Shared Workspaces section, select Launch for your preferred analysis platform.

List of analysis platforms in the Analyze Data in Cloud-based Shared Workspaces section on the BioData Catalyst website.

BDC-Seven Bridges

Jupyter notebook examples in R and python can be found under the Public projects tab by selecting PIC-SURE API.

Navigating to the PIC-SURE API in Seven Bridges Public Projects.

From the Data Studio tab, select an example that fits your research needs. Here, we will select PIC-SURE JupyterLab examples.

Dashboard of the PIC-SURE API on Seven Bridges

This will take you to the PIC-SURE API analysis workspace, where you can view the examples in python. Copy this workspace to your own project to edit or run the code yourself.

Copying the PIC-SURE API Public Project to a workspace from the Data Studio page.

Note The project must have network access to run the PIC-SURE examples on Seven Bridges. To ensure this, go to the Settings tab and select “Allow network access”.

BDC-Terra

To access the Jupyter notebook examples in R and python for the PIC-SURE API, select View Workspaces from the Terra landing page.

BioData Catalyst Powered by Terra landing page

Select the Public tab and search for “PIC-SURE”. Workspaces for both the python and R examples will be displayed. You must clone the workspaces to edit or run the code within them.

Searching for the PIC-SURE API examples in Terra workspaces

PIC-SURE Authorized Access

If you are authorized to access any dbGaP dataset(s), the Authorized Access tab at the top will be visible. PIC-SURE Authorized Access provides access to complete, participant-level data, in addition to aggregate counts, and access to the Tool Suite.

PIC-SURE Authorized Access specific features and layout.

A. Select Variables Action: Click the Select Variables icon to include variables when retrieving data. Users can select variables individually or at the dataset level.

  • Individually select variables: You can individually select variables from two locations:

    • Variable search results: From the search results you can click the data retrieval icon to include the variable in your data retrieval.

    • Variable modal variable data retrieval: The data retrieval icon next to the variable adds the variable to your data retrieval.

  • Select from a dataset or group of variables: In the variable modal the data retrieval icon next to the dataset opens a modal to allow you to select variables from the dataset table or group of variables.

B. Data Summary: In addition to the total number of participants in the filtered cohort, the number of variables the user has selected for data retrieval is also displayed.

There are four concept paths that are automatically included with any data export from PIC-SURE Authorized Access. These fields are listed and described below.

  • Patient ID: Internal PIC-SURE participant identifier. Please note that this field is not linking participants between studies and therefore should not be used for data correlation between different data sources or to the original data files.

  • Parent Study Accession with Subject ID: PIC-SURE generated identifier for parent studies. These identifiers are a combination of the study accession number and the subject identifier.

  • Topmed Study Accession with Subject ID: PIC-SURE generated identifier for TOPMed studies. These identifiers are a combination of the study accession number and the subject identifier.

  • Consents: Field used to determine which groups users are authorized to access from dbGaP. These identifiers are a combination of the study accession number and consent code.

C. Tool Suite: The Tool Suite contains tools that can be used to further explore filtered cohorts of interest. Note that at least one filter must be added to the query before using the Tool Suite.

  • Select and Package Data: Retrieve participant-level data corresponding to your filters and variable selections. Variables selected for data retrieval can be reviewed and modified. To learn more about the options associated with this tool, please refer to the Select and Package Data section.

  • Variable Distributions: View the distributions of query variables based on the filtered cohort. Note that there is a limit to the number of variable distributions that can be viewed at a given time. Additionally, genomic variables nor variables associated with any-record-of filter (e.g. entire datasets) will not be graphed.

Select and Package Data

The Select and Package Data tool is used to select and export participant-level data corresponding to your filters and variable selections. There are several options for selecting and exporting the data, which are shown using this tool.

Select and Package Data tool modal with example filters and variables.

In the top left corner of the modal, the number of participants and number of variables included in the query is shown. This is used to display the estimated number of data points in the export.

Note: Queries with more than 1,000,000 data points will not be exportable.

The table below displays a summary of the variables included in the export. Using the Selected column, variables that have been added to the export can be selected or deselected for the final dataframe.

Note: Variables with filters are automatically included in the export.

The Package Data button in the top right corner is used to prepare the data for export once the variable filters and selections have been finalized.

Select and Package Data tool modal with example filters and variables after clicking “Package Data”.

Once this button is clicked, there are several options to complete the export.

To export into a BDC analysis workspace, the Export to Seven Bridges or Export to Terra buttons can be used. Once clicking either of these buttons, a new modal will be displayed with all information and instuctions needed to complete the export. This includes your personalized access token, the query ID associated with the dataframe. Additionally, there is the option to Copy Query ID without accessing Seven Bridges or Terra if you wish to use a different analysis platform.

The Export to Seven Bridges option includes a Go to Seven Bridges button, which will open a new tab to the Public PIC-SURE API Project on BDC-Seven Bridges.

Export to Seven Bridges modal.

The Export to Terra option includes a Go to Terra via R button and a Go to Terra via Python button, which will open the Public PIC-SURE API R Examples workspace and the Public PIC-SURE API Python Examples workspace on BDC-Terra, respectively.

Export to Terra modal.

Use Case: Investigating Comorbidities of Breast Cancer in Authorized Access

In this section, the functionalities of PIC-SURE Authorized Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating some comorbidities of breast cancer in women with BRCA1 and BRCA2 gene variants, such as hypertension and COPD.

I already have been authorized to access the Women’s Health Initiative (WHI) study and am interested in a single cohort: women with breast cancer and variants of the BRCA1 and BRCA2 genes. I want to select hypertension-related variables of interest, check the distributions of some variables, and export all the data to an analysis workspace.

First, let’s apply our variable filters for the WHI study.

  1. Search “breast cancer” in Authorized Access.

  2. Add the WHI study tag to filter search results to only age variables found within the WHI study.

  3. Filter to participants with breast cancer by clicking the filter icon next to the variable of interest. Select values to filter your variable on and click “Add Filter to Query”.

    Adding a filter to the ‘BREAST’ variable from Women’s Health Initiative Study.
  4. Click the “Genomic Filtering” button to begin a filter on genomic variants.

  5. Select “BRCA1” and “BRCA2” genes of “High” and “Moderate” severity. Click “Apply genomic filter”.

    Filter to participants with breast cancer by clicking the filter icon next to the variable of interest. Select values to filter your variable on and click “Add Filter to Query”.
  6. Now, let’s filter to participants that have and do not have COPD. Similar to before, we will search ‘COPD’. After selecting the variable of interest, we can filter to the desired values before adding the filter to our query. Notice how the total number of participants in our cohort changes.

  7. Search “hypertension”.

  8. Add variables to data export by clicking the select variables icon in the Actions column next to the variable of interest. The icon next to variables selected for export will change to the checkmark icon.

    Adding a ‘hypertension’ variables (‘HTNTRT’, ‘HYPT’, ‘HYPTPILL’, and ‘HYPTPILN’) for export from Women’s Health Initiative Study.
  9. Notice how the number of variables changed in the Data Summary box.

  10. Before we Select and Package the data for export, let’s view the distribution of our participants’ ages to see if we have a normal distribution. Open the Variable Distributions tool in the Tool Suite. Here, we can see the distributions of the two added variable filters: breast cancer (‘BREAST’) and COPD (‘F33COPD’).

    Variable Distributions modal for the Authorized Access example cohort.
  11. Open the Select and Package Data tool in the Tool Suite. The variables shown in this table are those which will be available in your data export; you can remove variables as necessary.

    Select and Package Data modal.
  12. Click “Package Data” when you are ready.

  13. Once the data is packaged, you can select to either “Export to Seven Bridges” or “Export to Terra”. Copy over the personalized user token and query ID use the PIC-SURE API and export your data to an analysis workspace.

Appendix 2: Table of Harmonized Variables

cac_volume_1

Coronary artery calcium volume using CT scan(s) of coronary arteries

decimal

cubic millimeters

UMLS

cac_score_1

Coronary artery calcification (CAC) score using Agatston scoring of CT scan(s) of coronary arteries

decimal

UMLS

cimt_1

Common carotid intima-media thickness, calculated as the mean of two values: mean of multiple thickness estimates from the left far wall and from the right far wall.

decimal

mm

UMLS

cimt_2

Common carotid intima-media thickness, calculated as the mean of four values: maximum of multiple thickness estimates from the left far wall, left near wall, right far wall, and right near wall.

decimal

mm

UMLS

carotid_stenosis_1

Extent of narrowing of the carotid artery.

encoded

UMLS

0=None||1=1%-24%||2=25%-49%||3=50%-74%||4=75%-99%||5=100%

carotid_plaque_1

Presence or absence of carotid plaque.

encoded

UMLS

0=Plaque not present||1=Plaque present

height_baseline_1

Body height at baseline.

decimal

cm

UMLS

current_smoker_baseline_1

Indicates whether subject currently smokes cigarettes.

encoded

UMLS

0=Does not currently smoke cigarettes||1=Currently smokes cigarettes

weight_baseline_1

Body weight at baseline.

decimal

kg

UMLS

ever_smoker_baseline_1

Indicates whether subject ever regularly smoked cigarettes.

encoded

UMLS

0=Never a cigarette smoker||1=Current or former cigarette smoker

bmi_baseline_1

Body mass index calculated at baseline.

decimal

kg/m^2

UMLS

hemoglobin_mcnc_bld_1

Measurement of mass per volume, or mass concentration (mcnc), of hemoglobin in the blood (bld).

decimal

g / dL = grams per deciliter

UMLS

hematocrit_vfr_bld_1

Measurement of hematocrit, the fraction of volume (vfr) of blood (bld) that is composed of red blood cells.

decimal

% = percentage

UMLS

rbc_ncnc_bld_1

Count by volume, or number concentration (ncnc), of red blood cells in the blood (bld).

decimal

millions / microliter

UMLS

wbc_ncnc_bld_1

Count by volume, or number concentration (ncnc), of white blood cells in the blood (bld).

decimal

thousands / microliter

UMLS

basophil_ncnc_bld_1

Count by volume, or number concentration (ncnc), of basophils in the blood (bld).

decimal

thousands / microliter

UMLS

eosinophil_ncnc_bld_1

Count by volume, or number concentration (ncnc), of eosinophils in the blood (bld).

decimal

thousands / microliter

UMLS

neutrophil_ncnc_bld_1

Count by volume, or number concentration (ncnc), of neutrophils in the blood (bld).

decimal

thousands / microliter

UMLS

lymphocyte_ncnc_bld_1

Count by volume, or number concentration (ncnc), of lymphocytes in the blood (bld).

decimal

thousands / microliter

UMLS

monocyte_ncnc_bld_1

Count by volume, or number concentration (ncnc), of monocytes in the blood (bld).

decimal

thousands / microliter

UMLS

platelet_ncnc_bld_1

Count by volume, or number concentration (ncnc), of platelets in the blood (bld).

integer

thousands / microliter

UMLS

mch_entmass_rbc_1

Measurement of the average mass (entmass) of hemoglobin per red blood cell(rbc), known as mean corpuscular hemoglobin (MCH).

decimal

pg = picogram

UMLS

mchc_mcnc_rbc_1

Measurement of the mass concentration (mcnc) of hemoglobin in a given volume of packed red blood cells (rbc), known as mean corpuscular hemoglobin concentration (MCHC).

decimal

g /dL = grams per deciliter

UMLS

mcv_entvol_rbc_1

Measurement of the average volume (entvol) of red blood cells (rbc), known as mean corpuscular volume (MCV).

decimal

fL = femtoliter

UMLS

pmv_entvol_bld_1

Measurement of the mean volume (entvol) of platelets in the blood (bld), known as mean platelet volume (MPV or PMV).

decimal

fL = femtoliter

UMLS

rdw_ratio_rbc_1

Measurement of the ratio of variation in width to the mean width of the red blood cell (rbc) volume distribution curve taken at +/- 1 CV, known as red cell distribution width (RDW).

decimal

% = percentage

UMLS

bp_systolic_1

Resting systolic blood pressure from the upper arm in a clinical setting.

decimal

mmHg

UMLS

bp_diastolic_1

Resting diastolic blood pressure from the upper arm in a clinical setting.

decimal

mmHg

UMLS

antihypertensive_meds_1

Indicator for use of antihypertensive medication at the time of blood pressure measurement.

encoded

UMLS

0=Not taking antihypertensive medication||1=Taking antihypertensive medication

race_1

Harmonized race category of participant.

encoded

UMLS

AI_AN=American Indian_Alaskan Native or Native American||Asian=Asian||Black=Black or African American||HI_PI=Native Hawaiian or other Pacific Islander||Multiple=More than one race||Other=Other race||White=White or Caucasian

ethnicity_1

Indicator of Hispanic or Latino ethnicity.

encoded

UMLS

both=ethnicity component dbGaP variable values for a subject were inconsistent/contradictory (e.g. over multiple visits)||HL=Hispanic or Latino||notHL=not Hispanic or Latino

hispanic_subgroup_1

classification of Hispanic/Latino background for Hispanic/Latino subjects where country or region of origin information is available

encoded

UMLS

CentralAmerican=Central American||CostaRican=from Costa Rica||Cuban=Cuban||Dominican=Dominican||Mexican=Mexican||PuertoRican=Puerto Rican||SouthAmerican=South American

annotated_sex_1

Subject sex, as recorded by the study.

encoded

UMLS

female=Female||male=Male

geographic_site_1

Recruitment/field center, baseline clinic, or geographic region.

encoded

UMLS

subcohort_1

A distinct subgroup within a study, generally indicating subjects who share similar characteristics due to study design. Subjects may belong to only one subcohort.

encoded

UMLS

lipid_lowering_medication_1

Indicates whether participant was taking any lipid-lowering medication at blood draw to measure lipids phenotypes

encoded

UMLS

0=Participant was not taking lipid-lowering medication||1=Participant was taking lipid-lowering medication.

fasting_lipids_1

Indicates whether participant fasted for at least eight hours prior to blood draw to measure lipids phenotypes.

encoded

UMLS

0=Participant did not fast_or fasted for fewer than eight hours prior to measurement of lipids phenotypes.||1=Participant fasted for at least eight hours prior to measurement of lipids phenotypes.

total_cholesterol_1

Blood mass concentration of total cholesterol

decimal

mg/dL

UMLS

triglycerides_1

Blood mass concentration of triglycerides

decimal

mg/dL

UMLS

hdl_1

Blood mass concentration of high-density lipoprotein cholesterol

decimal

mg/dL

UMLS

ldl_1

Blood mass concentration of low-density lipoprotein cholesterol

decimal

mg/dL

UMLS

vte_prior_history_1

An indicator of whether a subject had a venous thromboembolism (VTE) event prior to the start of the medical review process (including self-reported events).

encoded

UMLS

0=did not have prior VTE event||1=had prior VTE event

vte_case_status_1

An indicator of whether a subject experienced a venous thromboembolism event (VTE) that was verified by adjudication or by medical professionals.

encoded

UMLS

0=Not known to ever have a VTE event_either self-reported or from medical records||1=Experienced a VTE event as verified by adjudication or by medical professionals

age_at_*

For each phenotypic value for a given subject, an associated age at measurement is provided.

decimal

years

See for more information.

unit_*

For each harmonized variable, a paired “unit_variable” is provided, whose value indicates where in the documentation to look to find the set of component variables and the algorithm used to harmonize those variables.

encoded

See for more information.

Workspace

Overview of Workspaces on BDC-Gen3

When navigating to a Workspace, users are presented with multiple workspace options.

The Gen3 platform offers two workspace environments: Jupyter Notebooks and R Studio.

There are six workspaces:

Virtual machines (VM):

  • Small Jupyter Notebook VM

  • Large Jupyter Notebook Power VM

  • R Studio VM

Pre-made workflow workspaces:

  • Autoencoder Demo

  • CIP Demo

  • Tensorflow-Pytorch.

To start a workspace, select Launch. You will see the following launch loading screen.

Launching a VM can take up to five minutes depending on the size and complexity of the workspace.

Once the VM is ready, the initial screen for the workspace will appear. For scripts and output that need to be saved when the workspace is terminated, store those files in the pd/ directory.

This workspace will persist once the user has logged out of the BDC-Gen3 system. If the workspace is no longer being used, terminate the workspace by selecting Terminate Workspace at the bottom of the window. You will be returned to the Workspace page with all of the workspace options.

For more information about the Gen3 Workspace, refer to .

Query

Overview of the Query page on BDC-Gen3

Overview

The Query page can search and return metadata from either the Flat Model or the Graph Model of a commons. Using GraphQL, these searches can be tailored to filter and return fields of interest for the data sets being queried. These queries can be made immediately after data submission as this queries the model directly.

For more information about how to use the Query page, refer to the .

Dictionary

Interactive Data Dictionary on BDC-Gen3

Overview

The Dictionary page contains an interactive visual representation of the Gen3 data model. The default graph model view, as pictured below, displays all of the nodes and relationships between nodes in a hierarchical structure. The model further specifies the node types and links between nodes, as highlighted in the legend located at the top right side of the page.

Graph View

Users can click on any of the graph nodes in order to learn more about their respective properties. By clicking on a node, the graph will highlight that specific node and all associated links that connect it to the Program node. A "Data Model Structure" list will also appear on the left side toolbar. This will display the node path required to reach the selected node from the Program node.

When a second node in the path is selected, it will then gray out the other possible paths and only highlight the selected path. It will also change the "Data Model Structure" list on the left side toolbar.

The left side toolbar has two options available:

  • Open properties: Will download the submission files for all the nodes in the "Data Model Structure" list. This option can also be found on the node that was first selected.

  • Download templates: Will open the node properties in a new pop-up window; an example is displayed in the following screenshot.

This property view will display all properties in the node and information about each property:

  • Property: Name of the property.

  • Type: The type of input for the node. Examples of this are string, integer, Boolean and enumerated values (enum), which are displayed as preset strings.

  • Required: This field will display whether the property is required for the submission of the node into the data model.

  • Description: This field will display further information about the property.

  • Term: This field can be populated with external resources that have further information about the property.

Table View

The Table view is similar to the Properties view, and nodes are displayed as a list of entries grouped by their node category.

Clicking on one of the nodes will open the Properties view of the node.

Dictionary Search

The Dictionary contains a text-based search function that will search through the names of the properties and the descriptions. While typing, a list of suggestions appears below the search bar. Click on a suggestion to search for it.

When the search function is used, it will default to the graph model and highlight nodes that contain the search term. Frames around the node boxes indicate whether the searched word was identified in the name of the node (full line) or in the node's description and properties' names/descriptions (dashed line).

Clicking on one of these nodes, it will only display the properties that have this keyword present in either the property name or the description.

Click Clear Search Result to clear the free text search if needed.

The search history is saved below the search bar in the "Last Search" list. Click on an item here to display the results again.

Current Projects

Overview of current projects hosted on BDC-Gen3, including their dependencies, characteristics, and relationships.

Current Project IDs

A list of current project IDs can be found in the Data tab, under Filters>Project>Project Id. The current project IDs are:

  • Parent

  • TOPMed

  • Open_Access

  • Tutorial

Parent and TOPMed Studies

Distinguishing Between Parent and TOPMed Studies

The Parent and TOPMed study types have been categorized on Gen3 by their Program designation. An example of this designation by Program is presented below.

The Program types can be further identified by whether there is an underscore (_) at the end of the study:

  • Parent studies will include an underscore at the end of the study name.

    • Example: parent-WHI_HMB-IRB_

  • TOPMed studies will not include an underscore at the end of the study name.

    • Example: topmed-BioMe_HMB-NPU

Relationship Between Parent and TOPMed Studies

There are three distinct relationships possible between Parent and TOPMed studies. The first two relationships are streamlined:

  • Parent only: The Parent study does not have a TOPMed counterpart study. This usually means that there are no genomic data, such as WXS (whole exome sequencing) or WGS (whole genome sequencing), located within the study; only phenotypic data.

  • TOPMed only: This TOPMed study does not have a Parent counterpart study. These studies will contain both genomic data, WXS or WGS, and phenotypic data.

  • Parent study with a counterpart TOPMed study: The Parent study will contain the phenotypic data, while the TOPMEd study will contain the genomic data. Under dbGaP, these studies would be kept separate from one another and the user would need to create the linkages. In the Gen3 platform, these studies have been linked together under the Parent study, based on the participant IDs found in dbGaP. This allows our system to produce valuable information and cohort creation as it combines both phenotypic and genomic data.

Parent and TOPMed Study Contents

The most notable difference between the Program categories is the type of hosted data.

Parent

  • Genomic data: None

  • Phenotypic data: Like with TOPMed studies, any phenotypic data found within the Graph Model, will only be DCC harmonized variables. For the raw phenotypic data from dbGaP, again, it can be found in the reference_file node.

TOPMed

  • Genomic data: Available data can include CRAM, VCFs and Cohort-level VCF files

  • Phenotypic data: TOPMed studies without an associated Parent study will include phenotypic data in the data graph by way of DCC harmonized variables. Additionally, raw phenotypic data from dbGaP can be found in the reference_file as tar files that share this common naming scheme: RootStudyConsentSet_phs######.<study_shorthand>.v#.p#.c#.<consent_codes>.tar.gz

Open_Access - 1000 Genomes project

The 1000 Genomes Project is an international research effort (2008-2015) to establish the most detailed catalogue of human variation and genotype data. On the Gen3 platform, the Program open_access contains:

  • Genotypic data: Available data can include CRAM and VCF files.

  • Phenotypic data: The data graph will contain phenotypic data by way of DCC harmonized variables. Additionally, raw phenotypic data can be found in the reference_file as VCF and TXT files.

Tutorial

This program contains genomic data from 1000 Genomes and synthetic clinical data generated by Terra. Purpose of this dataset is to use it as a genome-wide association study (GWAS) tutorial. GWAS is an approach used in genetics research to associate specific genetic variations with particular diseases. For more information, see .

On the Gen3 platform, the Program tutorial contains:

  • Genotypic data: Available data can include CRAM and VCF files.

  • Phenotypic data: The data graph will contain phenotypic data by way of DCC harmonized variables. Additionally, raw phenotypic data can be found in the reference_file as VCF and GDS files.

Discovering Data Using Gen3

How to login to the BDC Powered by Gen3 (BDC-Gen3) platform and view available genomic and phenotypic data.

Login to the BDC-Gen3 Platform

In order to navigate and access data available on the Gen3 platform, start by visiting the . You will need an eRA Commons account as well as access permissions through the . If you are a researcher, login by selecting NIH Login and using your . BDC consortia developers can login using their Google accounts. Make sure to use the correct login method that contains access to your available projects.

Once logged in, your username will appear in the upper right-hand corner of the page. You will also see a display with aggregate statistics for the total number of subjects, studies, aliquots and files available within the BDC platform.

NOTE: These numbers may differ from those displayed in the dbGaP records as they include TOPMed studies as well as the associated parent studies.

Types of Hosted Data

Phenotypic

DCC Harmonized clinical data:

A number of clinical variables have been harmonized by the in order to facilitate cross-study analysis. Faceted search over the DCC Harmonized Variables is available via the page, under the "Data" tab.

Unharmonized clinical data:

Unharmonized clinical files are also available on the Gen3 platform and contain all of the raw phenotypic information for the hosted studies. Unlike the DCC Harmonized Variables, these files are located and searchable under the "" tab in the page.

Genomic

The Gen3 platform hosts genomic data provided by the (TOPMed) program and the plus synthetic tutorial data from Terra. At present, these projects include CRAM and VCF files together with their respective index files. Specifically for TOPMed projects, each project will contain at least one multi-sample VCF that comprises all subjects within the consent group. CRAM and VCF are based on an individual level, whereas multi-sample VCFs are based on the study consent level.

All files are available under the "Files" tab in the page. More detailed information on currently hosted data on the Gen3 platform can be found .

Gen3 Pages

The BDC-Gen3 platform contains five pages described below:

  • : An interactive data dictionary display that details the contents and relationships between clinical and biospecimen data

  • : The facet filter custom cohort creation tool

  • : The GraphQL query tool to retrieve specific data within the graph model

  • : The launch page for Gen3 workspaces that includes Jupyter Notebooks and RStudio

  • : The information page for each user, displaying access and the location for credential file downloads

Profile

Overview of the Profile page on the BDC-Gen3

Profile Page

The Profile page contains two sections: API keys and Project access.

API key(s)

To download large amounts of data, an API key will be required as a part of the . To create a key on your local machine, click Create API key, which will activate the following pop-up window:

Click Download json to save the credential file to your local machine. After completion, a new entry will appear in the API key(s) section of the Profile page. It will display the API key key_id and the expiration date (one month after the key creation). The user should delete the key after it has expired. If for any reason a user feels that their API key has been compromised, the key should be deleted before subsequently creating a new one.

Project Access

This section of the Profile page lists the projects and the methods of access for the data within in the BDC-Gen3 system. If you do not see access to a specific study, check that you have been granted access within . If access has been granted for over a week, contact the BDC Help Desk: bdcat-support@datacommons.io

TOPMed Harmonization Strategies
TOPMed Harmonization Strategies

Seven Bridges

About Seven Bridges

BDC-Seven Bridges offers researchers collaborative workspaces for analyzing genomics data at scale. Researchers can find and analyze the hosted TOPMed studies by using hundreds of optimized analysis tools and workflows (pipelines); creating their own workflows; or interactive analysis. On the platform, researchers can utilize collaborative workspaces for analyzing genomics data at scale, and access hosted datasets along with Common Workflow Language (CWL) and GENESIS R package pipelines for analysis. This platform also enables users to bring their own data for analysis and work in RStudio and Jupyterlab Notebooks for interactive analysis.

Key Features

  • Private, secure, workspaces (projects) for running analyses at scale

  • Collaboration features with the ability to set granular permissions on project members

  • Direct access to BDC without needing to set up a Google or AWS billing account

  • Access hosted TOPMed studies all in one place and analyze data on the cloud at scale

  • Tools and features for performing multiple-variant and single-variant association studies including:

    • Annotation Explorer for variant aggregations

    • Cloud-optimized Genesis R package workflows in Common Workflow Language

  • Cohort creation by searching phenotype data

    • Use PIC-SURE API for searching phenotype data

    • Search by known dbGaP identifiers

  • Rstudio and Jupyterlab Notebooks built directly into the platform for easy interactive analysis and manipulation of phenotype data

  • Hosted TOPMed data you can combine with your own data on AWS or Google Cloud

  • Billing and administrative controls to help your research funding go further: avoid forgotten instances, abort infinite loops, get usage breakdowns by project.

Analyze Data

Getting Started Guide

Just starting out on BDC-Seven Bridges, and need to get up to speed on how to use the platform? Our experts have created a Getting Started Guide to help you jump right in. We recommend users begin learning how to use BDC-Seven Bridges by following the steps in this guide. After reading this guide, you will know how to create an account on BDC-Seven Bridges, learn the basics of creating a workspace (project), run an analysis, and search through the hosted data.

To read our Getting Started Guide, please refer to our documentation page here.

PFB Files

Overview of the Portable Format for Bioinformatics (PFB) file type

What is a Portable Format for Bioinformatics?

A Portable Format for Bioinformatics (PFB) allows users to transfer both the metadata from the the Data Dictionary as well as the Data Dictionary itself. As a result, data can be transferred while keeping the structure from the original source. Specifically, a PFB consists of three parts:

  • A schema

  • Metadata

  • Data

For more information and an in-depth review that includes Python tools for PFB creation and exploration, refer to the PyPFB github page and install the newest version.

Note The following PFB example is a direct PFB export from the tutorial-synthetic_data_set_1 found on BioData Catalyst Powered by Gen3. Due to the large amount of data stored within PFB files, only small sections are shown with breaks (displayed as ... ) occurring in the output.

Schema

A schema is a JSON formatted Data Dictionary containing information about the properties, such as value types, descriptions, and so on.

To view the PFB schema, use the following command:

pfb show -i PFB_file.avro schema

Example Output

...
  {
    "type": "record",
    "name": "gene_expression",
    "fields": [
      {
        "default": null,
        "name": "data_category",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_data_category",
            "symbols": [
              "Transcriptome Profiling"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "data_type",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_data_type",
            "symbols": [
              "Gene Expression Quantification"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "data_format",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_data_format",
            "symbols": [
              "TXT",
              "TSV",
              "CSV",
              "GCT"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "experimental_strategy",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_experimental_strategy",
            "symbols": [
              "RNA-Seq",
              "Total RNA-Seq"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "file_name",
        "type": [
          "null",
          "string"
        ]
      },
      {
        "default": null,
        "name": "file_size",
        "type": [
          "null",
          "long"
        ]
      },
      {
        "default": null,
        "name": "md5sum",
        "type": [
          "null",
          "string"
        ]
      },
      {
        "default": null,
        "doc": "The GUID of the object in the index service.",
        "name": "object_id",
        "type": [
          "null",
          "string"
        ]
      }
...

NOTE: To make the outputs more human-readable, the above information was then piped through the program jq. Example: pfb show -i PFB_file.avro schema | jq

Metadata

The metadata in a PFB contains all of the information explaining the linkage between nodes and external references for each of the properties.

To view the PFB metadata, use the following command:

pfb show -i PFB_file.avro metadata

Example Output

...
    {
      "name": "exposure",
      "ontology_reference": "",
      "values": {},
      "links": [
        {
          "multiplicity": "MANY_TO_ONE",
          "dst": "subject",
          "name": "subjects"
        }
      ],
      "properties": [
        {
          "name": "years_smoked",
          "ontology_reference": "Person Smoking Duration Year Count",
          "values": {
            "source": "caDSR",
            "cde_id": "3137957",
            "cde_version": "1.0",
            "term_url": "https://cdebrowser.nci.nih.gov/CDEBrowser/search?elementDetails=9&FirstTimer=0&PageId=ElementDetailsGroup&publicId=3137957&version=1.0"
          }
        },
        {
          "name": "years_smoked_gt89",
          "ontology_reference": "Person Smoking Duration Year Count",
          "values": {
            "source": "caDSR",
            "cde_id": "3137957",
            "cde_version": "1.0",
            "term_url": "https://cdebrowser.nci.nih.gov/CDEBrowser/search?elementDetails=9&FirstTimer=0&PageId=ElementDetailsGroup&publicId=3137957&version=1.0"
          }
        },
        {
          "name": "alcohol_history",
          "ontology_reference": "Alcohol Lifetime History Indicator",
          "values": {
            "source": "caDSR",
            "cde_id": "2201918",
            "cde_version": "1.0",
            "term_url": "https://cdebrowser.nci.nih.gov/CDEBrowser/search?elementDetails=9&FirstTimer=0&PageId=ElementDetailsGroup&publicId=2201918&version=1.0"
          }
        },
        {
          "name": "alcohol_intensity",
          "ontology_reference": "Person Self-Report Alcoholic Beverage Exposure Category",
          "values": {
            "source": "caDSR",
            "cde_id": "3457767",
            "cde_version": "1.0",
            "term_url": "https://cdebrowser.nci.nih.gov/CDEBrowser/search?elementDetails=9&FirstTimer=0&PageId=ElementDetailsGroup&publicId=3457767&version=1.0"
          }
        },
...

Data

The data in the PFB are the values for the properties in the format of the Data Dictionary.

To view the data within the PFB, use the following command:

pfb show -i PFB_file.avro

To view at a certain number of entries in the PFB file, use the flag -n to designate a number. For example, to view the first 10 data entries within the PFB, use the following command:

pfb show -i PFB_file.avro -n 10

Example Output

...
{
  "id": "6c5e21d5-da76-49a5-9f82-7e3a726d44c6",
  "name": "lab_result",
  "object": {
    "cer451q1": null,
    "oxldl1": null,
    "f81c": null,
    "renins1c": null,
    "cystatc1": null,
    "triglycerides": -0.40415245294570923,
    "glucos1c": 6.5463337898254395,
    "glucos1u": null,
    "ldl": 2.0789523124694824,
    "hdl": 2.7123606204986572,
    "creatin1": null,
    "total_cholesterol": 3.039848566055298,
    "chlcat1c": null,
    
...

    "uabcat1c": null,
    "inslnr1t": 1.8090298175811768,
    "vldlp31c": null,
    
...

    "unit_hematocrit_vfr_bld": null,
    "age_at_total_cholesterol": 80,
    "unit_total_cholesterol": null,
    "age_at_triglycerides": 80,
    "unit_triglycerides": null,
    "age_at_hdl": 80,
    "unit_hdl": null,
    "age_at_ldl": 80,
    "unit_ldl": null,
    
...

    "unit_mcv_entvol_rbc": null,
    "submitter_id": "HG00325_lab_res",
    "state": "validated",
    "project_id": "tutorial-synthetic_data_set_1",
    "created_datetime": "2020-01-27T13:54:06.745386+00:00",
    "updated_datetime": "2020-01-27T13:54:06.745386+00:00"
  },
  "relations": [
    {
      "dst_id": "f4fdda57-80f4-4995-bea2-161c3242c525",
      "dst_name": "subject"
    }
  ]
}

Comprehensive Analysis Tips

This guide has been prepared to help you with your first set of projects on BDC-Seven Bridges.

This guide aims to help you learn how to take advantage of all the various features and functionality for performing analyses on Seven Bridges and ensure that you can set up your analyses in the most efficient way possible to save time and money.

The following topics are covered in this guide:

  • The basics of working with CWL tools and workflows on the platform.

  • How to specify computational resources on the platform and how to use the default options selected by the execution scheduler.

  • How to run Batch analyses and take advantage of parallelization with scatter.

  • The basics of working with Jupyterlab Notebooks and Rstudio for interactive analysis.

You can refer to the guide here.

Data Analysis in a Gen3 Data Commons
BDC-Gen3 Workspace Page
Launch loading screen
The initial workspace for Jupyter Notebooks
The initial workspace for R Studio
Gen3 documentation
BDC Query Page
Terra Tutorials
The list of current project IDs can be found under Project Id.
A list of Parent (underlined in blue) and TOPMed studies (underlined in red).
login page
Database of Genotypes and Phenotypes (dbGaP)
eRA Commons account
Data Coordinating Center (DCC)
Exploration
Files
Exploration
Trans-Omics for Precision Medicine
1000 Genomes Project
Exploration
here
Dictionary
Exploration
Query
Workspace
Profile
Post-login view of the BDC-Gen3 front page.
The BDC-Gen3 Pages.
gen3-client
dbGaP
Profile page with an active key and access to projects
API key creation pop-up window
Default view of the interactive Gen3 Data Dictionary
An example of a node being selected in the interactive graph view.
An example of a second node being selected in the path of the first selected node.
A node's property window.
Table View of the Gen3 Data Dictionary.
Opening the Properties in the Table View format.
An example search for the term "study". results appear under the search bar as you type.
Search results for "study" in the Graph View.
Search results are highlighted in orange color.
Clear the search results

Exploration

An explanation for the Exploration page on BDC-Gen3

Using Exploration

The Exploration page located in the upper right-hand section of the toolbar allows users to search through data and create cohorts. The Exploration portal contains a dynamic summary statistics display, as well as search facets leveraging the DCC Harmonized Variables.

Data Accessibility

Data Access panel on the Exploration page.

Users can navigate through data on the Exploration page by selecting any of the three Data Access categories.

  • Data with Access: A user can view all of the summary data and associated study information for studies the user has access to, including but not limited to Project ID, file types, and clinical variables.

  • Data without Access:

    • Locks next to the project ID signify to users that they do not have subject-level access but they can still search through the available studies but only view summary statistics. Users can request access to data by visiting the dbGaP homepage.

    • Projects will also be hidden if the select cohort contains fewer than 50 subjects (50 ↓, "You may only view summary information for this project", example below); in this case grayed out boxes and locks both appear. An additional lock means users have no access.

The view on the list of Projects when "Data without Access" is selected.
Example: The variable of Ethnicity is hidden once the number of subjects falls below 50.
Lock, grayed out box and "50" signify the number of subjects falls <50 and users have no access..
  • All Data: Users can view all of the data available in the BDC-Gen3 platform, including studies with and without access. As a result, studies not available to a user will be locked as demonstrated below.

By default, all users visiting the Exploration page will be assigned to Data with Access.

The Data Tab

Exploration page with Data Access displaying the Data with Access.

Under the "Data" tab, users can leverage the DCC harmonized variables to create custom cohorts. When facets are selected and/or updated to cover a desired range of values, the display will reflect the information relevant to the new applied filter. If no facets have been selected, all of the data accessible to the user will be displayed. At this time, a user can filter based on three categories of clinical information:

  • Project: Any specifically defined piece of work that is undertaken or attempted to meet a single investigative question or requirement.

  • Subject: The collection of all data related to a specific subject in the context of a specific experiment.

  • Harmonized Variables: A selection of different clinical properties from multiple nodes, defined by the Consortium.

NOTE: The facet filters are based on the DCC Harmonized Variables, which are a selected subset of clinical data that have been transformed for compatibility across the dbGaP studies. TOPMed studies that do not contain harmonized clinical data at this time will be filtered out when a facet is chosen, unless the no data option is also selected for certain facets.

Exporting Data from the Data Tab

After a cohort has been selected, the user has four different options for exporting the data.

Export

The options for export are as follows:

Four options offered for data export.
  • Export All to Terra : Initiate a Portable Format for Bioinformatics (PFB) export of all clinical data and file GUIDs for the selected cohort to BioData Catalyst powered by Terra. At this time the max number of subjects that can be exported to Terra is 120,000.

  • Export All to Seven Bridges: Initiate a Portable Format for Bioinformatics (PFB) export of all clinical data and file GUIDs for the selected cohort to BioData Catalyst powered by Seven Bridges.

  • Export to PFB : Initiate a PFB export of all clinical data and file GUIDs for the selected cohort to your local storage.

  • Export to Workspaces : Export a manifest to the user's workspace and make the case-associated data files available in the workspace under the /pd/data directory.

NOTE: PFB export times can take up to 60 minutes, but often will complete in less than 10 minutes.

The Files Tab

The Files Tab page.

The Files tab displays study files from the facets chosen on the left-side panel (Project ID, Data Type, Data Format, Callset, and Bucket Path). Each time a facet selection is made, the data summary and displays will update to reflect the applied filters.

Locating Unharmonized Clinical Data

The Files tab also contains files that are either case-independent or project-level. This is important for files that are part of the Unharmonized Clinical Data category under the Data Type field. Unharmonized clinical files are made available in two distinct data formats:

  • TAR : Contain a complete directory of phenotypic datasets as XML and TXT files that are direct downloads of unharmonized clinical data from dbGaP on a study consent level project.

  • AVRO: These files are the same as the unharmonized clinical data from dbGaP as the TAR files, but in form of a PFB file.

  • XML: These files contain either dictionary or variable reports of the phenotypic datasets that are in the TXT files. These supporting files do contain information on a study-level and not on a subject-level.

  • TXT: These files contain subject-level phenotypic datasets.

NOTE: The unharmonized clinical data sets contains all data from the dbGaP study, but it is not cross-compatible across all studies within BDC.

Exporting/Downloading Data from the Files Tab

Once the user has selected a cohort, there are five options for accessing the files:

Five button options offered for file download or export.
  • Download Manifest: Download the file manifest and use this manifest to download the enlisted data files using the gen3-client.

  • Export to Workspace: The files can be exported to a Gen3 workspace.

  • Export All PFB: Initiate a PFB export of the selected files.

  • Export All to Terra: Initiate a PFB export of the selected files to BioData Catalyst powered by Terra.

  • Export All to Seven Bridges: Initiate a PFB export of the selected files to BioData Catalyst powered by Seven Bridges.

  • GUID Download File Page: Aside from the 5 button options, users can download files by first clicking on the link(s) under the GUIDs column, followed by the Download button in the file information pages (see next section below).

Download files by clicking on the link located under the GUID column.

File Information Page

A user can visit the File Information Page after clicking on any of the available GUID link(s) in the Files tab page. The page will display details such as data format, size, object_id, the last time it was updated and the md5sum. The page also contains a button to download the file via the browser (see below). For files that are 5GB or more, we suggest using the gen3-client.

An example file information page with the Download button.

Free text search for Submitter IDs and File Names

Both the Data and File tabs contain a text-based search function that will initiate a list of suggestions below the search bar while typing.

In the Data tab, Submitter IDs can be searched under the Subject tab.

Free text search of Submitter IDs in Subject on the Data Tab.

In the File tab, File Names can be searched under the File tab.

Free text search of File Names on the File Tab.

Click either on a single or on multiple suggestions in the list appearing underneath the search bar to create a cohort and export/download the data. The selections can be again clicked to be removed from the created cohort.

Select multiple suggestions to create an exportable cohort.

GWAS with GENESIS workflows

Overview of the GENESIS pipelines

For researchers interested in performing genotype-phenotype association studies, Seven Bridges offers a suite of tools for both single-variant and multiple-variant association testing on BDC-Seven Bridges. These tools and features include the GENetic EStimation and Inference in Structured samples (GENESIS) pipelines, which were developed by the Trans-Omics for Precision Medicine (TOPMed) Data Coordinating Center (DCC) at the University of Washington. The Seven Bridges team collaborated with the TOPMed DCC to create Common Workflow Language (CWL) tools for the GENESIS R functions, and arranged these tools into five computationally-efficient workflows (pipelines).

These GENESIS pipelines offer methods for working with genotypic data obtained from sequencing and microarray analysis. Importantly, these pipelines have the robust ability to estimate and account for population and pedigree structure, which makes them ideal for performing association studies on data from the TOPMed program. These pipelines also implement linear mixed models for association testing of quantitative phenotypes, as well as logistic mixed models for association testing of binary (e.g. case/control) phenotypes.

Below, we feature our GENESIS Benchmarking Guide to assist users in estimate cloud costs when running GENESIS workflows on BDC-Seven Bridges.

GENESIS Benchmarking Guide

Introduction

The objective of the GENESIS Benchmarking Guide is to instruct users on the drivers of cloud costs when running GENESIS workflows on the BDC-Seven Bridges.

For all GENESIS workflows, the Seven Bridges team has performed comprehensive benchmarking analysis on Amazon Web Services (AWS) and Google Cloud Provider (GCP) instances for different scenarios:

  • 2.5k samples (1000G data)

  • 10k samples (TOPMed Freeze5 data)

  • 36k samples (TOPMed Freeze5 data)

  • 50k samples (TOPMed Freeze5 data)

The resulting execution times, costs, and general advice for running GENESIS workflows can be found in the sections below. In these sections, each GENESIS workflow is described, followed by the benchmarking results and some tips for implementing that workflow from the Seven Bridges Team. Lastly, we included a Methods section to describe our approach to benchmarking and interpretation for your reference.

The contents of this guide are arranged as follows:

  • Introduction

  • Helpful Terms to Know

  • GENESIS VCF to GDS

  • GENESIS Null model

  • GENESIS Single Association testing

  • GENESIS Aggregate Association testing

  • GENESIS Sliding window Association testing

  • General considerations

Below is a link to download the results of our Benchmarking Analysis described herein. It may prove useful to have this file open for reference when reading through this guide.

Helpful terms to know

Before continuing on to the benchmarking results, please familiarize yourself with the following helpful terms to know:

  • Tool: Refers to a stand-alone bioinformatics tool or its Common Workflow Language (CWL) wrapper that is created or already available on the platform.

  • Workflow/Pipeline (interchangeably used): Denotes a number of tools connected together in order to perform multiple analysis steps in one run.

  • App: Stands for a CWL wrapper of a tool or a workflow that is created or already available on the platform.

  • Task: Represents an execution of a particular tool or workflow on the platform. Depending on what is being executed (tool or workflow), a single task can consist of only one tool execution (tool case) or multiple executions (one or more per each tool in the workflow).

  • Job: This refers to the “execution” part from the “Task” definition (see above). It represents a single run of a single tool found within a workflow. If you are coming from a computer science background, you will notice that the definition is quite similar to a common understanding of the term “job” (). Except that the “job” is a component of a bigger unit of work called a “task” and not the other way around, as in some other areas may be the case. To further illustrate what job means on the platform, we can visually inspect jobs after the task has been executed using the View stats & logs panel (button in the upper right corner on the task page):

Figure 1. The jobs for an example run of RNA-Seq Quantification (HISAT2, StringTie) public workflow

The green bars under the gray ones (apps) represent the jobs (Figure 1). As you can see, some apps (e.g. HISAT2_Build) consist of only one job, whereas others (e.g. HISAT2) contain multiple jobs that are executed simultaneously.

GENESIS VCF to GDS

In this section, we detail the process of converting a VCF to a GDS via a GENESIS workflow. This VCF to GDS workflow consists of 3 steps:

  • Vcf2gds

  • Unique variant id

  • Check GDS

The first two steps are required while the last one is optional. The Check GDS step when included is the biggest cost driver in these tasks.

Check GDS tool is QC, which checks whether the final GDS file contains all variants that the input VCF/BCF has. This step is computationally intensive and its execution time can be 4-5 times longer than the rest of the workflow. Also, a failure of this step is something that we experience very rarely. In our results, there is a Check GDS column which is used as an indicator whether the Check GDS was performed or not.

We advise anyone who is using this workflow to consider results from the table below because differences in execution time and price with and without this check are considerable. A final decision on the approach that someone will use depends on the resources that one has (budget and time), and the preference of including or excluding the optional QC step.

In addition, CPU/job and Memory/job parameters have direct effects on execution time and the cost of the GENESIS VCF to GDS workflow. A combination of these parameters defines the number of jobs (files) that will be processed in parallel.

For example:

If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job the number of jobs run in parallel will be min{36/1,72/4}=18. If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job the number of jobs run in parallel will be min{36/1,72/8}=9. In this example, the second case would take twice as long as the first.

The following conclusions were drawn from performed benchmarking:

  • Benchmarking showed that the most suitable AWS instances for this workflow are c5 instances.

  • For all tasks that we run (from 2.5k up to 50k samples), 1CPU and 4GB per job were sufficient.

  • For small sample sizes (up to 10k samples), tasks can be run on spot/preemptible instances to additionally decrease the cost.

  • For samples up to 10k, 2GB per job could suffice, but consider that if we run check GDS step as well, execution time and price will not be much lower because CPU and Mem per job inputs are related only to vcf2gds step and not to the whole workflow.

  • We recommend using VCF.GZ as input files rather than BCF, as the conversion process cannot be parallelized when using BCFs.

  • If you have more files to convert (e.g. multiple chromosomes), we recommend running one analysis with all files as an input, rather than batch analysis with separate tasks for each file.

GENESIS Null model

The GENESIS Null model workflow is not computationally intensive and it is relatively low-cost compared to other GENESIS workflows. For that reason, we present results that we obtained without any optimization below:

The null model can be fit with relatedness matrices (i.e. mixed models) or without relatedness matrices (i.e. simple regression models). If a relatedness matrix is provided, it can be sparse or dense. The tasks with dense relatedness matrix are the most expensive and take the longest to run. For the Null model workflow, available AWS instances appear to be more suitable than Google instances available on the platform.

GENESIS Single Association testing

Results of the GENESIS Single Association Testing workflow benchmarking can be seen in the table above. Some important notes to consider when using this workflow:

  • Null model type effect: The main cost and duration driver in these tasks is the null model type. The null model can be fit without a relatedness matrix (i.e. simple regression models), or with a relatedness matrix that can be sparse or dense (i.e. mixed models). The table above shows that task cost and execution time increases as the null model is weighted (dense > sparse > none). Differences between the model with dense matrix and the model with sparse matrix are significant, which is driven by both increased CPU time and memory required to use a dense matrix. Our advice is to check the relatedness matrix before fitting the model, and transform it to sparse if it is dense, especially if sample size is large. The null model type has direct influences on further parameters setup such as: instance type selection, CPU per job, Memory per job, etc.

  • Instance type: Benchmarking showed that the most appropriate instance type is an AWS r5.x instance. Instance type selection is especially important when we are performing analysis with many samples (30k participants and above). In the tasks with up to 30k samples, r5.4xlarge instances can be used, and r5.12xlarge with more participants included. In addition, it is important to note that if a single association test is performed with a dense null model, then r5.12xlarge or r5.24xlarge instances should be picked. When it comes to Google instances, results can be seen in the above table as well. Since there often isn’t a Google instance that is the exact equivalent of the AWS instance, , we recommend choosing the most appropriate Google instance (matching the chosen AWS instance) from the list of available Google instances on BDC.

  • CPU and memory per job: CPUs and memory per job input parameters are determining the number of jobs to be run in parallel on one instance. For example:

    • If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job, the number of jobs run in parallel will be min{36/1,72/4}=18.

    • If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job, the number of jobs run in parallel will be min{36/1,72/8}=9.

    The bottleneck in single variant association testing is memory, so we suggest carefully considering this parameter and instance type. Workflow defaults are 1 CPU/job and 8GB/job. The table above shows that these tasks require much more memory than CPUs, therefore r.x instances are most appropriate in these cases. The table additionally shows that the task where the null model is fit with the dense relatedness matrix requires the most memory per job. This parameter also depends on the number of participants included in the analysis.

  • Maximum number of parallel instances: The default number of parallel instances is 8. The impact of changing this number is mainly reflected through execution time. The tasks with more parallel instances will be finalized faster. This parameter can be set in Execution settings when drafting a task. However, each user has a limited total number of parallel instances and capturing a big number of parallel instances per task leads to a decrease in the total number of different tasks that can run at the same time.

  • Spot instances: If it is expected for the task to be finalized within a few hours, it can be run on spot instances in order to reduce the execution cost. However, losing a spot instance leads to rerunning the task using on-demand instances which can lead to a higher cost than running the task on on-demand instances from the beginning. That is why spot instances are generally only suitable for short tasks.

GENESIS Aggregate Association testing

GENESIS Aggregate association testing can be performed using burden, SKAT, SMMAT, fastSKAT and SKAT-O tests. Our general conclusions are as follows:

  • Null model selection: The same as in the single variant association testing, the main cost and duration driver in these tasks is the null model type. The null model can be fit without a relatedness matrix (i.e. simple regression models), or with a relatedness matrix that can be sparse or dense (i.e. mixed models). The table above shows that task cost and execution time increases as the null model is weighted (dense > sparse > none). Differences between the model with dense matrix and the model with sparse matrix are significant, which is driven by both increased CPU time and memory required to use a dense matrix. Our advice is to check the relatedness matrix before fitting the model, and transform it to sparse if it is dense, especially if sample size is large. The null model type has direct influences on further parameters setup such as: instance type selection, CPU per job, Memory per job, etc.

  • Instance type: Benchmarking showed that the most appropriate instance type is an AWS r5.x instance. The majority of the tasks can be run on r5.12xlarge instances or on r5.24xlarge instances when the null model is with the dense relatedness matrix. Results for Google instances can be seen in the above table as well. Since the Google instance options often do not have an exact AWS equivalent, we selected the closest match from the list of available Google instances on BDC.

  • CPU and memory per job: CPUs and memory per job input parameters determine the number of jobs to be run in parallel on one instance. For example:

    • If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job, the number of jobs run in parallel will be min{36/1,72/4}=18.

    • If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job, the number of jobs run in parallel will be min{36/1,72/8}=9.

    Different tests can require different computational resources.

As can be seen for small sample sizes, up to 10GB per job can be sufficient for successfully completed tasks. One exception is the case when running a task with null model fit with the dense relatedness matrix when approx. 36GB/job is needed. When there are 50k samples, jobs require 70GB. Details can be seen in the table above. In addition to sample size, the memory required is determined by the number of variants included in each aggregation unit, as all variants in an aggregation unit are analyzed together.

SKAT and SMMAT tests are similar when it comes to CPU and Memory per job requirements. Roughly, these tests require 8GB/CPU, and details for different task configurations can be seen in the table below:

  • Maximum number of parallel instances: The default number of parallel instances is 8. The impact of changing this number is mainly reflected through execution time. The tasks with more parallel instances will be finalized faster. This parameter can be set in Execution settings when drafting a task. However, each user has a limited total number of parallel instances and capturing a big number of parallel instances per task leads to a decrease in the total number of different tasks that can run at the same time.

  • Spot instances: If it is expected for the task to be finalized in a few hours it can be run on spot instances. This will reduce the execution cost. However, losing a spot instance leads to rerunning the task on on-demand instances, which can lead to a higher cost than running the task on on-demand instances from the beginning. That is why spot instances are generally suitable for short tasks.

GENESIS Sliding window Association testing

GENESIS Sliding window association testing can be performed using burden, SKAT, SMMAT, fastSKAT and SKAT-O tests. When running sliding window test is good to know:

  • Null model selection: The same as in the previous tests the main cost and duration driver in these tasks is the null model type. The null model can be fit without a relatedness matrix (i.e. simple regression models), or with a relatedness matrix that can be sparse or dense (i.e. mixed models). The table below shows that task cost and execution time increases as the null model is weighted (dense > sparse > none). Differences between the model with dense matrix and the model with sparse matrix are significant, which is driven by both increased CPU time and memory required to use a dense matrix. Our advice is to check the relatedness matrix before fitting the model, and transform it to sparse if it is dense, especially if sample size is large. The null model type has direct influences on further parameters setup such as: instance type selection, CPU per job, Memory per job, etc.

  • Instance type: Benchmarking showed that for analysis with or without sparse relatedness matrix tasks can be completed on a c5.9xlarge AWS instance. For analysis with dense relatedness matrix included in the null model and with 50k samples or more, r5.12xlarge instances can be used. Also, it is important to note that in this case increasing the instance (for example from c5.9xlarge to c5.18xlarge) will not lead to shorter execution time. Furthermore, it can be completely opposite. By increasing the size of the instance we also increase the number of jobs running in parallel. At one point there will be a lot of jobs running in parallel and accessing the same memory space which can reduce the performance and increase task duration. Results for Google instances can be seen in respective tables. Since the Google instance options often do not have an exact AWS equivalent, we selected the closest match from the list of available Google instances on BDC.

  • CPU and memory per job: When running a sliding window test it is important to ensure that CPU resources at the instances that we are using are not overused. Avoiding 100% CPU usage in these tasks is crucial for fast execution. For that reason, it is good to decrease the number of jobs which are running in parallel on one instance. The number of parallel jobs is highlighted in the summary table as it is an important parameter for the execution of this task. We can choose different CPU and memory inputs as long as that combination gives us an appropriate number of parallel jobs. This is example how the number of parallel jobs are calculated:

    • If we run our task on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job, the number of jobs run in parallel will be min{36/1,72/4}=18.

    • If we run our task on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job, the number of jobs run in parallel will be min{36/1,72/8}=9.

    For details on the number of jobs that we’ve set for each tested case please refer to the table below.

  • Window size and window step: The default values for these parameters are 50kb and 20kb (kilobases), respectively. Please have in mind that since the sliding window algorithm is considering all bases inside the window, the window length and number of windows are parameters that are directly affecting the execution time and the price of the task.

  • Maximum number of parallel instances: The default number of parallel instances is 8. The impact of changing this number is mainly reflected through execution time. The tasks with more parallel instances will be finalized faster. This parameter can be set in Execution settings when drafting a task. However, each user has a limited total number of parallel instances and capturing a big number of parallel instances per task leads to a decrease in the total number of different tasks that can run at the same time.

  • Spot instances: If it is expected for the task to be finalized in a few hours it can be run on spot instances. This will reduce the execution cost. However, losing a spot instance leads to rerunning the task on on-demand instances which can lead to a higher cost than running the task on on-demand instances from the beginning. That is why spot instances are generally suitable for short tasks.

Benchmarking results:

General considerations

In this text, we have highlighted the main cloud cost and execution time drivers when running GENESIS analyses. Please have in mind that when running an analysis users may experience additional costs due to different factors such as task failures or need for rerunning the analysis. When estimating cloud costs for your study, please account for a cost buffer for these two factors as well.

To prevent task failures, we advise you to carefully read app descriptions and if you have any questions or doubts, contact our Support Team at support@sevenbridges.com. Also, using memoization can help in cost reduction when rerunning the task after initial failure.

Methods

Throughout this document, it is important to note that the figures in the tables above are intended to be informative as opposed to predictive. The actual costs incurred for a given analysis will also depend on the number of samples and number of variants in the input files. For our analysis described above, we selected 1000G and TOPMed Freeze5 data as inputs. For TOPMed Freeze5, we selected cohorts of 10k, 36k, and 50k subjects. The benchmarking results for the selected tasks would vary if the cohorts were defined differently.

The selection of instances is another factor that can lead to variation in results for a given analysis. The results highly depend on the user’s skill to choose an appropriate instance and use the instance resources optimally. For that reason, if two users run the same task with different configurations (different instance type CPU/job and/or RAM/job parameters), the results may vary.

The results (execution time and cost) are directly connected to the CPU per job and Memory per job parameters. Different resources dedicated to a given job will result in a different number of total jobs run on the selected instance, as so with the different execution time and cost. For that reason, setting up a task draft properly is crucial. In this document, we provided details on what we consider optimal CPU and Memory per job inputs for TOPMed Freeze5 and 1000G data. These numbers can be used as a good starting point, bearing in mind that each study has its own unique requirements.

For both Single and Sliding Window Association Testing:

Please note that results for single and sliding window tests are approximations. To avoid unnecessary cloud costs, we performed both single and sliding window tests only on 2 chromosomes. These results were the basis on which we assessed the cost and execution time for the whole genome.

The following is an explanation of the procedure we applied for GENESIS Single Association testing workflow and TOPMed freeze5 data (the similar stands for GENESIS Sliding window Association testing workflow):

In GENESIS Single Association testing workflow, the variants are tested in segments. The number of segments that the workflow will process is a ratio of the total number of variants and a segment length (which is one of the input parameters in this workflow). For example: if we are testing a whole genome with 3,000,000,000 variants and use the default segment length value of 10,000kb, we will have 300 segments. Furthermore, if we use the default value for maximum number of parallel instances, which is 8, we can approximate the average number of segments that each instance processes: 37.

The GENESIS Single Association testing workflow can process segments in parallel (processing of one segment is a job). The number of parallel segments (jobs) depends on the CPU per job and Memory per job parameters, and can be calculated as described previously. For example: if we are running the analysis on a c5.9xlarge instance (36 CPUs and 72GB RAM) with 1 CPU/job and 4GB/job, we will have 18 jobs in parallel. Knowing that each of our 8 instances is processing approximately 37 jobs in parallel it means that each instance will have approximately 2 cycles. Furthermore, knowing the average job length we can approximate the running time of 1 instance: it will be 2 cycles multiplied by average job length. Since the instances are running in parallel, this will be the total execution time. Lastly, when execution time is known, we can calculate the task price: the number of instances multiplied by execution time per hour, multiplied by instance price per hour. For each tested scenario in our benchmarking analysis, we obtained the average job length based on the corresponding tasks which included 2 chromosomes, such that the total number of jobs was above 30.

Transferring Files Between Seven Bridges and Terra

Instructions on transferring files between BDC Powered by Seven Bridges (BDC-Seven Bridges) and BDC Powered by Terra (BDC-Terra)

Introduction

This tutorial guides users through the process of transferring files between the two workspace environments of NHLBI BioData CatalystⓇ (BDC): BDC-Seven Bridges and BDC-Terra.

Most researchers select one of the workspaces as their primary analysis environment and their labmates and collaborators typically work with them on the same workspace environment. However, there are cases where some collaborators work on Seven Bridges and others work on Terra. In this case, researchers need to share data files between the two workspaces to facilitate collaboration. When researchers run analyses on Seven Bridges, the results, or derived data, is only available on Seven Bridges. Likewise, when researchers run analyses on Terra, the results are only available on Terra. This tutorial provides step-by-step guidance on how to share derived data between the workspace environments. These instructions can also be used to share private data that has been uploaded to Seven Bridges or Terra.

Both open access data and controlled access data can be shared across workspace environments. Importantly, if a researcher intends to share controlled access data, they must ensure that all recipients have the necessary dbGaP permissions for those files. In some cases, this may mean the researchers must be listed as collaborators on their respective dbGaP applications. These instructions are intended for sharing files under 1 terabyte (TB) in size. If you want to share data larger than 1 TB, contact the to discuss your use case.

It is not recommended to transfer large amounts of data between cloud providers or regions; for example, AWS --> Google costs approximately $100/TB.

Initial Considerations

Platform Accounts

The first consideration is platform accounts. Moving data between Seven Bridges and Terra is currently a manual process and requires that one of the researchers involved in sharing has an account on both platforms. It is recommended that the recipient of the shared data is the person to have accounts on both Seven Bridges and Terra.

Let’s consider an example case: Sebastian who is working on Seven Bridges and Teresa who is working on Terra. If Sebastian wants to share data with Teresa so that she can use the data on Terra, Teresa first needs to set up an account on Seven Bridges. Now Teresa has an account on Terra and an account on Seven Bridges. Sebastian will share the data with Teresa on Seven Bridges by adding her as a member of the project with the data he wants to share, with Copy permissions. For information on permissions, refer to the Seven Bridges documentation. Once Teresa is added as a member of the project, she can move the data from the Seven Bridges project to a workspace on the Terra platform, following the instructions in the section titled Moving Data From Seven Bridges to Terra.

If Teresa (Terra) wants to share data with Sebastian (Seven Bridges) so that he can use the data on Seven Bridges, Sebastian first needs to create an account on Terra. Now Sebastian has an account on Seven Bridges and an account on Terra. Teresa can share the data with Sebastian on Terra by sharing the workspace with the data she wants to share with Sebastian. For information on sharing workspaces, refer to the Terra documentation.

To create a Terra account, refer to the .

To create a Seven Bridges account, refer to the . If you are new to Seven Bridges, you may find this helpful.

Billing

The second consideration is making sure the researcher moving data between the two workspaces has billing groups set up on both workspaces to cover cloud costs if necessary. Contact the if you have questions about how to get a billing group on Seven Bridges or Terra.

Moving Files From Terra to Seven Bridges

The following steps describe how to use the Seven Bridges platform to pull data securely from a Terra workspace into a Seven Bridges project.

Refer to the , specifically the section . This method:

  • Works well for all size transfers.

  • Ideal for large file sizes or 1000s of files.

  • Can be used for transfers between local storage and a bucket, workspace VM or persistent disk and a Google bucket, as well as between Google buckets (external and workspace).

You will use the terminal in JupyterLab on the Seven Bridges workspace environment. The reason for this is that although Seven Bridges can run on the Google Cloud Platform, the Google bucket API is not exposed in the same manner as it is on Terra. Therefore you will start a JupyterLab notebook on Seven Bridges, using the project you would like to be the destination for the copied data. Refer to the Seven Bridges documentation for launching and accessing the terminal in a .

After launching the notebook, the next step is to open the terminal and install the program gsutil which is a python program that lets end users add data to or copy data from a Google Cloud bucket. After opening the terminal, run the following commands:

Installing gsutil takes only a few seconds.

The config command provides a secure URL for you to navigate to in the browser. You will authenticate with the same credentials that were used to login to Terra. The shortcut to access the printed URL in the JupyterLab terminal is to press shift and right click, which will display options to copy the URL. Copy and then navigate to the URL in a new browser tab, which will direct you to Google authentication:

Google will provide an authentication code that you will copy and paste into the terminal.

Next, you will type in the Google Project id. This is found on the right side of the Terra Workspace Dashboard.

Next, run the command below to display the different Google buckets that are attached to the project id.

The Google bucket name for the Terra project can be found in the lower right corner of the Terra Workspace.

Running gsutil ls on the Google bucket name will display the folders and files from the Terra workspace.

To copy a folder to the Seven Bridges workspace environment, run the following command:

There are a couple important things to mention about the gsutil cp command. First, the -R flag for gsutil cp is used to recursively copy a folder and all of its subfolders and files. Most users will likely want to use the -R flag. This flag should be omitted if copying individual files or if using a wild card such as “*.vcf”.

Additionally, /sbgenomics/output-files should be the destination folder when bringing in data from Terra, as this will ensure the files or folders get populated back to the Seven Bridges project. Refer to the documentation for information about working with files in Data Cruncher environments. After the JupyterLab instance is shut down, your files will automatically be populated in your project-files tab on Seven Bridges.

Moving Data From Seven Bridges to Terra

In this section we will discuss pushing data from a Seven Bridges project to a Terra workspace.

The process of moving data from Seven Bridges to Terra is the same setup as the previous section with some modifications to the gsutil copy command. Instead, we reverse the arguments.

You will still use the -R flag but the destination is a Terra bucket. The Terra workspace’s Google bucket name/id can be found on the Terra workspace Dashboard tab. You can verify that the folder has been copied by navigating to the Files section of the Data tab in your Terra workspace.

Clicking on the folder, you will see that all three files have been copied.

Account Setup

Logging in to Terra for the first time is a quick and straight-forward process. The process is easiest if you already have an email address hosted by Google. If you want to use an email address that is not hosted by Google, we have instructions for that as well. Article: ​ Article: ​ We also recommend our article on to get familiar with basic menus and options in Terra, as well as this .

Read on in the next two subsections for primers on how to set up billing and how to manage costs.

Troubleshooting Tasks

One of the key steps to becoming an advanced user and being able to fully understand and leverage the power of BDC-Seven Bridges is to learn how to detect and correct errors that prevent the successful execution of your analyses. The Troubleshooting tutorial presents some of the most common errors in task execution on the platform and shows you how to debug and resolve them. There is also a corresponding public project on the platform called "Troubleshooting Failed Tasks" which has examples of the failed analyses presented in the written tutorial.

  • Find the written tutorial .

  • Find the platform public project with examples .

Knowledge Center

is a collection of user documentation which describes all of the various components of the platform, with step-by-step guides on their use. Our Knowledge Center is the central location where you can learn how to store, analyze, and jointly interpret your bioinformatic data using BDC-Seven Bridges.

From the Knowledge Center, you can access platform documentation. This content is organized into sections that deal with the important aspects of accessing and using BDC-Seven Bridges.

You can also read the Release Notes in the Knowledge Center, keeping you up-to-date on all of the latest updates and new features for BDC-Seven Bridges.

Annotation Explorer

The Annotation Explorer is an application developed by Seven Bridges in collaboration with the TOPMed Data Coordinating Center. The application enables users to interactively explore, query, and study characteristics of an inventory of annotations for the variants called in TOPMed studies. This application can be used pre-association testing to interactively explore aggregation and filtering strategies for variants based on annotations and generate input files for multiple-variant association testing. It can also be used post-association testing to explore annotations associated with a set of variants, like variant sets found significant during association testing.

The Annotation Explorer currently hosts a subset of genomic annotations obtained using Whole Genome Sequence Annotator software for TOPMed variants. Currently, annotations for TOPMed Freeze5 variants and TOPMed Freeze8 variants are integrated with the Annotation Explorer. Researchers who are approved to access one or more of the TOPMed studies included in Freeze8 or Freeze5 will be able to access these annotations in the Annotation Explorer.

For more information, refer to the A.

How to register for a Terra account
Setting up a Google account with a non-Google email
navigating in Terra
video introduction to Terra

Managing Costs

We have a number of articles on tracking and minimizing the costs of operating on Terra. There are multiple ways of estimating how much your analyses are costing you, including built-in tools and external resources. The articles below contain instructions and advice on managing your cloud resources in a variety of ways: Article: Understanding and controlling cloud costs Article: Best practices for managing shared team costs Article: How much did a workflow cost? Article: How to disable billing on a Terra project

Workspace Setup

Workspaces are the fundamental building blocks of Terra. You can think of them as modular digital laboratories that enable you to organize and access your data in a number of ways for analysis.

To learn about the basics of operating a Terra workspace, we recommend these resources: Article: Working with workspaces Video: Introduction to using workspaces in Terra

Read on in this section to get familiar with:

  • Data storage and management

  • Collaboration

  • Security

Terra

BDC Powered by Terra (BDC-Terra) is a user-friendly system for doing biomedical research in the cloud. Terra workspaces integrate data, analysis tools, and built-in security components to deliver smooth research flows from data to results.

The following entries in this section of the BDC documentation are a starting point for learning how to use Terra in the context of the BDC ecosystem. You can also dive deeper into Terra by visiting the Terra website and the Terra Support Center. Wherever possible, we highlight specific articles, tutorial videos, and example workspaces that will help you learn what you need to know to accelerate your research.

If you can't find what you are looking for, we are happy to help. See the Troubleshooting and Support section for more information.

Please note that Terra is designed for and tested with the Chrome browser.

Data Storage & Management

Terra workspaces include a dedicated workspace Google bucket, as well as a built-in data model for managing your data. We provide articles in Terra’s knowledge base explaining how to organize and access data in a variety of ways.

A key to understanding the power of Terra is understanding it’s built-in data model, which allows you to rewire the inputs and outputs of your workflows and Jupyter notebooks.

The following resources give you guided instructions using cloud-based data with Terra: Article: ​Managing data with table VIdeo: Introduction to Terra data tables Article: Uploading to a workspace Google bucket​ Article: ​How to import metadata to a workspace data table Video: Making and uploading data tables to Terra

Collaboration

Sharing a workspace allows collaborators to actively work together in the same project workspace. Workspaces can be used as repositories of data, workflows, and Jupyter notebooks. Learn more about how to securely share a workspace: Article: How to share a workspace Article: Reader, writer or owner? Workspace access controls, explained Article: Using permissions Video: Introduction to Collaboration and Sharing in Terra

Billing

Now that you can log in, you’ll want to make sure that you have access to a Billing Account and Billing Project. This will allow you to charge storage and analysis costs through a Google account linked to Terra. A Terra Billing Project is Terra's way of connecting a workspace where you accrue costs for things, back to a Google Billing account where you pay for it. You must have a Google Billing Account established before creating a Terra Billing Project. Outlined here are the steps necessary to set this up, as well as instructions on how to add or be added to an existing account/billing project.

Detailed instructions for setting up your billing can be found by following the links below. If you are a BDC Fellow, your procedure for billing set up is a bit different, but you may find some of the information below still relevant (sharing a billing project with another user, for example). Step 1: Get Cloud credits for BioData Catalyst Step 2: Wait for approval & review the ​Billing overview for BioData Catalyst users Step 3: Credits approved. Now create a new Terra billing project Step 4 (optional): ​Sharing Billing Projects among colleagues

pip install gsutil
gsutil config
gsutil ls
gsutil cp -R gs://[Google-Bucket-Name] /sbgenomics/output-files/
gsutil cp -R /sbgenomics/output-files/vcfs_to_transfer \ gs://[Google-Bucket-Name]
BioData Catalyst Help Desk
Set permissions
How to share a workspace
Terra documentation
Seven Bridges documentation
Getting Started Guide
BioData Catalyst Help Desk
Terra documentation for Moving data to/from a Google bucket (workspace or external)
Upload and download data files in a terminal using gsutil
Jupyter Lab notebooks on Seven Bridges
JupyterLab environment
Save analysis outputs
1KB
GENESIS Benchmarking - VCF to GDS.csv
Open
Benchmarking: VCF to GDS
964B
GENESIS Benchmarking - Null Model.csv
Open
Benchmarking: Null Model
2KB
GENESIS Benchmarking - Single test.csv
Open
Benchmarking: Single Test
5KB
GENESIS Benchmarking - Aggregate test.csv
Open
Benchmarking: Aggregate Test
5KB
GENESIS Benchmarking - Sliding window test.csv
Open
Benchmarking: Sliding Window
wikipedia
here
here
The Seven Bridges Knowledge Center
nnotation Explorer's Public Project Page

Use Your Own Data with Terra

This page describes how researchers may bring their own data files and metadata into Terra. Some researchers may choose to bring their own data to Terra in addition to - or instead of - using BDC data from Gen3. For example, this may be done when bringing additional (e.g., longitudinal) phenotypic data to enhance the harmonized metadata available from Gen3, or when using Joint variant calling with additional researcher provided genomic data, or even using researcher provided data exclusively,

Generally, there are two types of data that researchers typically bring to Terra. Data files (e.g., genomic data, including CRAM and VCF data), and metadata (e.g., tables of clinical/phenotypic or other data, typically regarding the subjects in their study). These are described separately below.

There are two ways a researcher's data files may be made available in Terra: By uploading data to the researcher's workspace bucket or enabling Terra to access the researcher's data in a researcher managed Google bucket, for which you need to set up a proxy group.

Article: ​ Article:

The ways in which a researcher may import metadata to the Terra Data tables are described in the​ articles and tutorials below:

Article: ​​ Article: ​VIdeo: Video:

Bring Data into a Workspace

You can import data into your workspace by either linking directly to external files you have access to, or by interfacing with a number of platforms with which Terra has integrated access.

For BDC researchers, one of the most relevant of these interfacing platforms is . However this section also provides you with resources that teach how to import data from other public datasets integrated into Terra’s data library, as well as how to bring in your own data.

Read on in this section for more information on:

Security

Terra has a number of features to ensure the security of sensitive data accessed through the platform. Many of these features are in place automatically, while tools like authorization domains give you greater control over your data. These articles contain an overview of the security features enabled on Terra: Article: ​Article: Article: Article:

Interactive Analysis

The interactive analysis features of Terra support interactive data exploration, including the use of statistical methods and graphical display. Versatile and powerful interactive analysis is provided through Jupyter Notebooks in both Python and R languages.

Jupyter Notebooks run on a virtual machine (VM). You can customize your VM’s installed software by selecting one of Terra's preinstalled notebook cloud environments or choosing a custom environment by specifying a Docker container. Dockers ensure you and your colleagues analyze with the same software, making your results reproducible.

Article: Article: ​ Article: ​Article: ​Article: ​Video: Video: Workspace: Workspace: Workspace: Workspace:

From Terra’s Data Library

Terra’s includes a number of integrated datasets, many of which have individualized Data Explorer interfaces, useful for generating and exporting custom cohorts. If you click into a dataset and have the proper permissions, you'll be able to explore the data. If you don't have the necessary permission, you'll be taken to a page that tells you whom to contact for access.

​The resources linked below provide guided instructions for creating custom cohorts from the data library and importing them to your workspace, and using a Jupyter notebook to interact with the data: Article: Video: Workspace:

Bring in Data from Gen3

provides data for many projects and conveniently supports search across the vast set of subjects to identify the best available cohorts for research analysis. Searches are based on harmonized phenotypic variables and may be performed both within and across projects.

When a desired cohort has been identified in Gen3, the cohort may be conveniently "handed-off" to Terra for analysis. Optionally, this dataset may be enhanced with additional metadata from dbGaP, or extended to include additional researcher-provided subject data.

Here we provide essential information for all researchers using BDC data from Gen3, including how to access and select Gen3 subject data and hand it off to Terra, as well as a description of the GA4GH Data Repository Service (DRS) protocol and data identifiers used by Gen3 and Terra.

The resources below contain the information you’ll need to access your desired data: Video: Article: ​ Article: ​ Article: ​ Article: ​Article: ​ Workspace: Workspace:

Uploading to a workspace Google bucket
Understanding and setting up a proxy group
Managing data with tables
How to import metadata to a workspace data table
Introduction to Terra data tables
Making and uploading data tables to Terra
Gen3
Bringing in data from Gen3
Bringing in data from Terra's Data Library
Using your own data with Terra
Understanding the Terra ecosystem and how your files live in it
Authorization Domain overview for BioData Catalyst users
Managing data privacy and access with Authorization Domains
Best Practices for accessing external resources
Terra security posture
Interactive statistics and visualization with Jupyter notebooks
Customizing your interactive analysis application compute
Terra's Jupyter Notebooks environment Part I: Key components
Terra's Jupyter Notebooks environment Part II: Key operations
Terra's Jupyter Notebooks environment Part III: Best Practices
Notebooks overview
Notebooks Quickstart walkthrough
Notebooks Quickstart workspace
BioData Catalyst notebooks collection
PIC-SURE Tutorial in R
PIC-SURE Tutorial in Python
Dataset Library
Accessing and analysing custom cohorts with Data Explorer
Notebooks Quickstart walkthrough
Notebooks Quickstart workspace
BioData Catalyst Powered by Gen3
Data Analysis with Gen3, Terra and Dockstore
Discovering Data Using Gen3
Understanding and using Gen3 data in Terra
Data Access with the GA4GH Data Repository Service (DRS)
Linking Terra to External Servers
Understanding and setting up a proxy group
BioDataCatalyst Gen3 data on Terra tutorial
TOPMed Aligner workspace

Run Analyses

Terra supports the following types of analysis: Batch processing with Workflows and Interactive analysis with Jupyter Notebooks. This section will orient you with resources that teach you how to do:

  • Batch processing with workflows

  • Interactive analysis with Jupyter Notebooks

  • Genome-wide association studies

As an introduction, we recommend reading our article on the kinds of analysis you can do in Terra.

Batch Processing with Workflows

The batch workflow features of Terra provide support for computationally-intensive, long-running, and large-scale analysis.

You can perform whole pipelines—from preprocessing and trimming sequencing data to alignment and downstream analyses—using Terra workflows. Written in the human-readable Workflow Description Language (WDL), you can search for and import workflows into your workspace from Dockstore or the Broad Methods Repository.

Video: Data Analysis with Gen3, Terra and Dockstore​ Article: ​How to import data from Gen3 into Terra and run the TOPMed aligner workflow​ Article: ​Configure a workflow to process your data Article: ​​Getting workflows up and running faster with a JSON file Article: ​Importing a Dockstore workflow into Terra Video: Importing a Dockstore workflow into Terra walkthrough Video: Workflows Quickstart walkthrough Workspace: Workflows Quickstart workspace Workspace for BDC: TOPMed Aligner workspace Workspace for BDC: GWAS with 1000 Genomes and synthetic clinical data Workspace for BDC: GWAS with TOPMed data

Dockstore

"An app store for bioinformatics workflows"

Dockstore is an open platform used by the GA4GH for sharing Docker-based tools described with either the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL). Dockerized workflows come packaged with all of their requirements, meaning you spend less time searching the web for obscure installation errors and more time doing research.

Dockstore is aimed at scientific use cases, and we hope this helps users find helpful resources more quickly. Our documentation is also created with researchers in mind: we work to distill down information about the technologies we use to the relevant points to get users started quickly.

This section highlights the documentation relevant to BioData Catalyst users. If you are brand new to Dockstore, it is suggested to review the Getting Started Guide. Our entire suite of documentation is available here.

Troubleshooting & Support

If things aren’t going quite as expected, there are a number of avenues to help unblock any issues you may have.

Troubleshooting This section of the Terra knowledge base contains many useful articles on how to address problems, including a variety of articles describing common workflow errors, as well as more general articles that explain how to find which errors are affecting your work, and how to proceed once you’ve diagnosed your problem.

Monitor your jobs The Job History tab is your workflow operations dashboard, where you can check the status of past and current workflow submissions and find links to the job manager where you can diagnose issues.

How to report an issue There are a number of ways you can report an issue directly to us outlined in this article. If something appears broken, slow, or just plain weird, feel free to let us know.

Community forum A lot of answers can be found on our forum, which is monitored by our dedicated frontline support team and has an integrated search function. If you suspect that you’re running into a common issue but can’t find an answer in the documentation, this is a great place to check.

Intro to Docker, WDL, CWL

Technologies for reproducible analysis in the cloud

Introduction to Docker

Docker is a fantastic tool for creating light-weight containers to run your tools. It gives you a fast, VM-like environment for Linux where you can automatically install dependencies, make configurations, and setup your tool exactly the way you want, just as you would on a “normal” Linux host. You can then quickly and easily share these Docker images with the world using registries like Quay.io (indexed by Dockstore), Docker Hub, and GitLab.

Learn how to create a Docker image

Introduction to Workflow Languages

There are multiple workflow languages currently available to use with docker technology. In the BioData Catalyst ecosystem, SevenBridges uses CWL and Terra uses WDL. To learn more about how these language compare and differ, read Dockstore's documentation on tools and workflows.

Once you have picked what language works best for you, prepare your pipeline for analysis in the cloud with these tutorials aimed at bioinformaticians:

Learn how to create a tool in Common Workflow Language (CWL)

Learn how to create a tool in Workflow Descriptor Language (WDL)

Best Practices for Secure and FAIR Workflows

Dockstore’s integration with BioData Catalyst allows researchers the ability to easily launch reproducible tools and workflows in secure workspace environments for use with sensitive data. This privilege to work with sensitive data requires assurances of safe software.

We believe we can enhance the security and reliability of tools and workflows through open, community-driven best practices that exemplify the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles. We have established a best practices framework for secure and FAIR workflows published in Dockstore. We ask that users try to implement these practices for all workflows they develop.

Community Tools & Integration

Genome-Wide Association Studies

Terra provides powerful support for performing Genome-Wide Association Studies (GWAS). The following featured and template workspaces include Jupyter notebooks for phenotypic and genomic data preparation (using Hail) and workflows (using GENESIS) to perform single or aggregate variant association tests using mixed models. We will continue to provide more resources for performing more complex GWAS scenarios in BioData Catalyst.

Kinship Matrices

A Jupyter Notebook in both of the following workspaces uses Hail to generate Genetic Related Matrices for input into the GWAS workflows. Users with access to kinship matrices from the TOPMed consortium may wish to exclude these steps and instead import kinship files using the bring your own data instructions.

BioData Catalyst GWAS tutorial​ workspace

The BioData Catalyst GWAS tutorial workspace was created to walk users through a GWAS with training data that includes synthetic phenotypic data (modeled after traits available in TOPMed) paired with 1000 Genomes open-access data. This tutorial aims to familiarize users with the Gen3 data model so that they can become empowered to use this data model with any existing tutorials available in the Terra library’s showcase section.

BioData Catalyst GWAS blood pressure trait ​template workspace

This template is an example workspace that asks researchers to export TOPMed projects (for which they have access) into an example template for conducting a common variant, mixed-models GWAS of a blood pressure trait. Our goal was to include settings and suggestions to help users interact with data exactly as they receive it through BioData Catalyst. Accommodating other datasets may require modifying many parts of this notebook. Inherently, the notebook is an interactive analysis where decisions are made as you go. It is not recommended that the notebook be applied to another dataset without careful thought.

Cost Examples Below are reported costs from using 1,000 and 10,000 samples to conduct a GWAS using the BioData Catalyst GWAS Blood Pressure Trait template workspace. The costs were derived from single variant tests that used Freeze 5b VCF files that were filtered for common variants (MAF <0.05) for input into workflows. The way these steps scale will vary with the number of variants, individuals, and parameters chosen. TOPMed Freeze 5b VCF files contain 582 million variants and Freeze 8 increases to ~1.2 billion. For GWAS analyses with Freeze 8 data, computational resources and costs are expected to be significantly higher.

Analysis Step

Cost (n=1,000; Freeze5b)

Cost (n=10,000; Freeze 5b)

​​

$29.34 ($19.56/hr for 1.5 hours)

$336 ($56/hr for 6 hours)

​ workflow

$1.01

$5.01

​ workflow

$0.94

$6.67

TOTAL

$32.29

$347.68

These costs were derived from running these analyses in Terra in June 2020.

BYOT Glossary

An introduction to terms used in this document

Each platform within BDC may have slight variations on these definitions. You will find a more specific definition within the section of the BYOT document. Below, we highlight a few terms to introduce you to before you get started.

  • App: 1) In Seven Bridges, an app is a general term to refer to both tools and workflows. 2) App may also refer to persistent software that is integrated into a platform.

  • Container: A standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another (for example, Docker).

  • Command: In workflow writing, the command specifies the literal command line run (akin to what you would run in the terminal).

  • Common Workflow Language (CWL): Simple scripting language for describing computational workflows for performing sequential operations on data. CWL is a way to describe command-line tools and connect them together to create workflows. CWL is well suited for describing large-scale workflows in cluster, cloud, and high-performance computing environments where tasks are scheduled in parallel across many nodes.

  • Docker: Software for running packaged, portable units of code, and dependencies that can be run in the same way across many computers. See also Container.

  • Dockerfile: A text document that contains all the commands a user could call on the command line to assemble an image.

  • Dockstore: An open platform developed by the Cancer Genome Collaboratory and used by the GA4GH for sharing Docker-based tools described with the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL).

  • Image: In the context of containers and Docker, this refers to the resting state of the software.

  • Instance: Refers to a virtual server instance from a public or private cloud network.

  • Task: In workflow writing, the term task encompasses all of the information necessary to execute a command, such as specifying input/output files and parameters.

  • Tool: In CWL, the term tool specifies a single command. This specification is not as discrete in other languages such as WDL.

  • Workflow Description Language (WDL): Way to specify data processing workflows with a human-readable and writable syntax. Define complex analysis tasks, chain them together in workflows, and parallelize their execution.

  • Workflow: A sequence of processes, usually computational in this context, through which a user may analyze data.

  • Workspace: Areas to work on/with data within a platform. Examples: projects within Seven Bridges.

  • Wrapping: The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.

  • Virtual Machine (VM): An isolated computing environment with its own operating system.

For other terms, you can reference the .

Launch workflows with BDC

How to use Dockstore workflows in our cloud partner platforms

Using the BDC ecosystem, you can launch workflows from Dockstore in both of our partner analysis platforms, Terra and SevenBridges. It is important to know that these platforms use different workflow languages: Terra uses WDL and SevenBridges uses CWL.

When you open any WDL and CWL workflow in Dockstore, you will see the option to "Launch with NHLBI BioData Catalyst":

  • If you selected a CWL workflow, this workflow will launch in BDC Powered by Seven Bridges (BDC-Seven Bridges).

  • If you selected a WDL workflow, this workflow will launch in BDC Powered by Terra (BDC-Terra). .

Discover our catalog

How to search our catalog

Dockstore offers faceted search, which allows for flexible querying of tools and workflows. Tabs are used to split up the results between tools and workflows. You can search for basic terms/phrases, filter using facets (like CWL vs WDL), and also use advanced search queries. Learn more.

Organizations

You can also search curated workflows in Dockstore's page.

Organizations are landing pages for collaborations, institutions, consortiums, companies, etc. that allow users to showcase tools and workflows. This is achieved through the creation of collections, which are groupings of related tools and workflows. Learn more about , including how your research group can create your own organization to share your work with the community.

Dockstore Organizations relevant to BDC users:

Here, you can find a suite of analysis tools we have developed with researchers that are aimed at the BDC community. Examples include workflows for performing GWAS and Structural Variant Calling. Many of these collections also point users to tutorials where you can launch these workflows in our partner platforms and run an analysis.

These workflows are based on pipelines the University of Michigan developed to perform alignment and variant calling on TOPMed data. If you're bringing your own data to BDC to compare with TOPMed data, these may be helpful resources.

Dockstore Forum

This forum is a great place to find and post questions about Docker files, workflow languages, Dockstore features, and workflow learning resources. The user base includes CWL, WDL, Nextflow, and Galaxy workflow authors and users.

Contribute to the community

Our mission is to catalyze open, reproducible research in the cloud

We hope Dockstore provides a reference implementation for tool sharing in the sciences. Dockstore is essentially a living and evolving proof of concept designed as a starting point for two activities that we hope will result in community standards within the GA4GH:

  • a best practices guide for describing tools in Docker containers with CWL/WDL/Nextflow

  • a minimal web service standard for registering, searching and describing CWL/WDL-annotated Docker containers that can be federated and indexed by multiple websites

We plan on expanding the Dockstore in several ways over the coming months. Please see our for details and discussions.

Building a community

To help Dockstore grow, we encourage users to publish their tools and workflows on Dockstore so that they can be used by the greater scientific community. Here is how to get started:

Register your or on Dockstore

Create an , invite your collaborators, and promote your work in collections

GWAS Preliminary Analysis Notebook
vcfTogds
genesis_GWAS
BioData Catalyst glossary
here
Organizations
Organization and Collections
NHLBI BioData Catalyst
TOPMed
Dockstore Forum
issues page
Create a Dockstore account
tool
workflow
Organization
Learn more about how this integration works.
Learn more about how this integration works
Dockstore's Launch With button