Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 123 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

BDC Documentation

Loading...

Community

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Written Documentation

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Citation and Acknowledgement

Let us know about your publications and how you can cite us.

If you are writing a manuscript about research you conducted using BDC, use the citation available below.

After learning your manuscript has been accepted, let us know by filling out our Contact form and selecting Published Research under the "Select the Type of Assistance Needed" dropdown.

For citation of BDC:

National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services (2020). The NHLBI BioData Catalyst. Zenodo. https://doi.org/10.5281/zenodo.3822858

To acknowledge BDC, use:

The authors wish to acknowledge the contributions of the consortium working on the development of the NHLBI BioData Catalyst® (BDC) ecosystem.

Request for Comments

NHLBI BioData Catalyst® (BDC) Documentation

This is a repository for documentation related to the platforms and services that are part of the BDC ecosystem.

Click here to access the NHLBI BioData Catalyst® (BDC) website.

Welcome to NHLBI BioData Catalyst® (BDC)

Welcome to the BDC ecosystem and thank you for joining our community of practice. The ecosystem offers secure workspaces to support your data analysis in addition to a number of bioinformatics tools for analysis. There is a lot of information to understand and many resources (documentation, learning guides, videos, etc.) available, so we developed this overview to help you get started. If you have additional questions, use the links at the very end of this document, under the "Questions" section, to contact us.

About BDC and Our Community

What is BDC?

NHLBI BioData Catalyst® (BDC) is a cloud-based ecosystem that offers researchers data, analytical tools, applications, and workflows in secure workspaces. BDC is a community where researchers can find, access, share, store, and analyze heart, lung, blood, and sleep data. BDC is an NHLBI data repository where researchers share scientific data from NHLBI-funded research, so they and others can reproduce findings and reuse data to advance science.

By increasing access to NHLBI data and innovative analytic capabilities, BDC accelerates reproducible biomedical research to drive scientific advances that can help prevent, diagnose, and treat heart, lung, blood, and sleep disorders.

What are we doing and why does it matter?

By increasing access to the NHLBI’s datasets and innovative data analysis capabilities, the BDC ecosystem accelerates efficient biomedical research that drives discovery and scientific advancement, leading to novel diagnostic tools, therapeutics, and prevention strategies for heart, lung, blood, and sleep disorders.

Who is developing BDC?

The ecosystem is funded by the National Heart, Lung, and Blood Institute (NHLBI). Researchers and other professionals receive funding from the NHLBI to work on the development of the ecosystem, together often referred to as “The BDC Consortium” . You can refer to on the Overview page of the BDC website and a is available in our documentation.

Find out the meanings of our terms and acronyms.

Like many professional communities, BDC has adopted terms to help us communicate quickly and more efficiently, but that can be a challenge for newcomers. To help, we created a BDC of terms and acronyms. If ever there is a time when an ecosystem term or acronym is unfamiliar and isn’t in the glossary, so we can give you the information and add it to the glossary.

The BDC Ecosystem

Learn about the platforms available in the ecosystem.

The BDC ecosystem features the following platforms.

Explore Available Data

  • BDC Powered by Gen3 (BDC-Gen3) - Hosts genomic and phenotypic data and enables faceted search for authorized users to create and export cohorts to workspaces in a scalable, reproducible, and secure manner.

  • BDC Powered by PIC-SURE (BDC-PIC-SURE) - Enables access to all clinical data, feasibility queries to be conducted, and allows cohorts to be built in real-time and results to be exported via the API for analysis.

Analyze Data in Cloud-based Shared Workspaces

  • BDC Powered by Seven Bridges (BDC-Seven Bridges) - Collaborative workspaces where researchers can find and analyze hosted datasets as well as their own data by using hundreds of optimized analysis tools and workflows in CWL, as well as JupyterLab and RStudio for interactive analysis.

  • BDC Powered by Terra (BDC-Terra) - Secure collaborative place to organize data, run and monitor workflow analysis pipelines in WDL, and perform interactive analysis using applications such as Jupyter Notebooks and the Hail GWAS tool.

The BDC website provides details about the available in the ecosystem. We encourage you to create accounts on all the platforms as you get to know BioData Catalyst.

Ecosystem Access, Hosted Data, and System Services

How do I login?

Users log into BioData Catalyst platforms with their eRA Commons credentials (see ) and authentication is performed by iTrust. Every time a user logs in, the ecosystem checks his/her user credentials to ensure s/he can only access the data for which s/he has dbGaP approval.

While all of the platforms within BioData Catalyst use eRA Commons credentials and iTrust performs authorization and authentication, respectively, there are some slight differences between the platforms when getting set up:

  • BioData Catalyst Powered by Gen3 - Users do not set up usernames on Gen3. Upon the first time logging in, select “Login from NIH”, then enter eRA commons credentials at the prompt. This ‘User Identity’ is used to track the user on the system.

  • BioData Catalyst Powered by PIC-SURE - Similar to Gen3, user identities are used - researchers log into the system by selecting “Log in with eRA Commons.”

  • BioData Catalyst Powered by Seven Bridges - Users set up platform accounts. The first time on the system, users select to “Create an account” and then proceed with entering their eRA Commons credentials. The user is then prompted to fill out a registration form with their name, email, and preferred username. Users are also asked to acknowledge that they have read the Privacy Act notice and then they can proceed to the platform.

Details about how data access works on the NHLBI BioData Catalyst ecosystem are .

How do I check which data I can access?

We recommend users first check their access to data before logging in. Do this by going to the and clicking on the “Check My Access” button. Once you confirm your data access, go to the page from which you click on the “Launch” hyperlink for the platform or service you wish to use. Platforms and services have login/sign in links on their pages that bring you to the pages on which you enter your eRA Commons credentials. on checking your access to data is also available.

What data are available in the ecosystem?

The NHLBI BioData Catalyst currently hosts a subset of datasets from TOPMed including phs numbers with genomic data and related phs numbers with phenotype data. You can find information about which are currently hosted on the of the website as well as in the .

Harmonized data available.

There are limited amounts of harmonized data available to users with appropriate access at this time. The TOPMed Data Coordinating Center curation team has produced forty-four (44) harmonized phenotype variables from seventeen (17) NHLBI studies. Information about the 17 studies and the 44 variables can be found in the .

Bring your own data and workflows into the system.

We allow researchers to bring their own data and workflows into the ecosystem to support their analysis needs. Researchers can bring their own datasets into and . Users can also bring their own workflows to the system. Users can either add workflows to in CWL or WDL, or they can directly on BioData Catalyst Powered by Seven Bridges and for use on BioData Catalyst Powered by Terra.

Learn about Genome-wide association study and genetic association testing on BioData Catalyst.

Walk through our self-paced genome-wide association study and genetic association testing .

Share your workflows.

We encourage users to publish their workflows so they can be used by other researchers working in the NHLBI BioData Catalyst ecosystem. Share your workflows via .

Costs and cloud credits.

BioData Catalyst hosts a number of datasets available for analysis to users with appropriate data access approvals. Users are not charged for the storage of these hosted datasets; however, if hosted data is used in analyses users incur costs for computation and storage of derived results. Cloud credits are available on the system, and you can .

Questions?

Learn more, ask questions, or request help.

Answers to are available on the website, as are many resources that can be found under . You can also use a form to , and if you aren’t sure which selections to make on the form, please see our .

Strategic Planning

In the context of agile development and a Consortium with a diverse set of members, the application of various agile-development terms may mean different things to different individuals.

The table below defines the BDC Core Terminology:

Term

Definition/Description

Example

User Narrative

Descriptions of a user interaction experience within the system from the perspective of a particular persona. User Narratives are further broken down into Features, Epics, and User Stories. Currently formulated into rough 6-month timelines to benchmark progress.

An experience bioinformatician wants to search TOPMed studies for a qualitative trait to be used in a GWAS study

Feature

A functionality at the system level that fulfills a meaningful stakeholder need

*Level at which the BDC3 coordinates

Strategic Planning Documents Reviewed & Approved by NHLBI Leadership

Who We Are

Our Culture: Though the primary goal of the BDC project is to build a data science ecosystem, at its core, this is a people-centric endeavor. BDC is also building a community of practice working collaboratively to solve technical and scientific challenges in biomedical science.

Principal Investigators (PIs):

  • Stan Ahalt, PI RENCI (Coordination Center)

  • Rebecca Boyles, Co-PI RTI (Coordination Center)

  • Paul Avillach, PI HMS (Team Carbon)

  • Kira Bradford, Co-PI RENCI (Team Helium)

  • Steve Cox, Co-PI RENCI (Team Helium)

  • Brandi Davis-Dusenbery, PI Seven Bridges (Team Xenon)

  • Robert Grossman, PI UChicago (Team Calcium)

  • Ashok Krishnamurthy, PI RENCI (Team Helium )

  • Benedict Paten, PI UCSC (Team Calcium)

  • Anthony Philippakis, PI Broad Institute (Team Calcium)

Note: BDC collaboration is organized around teams based on elements in the periodic table. There are additional modes of collaboration in BDC including Tiger Teams, Working Groups, Steering Committee, and Publications.

More about who we are and the partners empowering our ecosystem can be found at the .

BioData Catalyst Powered by Terra - Users initially log in using Google credentials and are asked to agree to the Terms of Service and Privacy Act notice. User activity is tracked via the Google credentials, but users can link their eRA Commons credentials to the account to get access to hosted datasets.

a list of partners and platforms powering the ecosystem
list of the principal investigators
glossary
contact us
Click here to view the differences between BDC’s standard workspaces (BDC-Seven Bridges) and those provided by BDC-Terra.
platforms and services
Understanding eRA Commons Accounts
on the website
Accessing BioData Catalyst page
Platforms and Services
Documentation
TOPMed studies
Data page
Release Notes
BioData Catalyst Powered by PIC-SURE User Guide
BioData Catalyst Powered by Seven Bridges
BioData Catalyst Powered by Terra
Dockstore
create CWL tools
develop custom workflows
tutorials
Dockstore
learn more here
frequently asked questions
Learn & Support
Contact Us
help desk directory
BioData Catalyst About page

Data Access

Getting Started

Documentation for getting started on the NHLBI BioData CatalystⓇ (BDC) ecosystem.

Search TOPMed datasets using PIC-SURE platform

Epic

A very large user story which can be broken down into executable stories

*NHLBI’s cost-monitoring level

PIC-SURE is accessible on BDC

User Stories

A backlog item that describes a requirement or functionality for a user

*Finest level of PM Monitoring

A user can access PIC-SURE through an icon on BDC to initiate search

Workstream

A collection of related features; orthogonal to a User Narrative

Workstreams impacted by the User Narrative above include:

  • production system

  • data analysis

  • data access

  • data management

148KB
PM-graphic.pdf
PDF
Open
Project Management Approach PDF
471KB
BioData-Catalyst-Strategic-Framework-Plan-V1-v2.0 (1).pdf
PDF
Open
685KB
BioData-Catalyst-Implementation-Plan-V1-v2.0.pdf
PDF
Open
491KB
BioData Catalyst Data Management Strategy - V1.0(3).pdf
PDF
Open
622KB
BioData Catalyst Project Management Plan V2.0 (1).pdf
PDF
Open

Contributing User Resources to BDC

The BDC user community is essential to advancing science with new and exciting discoveries and informing the development of the ecosystem and its infrastructure. Members of the BDC user community learn how to explore the hosted data, use the services, and employ its tools in exciting and valuable ways that even developers may not know. Therefore, we actively invite user resource contributions to be shared with the community.

Types of Resources

Consider supporting fellow ecosystem users in one of the following ways:

BDC Video Content Guidance

Overview

BDC recognizes the importance of multimedia resources for ecosystem users, particularly audio/visual recordings. This document provides guidelines on the program's video content approach. Using these guidelines will ensure users get optimized video experiences, from consistent branding that offers insights into the sources of the videos to best practices in video creation that support learning.

BDC Glossary

Glossary of terms used in the context of the BDC Consortium and platform.

  • Agile Development

    Agile software development is an approach to software development under which requirements and solutions evolve through the collaborative effort of self-organizing and cross-functional teams and their customer(s)/end user(s).

  • Alpha Users

    A small group of users who are more willing to tolerate working in a system that isn’t as fully developed, providing detailed feedback & maybe some back & forth discussions

NHLBI DICOM Medical Image De-Identification Baseline Protocol

BDC-RFC-#: 28 Title: DICOM Medical Image De-Identification Baseline Protocol Type: Process Contact Name and Email: Keyvan Farahani, Submitting Teams: NHLBI, DMC Date Sent to Consortium: Oct. 11, 2023 Status: Closed for comment URL Link to this Google Document: License: This work is licensed under a CC-BY-4.0 license.

Medical Image De-Identification: BDC Baseline Protocol

Contributors:

  • Zixin Nie (BDC Data Management Core)

[Amazon] EFS

[Amazon] Elastic File System a simple, scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources.

  • Ambassadors

    A small group of experts that represent the personas featured within the priority User Narratives. For their time and help, Ambassadors will receive early access to the BDC platform, free compute time, monetary fee for time, and relevant travel expenses will be covered.

  • App

    1. In Seven Bridges, an app is a general term to refer to both tools and workflows.

    2. App may also refer to persistent software that is integrated into a platform.

  • API

    Application Programmer Interfaces. API technologies serve as software-based intermediaries to exchange data.

  • AWS

    Amazon Web Services. A provider of cloud services available on-demand.

  • BagIt

    BagIt is a hierarchical file packaging format for storage and transfer of arbitrary digital content.

  • BDC3

    BDC Coordinating Center

  • Beta Users

    A slightly larger group than the alpha users who are not as tolerant to a difficult/clunky environment but understand that the version they are using is not polished and they need to give feedback.

  • Beta-User Training

    Once the platform is available to a broader audience, we will support freely-accessible online training for beta-users at any time.

  • Carpentries Instructor Training Program

    Ambassadors attend this training program to become BDC trainers.

  • CCM

    Change Control Management; the systematic approach to managing all changes made to a document or process. Ensures no unnecessary changes are made, all changes are documented, and a process exists for implementing approved change.

  • CIO

    Chief Information Officer

  • Cloud Computing

    Internet-based computing, wherein computing power, networking, storage or applications running on computers outside an organization are presented to that organization in a secure, services-oriented way.

  • Components

    Software units that implement a specific function or functions and which can be reused.

  • ConOps

    Concept of Operations

  • Consortium

    A collection of teams and stakeholders working to deliver on the common goals of integrated and advanced cyberinfrastructure, leading-edge data management and analysis tools, FAIR data, and HLBS researcher engagement.

  • Containers

    A standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another (for example, Docker).

  • Command

    In workflow writing, the command specifies the literal command line run (akin to what you would run in the terminal).

  • COPDGene

    Chronic Obstructive Pulmonary Disease (COPD) Gene

  • Cost Monitoring (level)

    At the Epic Level The Coordinating Center will facilitate this process by developing reporting templates (see example in PM Plan, Financial Management) for distribution to the teams. The BDC teams will complete these templates and send them directly to NHLBI. Each team is responsible for tracking their finances based upon the award conditions and for providing status updates as requested to NHLBI.

  • CRAM File Compressed columnar file format for storing biological sequences aligned to a reference sequence. Designed to be an efficient reference-based alternative to the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence, reducing storage costs. Additionally each column in the SAM format is separated into its own blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller than BAM, depending on the data held within them (from Wikipedia).

  • CSOC Alpha

    Common Services Operations Center (CSOC): operates cloud, commons, compliance and security services that enable the operation of data commons; has ATO and hosts production system.

  • CSOC Beta

    Development/testing; Real data in pilot (not production) that can be accessed by users

  • Common Workflow Language (CWL)

    Simple scripting language for describing computational workflows for performing sequential operations on data. CWL is a way to describe command line tools and connect them together to create workflows. CWL is well suited for describing large-scale workflows in cluster, cloud and high performance computing environments where tasks are scheduled in parallel across many nodes.

  • DAC

    Data Access Committee: reviews all requests for access to human studies datasets

  • DAR Data Access Request

  • Data Access

    A process that involves authorization to access different data repositories; part of a User Narrative for the December 2020 release goal A Work Stream PM Plan constraint: NHLBI, as the project sponsor, will identify a process to enable data access by the BDC team members and for research users

  • Data Commons

    Provides tools, applications, and workflows to enable computing large scale data sets in secure workspaces.

  • Data Repository Service (DRS) Generic interface (API) to data repositories so data consumers, including workflow systems, can access data in a single, standardized way regardless of where it’s stored or how it’s managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID (from GA4GH).

  • Data Steward

    Members of the TOPMed and COPDGene communities who are working with BDC teams.

  • dbGaP

    Database of Genotypes and Phenotypes

  • DCPPC

    Data Commons Pilot Phase Consortium. The Other Transaction Awardees, Data Stewards, and the NIH.

  • Decision Tree

    A decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility

  • Deep Learning

    A machine learning method based on neural networks to learn from data through training to recognize patterns in the data.

  • Deliverables

    Demonstrations and products.

  • Demos

    Activities and documentation resulting from the DCPPC to build, test and demonstrate completion of goals of the Data Commons Pilot Phase.

  • DEV Environment

    Set of processes and programming tools used to create the program or software product

  • DMI

    Data Management Incident

  • Docker

    Software for running containers, packaged, portable units of code and dependencies that can be run in the same way across many computers. See also Containers.

  • Dockerfile

    A text document that contains all the commands a user could call on the command line to assemble an image.

  • Dockstore

    An open platform developed by the Cancer Genome Collaboratory and used by the GA4GH for sharing Docker-based tools described with the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL)

  • DOI

    Digital Object Identifier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself.

  • DUO

    Data Use Ontology - a GA4GH standard for automating access (API) to human genomics data (https://github.com/EBISPOT/DUO)

  • DUOS

    Data Use Oversight System, https://duos.broadinstitute.org/

  • Ecosystem

    A software ecosystem is a collection of processes that execute on a shared platform or across shared protocols to provide flexible services. Example: The "BDC Ecosystem" - inclusive of all platforms and tools

  • EEP

    External Expert Panel. A group of experts who provide guidance and direction to NIH about the program.

  • Epic

    A very large user story which can be broken down into executable stories

    *NHLBI’s cost-monitoring level

  • eRA Commons

    Designated ID provider for whitelist

  • External Expert Panel

    An independent body of experts that inform and advise the work of the BDC Consortium.

  • FAIR

    Findable Accessible Interoperable Reusable.

  • Feature

    A functionality at the system level that fulfills a meaningful stakeholder need

    *Level at which the CC coordinates

  • FireCloud

    Broad Institute secure cloud environment for analytical processing, https://software.broadinstitute.org/firecloud/

  • FISMA moderate environment

    Federal Information Security Modernization Act of 2014, amends the Federal Information Security Management Act of 2002 (FISMA), see https://www.dhs.gov/fisma

  • FS

    Full Stack

  • GA4GH

    Global Alliance for Genomics and Health

  • GA4GH APIs

    The Genomic Data Working Group is a coalition assembled to create interoperability standards for storing and sharing genomic data. The GA4GH Genomics API offers Interoperability for exchanging genomic data between various platforms and organizations by sending simple HTTP requests through a JSON equipped RESTful API.

  • GCP

    Google Cloud Platform

  • GCR

    Governance, Compliance, and Risk

  • Gen3

    Gen3 is open source and licensed under the Apache license, which you can use for setting up, developing and operating data commons

  • GitHub

    An online hub for storing and sharing computer programs and other plain text files. We use it for storage, hosting websites, communication and project management.

  • Gold Master

    A gold master, or GM, is the final version of software or data ready for release to production; a master version from which copies can be made.

  • GWAS

    Genome-wide Association Study

  • HLBS

    Heart, Lung, Blood, Sleep

  • Identity Providers

    A system entity that creates, maintains, and manages identity information for principals while providing authentication services to relying applications within a federation or distributed network; identity providers offer user authentication as a service

  • Interoperability

    The ability of data or tools from multiple resources to effectively integrate data, or operate processes, across all systems with a moderate degree of effort.

  • Instance

    In cloud computing, refers to a virtual server instance from a public or private cloud network.

  • Image

    In the context of containers and Docker, this refers to the resting state of the software.

  • IP

    BDC Implementation Plan; outlines how the various elements from the planning phase of the BDC project will come together to form concrete, operationalized BDC platform.

  • IRB

    Institutional Review Board; the entity within a research organization that reviews and approves research protocols and clinical research protocols to protect human and animal subjects.

  • IRC

    Informatics Research Core

  • ISA

    Interoperability Service Agreement

  • ITAC

    Information Technology Applications Center

  • Jupyter Notebooks

    A web-based interactive environment for organizing data, performing computation, and visualizing output.

  • Linux

    An open source computer operating system

  • Metadata

    Data about other data

  • Milestone

    Marks specific progress points on the development timeline, and they can be invaluable in measuring and monitoring the evolution and risk of a program. © Scaled Agile, Inc.

  • MSD

    Minimum set of documents

  • MVP

    Minimum viable product

  • NHLBI

    National Heart, Lung, and Blood Institute

  • NIH

    National Institutes of Health

  • NIST Moderate controls

    NIST 800-53 - A collection of security controls and assessment procedures that both U.S. Federal and non-governmental organizations can apply to their information systems, policies, and procedures.

  • OTA

    Other Transaction Authority - the mechanism of award that NHLBI chose because it provides a degree of flexibility in the scope of the work that is needed to advance this type of high risk/high reward project

  • PI

    Principal Investigator

  • Platform

    A piece of the BDC ecosystem. Examples: Terra, Gen3, Seven Bridges, etc.

  • PM

    Project Manager

  • PMP

    BDC Project Management Plan; breaks down the implementation of BDC from the perspective of the project managers involved in the project including details on roles, specific milestones, and the project schedule.

  • PO

    Program Officer

  • Portable Format for Biomedical Data (PFB) Avro-based serialization format with specific schema to import, export and evolve biomedical data. Specifies metadata and data in one file. Metadata includes data dictionary, ontology references and relations between nodes. Supports versioning, back- and forward compatibility. A binary format.

  • Portfolio for Jira

    Software-as-a-Service project management tool, used to track, roadmap, and visualize various project metrics.

  • Python

    Open source programming language, used extensively in research for data manipulation, analysis, and modeling

  • Quality Assurance

    The planned and systematic activities implemented in quality management so that quality requirements for a product or service satisfy stated goals and expectations.

  • Quality Control

    The operational techniques and activities aimed at monitoring and measuring work processes and eliminating the causes of unsatisfactory outputs.

  • RACI

    Responsible, Accountable, Consulted and Informed; tool that can be used for identifying roles and responsibilities during an organizational change process; BDC RACI

  • Researcher Auth Service (RAS) Will be a service provided by NIH's Center for Information Technology to facilitate access to NIH’s open and controlled data assets and repositories in a consistent and user-friendly manner. The RAS initiative is advancing data infrastructure and ecosystem goals defined in the NIH Strategic Plan for Data Science.

  • RFC

    Request for Comment: A process that documents and enables effective interactions between stakeholders to support shared decision making.

  • Risk Register

    A tool used to continuously identify risk, risk response planning and status updates throughout the project lifecycle. This project risk register is the primary risk reporting tool, and is located in the Project Management Plan.

  • SC

    Steering Committee

  • Scientific use case

    Defined in this project as an analysis of data from the designated sources which has relevance and value in the domain of health sciences, probably implementation and software agnostic.

  • SF or SFP

    BDC Strategic Framework [Plan]; defines what the BDC teams have accomplished up to this point, what we plan to accomplish in a timeline fashion, and milestones to track and measure implementation.

  • SFTP

    Secure File Transfer Protocol

  • Software Developers Kit

    A set of software development tools that allows the creation of applications for a certain software package, software framework, hardware platform, computer system, or similar development platform

  • Sprints

    Term of art used in software generation, referring to short, iterative cycles of development, with continuous review of code through daily builds and end-of-sprint demos

  • Stack

    Term of art referring to a suite of services that run in the cloud and enable ubiquitous, convenient, on-demand access to a shared pool of configurable computing resources.

  • Steering Committee

    Responsible for decision-making and communication in BDC.

  • STRIDES

    Science & Technology Research Infrastructure for Discovery, Experimentation, and Sustainability

  • Task

    In workflow writing, the term task encompasses all of the information necessary to execute a command, such as specifying input/output files and parameters.

  • Team

    Groups of people led by a Principal Investigator (PI), or PIs, who will complete milestones and produce deliverables. Each group has been assigned a name, represented by the elements on the periodic chart.

  • Tiger Teams

    A diversified group of experts brought together to investigate, solve, build, or recommend possible solutions to unique situations or problems. Populated with mature experts who know what's at stake, what needs to be done, and how to work well with others; their strengths are diversity of knowledge, a single focus or purpose, cross-functional communications, decision-making sovereignty, and organizational agility.

  • Tool

    In Common Workflow Language, the term tool specifies a single command. This definition is not as discrete in other workflow languages such as WDL.

  • Tool Registry Service (TRS)

    The GA4GH Cloud Work Stream has released a standard API for exchanging tools and workflows to analyze, read, and manipulate genomic data. The Tool Registry Service (TRS) API is one of a series of technical standards from the Cloud Work Stream that together allow genomics researchers to bring algorithms to datasets in disparate cloud environments, rather than moving data around.

  • TOPMed

    Trans-Omics for Precision Medicine. One of the primary data sets of the DCPPC.

  • TOPMed DCC

    TOPMed Data Coordinating Center

  • Trans-cloud

    A provider-agnostic multi-cloud deployment architecture.

  • User Narrative

    Descriptions of a user interaction experience within the system from the perspective of a particular persona. User Narratives are further broken down into Features, Epics, and User Stories. Currently formulated into rough 6-month timelines to benchmark progress.

  • User story

    A description of a software feature from a technical/process-oriented perspective; a backlog item that describes a requirement or functionality for a user

    *Finest level of PM Monitoring

  • Variant Call Format (VCF)

    File format for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome. There is also a Genomic VCF (gVCF) extended format, which includes additional information about "blocks" that match the reference and their qualities (from Wikipedia). See http://www.internationalgenome.org/wiki/Analysis/vcf4.0/.

  • VDS

    A composite of complete server hardware, along with the operating system (OS), which is powered by a remote access layer that allows end users to globally access their server via the Internet

  • VPC

    Virtual Private Cloud

  • Whitelist

    A security measure to permit only an approved list of entities. We recommend instead using the term "allow list".

  • Workflow

    A sequence of processes, usually computational in this context, through which a user may analyze data.

  • Workflow Description Language (WDL) Way to specify data processing workflows with a human-readable and writeable syntax. Define complex analysis tasks, chain them together in workflows, and parallelize their execution.

  • Workspace

    Areas to work on/with data within a platform. Examples: projects within Seven Bridges

  • Workstream

    A collection of related features; orthogonal to a User Narrative

  • Wrapping

    The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.

  • Virtual Machine (VM)

    An isolated computing environment with its own operating system.

  • Written Documentation: Develop step-by-step guides, FAQs, checklists, and so on. Include screenshots to support user understanding.
  • Videos: Record a shortcut, tip, or process you think would be helpful to other users. Keep videos short by dividing larger processes into smaller segments and recording separate videos for each.

  • Respond to inquiries: Answer questions posed in the BDC Forums. Forum content with significant engagement may get incorporated into written documentation or made into videos.

  • Note

    All materials must ensure privacy policy compliance. Make certain to block any patient information on all content and protect study participants' privacy by not including personally identifiable, confidential, sensitive, or personal health information (for example, blur screenshots with data).

    Decide How to Share What You Know

    Experienced users who want to share their tips and tricks should consider the following questions.

    • Did someone already share my tip? Look through the resources already available to users before investing your time and energy into creating a new one.

      • Check the Frequently Asked Questions page on the BDC website

      • View the “Learn” and “Documentation” links available on the BioData Catalyst Services webpage.

      • View the hosted on GitBook.

      • Explore the links to platform-specific documentation, videos, FAQs, community forums, blogs, tutorials, and upcoming events on the

      • Check out the videos on the .

    • Which format best suits your resource? Ask yourself, "Would I prefer to watch this on video or have a step-by-step guide to help me?" Then ask yourself which you think other users would prefer. Figuring out which you'd prefer is a great place to start because you are the one who identified the tip. But remember that you are creating something to help other people whose preferences will determine whether a resource gets used.

    • Is my tip complex, or does it require several steps? If so, a written how-to guide will probably be easier to follow than a video because someone watching a video may need to stop and restart it often. Still, visual aids will be helpful, so consider using screenshots in your how-to guide.

    • Is the guidance I want to share relatively straightforward, but it requires clicking through several pages/places? If so, a short video could be the best way to share your tip. Finding buttons or links can be much easier if shown rather than described.

    • If I create a video and make sure to go slowly enough that someone can follow along, will it be longer than 15 minutes? If so, creating a video may not be the right format, or breaking down the content into shorter (more digestible) videos may be preferable.

    • Am I comfortable following the ? If not, please create written documentation (e.g., a how-to guide).

    • Do I want to provide help in almost-real-time without needing to formally draft a document or record a video? Visit the often to provide answers to questions posed by other users or even just post your tip.

    Creating and Sharing Your Contribution

    Once you decide upon the best way to share what you learned, you'll need to create your contribution and then share it.

    • For a quick tip that you want to distribute swiftly, draft something short that you can easily post to the Community Forum. The following is an example of a quick tip for using PIC-SURE’s Data Access Table:

      • In PIC-SURE, did you know you can use the search bar in the Data Access Table to find studies? Instead of scrolling through the table and looking at the list of available studies manually, you can search for studies. An example could be “MESA” for a specific study name, or a phenotype like “Sickle Cell” to find all sickle cell related studies. It seems obvious, but I’m not sure how many other users are aware of this, and I found it really helpful!

    • For Written Documentation, draft your suggestions and include screenshots to help lead users through the process you describe. Once complete, submit the file to BDCatalystOutreach@nih.gov for review and posting to the BDC Gitbook. Note that we will accept Google Doc (with at least suggesting edits status preferred) and Microsoft Word formats; PDFs are not accepted.

    • For videos, review the User-Generated Videos portion of the . By submitting a video, you agree to those conditions. Once your video is uploaded to your YouTube channel, email the link to BDCatalystOutreach@nih.gov for consideration to be linked to the BDC YouTube channel also.

    Finding User-Generated User Resources

    • Forum messages will post directly in the community forums.

    • Written documentation will live in the BDC Documentation, hosted in Gitbook.

    • User-generated videos will be linked in the BDC YouTube Channel.

    Overview of BDC Videos

    To share video content - from the consortium, platforms, and users, as described in the following sections - BDC created a YouTube channel: https://www.youtube.com/channel/UCGkmY5oNK8uFZzT8vV_9KgQ

    The BioData Catalyst Coordinating Center (BDC3) , has authority (with direction from the NHLBI) to post (or not post), remove, edit, and otherwise change video content on this channel with or without permission from or notice to video creators, owners, or sharers. Feedback about videos on the BDC YouTube channel should be sent to BDCatalystOutreach@nih.gov.

    Categories and Organization of Videos

    The BDC YouTube Channel hosts three categories of videos based on their sources and/or approval statuses:

    • Consortium-produced / Consortium-approved

    • Platform-generated

    • User-generated

    Learn more about each video category below. Note that each category has its own set of standards that must be adhered to when creating and publishing video content, whether the final outlet is the BDC YouTube channel or another channel.

    BDC3 is responsible for organizing videos on the BDC YouTube channel, grouping them into playlists it believes will be most beneficial to ecosystem community members. Playlists may include videos from any or all categories of videos. Viewers can determine the category of a video based on the branding (or non-branding) that appears. The additional information about each video category includes video standards that direct video creators on branding for each category of videos.

    Consortium-produced / Consortium-approved Videos

    Videos in this category are produced by BDC3, or are produced by Platforms or Users that receive approval from the BDC Consortium (select organizations developing and maintaining the ecosystem). These videos contain pre-approved opening and closing BDC animations and sound.

    Consortium-produced / Consortium-approved Video Standards

    Videos produced by the Consortium, or by Platforms or Users that submit for approval for recognition as a Consortium-approved video, must adhere to the following standards:

    Comply with all requirements and, when possible, follow all best practices outlined in Addendum A: Consortium-produced / Consortium-approved Videos Best Practices. Platforms and users generating videos who wish to submit them for recognition as Consortium-approved must complete the BDC Consortium Video Submission Pre-Approval Application. Submit the form BEFORE producing the video to improve the likelihood that the video receives Consortium approval.

    Platform-generated Videos

    Videos in this category are produced by one of the BDC platforms to support users' understanding of their platform. These videos are not vetted by BDC3, BDC3 Consortium members, or representatives of other BDC platforms. These videos must open with the creator's platform "Powered by" logo (downloadable from the BDC3 internal consortium website).

    Platform-generated Video Standards

    Unless a Platform plans to seek Consortium-approval status for a video, platforms should use the following standards in the production and posting of their platform-generated videos:

    Producers of Platform-generated videos, like all BDC ecosystem users, are always obligated to protect participant privacy and must follow NIH policies for data protection. Platforms are accountable and may be subject to sanctions if policies are violated. Only produce videos that provide information specific to the Platform's BDC instance. Use the Platform's Powered by logo (and only the Powered by logo) for the YouTube thumbnail image. Videos should open with the following information: “In this video we will [discuss/cover/explore] BioData Catalyst Powered by [platform name] and [task/example]” YouTube description language should include: The following language: This is a BioData Catalyst platform-generated video to support ecosystem users' understanding of the BioData Catalyst Powered by [platform name]. The link to the NHLBI BioData Catalyst homepage: https://biodatacatalyst.nhlbi.nih.gov/ Videos should be uploaded using YouTube's auto-generated captions to support 508 compliance. Once the video is uploaded, email the link to: BDCatalystOutreach@nih.gov so BDC3 can make it visible on the BioData Catalyst YouTube channel.

    Important Notes

    • Only videos offering information specific to the use of ecosystem Platform instances will be shared on the BDC YouTube channel. Videos that support the use of Platforms but are not specific to BDC instances may be linked from the ecosystem documentation but will not appear on the BioData Catalyst YouTube channel.

    • Platform-generated videos that do not follow the above standards will not be made visible on the BDC YouTube channel.

    User-generated Videos

    These videos are neither approved nor vetted by BDC, the BDC Consortium, BDC Platforms, or the organizations they represent. The opinions and other content in these videos are those of the video creators and sharers alone. These videos may NOT open or close with BDC branding and may only display BDC branding when capturing images of properties where it already appears (i.e., a screencap of an ecosystem platform instance).

    User-generated Video Standards

    BDC offers user-generated video tutorials and guides. Unless a user plans to seek Consortium-approval status for a video, BDC requires the following for user-generated videos, their creators, and their sharers:

    Producers of user-generated videos, like all BDC ecosystem users, are always obligated to protect participant privacy and must follow NIH policies for data protection. User institutions are accountable and may be subject to sanctions if policies are violated. By submitting a video for inclusion, users are attesting that the content of the video follows NIH policies for data protection, agreeing to follow this guidance, and committing to the inclusion of the following statement in video descriptions: This is a user-generated video and is neither approved nor vetted by NHLBI BioData Catalyst (BDC), the members of the BDC Consortium, or the organizations they represent. For more information about BDC, go to https://biodatacatalyst.nhlbi.nih.gov/. For more BDC videos, go to https://www.youtube.com/channel/UCGkmY5oNK8uFZzT8vV_9KgQ. #BioDataCatalyst To share a video, please contact: BDCatalystOutreach@nih.gov

    Important Notes

    • User-generated videos that do not follow the above standards will not be made visible on the BioData Catalyst YouTube channel.

    • User-generated videos are just one type of user-contributed content BDC seeks to share. To learn about other kinds of user-generated content BDC seeks, read .

    Addendum A: Consortium-produced / Consortium-approved Videos Best Practices

    Consortium-produced/Consortium-approved videos must adhere to this addendum. While not required of BDC Platforms and users, BDC encourages them to consider these best practices for the videos they produce.

    Gaining Approval: Submitting Your Idea

    Phase/Task
    Required or Best Practice
    Context

    Consider if the video is fulfilling a need/gap

    Required

    Ensure video isn't replicating information already available to users

    Complete & submit for pre-approval

    Required

    Pre-approval is required to ensure relevance & consistency

    Planning the video: Considerations before recording

    Phase/Task
    Required or Best Practice

    Outline the video

    Best practice

    Consider how info can be presented in a concise & useful manner

    Avoid having too much text on slides

    Best practice

    Slides should be concise; keep text & bullets at a minimum; use images when possible as viewers respond to images more positively than text

    Shooting the video: Best practices

    Phase/Task
    Required or Best Practice
    Context

    Use clear language & explain jargon

    Best Practice

    Simple communications are preferred; many viewers may not speak English as a first language

    Policy compliance: Federal regulations & BDC3 best practices

    Phase/Task
    Required or Best Practice
    Context

    Ensure Section 508 compliance

    Required

    Subtitles & transcripts are required to ensure equity in access for people with disabilities

    Ensure privacy policy compliance

    Required

    Protect study participants' privacy by not including personally identifiable, confidential, sensitive, or personal health information (e.g., blur screenshots with data)

    Required

    For people with disabilities, readability can be essential to a successful user experience

    Technical aspects: Steps after shooting

    Phase/Task
    Required or Best Practice
    Context

    Best practice

    Search for meaningful keywords for titles, descriptions & tags

    Create a meaningful title

    Required

    The title should be under 66 characters to make it easier for Google to display; make the title engaging & descriptive

    Required

    Think about the action the user is trying to take & the keywords they might use to find your video

    Publishing & promoting: Publicizing & sharing video

    Phase/Task
    Require or Best Practice
    Context

    Share completed videos with BDC3

    Required

    Email with info on accessing the video, a thumbnail image, descriptive tags to include, and the video description

    BDC3 sets appropriate privacy settings according to policy with input from the video creator

    If Approved

    Videos can be Public, Unlisted (link needed), or Private (invite needed; most secure)

    BDC3 uploads to YouTube channel & adds to relevant playlists

    If Approved

    Videos can be in multiple playlists but don't need to be in any playlists

    Library maintenance: Keeping an up-to-date catalog

    Phase/Task
    Required or Best Practice
    Context

    BDC3 will prompt teams annually to check videos to ensure continued relevance.

    Required

    Outdated videos could cause viewers to lose confidence in the accuracy of info available on the channel

  • Keyvan Farahani (NHLBI)

  • David Clunie (PixelMed Publishing)

  • Why image de-identification?

    De-identification of protected health information (PHI) is often a necessary procedure to undertake in order to share potentially sensitive information, such as health data. Many data repositories that allow human data to be deposited and shared require the data to be de-identified. Medical images and their associated metadata (i.e., DICOM headers) often contain PHI, such as patient names, dates of birth, or medical record numbers. The de-identification of these images is essential to minimize privacy risk and comply with regulations and standards that require the protection of PHI. The overarching goal in medical image de-identification is to reduce the risk of identification as much as possible.

    De-identification facilitates the sharing of medical imaging data, enabling greater access by researchers and the public and allowing for secondary research to be conducted. Several standards exist for de-identification of medical images, including the confidentiality profile detailed in the DICOM Part 15 standard, HIPAA Safe Harbor and Expert Determination. The BioData Catalyst Data Management Core (BDC DMC) performed an evaluation of these standards and used them to create the protocol detailed in this document. This document describes the de-identification processes and technical considerations for de-identifying medical images as they are being added to BDC and made available to researchers using the BDC platform. The protocol, referred to as the “BDC Baseline Protocol for Image De-identification,” takes into account the data use cases for researchers accessing the BDC platform by defining a de-identification profile that strikes a balance between privacy protection and preserving utility.

    The Baseline protocol only applies to the metadata in radiologic (DICOM) images (see table below). It does not apply to image pixel information, other imaging formats, or other types of data that may be imported into BDC, such as clinical and omics data. It reflects the understanding of the de-identification needs of BDC as of October 2023. Future RFCs are planned that will address masking of unique identifiers, the details of how imaging pixel data will be de-identified, the de-identification process workflow, and quality management.

    Major medical imaging modalities

    Imaging Data Type

    Conventional formats

    Radiologic (X-ray, PET/CT, MRI, ultrasound)

    DICOM (Digital Imaging and Communication in Medicine)

    Cardiac ECG

    XML

    Digital Pathology

    Proprietary TIFF and DICOM Pathology

    The focus of this RFC is on de-identification of DICOM images.

    The Baseline De-Identification Protocol

    The de-identification protocol described in this section is intended to be a baseline for de-identification within BDC. The protocol is compliant with regulations such as the HIPAA Privacy Rule and the Common Rule, while retaining the maximal amount of research utility possible. It is designed based on the experiences from the HeartShare imaging pilot project. The protocol will evolve over time, with future iterations to address new issues as they arise, and customizations to address specific research use cases. These may involve Expert Determinations, which can both increase privacy protections and improve research utility. This protocol is to be used for all medical imaging data to be submitted to the BDC. The protocol may be implemented in an image de-identification tool at the submitter’s site, or in a central BDC-related data curation service. Any deviation from this protocol must be discussed with and approved by the BDC/DMC. The baseline de-identification protocol can be found at this link: DICOM_deid_part_15_classified_09_26_2024_Baseline.xlsx.

    Introduction to HIPAA Safe Harbor and DICOM Part 15

    De-identification of DICOM data can be performed according to different standards. Two commonly accepted standards are HIPAA Safe Harbor and Normative E Attribute Confidentiality Profiles defined in part 15 of the DICOM standard (referred to in the rest of this document as the DICOM Part 15 Standard).

    HIPAA Safe Harbor de-identification calls for the removal of 18 types of identifiers (detailed here: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#standard). The standard legally applies to PHI handled by HIPAA Covered Entities, however as it has been in use for over 20 years it is generally accepted as a standard for de-identification for other types of data as well.

    The DICOM Part 15 Standard was developed through a careful review of all DICOM attributes, identifying any that had the possibility of containing identifying information and creating a mitigation strategy. It is more extensive than HIPAA Safe Harbor, covering attributes that are not part of the 18 prescribed types of identifiers such as ethnicity and biological sex. Various mitigation strategies are presented to treat the attributes detailed as part of the standard, with the Basic DICOM Part 15 Confidentiality Profile being the most conservative, calling for suppression of most of the attributes.

    De-Identification of DICOM Header Data

    In order to have de-identified data that still possesses analytic utility for BDC researchers, while also being a standardized implementation of de-identification that can be applied across most data to be ingested by BDC, an evaluation was performed to produce a set of de-identification rules that can be applied to DICOM header attributes. The evaluation leveraged the de-identification profiles detailed in the DICOM Part 15 standard by evaluating its contents and aligning with the minimum requirements to comply with HIPAA Safe Harbor. The resulting de-identification strategy should be sufficient to construct a de-identification profile that can be applied across all DICOM headers.

    The steps for performing this evaluation were as follows:

    1. Attributes from each profile were classified into the following categories: Direct Identifier (DI), Quasi-Identifier (QI), and Non-Identifier (NI), according to the classification framework detailed in the following diagram:

    1. After classification, DIs and QIs were then aligned with the 18 types of identifiers specified for removal within the HIPAA Safe Harbor provision.

    2. Each of the attributes that aligns with one of the HIPAA Safe Harbor identifiers was then assigned a mitigation technique to remove the identifying information that could appear in the field.

    Of the attributes within the DICOM Part 15 standard that must be removed for compliance with HIPAA Safe Harbor, there are:

    • 4 name attributes

    • 4 patient address attributes

    • 122 date attributes

    • 5 telephone number attributes

    • 91 other unique ID attributes

    Names, addresses, and telephone numbers should be suppressed from the data. Dates can be kept accurate to the year (a future BDC medical image de-identification RFC will address improving this approach for longitudinally acquired imaging studies). The other unique IDs can either be suppressed or they can be masked in a way so that their original values cannot be re-obtained. The specifics of how the other unique IDs will be masked will come in a separate RFC that describes the masking procedures. Additionally, there are 26 attributes that contain various forms of free text, such as comments, notes, labels, and text strings. Identifying information may be written in these attributes. As such, they should be suppressed to prevent the leakage of identifying information.

    The other attributes detailed in the DICOM Part 15 standard do not necessarily require mitigation for compliance with HIPAA Safe Harbor. However, if they do not have analytic usage, it is recommended to mitigate them according to the specifications detailed in the DICOM Part 15 standard in order to decrease the risk of re-identification represented by indirectly identifying fields not mentioned in HIPAA Safe Harbor.

    De-Identification of Image Pixel Data

    Image pixel data, often encountered in ultrasound (echo) imaging, can contain PHI, such as patient names, dates of birth, and the hospital or imaging center names. This information can be shown either in labels on images, which usually have pre-specified areas, or in the form of burned-in text, which can appear anywhere on the image. Any identifying information contained within pixel data should be removed before it is made available to researchers.

    Methods for removal of image pixel data include the following:

    • Masking through opaque boxes over parts of the image

    • AI assisted removal of identifying information, deploying optical character recognition (OCR)

    • Deletion of images from the dataset that contain identifying information

    Image pixel de-identification will be performed as a service by tools provided by existing third party tool provided by DMC contractors. After de-identification, images will still require review to ensure that the process was able to capture and remove all identifying information on the images. This is a necessary quality control to ensure that there is no leakage of identifying information.

    De-Identification of Filenames and File Paths

    Metadata associated with images, such as filenames and file paths, can often include unique IDs and dates of medical events. This information is important to associate imaging data correctly with other types of data for linkage, processing, and analysis, however it can also present a risk of leakage of identifying information on de-identified data files. To prevent that from happening, the following rules should be followed:

    1. Folder names should only include the study name and associated visit number, and no further information

      1. e.g., for the first visit of the MESA study, the folder name should be called MESA_V1

    2. Image filenames are to be set to the following format: STUDYNAME_TYPE_VISITNN_ YYYYMMDD_SEQ

      1. VISITNN: ”VISIT”+VisitNumber (specifically include the label “VISIT” to inform investigator what the number is referring

      2. YYYYMMDD: AcquisitionDate set to set to 01-01-YYYY, where YYYY is the year of acquisition

      3. SEQ: sequence number to ensure filename is unique

      4. e.g., MESA_ECG_VISIT05_20220101_999.xml

    Risk Mitigation

    The risks presented by using the de-identification methods detailed in this RFC are as follows:

    1. HIPAA Safe Harbor, while being an accepted standard for de-identification, does not cover all potential identifiers (leaving out potentially attributes such as race, employment, diagnoses, procedures, and treatments). Data de-identified under HIPAA Safe Harbor holds a residual risk of re-identification.

    2. Automated imaging de-identification solutions are not 100% accurate, leaving the potential for small amounts of identifying information to be retained.

    Data made available through BDC is provided for research purposes to investigators who should not have ulterior motives to perform re-identification. HIPAA Safe Harbor represents a standard that has been in use for over 20 years, so the risks presented from using that standard are well understood and acceptable by BDC. The risk presented by leakage of identifying information from imaging data can be mitigated through human review of de-identified images to ensure that all identifying information has been removed.

    In the event that PHI is discovered in de-identified imaging data in BDC, such data shall be pulled off-line, checked for removal of offending PHI, before being posted again on BDC. In such cases, the data submitter shall be informed of the incident.

    Local vs. Cloud-based Image De-Identification

    Depending on the capabilities of the de-identification tool and the legal and logistic requirements for access to original identifiable images, de-identification may be done locally on the data-generating site or through a central cloud-based service. Although the latter is often more efficient (semi-automated and scalable), the transfer of identifiable (PHI-containing) images to a central cloud may require agreements between the data provider (submitter) and the de-identification service provider, stipulated through execution of Data Transfer Agreement (DTA). Details as to the image de-identification process that will be used will be provided in a future RFC.

    farahank@mail.nih.gov
    https://docs.google.com/document/d/14-WfeMqgZz115DbBnFs-8AvcdRY1oIjCgi0K33pwMjE/edit?usp=sharing

    Search and Results

    1. Navigate to to access Dug Semantic Search.

    2. Semantic search is a concept-based search engine designed for users to search biomedical concepts, such as “asthma,” “lung,” or “fever,” and the variables related to and/or used to measure them. For example, a search for “chronic pain acceptance” will return a list of related biomedical concepts, such as chronic pain, headaches, neuralgia, or fibromyalgia, each of which can be expanded to display related variables and CDEs. Semantic search can also find variable names and descriptions directly, using synonyms from its knowledge graphs to find search-related variables.

    3. Enter a search term and press “Enter,” or click on the Search button. This will take you to the Semantic Search interface.

    Dug Semantic Search

    Step-by-step guidance on using Dug Semantic Search: efficiently and effectively perform and interpret a search using Dug.

    Overview

    Dug Semantic Search is a tool that allows users to deep dive into BDC studies and biomedical topics, research, and publications to identify related studies, datasets, and variables. If you are interested in how Dug connects study variables to biomedical concepts, or visit the.

    This tool applies semantic web and knowledge graph techniques to improve BDC research data Findability, Access, Interoperability, and Reusability (FAIR). Through this process, semantic search helps users identify novel relations, build unique research questions, and identify potential collaborations.

    Understanding Access

    This checklist is intended to help new users understand their obligations regarding access and permissions, which individual users are responsible for obtaining and maintaining.

    About BDC Access: eRA Commons Account

    Users log into BDC platforms with their eRA Commons credentials. For more information, see.

    Users create an eRA Commons Account through their institution's Office of Sponsored Research or equivalent. For more information, refer to.

    Data Interoperability

    How to access additional data stacks

    GTEx Data

    The Genotype-Tissue Expression (GTEx) Program is a widely used data resource and tissue bank to study the relationship between genetic variants (inherited changes in DNA sequence) and gene expression (how genes are turned on and off) in multiple human tissues and across individuals. For information on access to GTEx data, refer to as part of the documentation.

    Checking Access

    You can check your access to data on BDC using the website or on a specific platform

    BDC Website

    Go to the page of the BDC website. Under the section, "Requirements for Accessing BDC Hosted Data," click Check My Access.

    NHLBI BioData Catalyst Ecosystem Security Statement

    BDCatalyst-RFC-#: 11 BDCatalyst-RFC-Title: NHLBI BioData Catalyst Ecosystem Security Statement BDCatalyst-RFC-Type: Consensus Building Name of the person who is to be Point of Contact: Sarah Davis Email of the person who is to be Point of Contact: sdavis@renci.org Submitting Team: BDC3/NHLBI Requested BDCatalyst-RFC posting start date: 6/14/2021 Date Emailed for consideration: 6/14/2021 BDCatalyst-RFC-Status: Comment only URL Link to this document: URL Link to the website: License: This work is licensed under a .

    Submitting a dbGaP Data Access Request

    Users who want to access a hosted controlled study on the ecosystem must be approved for access to that study in dbGaP

    Requirements

    • An NIH eRA Commons ID (or appropriate NIH Login) is required for submitting a Data Access Request (DAR). If you do not have an eRA Commons account, you must request one through your institution’s Office of Sponsored Research or equivalent. For more information, refer to on the eRA website.

    BDC Documentation
    BioData Catalyst Learn webpage
    BDC YouTube channel
    requirements outlined for the user-generated video tutorials
    Community Forum
    BDC Video Content Guidance page

    Use appropriate branding according to the BDC Style Guide

    Required

    Required to create a unified look across the BioData Catalyst ecosystem. Work with your BDC3 contact to get a copy of the style guide

    Edit automatic transcription

    Required

    Transcription is free but likely needs editing; you can make changes to the text & timestamps of your captions

    Create cards for interaction

    Best practice

    Cards are clickable calls to action that take viewers to another video, channel, or site

    Create end screens for marketing

    Best practice

    End screens can be added to a video's last 5 - 20 seconds to promote other videos, encourage viewers to subscribe, etc.

    Divide into chapters & create Table of Contents

    Best practice

    Break up videos into sections (each with an individual preview) to provide more info & context; eases re-playing certain sections

    Create thumbnail

    Required

    A clear & colorful video thumbnail will catch viewers' attention & let them see a quick snapshot of your video as they're browsing

    Create meaningful tags, including the required #BioDataCatalyst tag

    Required

    Tags are descriptive keywords you can add to your video to help viewers find your content; include at least 10 tags

    Add links to BDC

    Best practice

    Where possible, provide links to relevant parts of the BDC ecosystem

    Teams and BDC3 develop plans to promote the video, if appropriate.

    Best Practice

    Potential options include Facebook, Instagram, LinkedIn, Snapchat, Twitter, Vimeo, WeChat, Pinterest, Flipgrid, etc.

    Contributing User Resources to BDC
    Video Submission Pre-Approval Form
    Ensure accessibility, including readability & making slides available for download
    Search Google Trends
    Create a meaningful description
    BDCatalystOutreach@nih.gov
    https://biodatacatalyst.nhlbi.nih.gov/use-bdc/explore-data/dug/
    read the Dug paper
    Help Portal
    About Data Access: dbGaP

    Users who want to access a hosted controlled study on the BDC ecosystem must be approved for access to that study in the NIH Database of Genotypes and Phenotypes (dbGaP). Therefore, users who want to access a hosted controlled study on the ecosystem must be approved for access to that study in dbGaP. For more information, see the BioData Catalyst FAQs. Note that obtaining these approvals can be a time-intensive process; failure to obtain them in a timely manner may delay data access.

    Users have two options for obtaining dbGaP approval depending on whether they already are affiliated with a PI who has dbGaP access to the relevant data:

    1. The BDC user has no affiliation with an existing dbGaP approved project. In this case the user needs to create their own dbGaP project and then submit a data access request (DAR) for approval by the NHLBI Data Access Committee (DAC). This process often takes 2-6 months depending on whether local IRB approval is required by the dataset the user is requesting, the amount of time it takes for local review of the dbGaP application by the user’s home institution and processing by the DAR committee. See the dbGap Authorized Access Portal or dbGaP Overview: Requesting Controlled-Access Data. Once a DAR is approved, it can take a week or longer for the approval to be reflected on BDC.

    2. The BDC user is affiliated with an existing principal investigator, who already has an approved dbGaP application with existing DAR access (for example, the BDC user is a post-doctoral fellow in a PI’s lab). A principal investigator with dbGaP DAR access assigns the User as a “Downloader” in dbGaP. See Assign Downloaders for dbGaP Data. It can take about 24 hours for “Downloader” approval to be reflected on BDC.

    Notes

    DARs must be renewed annually to maintain your data access permissions. If your permissions expire, you may lose access to hosted data in BDCatalyst during your renewal process.

    A Cloud Use Statement may be required as part of the DAR.

    TOPMed

    BDC hosts data from the NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium. BDC users are not automatically onboarded as TOPMed investigators. BDC users who are not members of the TOPMed Consortium may apply for released data through the regular dbGaP Data Access Request process.

    When conducting TOPMed-related research on BDC, members of the TOPMed consortium must follow the TOPMed Publications Policy and associated processes; for example, operating within Working Groups.

    For more information, refer to the following resources:

    • Information on joining TOPMed

    • TOPMed website

    • TOPMed FAQs (login required)

    • BioData Catalyst FAQs

    IRB

    Users must ensure that IRB data use agreements (DUAs) are approved and maintained as they are enforced by the BDC ecosystem.

    BDC

    Refer to the BDC Data Protection page to learn more about topics such as data privacy, access controls, and restrictions.

    Use your eRA Commons account to review the data indexed by BDC to which you have access on the Explore BioData Catalyst Data page. For more information, see Checking Access.

    If your data is not indexed, inform BDC team members during your onboarding meetings or by submitting a Help Desk ticket.

    Ecosystem Access, Hosted Data, and System Services
    Understanding eRA Commons Accounts
    NCPI Data Portal

    The NIH Cloud Platform Interoperability Effort (NCPI) is currently working to establish and implement guidelines and technical standards to empower end-user analyses across participating cloud platforms and facilitate the realization of a trans-NIH, federated data ecosystem. Participating institutions include BDC, AnVIL, Cancer Research Data Commons, and Kids First Data Resource Center. Learn what data is currently hosted by these platforms by using the NCPI Data Portal.

    GTEx v8 - Free Egress Instructions
    AnVIL

    Explore Available Data

    Getting Started

    BDC Powered by Gen3 (BDC-Gen3)

    Go to BioData Catalyst Powered by Gen3, select NIH Login, then log in using your NIH credentials. Once logged in, select the Exploration tab. From the Data Access panel on the left, make sure Data with Access is selected. Note whether you have access to all the datasets you expect.

    Checking data access in BDC-Gen3

    Data Access

    Parameter
    Description

    Data with Access (default)

    A user can view all of the summary data and associated study information for studies the user has access to, including but not limited to Project ID, file types, and clinical variables.

    Data without Access

    Displays data you do not have subject-level access to, but for which summary statistics can be accessed. Locks next to the project ID signify to users that they do not have subject-level access but they can still search through the available studies but only view summary statistics.

    Projects will also be hidden if the select cohort contains fewer than 50 subjects (50 ↓, "You may only view summary information for this project", example below); in this case grayed out boxes and locks both appear. An additional lock means users have no access.

    All Data

    Displays all projects, but also projects you have no access to. A lock will appear for data you cannot access. Users can view all of the data available in the BDC-Gen3 platform, including studies with and without access. As a result, studies not available to a user will be locked as demonstrated below.

    BDC Powered by PIC-SURE (BDC-PIC-SURE)

    You do not need to check your data access on BDC-PIC-SURE. Instead, refer to the Accessing BioData Catalyst Data page, then click Check My Access.

    BDC Powered by Seven Bridges (BDC-Seven Bridges)

    1. Log into BDC-Seven Bridges.

    2. Click your username in the upper right, then select Account settings.

    3. From the upper left, select the tab for Dataset Access.

    4. Browse the datasets and note whether you have access to all the datasets you expect.

      • Datasets you have access to will have green check marks.

      • Datasets you do not have access to will have red check marks.

    BDC Powered by Terra (BDC-Terra)

    You do not need to check your data access on BDC-Terra. But before submitting a help desk ticket, ensure that you’ve done the following steps:

    Establish a link in BioData Catalyst powered by Terra to your eRA Commons/NIH Account and the University of Chicago DCP Framework. To link eRA Commons, NIH, and DCP Framework Services, go to your Profile page in BDC-Terra and log in with your NIH credentials.

    Screenshot of NIH account credentials

    If your access still has issues using particular files or datasets in analyses on BDC-Terra, submit a request to our help desk.

    Explore BDC Data
    Overview

    The purpose of this RFC is to provide the NHLBI BioData Catalyst Consortium and users of the NHLBI BioData Catalyst ecosystem with a clear statement on security mechanisms of the ecosystem that protect the confidentiality, integrity, provenance, and availability of the hosted data as well as any data that may be uploaded using the ecosystem’s “Bring Your Own Data” (BYOD) functionality.

    Figure 1. The NHLBI BioData Catalyst ecosystem leverages separately developed and managed platforms to maximize flexibility for users based on their research needs, expertise, and backgrounds. Utilizing multiple Authorizations to Operate (ATO), these platforms combine to provide secure, cloud-based workspaces, user authentication and authorization, search, tools and workflows, applications, and new innovative features to address community needs.

    NHLBI BioData Catalyst Ecosystem Security Statement

    The NHLBI and BioData Catalyst Consortium recognizes the importance of protecting both the privacy and security of the data and respecting the consent of the study participants whose data is stored within the BioData Catalyst ecosystem. Tackling these issues produces some challenges beyond those faced by most Federal Information Systems. The BioData Catalyst Consortium has implemented many innovative approaches to enable compliance and ensure that users understand their responsibility to protect data as articulated in specific Data Use Agreements (DUA). These approaches and controls work to protect the confidentiality, integrity and availability of the data; the privacy of the study participants who have contributed data; and data that may be uploaded to BioData Catalyst using the ecosystem’s “Bring Your Own Data” (BYOD) functionality. While the same general security controls are applied to both system and BYOD data, BYOD data is further protected as the ecosystem provides access only to the data’s uploaders and their designated collaborators.

    From a Federal Information Security Modernization Act (FISMA) perspective, the BioData Catalyst ecosystem is a set of software systems with distinct security boundaries. Each system owner holds an Authority to Operate (ATO) issued by the NIH. The ATO is the result of a rigorous Security Assessment and Authorization (SA&A) process and third party assessment consistent with guidance from the National Institute of Standards and Technology (NIST). The ecosystem operates via a set of Interconnection Security Agreements (ISA) (Reindl 1979) and utilizes several existing components of security infrastructure (Bridges 2017, Gutiérrez-Sacristán et al. 2018) developed for other NIH platforms. Where the documentation provided as part of the SA&A process describes how security controls are implemented based on the NIST Special Publication 800-53r4 (see Endnote), the ISAs describe the permitted exchange of data and establish ecosystem-wide incident response, logging and auditing expectations that enable the consortium to respond in a unified manner to any suspected cybersecurity incident. The SA&A documentation provides for regular evaluation of the security of the component systems including regular scanning for vulnerabilities and the conduct of an annual penetration test. This level of security represents a baseline, and the BioData Catalyst ecosystem will extend protections over time.

    Where the processes, policies, and technical controls protect confidentiality, integrity, and availability of data in accordance with Federal statute and regulation, there are additional ways to ensure that data is used in a manner consistent with study participants’ wishes, as represented by the consent form participants sign when enrolling in a specific study. Respect for these consents is critical to maintaining the public’s trust and requires additional policy, process, and technical controls. The respect for consent in NHLBI BioData Catalyst is enforced using normative NIH policies and processes for data sharing and using the existing infrastructure provided by the National Center for Biomedical Information’s (NCBI) Database of Genotypes and Phenotypes (dbGaP). All NHLBI-provided data within the NHLBI BioData Catalyst ecosystem are registered in dbGaP; in this process, data are assigned “consent groups” that describe in a machine-readable format the parameters of the consent for the data. These range from the most expansive “General Research Use” to more restrictive, such as only allowing for research outcomes related to “Health Medical Biomedical topics” or even to specific diseases, such as Chronic Obstructive Pulmonary Disease. Further, while secondary analysis of data is not considered human subjects research as described in the Common Rule (45 CRF Part 46), some datasets require the review of a research proposal by an Independent Review Board (IRB) or a Letter of Collaboration (LOC) with the originating study Principal Investigator as determined by the informed consents or special considerations that the submitting institution has determined are needed. These measures provide additional protection for datasets with particular sensitivity or special criteria for use.

    For instance, IRB review is required when 1) informed consents that were signed by the study participants state that IRB oversight for secondary use of the data is required, and/or 2) the study IRB of record determines that the data may contain sensitive information that requires IRB oversight for secondary research. For collaboration letters, the informed consents indicate that the secondary use of the data by researchers outside the study will work with the study and therefore formal collaborations need to be set in place. While these are rarely used, they provide additional protection under special circumstances, such as where an indigenous population or sovereign nation requires direct control of how their data is used. Because consent is expressed at the individual level there may be a variety of consents for a study, either because the study offered choices to their participants or because the study consent evolved over an extended longitudinal study, such as the Framingham Heart Study. These variations in consent are reflected as multiple “consent groups” within a study and may mean that an investigator may only receive permission for subsets of study participants.

    BioData Catalyst uses telemetry provided by dbGaP to enforce compliance with consents. Accordingly, users of BioData Catalyst can see only the data for which they have completed the process of a dbGaP Data Access Request (DAR) and after they receive approval from an NIH Data Access Committee (DAC). DAC approval results in a Data Use Agreement (DUA) describing any Data Use Limitations asserted by the originating study Principal Investigator, including ensuring that any requirements for an IRB review or Letter of Collaboration are met. On the BioData Catalyst ecosystem, the Gen3 “Fence” service, developed and operated by the University of Chicago and utilized by other NIH platforms and many other non-federal data commons, ensures enforcement of data access requirements. In order to ensure that the NHLBI maintains control over the use of the data and has the ability to audit this use, the BioData Catalyst policy does not allow download of source data outside the cloud ecosystem. Instead, BioData Catalyst provides access to data in approved cloud environments where computation occurs, which is documented in an ISA via standard Application Programming Interfaces (APIs) that use the authentication and authorization provided by the NIH Researcher Auth Service (RAS) (REF), together with Fence, to protect access and to allow for monitoring and auditing for appropriate data use (e.g. within scope of the approved project). These APIs include implementations of protected GA4GH Data Repository Service (DRS) for access to data objects such as genomic data and protected PIC-SURE interfaces for access to phenotypic, genotypic variant, and electronic health record data. The use of these APIs will, once fully implemented, enable other trusted cloud-based systems that meet equivalent security requirements to access data stored within BioData Catalyst for analysis using that other systems’ tools without the data being downloaded outside the security perimeters of the systems. This commitment to the use of APIs together with the requirement that data stay within the designated security boundaries is a critical component of making NHLBI data FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016, Corpas et al. 2018), while also ensuring confidentiality of data and respect for consent, regardless of the platform where the data is analyzed. BioData Catalyst has extended this model through the use of the NIH Researcher Auth Service (RAS) to provide authentication and authorization controls which together with the use of secure APIs is enabling secure interoperability with other trusted NIH-funded platforms such as the NHGRI’s AnVIL and the Gabriella Miller Kids First Data Resource.

    Endnote

    While NIST-800-53r4 is a thorough standard for many kinds of systems, it has some gaps for the most-modern systems encountered. There are additional standards to apply, either from the High column of NIST-800-53r4 or addendums as “best practice”.

    In particular, the Standard does not give guidance on modern applications (“appsec”) -- especially when the Infrastructure is serverless and the entire security surface is the application itself. For instance, there are requirements for regular scanning but scanning tools do not scan APIs or modern Single Page Apps (SPA) sufficiently. A “web app scan” with tools like IBM’s Appscan would return no vulnerabilities on an API without actually testing it, yet it would satisfy RA-5 of NIST-800-53 to run such a scan. Nor does the standard require that the Infrastructure as a Service layer (AWS, GCP, Azure) abide by a continual scanning posture for misconfiguration -- only Networks and VMs are specified. That is one example of where NIST-800-53r4 doesn’t work for modern applications.

    There’s also no guidance from NIST at large about running an API where external parties are building “clients'' to that API. Does allowing clients extend the security boundary, and thus all 3rd party applications are to be evaluated as such? Is there a different consideration for these 3rd party applications? The standard is silent there. Companies like Apple and Google through their AppStores require all 3rd party apps to undergo evaluation to their own security standards and such an idea might be applicable here. NHLBI might consider adding some extra controls, what the AllOfUs Program calls FISMA+, for enhanced security.

    https://bdcatalyst.gitbook.io/biodata-catalyst-documentation/community/request-for-comments/nhlbi-biodata-catalyst-ecosystem-security-statement
    https://www.biodatacatalyst.org/collaboration/rfcs/bdcatalyst-rfc-11/
    CC-BY-4.0 license

    To submit a DAR, users must have PI status through their institution. Non-PI users must have a PI they work with that can submit a DAR and add them as a downloader.

    Data Access Request Process

    Step 1: Go to https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login to log in to dbGaP.

    Step 2: Navigate to My Projects.

    Step 3: Select Datasets.

    You can search by Primary disease type, or if you know the dataset you are interested in, you can use Study lookup.

    We want to request HCT for SCD, so we will use the accession number phs002385. As you type the accession number, the numbers will start to auto-populate.

    Select the study to add it to the Data Access Request. You can request up to 200 studies that you are interested in accessing.

    The user can add additional datasets as necessary needed to answer the research question.

    Sample Research Use Statement

    Title

    Long-term survival and late death after hematopoietic cell transplant for sickle cell disease

    Research Use Statement

    Our project is limited to requested dataset. We have no plans to combine with other datasets.

    In 2018, the National Heart Lung and Blood Institute (NHLBI) began work on BioData Catalyst, a shared virtual space where scientists can access NHLBI data and work with the digital objects needed for biomedical research (www.nhlbi.nih.gov/science/biodata-catalyst). This is a cloud-based platform that allows for tools, applications and workflows. It provides secure workspaces to share, store, cross-link and analyze large sets of data generated from biomedical research. Biodata Catalyst addresses the NHLBI Strategic Vision objective of leveraging emerging opportunities in data science to facilitate research in heart, lung, blood and sleep disorders. It offers specialized search functions, controlled access to data and analytic tools via programming interfaces and its interoperability will allow exchange of information with other components of the Data Commons. BioData Catalyst may be accessed by the biomedical researchers and the public at large. The first available datasets in BioData Catalyst include data from NHLBI’s Trans-Omics for Precision Medicine (TOPMed) Program and the Cure Sickle Cell Initiative. Rigor in designing and performing scientific research and the ability to reproduce biomedical research are two of the cornerstones of science advancement. In order to test reproducibility of biomedical data available in BioData Catalyst we accessed NHLBI data from the Cure Sickle Cell Initiative to test and validate the findings of a publication that utilized those data. That report, focused on the effect of donor type and transplant conditioning regime intensity on hematopoietic cell transplant outcomes for sickle cell disease. Hematopoietic cell transplant is potentially curative, yet this treatment is associated with risks for mortality from the treatment procedure. Published reports suggest the life expectancy of adults with sickle cell disease in the United States is shortened by at least two decades compared to the general population. Thus, a fundamental question that is often asked is whether hematopoietic cell transplant over time would offer a survival advantage compared to treatment with disease-modifying agents. In the report1 that examined for factors associated with survival after transplantation, young patients (aged =12 years) and patients who received their graft from an HLA-matched sibling had the highest survival. For those without an HLA-matched sibling the data did not favor one alternative donor type over another.1 The purpose of the current analyses is two-fold: 1) test and validate a publication that utilized data in the public domain and 2) the utility of these data to conduct an independent study. The aim of the later study was to estimate the conditional survival rates after hematopoietic cell transplantation stratified by time survived since transplantation and to compare all-cause mortality risks to those of an age, sex, and race-matched general population in the United States.

    Non-technical summary

    Investigators in the Cure Sickle Cell Initiative Data Consortium request access to data to examine rigor and reproducibility of data submitted - using a previous publication as reference. Additionally, we will calculate survival after hematopoietic cell transplant by time survived since transplant. For example, we will calculate the 5- and 10-year likelihood of being alive 2 and 5 years after transplantation.

    Cloud-Use Statement

    The NHLBI-supported BioData Catalyst (www.nhlbiBioDataCatalyst.org) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces. The BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by FireCloud, hosted and operated by the Broad Institute, Fair4Cures, hosted and operated by Seven Bridges Genomics and PIC-SURE operated by Harvard Medical School. For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Additionally, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) for Harvard Medical School.

    Cloud Provider Information

    Cloud Provider:

    NHLBI BioData Catalyst, Private, The NHLBI-supported BioData Catalyst (https://biodatacatalyst.nhlbi.nih.gov/) is a cloud-based infrastructure where heart, lung, blood, and sleep (HLBS) researchers can go to find, search, access, share, cross-link, and compute on large scale datasets. It will provide tools, applications, and workflows to enable those capabilities in secure workspaces.

    The NHLBI BioData Catalyst will employ Amazon Web Services and Google Cloud Platform for data storage and compute. The NHLBI BioData Catalyst comprises the Data Commons Framework Services (DCFS) hosted and operated by the University of Chicago. DCFS will provide the gold master data reference as well as authorization/authentication and indexing services. The DCFS will also enable security interoperability with the secure workspaces. Workspaces will be provided by FireCloud, hosted and operated by the Broad Institute, Fair4Cures, hosted and operated by Seven Bridges Genomics and PIC-SURE operated by Harvard Medical School.

    For the NHLBI BioData Catalyst, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) issued to the Broad Institute, University of Chicago and Seven Bridges Genomics as presenting acceptable risk, and therefore the NCI ATO serves as an Interim Authority to Test (IATT) when used by designated TOPMed investigators and collaborators. Additionally, the NHLBI Designated Authorizing Official has recognized the Authority to Operate (ATO) for Harvard Medical School.

    Amazon Web Services (AWS), Commercial Amazon Web services (AWS) is a public cloud platform that provides solutions and services such as virtual machines, database instances, storage, and more. We will use the Amazon Elastic Compute Cloud (Amazon EC2), a web service that provides resizable compute capacity, to allocate Amazon Machine Instances (AMIs) in which we will develop the methods and infrastructure necessary to build the NHLBI BioData Catalyst. AWS offers several storage options that work in conjunction with EC2: Amazon Simple Storage Service (Amazon S3), Amazon Elastic Block Store (EBS) and Amazon Elastic File System (Amazon EFS). We expect to use each of these as the provide different capabilities including persistent storage and direct and networked storage for attaching to running AMI(s). We will use the Amazon Virtual Private Cloud (VPC) to provide security and robust networking functionality to these compute resources and Amazon Identity and Access Management (IAM) to control user access to these compute resources. AWS offers extensive security and has written a white paper with guidelines for working with controlled access data sets in AWS which we will follow. (see https://d0.awsstatic.com/whitepapers/compliance/AWS_dBGaP_Genomics_on_AWS_Best_Practices.pdf).

    Google Cloud Platform, Commercial

    Google Cloud Platform is a public cloud platform that provides solutions and services such as virtual machines, database instances, storage, and more. We will use the Google Compute, a service that provides resizable compute capacity, to allocate Machine Types in which we will develop the methods and infrastructure necessary to build the NHLBI BioData Catalyst. Google Cloud offers several storage options that work in conjunction with Compute Engine: Google Cloud Storage, and Google Compute Engine Persistent Disks. We expect to use each of these as the provide different capabilities including persistent storage and direct and networked storage for attaching to running machine instances. We will use networking technologies based on Google’s Andromeda architecture, which can create networking elements at any level with software. This software-defined networking allows Cloud Platform's services to implement networking features that fit their exact needs, such as secure firewalls for virtual machines in Google Compute Engine. We will use Google Cloud Identity & Access Management to control user access to these compute resources.

    Understanding eRA Commons Accounts

    PIC-SURE User Guide

    PIC-SURE: Patient Information Commons Standard Unification of Research Elements

    The Patient Information Commons: Standard Unification of Research Elements (PIC-SURE) integrates clinical and genomic data to allow users to search, query, and export data at the variable and variant levels. This allows users to create analysis-ready data frames without manually mapping and merging files.

    BDC Powered by PIC-SURE (BDC-PIC-SURE) functions as part of the BDC ecosystem, allowing researchers to explore studies funded by the National Heart, Lung, and Blood Institute (NHLBI), whether they have been granted access to the participant level data or not.

    Overview of PIC-SURE search interface

    Requirements and Login

    Requirements

    To obtain access to BDC-PIC-SURE, you must have an NIH eRA Commons account. For instructions and to register an account, refer to the eRA website.

    Login

    After you have created an eRA Commons account, you can log in to BDC-PIC-SURE by navigating to and selecting to log in with eRA Commons. You will be directed to the NIH website to log in with your eRA Commons credentials. After signing in and accepting the terms of the agreement on the NIH RAS Information Sharing Consent page, allow the BDC-Gen3 service to manage your authorization.

    Upon login, you will be directed to the Data Access Dashboard. This page provides a summary of PIC-SURE Authorized Access, PIC-SURE Open Access, and the studies you are authorized to access.

    Available Data and Managing Data Access

    BDC-PIC-SURE has integrated clinical and genomic data from a variety of heart, lung, blood, and sleep related datasets. These include NHLBI Trans-Omics for Precision Medicine (TOPMed) and TOPMed related studies, BioLINCC datasets, and COVID-19 datasets.

    View a summary of the data you have access to by viewing the Data Access Table.

    This table displays information about the study and associated data, including the full and abbreviated name of the study, study design and focus, the number of clinical variables, participants, and samples sequenced, additional information with helpful links, consent group information, and the dbGaP accession number (or phs number). You are also able to see which studies you are authorized to access in the Access column of the table. For information from dbGaP on submitting a data access request, refer to Tips for Preparing a Successful Data Access Request documentation. Note that studies with a sickle cell disease focus contain links to the Cure SCi Metadata Catalog for additional information.

    Sample summary table of studies available and user-based authorization via the Data Table.

    You can also check the data you have access to by going to the BioData Catalyst Data Access page on the BDC website and clicking Check My Access.

    PIC-SURE Features and General Layout

    General layout of PIC-SURE search
    1. Search bar: Enter any phenotypic variable, study or table keyword into the search bar to search across studies. Users can also search specific variables by accession number, if known (phs/pht/phv).

    2. Study Tags: Users can filter the results found through their search by limiting to studies of interest or excluding studies.

    3. Variable Tags: Users can filter the results found through their search by limiting to keywords of interest or excluding keywords that are out of scope. For example, a user could filter to categorical variables, variables containing the term ‘blood’, and/or exclude variables containing the term ‘pressure’.

      How are variable tags generated? Each variable has a set of associated tags, which are generated during the PIC-SURE data loading process. These tags are generated based on information associated with the variable, including the name of the study, study description, dataset name, PIC-SURE data type (continuous or categorical), and variable description. For a search in PIC-SURE, tags associated with a variable are displayed. Note that tags applicable to less than 5% or more than 95% of the search results are not displayed since these are not useful for filtering results.

    4. Search Results table: View all variables associated with your search term and/or study & variable tags.

    5. Results Panel: Panel with content boxes that describe the cohort based on the variable filters applied to the query.

    6. Data Summary: Displays the total number of participants in the filtered cohort which meet the query criteria. When first opening the Open or Authorized Access page, the number will be the total number of participants that you can access.

    7. Added Variable Filters summary: View all filters which have been applied to the cohort.

    8. Filter Action: Click on the filter icon to filter cohort participants by specific variable values.

    9. Reset button: Allows users to start a new search and query by removing all added filters and clearing all active study and variable tags.

    TOPMed and TOPMed related datasets

    The BDC ecosystem hosts several datasets from the . The PIC-SURE platform has integrated the clinical and genomic data from all studies listed in the Data Access Dashboard. Through the ingestion process, occasionally PIC-SURE will ingest phenotypic data for the TOPMed studies prior to the genomic data.

    Harmonized Data (TOPMed Harmonized Clinical Variables)

    There are limited amounts of harmonized data available at this time. The TOPMed Data Coordinating Center (DCC) curation team has identified 44 variables that are shared across 17 NHLBI studies and normalized the participant values for these variables.

    The 44 harmonized variables available are listed in the table in Appendix 2. For more information on this initiative, you can view the

    CONNECTS Dataset

    The BDC ecosystem hosts several datasets from the . These COVID-19 related studies follow the guidelines for implementing common data elements (CDEs) and for de-identifying dates, ages, and free text fields. For more information about these efforts, you can view the CDE Manual and De-Identification Guidance documents on the .

    Table of COVID-19 Studies Included in the CONNECTS Program Available in PIC-SURE

    Additional Resources

    Video Walkthroughs

    Playlist

    BioLINCC Datasets

    The BDC ecosystem hosts several datasets from the . To access the BioLINCC studies, you must request access through dbGaP even if you have authorization from BioLINCC.

    Data Organization in PIC-SURE

    PIC-SURE integrates clinical and genomic datasets across BDC, including TOPMed and TOPMed related studies, COVID-19 studies, and BioLINCC studies. Each variable is organized as a concept path that contains information about the study, variable group, and variable. Though the specifics of the concept paths are dependent on the type of study, the overall information included is the same.

    For more information about additional dbGaP, TOPMed, and PIC-SURE concept paths, refer to Appendix 1.

    Table of Data Fields in PIC-SURE

    PIC-SURE Open Access vs. PIC-SURE Authorized Access

    PIC-SURE provides two ways to search: PIC-SURE Open Access and PIC-SURE Authorized Access. enables the user to explore aggregate-level data without any dbGaP data authorizations. feature allows the user to explore participant-level data and requires authorization to access at least one study through an active dbGaP Data Access Request (DAR).

    Table Comparison of PIC-SURE Open and Authorized Access

    Videos

    Introduction to BioData Catalyst Powered by PIC-SURE

    Basics: Finding Variables

    Basics: Applying a Variable on a Filter

    Basics: Editing a Variable Filter

    PIC-SURE Open Access: Interpreting the Results

    PIC-SURE Authorized Access: Applying a Genomic Filter

    PIC-SURE Authorized Access: Add Variables to Export

    PIC-SURE Authorized Access: Select and Package Data Tool

    PIC-SURE Authorized Access: Variable Distributions Tool

    PIC-SURE Open Application Programming Interface (API)

    BioData Catalyst Powered by PIC-SURE
    YouTube channel
    NIH NHLBI Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC)

    dbGaP approval to access required

    ✓

    Access to aggregate counts

    ✓

    ✓

    Access to participant-level data

    ✓

    Phenotypic variable search

    ✓

    ✓

    Phenotypic variable filtering

    ✓

    ✓

    Genomic variable filtering

    ✓

    Data retrieval

    ✓

    Visualizations

    ✓

    Removed stigmatizing variables

    ✓

    Data obfuscation

    ✓

    PIC-SURE Open Access
    PIC-SURE Authorized Access
    https://picsure.biodatacatalyst.nhlbi.nih.gov
    PIC-SURE Data Access Dashboard
    “Check my access” on the BDC Access page.

    phs002752

    A Multicenter, Adaptive, Randomized Controlled Platform Trial of the Safety and Efficacy of Antithrombotic Strategies in Hospitalized Adults with COVID-19

    ACTIV4a

    phs002694

    COVID-19 Positive Outpatient Thrombosis Prevention in Adults Aged 40-80

    ACTIV4b

    phs002710

    Clinical-trial of COVID-19 Convalescent Plasma in Outpatients

    NIH NHLBI Collaborating Network of Networks for Evaluating COVID-19 and Therapeutic Strategies (CONNECTS) program
    CONNECTS COVID-19 Therapeutic Trial Common Data Elements webpage

    C3PO

    Variable ID

    phv corresponding to the variable accession number

    Equivalent to variable name

    Variable name

    Encoded variable name that was used by the original submitters of the data

    Encoded variable name that was used by the original submitters of the data

    Variable description

    Description of the variable

    Description of the variable, as available

    Dataset ID

    pht corresponding to the trait table accession number

    Equivalent to dataset name

    Dataset name

    Name of the trait table

    Name of a group of like variables, as available

    Dataset description

    Description of the trait table

    Description of a group of like variables, as available

    Study ID

    phs corresponding to the study accession number

    phs corresponding to the study accession number

    Study description

    Description of the study from dbGaP

    Description of the study from dbGaP

    Note that there are two data types in PIC-SURE: categorical and continuous data. Categorical variables refers to any variables that have categorized values. For example, “Have you ever had asthma?” with values “Yes” and “No” is a categorical variable. Continuous variables refer to any variables that have a numeric range of values. For example, “Age” with a value range from 10 to 90 is a continuous variable. The internal PIC-SURE data load process determines the type of each variable based on the data.

    General organization

    Data organized using the format implemented by the database of Genotypes and Phenotypes (dbGaP). Find more information on the dbGaP data structure here.

    Generally, a given study will have several tables, and those tables have several variables.

    Data do not follow dbGaP format; there are no phv or pht accessions.

    Data are organized in groups of like variables, when available. For example, variables like Age, Gender, and Race could be part of the Demographics variable group.

    Concept path structure

    \phs\pht\phv\variable name\

    \phs\variable name

    or on the
    .

    Table of Studies Included in the TOPMed Harmonized Dataset Available in PIC-SURE

    Atherosclerosis Risk in Communities Study

    ARIC

    phs000280

    Cardiovascular Health Study

    CHS

    phs000287

    Cleveland Family Study

    CFS

    phs000284

    Coronary Artery Risk Development in Young Adults Study

    CARDIA

    phs000285

    Epidemiology of Asthma in Costa Rica Study

    NHLBI Trans-Omics for Precision Medicine (TOPMed) program
    additional documentation from the TOPMed DCC GitHub repository
    NHLBI Trans-Omics for Precision Medicine website

    Query

    Overview of the Query page on BDC-Gen3

    Overview

    The Query page can search and return metadata from either the Flat Model or the Graph Model of a commons. Using GraphQL, these searches can be tailored to filter and return fields of interest for the data sets being queried. These queries can be made immediately after data submission as this queries the model directly.

    BDC Query Page

    For more information about how to use the Query page, refer to the Gen3 documentation.

    Data Analysis Using the PIC-SURE API

    Once you have refined your queries and created a cohort of interest, you can begin analyzing data using other components of the BDC ecosystem.

    What is the PIC-SURE API?

    Databases exposed through the PIC-SURE API encompass a wide heterogeneity of architectures and data organizations underneath. PIC-SURE hides this complexity and exposes the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible sciences. The API is available in two different programming languages, python and R, allowing investigators to query databases in the same way using either of those languages. The PIC-SURE API tutorial notebooks can be directly accessed on GitHub.

    PIC-SURE Access Token

    To access the PIC-SURE API, a user-specific token is needed. This is the way the API grants access to individual users to protected-access data. The user token is strictly personal; do not share it with anyone. You can copy your personalized access token by selecting the User Profile tab at the top of the screen.

    Here, you can Copy your personalized access token, Reveal your token, and Refresh your token to retrieve a new token and deactivate the old token.

    Analysis in the BDC Ecosystem

    The PIC-SURE API can be accessed via tutorial notebooks on either BDC- or BDC-.

    To launch one of the analysis platforms, go to the . From the Resources menu, select Services. A list of platforms and services on the BDC ecosystem will be displayed.

    From the Analyze Data in Cloud-based Shared Workspaces section, select Launch for your preferred analysis platform.

    BDC-Seven Bridges

    Jupyter notebook examples in R and python can be found under the Public projects tab by selecting PIC-SURE API.

    From the Data Studio tab, select an example that fits your research needs. Here, we will select PIC-SURE JupyterLab examples.

    This will take you to the PIC-SURE API analysis workspace, where you can view the examples in python. Copy this workspace to your own project to edit or run the code yourself.

    Note The project must have network access to run the PIC-SURE examples on Seven Bridges. To ensure this, go to the Settings tab and select “Allow network access”.

    BDC-Terra

    To access the Jupyter notebook examples in R and python for the PIC-SURE API, select View Workspaces from the .

    Select the Public tab and search for “PIC-SURE”. Workspaces for both the python and R examples will be displayed. You must clone the workspaces to edit or run the code within them.

    PIC-SURE API Documentation

    How to get started with PIC-SURE and the common endpoints you can use to query any resource registered with PIC-SURE

    The PIC-SURE v2 API is a meta-API used to host any number of resources exposed through a unified set of generalized operations.

    PIC-SURE Repositories:

    • PIC-SURE API: This is the repository for version 2+ of the PIC-SURE API.

    • PIC-SURE Wiki: This is the wiki page for version 2+ of the PIC-SURE API.

    • : This is the repository for the BDC environment of PIC-SURE.

    • : This is the repository for PIC-SURE-ALL-IN-ONE.

    Additional PIC-SURE Links:

    • : A link to the Avillach Lab Jenkins repository.

    • : A repository for Avillach Lab Jenkins development release control.

    Client Libraries

    The following are the collected client libraries for the entire PIC-SURE project.

    PIC-SURE User Interface

    The PIC-SURE User Interface acts as a visual aid for running normal queries of resources through PIC-SURE.

    PIC-SURE User Interface Repositories:

    • : The main High Performance Data Store (HPDS) UI repository.

    Additional PIC-SURE User Interface Links:

    • : Links to a google drawing of the PIC-SURE UI flow.

    PIC-SURE Auth Micro-App (PSAMA)

    The PSAMA component of the PIC-SURE ecosystem authorizes and authenticates all actions taken within PIC-SURE.

    PSAMA Repos:

    Additional PSAMA Links:

    • : This is where the core of the PSAMA application is stored in GitHub

    High Performance Data Store (HPDS)

    HPDS is a datastore designed to work with the PIC-SURE meta-API. It grants researchers fast, dependable access to static datasets and the ability to produce statistics-ready dataframes filtered on any variable they choose at any time.

    HPDS Repositories:

    • : The main HPDS repository.

    • : Python client library to run queries against a PIC-SURE HPDS resource.

    • : R client library to run queries against a PIC-SURE HPDS resource.

    PIC-SURE Authorized Access

    If you are authorized to access any dbGaP dataset(s), the Authorized Access tab at the top will be visible. PIC-SURE Authorized Access provides access to complete, participant-level data, in addition to aggregate counts, and access to the Tool Suite.

    PIC-SURE Authorized Access specific features and layout.

    A. Select Variables Action: Click the Select Variables icon to include variables when retrieving data. Users can select variables individually or at the dataset level.

    • Individually select variables: You can individually select variables from two locations:

      • Variable search results: From the search results you can click the data retrieval icon to include the variable in your data retrieval.

      • Variable modal variable data retrieval: The data retrieval icon next to the variable adds the variable to your data retrieval.

    • Select from a dataset or group of variables: In the variable modal the data retrieval icon next to the dataset opens a modal to allow you to select variables from the dataset table or group of variables.

    B. Data Summary: In addition to the total number of participants in the filtered cohort, the number of variables the user has selected for data retrieval is also displayed.

    There are four concept paths that are automatically included with any data export from PIC-SURE Authorized Access. These fields are listed and described below.

    • Patient ID: Internal PIC-SURE participant identifier. Please note that this field is not linking participants between studies and therefore should not be used for data correlation between different data sources or to the original data files.

    • Parent Study Accession with Subject ID: PIC-SURE generated identifier for parent studies. These identifiers are a combination of the study accession number and the subject identifier.

    • Topmed Study Accession with Subject ID: PIC-SURE generated identifier for TOPMed studies. These identifiers are a combination of the study accession number and the subject identifier.

    C. Tool Suite: The Tool Suite contains tools that can be used to further explore filtered cohorts of interest. Note that at least one filter must be added to the query before using the Tool Suite.

    • Select and Package Data: Retrieve participant-level data corresponding to your filters and variable selections. Variables selected for data retrieval can be reviewed and modified. To learn more about the options associated with this tool, please refer to the Select and Package Data section.

    • Variable Distributions: View the distributions of query variables based on the filtered cohort. Note that there is a limit to the number of variable distributions that can be viewed at a given time. Additionally, genomic variables nor variables associated with any-record-of filter (e.g. entire datasets) will not be graphed.

    Select and Package Data

    The Select and Package Data tool is used to select and export participant-level data corresponding to your filters and variable selections. There are several options for selecting and exporting the data, which are shown using this tool.

    In the top left corner of the modal, the number of participants and number of variables included in the query is shown. This is used to display the estimated number of data points in the export.

    Note: Queries with more than 1,000,000 data points will not be exportable.

    The table below displays a summary of the variables included in the export. Using the Selected column, variables that have been added to the export can be selected or deselected for the final dataframe.

    Note: Variables with filters are automatically included in the export.

    The Package Data button in the top right corner is used to prepare the data for export once the variable filters and selections have been finalized.

    Once this button is clicked, there are several options to complete the export.

    To export into a BDC analysis workspace, the Export to Seven Bridges or Export to Terra buttons can be used. Once clicking either of these buttons, a new modal will be displayed with all information and instuctions needed to complete the export. This includes your personalized access token, the query ID associated with the dataframe. Additionally, there is the option to Copy Query ID without accessing Seven Bridges or Terra if you wish to use a different analysis platform.

    The Export to Seven Bridges option includes a Go to Seven Bridges button, which will open a new tab to the Public PIC-SURE API Project on BDC-Seven Bridges.

    The Export to Terra option includes a Go to Terra via R button and a Go to Terra via Python button, which will open the Public PIC-SURE API R Examples workspace and the Public PIC-SURE API Python Examples workspace on BDC-Terra, respectively.

    Use Case: Investigating Comorbidities of Breast Cancer in Authorized Access

    In this section, the functionalities of PIC-SURE Authorized Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating some comorbidities of breast cancer in women with BRCA1 and BRCA2 gene variants, such as hypertension and COPD.

    I already have been authorized to access the Women’s Health Initiative (WHI) study and am interested in a single cohort: women with breast cancer and variants of the BRCA1 and BRCA2 genes. I want to select hypertension-related variables of interest, check the distributions of some variables, and export all the data to an analysis workspace.

    First, let’s apply our variable filters for the WHI study.

    1. Search “breast cancer” in Authorized Access.

    2. Add the WHI study tag to filter search results to only age variables found within the WHI study.

    3. Filter to participants with breast cancer by clicking the filter icon next to the variable of interest. Select values to filter your variable on and click “Add Filter to Query”.

    PIC-SURE Open Access

    PIC-SURE Open Access allows you to search any clinical variable available in PIC-SURE. Your queries will return obfuscated aggregate counts per study and consent. There are some features specific to PIC-SURE Open Access, which are outlined below.

    PIC-SURE Open Access specific features and layout.

    A. Stigmatizing Variables Removal: PIC-SURE Open Access data excludes clinical variables that contain potentially sensitive information. These variables are known as stigmatizing variables, which fall into the following categories:

    • Mental health diagnoses, history, and treatment

    • Illicit drug use history

    • Sexually transmitted disease diagnoses, history, and treatment

    • Sexual history

    • Intellectual achievement, ability, and educational attainment

    • Direct or surrogate identifiers of legal status

    For more information about stigmatizing variables and the identification process, please refer to the documentation and code on the .

    B. Data Obfuscation: Because participant-level data are not available in PIC-SURE Open Access, the aggregate counts are obfuscated to further anonymize the data. This means that:

    • If the consent group, study, and/or total participants of the query is between one and nine, the results will be shown as < 10.\

    • If the consent group results are between one and nine and the study and/or total participants of the query is greater than 10, the results will be obfuscated by ± 3.

    • Query results that are zero participants will display 0.

    C. View Filtered Results by Study: The filtered number of participants which match the query criteria is shown broken down by study and consent group. Users can see if they do or do not have access to specific studies.

    Use Case: Using PIC-SURE Open Access to Investigate Asthma in Healthy and Obese Adult Populations

    In this section, the functionalities of PIC-SURE Open Access will be described in the context of a scientific use case. Specifically, let’s say I am interested in investigating asthma in relation to obesity in adults.

    I’m interested in two cohorts: obese adults with a body mass index (BMI) greater than 30 and healthy adults with a BMI between 18.5 and 24.9. However, I have not yet submitted a Data Access Request and therefore am not authorized to access any datasets.

    First, let’s explore cohort A: Healthy adults with a BMI between 18.5 and 24.9 in Framingham Heart Study (FHS).

    1. Search for ‘age’.

    2. Apply ‘FHS’ study tag to view only ‘age’ variables within the Framingham Heart Study (phs000007).

    3. Select the variable of interest. You may notice many variables that appear similar. These variables may be located in different datasets, or tables, but contain similar information. Open up the variable information modal by clicking on the row containing the variable of interest to learn more.

    We can easily modify our filters to explore cohort B: Obese adults with a body mass index (BMI) greater than 30 in Framingham Heart Study.

    1. Edit the BMI filter by clicking the edit icon in the Added Variable Filters section. Change the range to have a minimum of 30 and no maximum.

    2. Note the total participant count in the Data Summary.

    We can easily repeat these steps for other studies, such as the Genetic Epidemiology of COPD (COPDGene) study, and create a table like the one below. By comparing these two studies, I can see that COPDGene may be more promising for my research since it contains many more participants in my cohorts of interest than FHS does.

    I can then use the Request Access button to go directly to the study’s dbGaP page and begin submitting a DAR.

    Appendix 1: BDC Identifiers - dbGaP, TOPMed, and PIC-SURE

    Table of BDC dbGAP/TOPMed Identifiers

    Patient ID

    This is the HPDS Patient num. This is PIC-SURE HPDS’s internal Identifier.

    Topmed / Parent Study Accession with Subject ID

    • These are the identifiers used by each in the team in the consortium to link data.

    • Values must follow this mask <STUDY_ACCESSION_NUMBER>.<VERSION>_<SUBJECT_ID> Eg: phs000007.v30_XXXXXXX

    DBGAP_SUBJECT_ID

    Table of PIC-SURE Identifiers