BDC Glossary

Glossary of terms used in the context of the BioData Catalyst Consortium and platform.

  • Agile Development

    Agile software development is an approach to software development under which requirements and solutions evolve through the collaborative effort of self-organizing and cross-functional teams and their customer(s)/end user(s).

  • Alpha Users

    A small group of users who are more willing to tolerate working in a system that isn’t as fully developed, providing detailed feedback & maybe some back & forth discussions

  • [Amazon] EFS

    [Amazon] Elastic File System a simple, scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources.

  • Ambassadors

    A small group of experts that represent the personas featured within the priority User Narratives. For their time and help, Ambassadors will receive early access to the BDC platform, free compute time, monetary fee for time, and relevant travel expenses will be covered.

  • App

    1. In Seven Bridges, an app is a general term to refer to both tools and workflows.

    2. App may also refer to persistent software that is integrated into a platform.

  • API

    Application Programmer Interfaces. API technologies serve as software-based intermediaries to exchange data.

  • AWS

    Amazon Web Services. A provider of cloud services available on-demand.

  • BagIt

    BagIt is a hierarchical file packaging format for storage and transfer of arbitrary digital content.

  • BDC3

    BioData Catalyst Coordinating Center

  • Beta Users

    A slightly larger group than the alpha users who are not as tolerant to a difficult/clunky environment but understand that the version they are using is not polished and they need to give feedback.

  • Beta-User Training

    Once the platform is available to a broader audience, we will support freely-accessible online training for beta-users at any time.

  • Carpentries Instructor Training Program

    Ambassadors attend this training program to become BDC trainers.

  • CCM

    Change Control Management; the systematic approach to managing all changes made to a document or process. Ensures no unnecessary changes are made, all changes are documented, and a process exists for implementing approved change.

  • CIO

    Chief Information Officer

  • Cloud Computing

    Internet-based computing, wherein computing power, networking, storage or applications running on computers outside an organization are presented to that organization in a secure, services-oriented way.

  • Components

    Software units that implement a specific function or functions and which can be reused.

  • ConOps

    Concept of Operations

  • Consortium

    A collection of teams and stakeholders working to deliver on the common goals of integrated and advanced cyberinfrastructure, leading-edge data management and analysis tools, FAIR data, and HLBS researcher engagement.

  • Containers

    A standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another (for example, Docker).

  • Command

    In workflow writing, the command specifies the literal command line run (akin to what you would run in the terminal).

  • COPDGene

    Chronic Obstructive Pulmonary Disease (COPD) Gene

  • Cost Monitoring (level)

    At the Epic Level The Coordinating Center will facilitate this process by developing reporting templates (see example in PM Plan, Financial Management) for distribution to the teams. The BDC teams will complete these templates and send them directly to NHLBI. Each team is responsible for tracking their finances based upon the award conditions and for providing status updates as requested to NHLBI.

  • CRAM File Compressed columnar file format for storing biological sequences aligned to a reference sequence. Designed to be an efficient reference-based alternative to the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence, reducing storage costs. Additionally each column in the SAM format is separated into its own blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller than BAM, depending on the data held within them (from Wikipedia).

  • CSOC Alpha

    Common Services Operations Center (CSOC): operates cloud, commons, compliance and security services that enable the operation of data commons; has ATO and hosts production system.

  • CSOC Beta

    Development/testing; Real data in pilot (not production) that can be accessed by users

  • Common Workflow Language (CWL)

    Simple scripting language for describing computational workflows for performing sequential operations on data. CWL is a way to describe command line tools and connect them together to create workflows. CWL is well suited for describing large-scale workflows in cluster, cloud and high performance computing environments where tasks are scheduled in parallel across many nodes.

  • DAC

    Data Access Committee: reviews all requests for access to human studies datasets

  • DAR Data Access Request

  • Data Access

    A process that involves authorization to access different data repositories; part of a User Narrative for the December 2020 release goal A Work Stream PM Plan constraint: NHLBI, as the project sponsor, will identify a process to enable data access by the BDC team members and for research users

  • Data Commons

    Provides tools, applications, and workflows to enable computing large scale data sets in secure workspaces.

  • Data Repository Service (DRS) Generic interface (API) to data repositories so data consumers, including workflow systems, can access data in a single, standardized way regardless of where it’s stored or how it’s managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID (from GA4GH).

  • Data Steward

    Members of the TOPMed and COPDGene communities who are working with BDC teams.

  • dbGaP

    Database of Genotypes and Phenotypes


    Data Commons Pilot Phase Consortium. The Other Transaction Awardees, Data Stewards, and the NIH.

  • Decision Tree

    A decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility

  • Deep Learning

    A machine learning method based on neural networks to learn from data through training to recognize patterns in the data.

  • Deliverables

    Demonstrations and products.

  • Demos

    Activities and documentation resulting from the DCPPC to build, test and demonstrate completion of goals of the Data Commons Pilot Phase.

  • DEV Environment

    Set of processes and programming tools used to create the program or software product

  • DMI

    Data Management Incident

  • Docker

    Software for running containers, packaged, portable units of code and dependencies that can be run in the same way across many computers. See also Containers.

  • Dockerfile

    A text document that contains all the commands a user could call on the command line to assemble an image.

  • Dockstore

    An open platform developed by the Cancer Genome Collaboratory and used by the GA4GH for sharing Docker-based tools described with the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL)

  • DOI

    Digital Object Identifier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself.

  • DUO

    Data Use Ontology - a GA4GH standard for automating access (API) to human genomics data (

  • DUOS

    Data Use Oversight System,

  • Ecosystem

    A software ecosystem is a collection of processes that execute on a shared platform or across shared protocols to provide flexible services. Example: The "BDC Ecosystem" - inclusive of all platforms and tools

  • EEP

    External Expert Panel. A group of experts who provide guidance and direction to NIH about the program.

  • Epic

    A very large user story which can be broken down into executable stories

    *NHLBI’s cost-monitoring level

  • eRA Commons

    Designated ID provider for whitelist

  • External Expert Panel

    An independent body of experts that inform and advise the work of the BDC Consortium.

  • FAIR

    Findable Accessible Interoperable Reusable.

  • Feature

    A functionality at the system level that fulfills a meaningful stakeholder need

    *Level at which the CC coordinates

  • FireCloud

    Broad Institute secure cloud environment for analytical processing,

  • FISMA moderate environment

    Federal Information Security Modernization Act of 2014, amends the Federal Information Security Management Act of 2002 (FISMA), see

  • FS

    Full Stack

  • GA4GH

    Global Alliance for Genomics and Health

  • GA4GH APIs

    The Genomic Data Working Group is a coalition assembled to create interoperability standards for storing and sharing genomic data. The GA4GH Genomics API offers Interoperability for exchanging genomic data between various platforms and organizations by sending simple HTTP requests through a JSON equipped RESTful API.

  • GCP

    Google Cloud Platform

  • GCR

    Governance, Compliance, and Risk

  • Gen3

    Gen3 is open source and licensed under the Apache license, which you can use for setting up, developing and operating data commons

  • GitHub

    An online hub for storing and sharing computer programs and other plain text files. We use it for storage, hosting websites, communication and project management.

  • Gold Master

    A gold master, or GM, is the final version of software or data ready for release to production; a master version from which copies can be made.

  • GWAS

    Genome-wide Association Study

  • HLBS

    Heart, Lung, Blood, Sleep

  • Identity Providers

    A system entity that creates, maintains, and manages identity information for principals while providing authentication services to relying applications within a federation or distributed network; identity providers offer user authentication as a service

  • Interoperability

    The ability of data or tools from multiple resources to effectively integrate data, or operate processes, across all systems with a moderate degree of effort.

  • Instance

    In cloud computing, refers to a virtual server instance from a public or private cloud network.

  • Image

    In the context of containers and Docker, this refers to the resting state of the software.

  • IP

    BDC Implementation Plan; outlines how the various elements from the planning phase of the BDC project will come together to form concrete, operationalized BDC platform.

  • IRB

    Institutional Review Board; the entity within a research organization that reviews and approves research protocols and clinical research protocols to protect human and animal subjects.

  • IRC

    Informatics Research Core

  • ISA

    Interoperability Service Agreement

  • ITAC

    Information Technology Applications Center

  • Jupyter Notebooks

    A web-based interactive environment for organizing data, performing computation, and visualizing output.

  • Linux

    An open source computer operating system

  • Metadata

    Data about other data

  • Milestone

    Marks specific progress points on the development timeline, and they can be invaluable in measuring and monitoring the evolution and risk of a program. © Scaled Agile, Inc.

  • MSD

    Minimum set of documents

  • MVP

    Minimum viable product


    National Heart, Lung, and Blood Institute

  • NIH

    National Institutes of Health

  • NIST Moderate controls

    NIST 800-53 - A collection of security controls and assessment procedures that both U.S. Federal and non-governmental organizations can apply to their information systems, policies, and procedures.

  • OTA

    Other Transaction Authority - the mechanism of award that NHLBI chose because it provides a degree of flexibility in the scope of the work that is needed to advance this type of high risk/high reward project

  • PI

    Principal Investigator

  • Platform

    A piece of the BDC ecosystem. Examples: Terra, Gen3, Seven Bridges, etc.

  • PM

    Project Manager

  • PMP

    BDC Project Management Plan; breaks down the implementation of BDC from the perspective of the project managers involved in the project including details on roles, specific milestones, and the project schedule.

  • PO

    Program Officer

  • Portable Format for Biomedical Data (PFB) Avro-based serialization format with specific schema to import, export and evolve biomedical data. Specifies metadata and data in one file. Metadata includes data dictionary, ontology references and relations between nodes. Supports versioning, back- and forward compatibility. A binary format.

  • Portfolio for Jira

    Software-as-a-Service project management tool, used to track, roadmap, and visualize various project metrics.

  • Python

    Open source programming language, used extensively in research for data manipulation, analysis, and modeling

  • Quality Assurance

    The planned and systematic activities implemented in quality management so that quality requirements for a product or service satisfy stated goals and expectations.

  • Quality Control

    The operational techniques and activities aimed at monitoring and measuring work processes and eliminating the causes of unsatisfactory outputs.

  • RACI

    Responsible, Accountable, Consulted and Informed; tool that can be used for identifying roles and responsibilities during an organizational change process; BDC RACI

  • Researcher Auth Service (RAS) Will be a service provided by NIH's Center for Information Technology to facilitate access to NIH’s open and controlled data assets and repositories in a consistent and user-friendly manner. The RAS initiative is advancing data infrastructure and ecosystem goals defined in the NIH Strategic Plan for Data Science.

  • RFC

    Request for Comment: A process that documents and enables effective interactions between stakeholders to support shared decision making.

  • Risk Register

    A tool used to continuously identify risk, risk response planning and status updates throughout the project lifecycle. This project risk register is the primary risk reporting tool, and is located in the Project Management Plan.

  • SC

    Steering Committee

  • Scientific use case

    Defined in this project as an analysis of data from the designated sources which has relevance and value in the domain of health sciences, probably implementation and software agnostic.

  • SF or SFP

    BDC Strategic Framework [Plan]; defines what the BDC teams have accomplished up to this point, what we plan to accomplish in a timeline fashion, and milestones to track and measure implementation.

  • SFTP

    Secure File Transfer Protocol

  • Software Developers Kit

    A set of software development tools that allows the creation of applications for a certain software package, software framework, hardware platform, computer system, or similar development platform

  • Sprints

    Term of art used in software generation, referring to short, iterative cycles of development, with continuous review of code through daily builds and end-of-sprint demos

  • Stack

    Term of art referring to a suite of services that run in the cloud and enable ubiquitous, convenient, on-demand access to a shared pool of configurable computing resources.

  • Steering Committee

    Responsible for decision-making and communication in BDC.


    Science & Technology Research Infrastructure for Discovery, Experimentation, and Sustainability

  • Task

    In workflow writing, the term task encompasses all of the information necessary to execute a command, such as specifying input/output files and parameters.

  • Team

    Groups of people led by a Principal Investigator (PI), or PIs, who will complete milestones and produce deliverables. Each group has been assigned a name, represented by the elements on the periodic chart.

  • Tiger Teams

    A diversified group of experts brought together to investigate, solve, build, or recommend possible solutions to unique situations or problems. Populated with mature experts who know what's at stake, what needs to be done, and how to work well with others; their strengths are diversity of knowledge, a single focus or purpose, cross-functional communications, decision-making sovereignty, and organizational agility.

  • Tool

    In Common Workflow Language, the term tool specifies a single command. This definition is not as discrete in other workflow languages such as WDL.

  • Tool Registry Service (TRS)

    The GA4GH Cloud Work Stream has released a standard API for exchanging tools and workflows to analyze, read, and manipulate genomic data. The Tool Registry Service (TRS) API is one of a series of technical standards from the Cloud Work Stream that together allow genomics researchers to bring algorithms to datasets in disparate cloud environments, rather than moving data around.

  • TOPMed

    Trans-Omics for Precision Medicine. One of the primary data sets of the DCPPC.

  • TOPMed DCC

    TOPMed Data Coordinating Center

  • Trans-cloud

    A provider-agnostic multi-cloud deployment architecture.

  • User Narrative

    Descriptions of a user interaction experience within the system from the perspective of a particular persona. User Narratives are further broken down into Features, Epics, and User Stories. Currently formulated into rough 6-month timelines to benchmark progress.

  • User story

    A description of a software feature from a technical/process-oriented perspective; a backlog item that describes a requirement or functionality for a user

    *Finest level of PM Monitoring

  • Variant Call Format (VCF)

    File format for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome. There is also a Genomic VCF (gVCF) extended format, which includes additional information about "blocks" that match the reference and their qualities (from Wikipedia). See

  • VDS

    A composite of complete server hardware, along with the operating system (OS), which is powered by a remote access layer that allows end users to globally access their server via the Internet

  • VPC

    Virtual Private Cloud

  • Whitelist

    A security measure to permit only an approved list of entities. We recommend instead using the term "allow list".

  • Workflow

    A sequence of processes, usually computational in this context, through which a user may analyze data.

  • Workflow Description Language (WDL) Way to specify data processing workflows with a human-readable and writeable syntax. Define complex analysis tasks, chain them together in workflows, and parallelize their execution.

  • Workspace

    Areas to work on/with data within a platform. Examples: projects within Seven Bridges

  • Workstream

    A collection of related features; orthogonal to a User Narrative

  • Wrapping

    The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.

  • Virtual Machine (VM)

    An isolated computing environment with its own operating system.

Last updated