Glossary of terms used in the context of the BioData Catalyst Consortium and platform.
Agile Development
Agile software development is an approach to software development under which requirements and solutions evolve through the collaborative effort of self-organizing and cross-functional teams and their customer(s)/end user(s).
Alpha Users
A small group of users who are more willing to tolerate working in a system that isn’t as fully developed, providing detailed feedback & maybe some back & forth discussions
[Amazon] EFS
[Amazon] Elastic File System a simple, scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources.
Ambassadors
A small group of experts that represent the personas featured within the priority User Narratives. For their time and help, Ambassadors will receive early access to the BDC platform, free compute time, monetary fee for time, and relevant travel expenses will be covered.
App
In Seven Bridges, an app is a general term to refer to both tools and workflows.
App may also refer to persistent software that is integrated into a platform.
API
Application Programmer Interfaces. API technologies serve as software-based intermediaries to exchange data.
AWS
Amazon Web Services. A provider of cloud services available on-demand.
BagIt
BagIt is a hierarchical file packaging format for storage and transfer of arbitrary digital content.
BDC3
BioData Catalyst Coordinating Center
Beta Users
A slightly larger group than the alpha users who are not as tolerant to a difficult/clunky environment but understand that the version they are using is not polished and they need to give feedback.
Beta-User Training
Once the platform is available to a broader audience, we will support freely-accessible online training for beta-users at any time.
Carpentries Instructor Training Program
Ambassadors attend this training program to become BDC trainers.
CCM
Change Control Management; the systematic approach to managing all changes made to a document or process. Ensures no unnecessary changes are made, all changes are documented, and a process exists for implementing approved change.
CIO
Chief Information Officer
Cloud Computing
Internet-based computing, wherein computing power, networking, storage or applications running on computers outside an organization are presented to that organization in a secure, services-oriented way.
Components
Software units that implement a specific function or functions and which can be reused.
ConOps
Concept of Operations
Consortium
A collection of teams and stakeholders working to deliver on the common goals of integrated and advanced cyberinfrastructure, leading-edge data management and analysis tools, FAIR data, and HLBS researcher engagement.
Containers
A standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another (for example, Docker).
Command
In workflow writing, the command specifies the literal command line run (akin to what you would run in the terminal).
COPDGene
Chronic Obstructive Pulmonary Disease (COPD) Gene
Cost Monitoring (level)
At the Epic Level The Coordinating Center will facilitate this process by developing reporting templates (see example in PM Plan, Financial Management) for distribution to the teams. The BDC teams will complete these templates and send them directly to NHLBI. Each team is responsible for tracking their finances based upon the award conditions and for providing status updates as requested to NHLBI.
CRAM File Compressed columnar file format for storing biological sequences aligned to a reference sequence. Designed to be an efficient reference-based alternative to the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence, reducing storage costs. Additionally each column in the SAM format is separated into its own blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller than BAM, depending on the data held within them (from Wikipedia).
CSOC Alpha
Common Services Operations Center (CSOC): operates cloud, commons, compliance and security services that enable the operation of data commons; has ATO and hosts production system.
CSOC Beta
Development/testing; Real data in pilot (not production) that can be accessed by users
Common Workflow Language (CWL)
Simple scripting language for describing computational workflows for performing sequential operations on data. CWL is a way to describe command line tools and connect them together to create workflows. CWL is well suited for describing large-scale workflows in cluster, cloud and high performance computing environments where tasks are scheduled in parallel across many nodes.
DAC
Data Access Committee: reviews all requests for access to human studies datasets
DAR Data Access Request
Data Access
A process that involves authorization to access different data repositories; part of a User Narrative for the December 2020 release goal A Work Stream PM Plan constraint: NHLBI, as the project sponsor, will identify a process to enable data access by the BDC team members and for research users
Data Commons
Provides tools, applications, and workflows to enable computing large scale data sets in secure workspaces.
Data Repository Service (DRS) Generic interface (API) to data repositories so data consumers, including workflow systems, can access data in a single, standardized way regardless of where it’s stored or how it’s managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID (from GA4GH).
Data Steward
Members of the TOPMed and COPDGene communities who are working with BDC teams.
dbGaP
Database of Genotypes and Phenotypes
DCPPC
Data Commons Pilot Phase Consortium. The Other Transaction Awardees, Data Stewards, and the NIH.
Decision Tree
A decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility
Deep Learning
A machine learning method based on neural networks to learn from data through training to recognize patterns in the data.
Deliverables
Demonstrations and products.
Demos
Activities and documentation resulting from the DCPPC to build, test and demonstrate completion of goals of the Data Commons Pilot Phase.
DEV Environment
Set of processes and programming tools used to create the program or software product
DMI
Data Management Incident
Docker
Software for running containers, packaged, portable units of code and dependencies that can be run in the same way across many computers. See also Containers.
Dockerfile
A text document that contains all the commands a user could call on the command line to assemble an image.
Dockstore
An open platform developed by the Cancer Genome Collaboratory and used by the GA4GH for sharing Docker-based tools described with the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL)
DOI
Digital Object Identifier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself.
DUO
Data Use Ontology - a GA4GH standard for automating access (API) to human genomics data (https://github.com/EBISPOT/DUO)
DUOS
Data Use Oversight System, https://duos.broadinstitute.org/
Ecosystem
A software ecosystem is a collection of processes that execute on a shared platform or across shared protocols to provide flexible services. Example: The "BDC Ecosystem" - inclusive of all platforms and tools
EEP
External Expert Panel. A group of experts who provide guidance and direction to NIH about the program.
Epic
A very large user story which can be broken down into executable stories
*NHLBI’s cost-monitoring level
eRA Commons
Designated ID provider for whitelist
External Expert Panel
An independent body of experts that inform and advise the work of the BDC Consortium.
FAIR
Findable Accessible Interoperable Reusable.
Feature
A functionality at the system level that fulfills a meaningful stakeholder need
*Level at which the CC coordinates
FireCloud
Broad Institute secure cloud environment for analytical processing, https://software.broadinstitute.org/firecloud/
FISMA moderate environment
Federal Information Security Modernization Act of 2014, amends the Federal Information Security Management Act of 2002 (FISMA), see https://www.dhs.gov/fisma
FS
Full Stack
GA4GH
Global Alliance for Genomics and Health
GA4GH APIs
The Genomic Data Working Group is a coalition assembled to create interoperability standards for storing and sharing genomic data. The GA4GH Genomics API offers Interoperability for exchanging genomic data between various platforms and organizations by sending simple HTTP requests through a JSON equipped RESTful API.
GCP
Google Cloud Platform
GCR
Governance, Compliance, and Risk
Gen3
Gen3 is open source and licensed under the Apache license, which you can use for setting up, developing and operating data commons
GitHub
An online hub for storing and sharing computer programs and other plain text files. We use it for storage, hosting websites, communication and project management.
Gold Master
A gold master, or GM, is the final version of software or data ready for release to production; a master version from which copies can be made.
GWAS
Genome-wide Association Study
HLBS
Heart, Lung, Blood, Sleep
Identity Providers
A system entity that creates, maintains, and manages identity information for principals while providing authentication services to relying applications within a federation or distributed network; identity providers offer user authentication as a service
Interoperability
The ability of data or tools from multiple resources to effectively integrate data, or operate processes, across all systems with a moderate degree of effort.
Instance
In cloud computing, refers to a virtual server instance from a public or private cloud network.
Image
In the context of containers and Docker, this refers to the resting state of the software.
IP
BDC Implementation Plan; outlines how the various elements from the planning phase of the BDC project will come together to form concrete, operationalized BDC platform.
IRB
Institutional Review Board; the entity within a research organization that reviews and approves research protocols and clinical research protocols to protect human and animal subjects.
IRC
Informatics Research Core
ISA
Interoperability Service Agreement
ITAC
Information Technology Applications Center
Jupyter Notebooks
A web-based interactive environment for organizing data, performing computation, and visualizing output.
Linux
An open source computer operating system
Metadata
Data about other data
Milestone
Marks specific progress points on the development timeline, and they can be invaluable in measuring and monitoring the evolution and risk of a program. © Scaled Agile, Inc.
MSD
Minimum set of documents
MVP
Minimum viable product
NHLBI
National Heart, Lung, and Blood Institute
NIH
National Institutes of Health
NIST Moderate controls
NIST 800-53 - A collection of security controls and assessment procedures that both U.S. Federal and non-governmental organizations can apply to their information systems, policies, and procedures.
OTA
Other Transaction Authority - the mechanism of award that NHLBI chose because it provides a degree of flexibility in the scope of the work that is needed to advance this type of high risk/high reward project
PI
Principal Investigator
Platform
A piece of the BDC ecosystem. Examples: Terra, Gen3, Seven Bridges, etc.
PM
Project Manager
PMP
BDC Project Management Plan; breaks down the implementation of BDC from the perspective of the project managers involved in the project including details on roles, specific milestones, and the project schedule.
PO
Program Officer
Portable Format for Biomedical Data (PFB) Avro-based serialization format with specific schema to import, export and evolve biomedical data. Specifies metadata and data in one file. Metadata includes data dictionary, ontology references and relations between nodes. Supports versioning, back- and forward compatibility. A binary format.
Portfolio for Jira
Software-as-a-Service project management tool, used to track, roadmap, and visualize various project metrics.
Python
Open source programming language, used extensively in research for data manipulation, analysis, and modeling
Quality Assurance
The planned and systematic activities implemented in quality management so that quality requirements for a product or service satisfy stated goals and expectations.
Quality Control
The operational techniques and activities aimed at monitoring and measuring work processes and eliminating the causes of unsatisfactory outputs.
RACI
Responsible, Accountable, Consulted and Informed; tool that can be used for identifying roles and responsibilities during an organizational change process; BDC RACI
Researcher Auth Service (RAS) Will be a service provided by NIH's Center for Information Technology to facilitate access to NIH’s open and controlled data assets and repositories in a consistent and user-friendly manner. The RAS initiative is advancing data infrastructure and ecosystem goals defined in the NIH Strategic Plan for Data Science.
RFC
Request for Comment: A process that documents and enables effective interactions between stakeholders to support shared decision making.
Risk Register
A tool used to continuously identify risk, risk response planning and status updates throughout the project lifecycle. This project risk register is the primary risk reporting tool, and is located in the Project Management Plan.
SC
Steering Committee
Scientific use case
Defined in this project as an analysis of data from the designated sources which has relevance and value in the domain of health sciences, probably implementation and software agnostic.
SF or SFP
BDC Strategic Framework [Plan]; defines what the BDC teams have accomplished up to this point, what we plan to accomplish in a timeline fashion, and milestones to track and measure implementation.
SFTP
Secure File Transfer Protocol
Software Developers Kit
A set of software development tools that allows the creation of applications for a certain software package, software framework, hardware platform, computer system, or similar development platform
Sprints
Term of art used in software generation, referring to short, iterative cycles of development, with continuous review of code through daily builds and end-of-sprint demos
Stack
Term of art referring to a suite of services that run in the cloud and enable ubiquitous, convenient, on-demand access to a shared pool of configurable computing resources.
Steering Committee
Responsible for decision-making and communication in BDC.
STRIDES
Science & Technology Research Infrastructure for Discovery, Experimentation, and Sustainability
Task
In workflow writing, the term task encompasses all of the information necessary to execute a command, such as specifying input/output files and parameters.
Team
Groups of people led by a Principal Investigator (PI), or PIs, who will complete milestones and produce deliverables. Each group has been assigned a name, represented by the elements on the periodic chart.
Tiger Teams
A diversified group of experts brought together to investigate, solve, build, or recommend possible solutions to unique situations or problems. Populated with mature experts who know what's at stake, what needs to be done, and how to work well with others; their strengths are diversity of knowledge, a single focus or purpose, cross-functional communications, decision-making sovereignty, and organizational agility.
Tool
In Common Workflow Language, the term tool specifies a single command. This definition is not as discrete in other workflow languages such as WDL.
Tool Registry Service (TRS)
The GA4GH Cloud Work Stream has released a standard API for exchanging tools and workflows to analyze, read, and manipulate genomic data. The Tool Registry Service (TRS) API is one of a series of technical standards from the Cloud Work Stream that together allow genomics researchers to bring algorithms to datasets in disparate cloud environments, rather than moving data around.
TOPMed
Trans-Omics for Precision Medicine. One of the primary data sets of the DCPPC.
TOPMed DCC
TOPMed Data Coordinating Center
Trans-cloud
A provider-agnostic multi-cloud deployment architecture.
User Narrative
Descriptions of a user interaction experience within the system from the perspective of a particular persona. User Narratives are further broken down into Features, Epics, and User Stories. Currently formulated into rough 6-month timelines to benchmark progress.
User story
A description of a software feature from a technical/process-oriented perspective; a backlog item that describes a requirement or functionality for a user
*Finest level of PM Monitoring
Variant Call Format (VCF)
File format for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome. There is also a Genomic VCF (gVCF) extended format, which includes additional information about "blocks" that match the reference and their qualities (from Wikipedia). See http://www.internationalgenome.org/wiki/Analysis/vcf4.0/.
VDS
A composite of complete server hardware, along with the operating system (OS), which is powered by a remote access layer that allows end users to globally access their server via the Internet
VPC
Virtual Private Cloud
Whitelist
A security measure to permit only an approved list of entities. We recommend instead using the term "allow list".
Workflow
A sequence of processes, usually computational in this context, through which a user may analyze data.
Workflow Description Language (WDL) Way to specify data processing workflows with a human-readable and writeable syntax. Define complex analysis tasks, chain them together in workflows, and parallelize their execution.
Workspace
Areas to work on/with data within a platform. Examples: projects within Seven Bridges
Workstream
A collection of related features; orthogonal to a User Narrative
Wrapping
The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.
Virtual Machine (VM)
An isolated computing environment with its own operating system.