Agile software development is an approach to software development under which requirements and solutions evolve through the collaborative effort of self-organizing and cross-functional teams and their customer(s)/end user(s).
A small group of users who are more willing to tolerate working in a system that isn’t as fully developed, providing detailed feedback & maybe some back & forth discussions
[Amazon] Elastic File System a simple, scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources.
A small group of experts that represent the personas featured within the priority User Narratives. For their time and help, Ambassadors will receive early access to the BioData Catalyst platform, free compute time, monetary fee for time, and relevant travel expenses will be covered.
In Seven Bridges, an app is a general term to refer to both tools and workflows.
App may also refer to persistent software that is integrated into a platform.
Application Programmer Interfaces. API technologies serve as software-based intermediaries to exchange data.
Amazon Web Services. A provider of cloud services available on-demand.
BagIt is a hierarchical file packaging format for storage and transfer of arbitrary digital content.
BioData Catalyst Coordinating Center
A slightly larger group than the alpha users who are not as tolerant to a difficult/clunky environment but understand that the version they are using is not polished and they need to give feedback.
Once the platform is available to a broader audience, we will support freely-accessible online training for beta-users at any time.
Carpentries Instructor Training Program
Ambassadors attend this training program to become BioData Catalyst trainers.
Change Control Management; the systematic approach to managing all changes made to a document or process. Ensures no unnecessary changes are made, all changes are documented, and a process exists for implementing approved change.
Chief Information Officer
Internet-based computing, wherein computing power, networking, storage or applications running on computers outside an organization are presented to that organization in a secure, services-oriented way.
Software units that implement a specific function or functions and which can be reused.
Concept of Operations
A collection of teams and stakeholders working to deliver on the common goals of integrated and advanced cyberinfrastructure, leading-edge data management and analysis tools, FAIR data, and HLBS researcher engagement.
A standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another (for example, Docker).
In workflow writing, the command specifies the literal command line run (akin to what you would run in the terminal).
Chronic Obstructive Pulmonary Disease (COPD) Gene
Cost Monitoring (level)
At the Epic Level The Coordinating Center will facilitate this process by developing reporting templates (see example in PM Plan, Financial Management) for distribution to the teams. The BioData Catalyst teams will complete these templates and send them directly to NHLBI. Each team is responsible for tracking their finances based upon the award conditions and for providing status updates as requested to NHLBI.
CRAM File Compressed columnar file format for storing biological sequences aligned to a reference sequence. Designed to be an efficient reference-based alternative to the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence, reducing storage costs. Additionally each column in the SAM format is separated into its own blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller than BAM, depending on the data held within them (from Wikipedia).
Common Services Operations Center (CSOC): operates cloud, commons, compliance and security services that enable the operation of data commons; has ATO and hosts production system.
Development/testing; Real data in pilot (not production) that can be accessed by users
Common Workflow Language (CWL)
Simple scripting language for describing computational workflows for performing sequential operations on data. CWL is a way to describe command line tools and connect them together to create workflows. CWL is well suited for describing large-scale workflows in cluster, cloud and high performance computing environments where tasks are scheduled in parallel across many nodes.
Data Access Committee: reviews all requests for access to human studies datasets
DAR Data Access Request
A process that involves authorization to access different data repositories; part of a User Narrative for the December 2020 release goal A Work Stream PM Plan constraint: NHLBI, as the project sponsor, will identify a process to enable data access by the BioData Catalyst team members and for research users
Provides tools, applications, and workflows to enable computing large scale data sets in secure workspaces.
Data Repository Service (DRS) Generic interface (API) to data repositories so data consumers, including workflow systems, can access data in a single, standardized way regardless of where it’s stored or how it’s managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID (from GA4GH).
Members of the TOPMed and COPDGene communities who are working with BioData Catalyst teams.
Database of Genotypes and Phenotypes
Data Commons Pilot Phase Consortium. The Other Transaction Awardees, Data Stewards, and the NIH.
A decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility
A machine learning method based on neural networks to learn from data through training to recognize patterns in the data.
Demonstrations and products.
Activities and documentation resulting from the DCPPC to build, test and demonstrate completion of goals of the Data Commons Pilot Phase.
Set of processes and programming tools used to create the program or software product
Data Management Incident
Software for running containers, packaged, portable units of code and dependencies that can be run in the same way across many computers. See also Containers.
A text document that contains all the commands a user could call on the command line to assemble an image.
An open platform developed by the Cancer Genome Collaboratory and used by the GA4GH for sharing Docker-based tools described with the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL)
Digital Object Identifier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself.
Data Use Ontology - a GA4GH standard for automating access (API) to human genomics data (https://github.com/EBISPOT/DUO)
Data Use Oversight System, https://duos.broadinstitute.org/
A software ecosystem is a collection of processes that execute on a shared platform or across shared protocols to provide flexible services. Example: The "BioData Catalyst Ecosystem" - inclusive of all platforms and tools
External Expert Panel. A group of experts who provide guidance and direction to NIH about the program.
A very large user story which can be broken down into executable stories
*NHLBI’s cost-monitoring level
Designated ID provider for whitelist
External Expert Panel
An independent body of experts that inform and advise the work of the BioData Catalyst Consortium.
Findable Accessible Interoperable Reusable.
A functionality at the system level that fulfills a meaningful stakeholder need
*Level at which the CC coordinates
Broad Institute secure cloud environment for analytical processing, https://software.broadinstitute.org/firecloud/
FISMA moderate environment
Federal Information Security Modernization Act of 2014, amends the Federal Information Security Management Act of 2002 (FISMA), see https://www.dhs.gov/fisma
Global Alliance for Genomics and Health
The Genomic Data Working Group is a coalition assembled to create interoperability standards for storing and sharing genomic data. The GA4GH Genomics API offers Interoperability for exchanging genomic data between various platforms and organizations by sending simple HTTP requests through a JSON equipped RESTful API.
Google Cloud Platform
Governance, Compliance, and Risk
Gen3 is open source and licensed under the Apache license, which you can use for setting up, developing and operating data commons
An online hub for storing and sharing computer programs and other plain text files. We use it for storage, hosting websites, communication and project management.
A gold master, or GM, is the final version of software or data ready for release to production; a master version from which copies can be made.
Genome-wide Association Study
Heart, Lung, Blood, Sleep
A system entity that creates, maintains, and manages identity information for principals while providing authentication services to relying applications within a federation or distributed network; identity providers offer user authentication as a service
The ability of data or tools from multiple resources to effectively integrate data, or operate processes, across all systems with a moderate degree of effort.
In cloud computing, refers to a virtual server instance from a public or private cloud network.
In the context of containers and Docker, this refers to the resting state of the software.
BioData Catalyst Implementation Plan; outlines how the various elements from the planning phase of the BioData Catalyst project will come together to form concrete, operationalized BioData Catalyst platform.
Institutional Review Board; the entity within a research organization that reviews and approves research protocols and clinical research protocols to protect human and animal subjects.
Informatics Research Core
Interoperability Service Agreement
Information Technology Applications Center
A web-based interactive environment for organizing data, performing computation, and visualizing output.
An open source computer operating system
Data about other data
Marks specific progress points on the development timeline, and they can be invaluable in measuring and monitoring the evolution and risk of a program. © Scaled Agile, Inc.
Minimum set of documents
Minimum viable product
National Heart, Lung, and Blood Institute
National Institutes of Health
NIST Moderate controls
NIST 800-53 - A collection of security controls and assessment procedures that both U.S. Federal and non-governmental organizations can apply to their information systems, policies, and procedures.
Other Transaction Authority - the mechanism of award that NHLBI chose because it provides a degree of flexibility in the scope of the work that is needed to advance this type of high risk/high reward project
A piece of the BioData Catalyst ecosystem. Examples: Terra, Gen3, Seven Bridges, etc.
BioData Catalyst Project Management Plan; breaks down the implementation of BioData Catalyst from the perspective of the project managers involved in the project including details on roles, specific milestones, and the project schedule.
Portable Format for Biomedical Data (PFB) Avro-based serialization format with specific schema to import, export and evolve biomedical data. Specifies metadata and data in one file. Metadata includes data dictionary, ontology references and relations between nodes. Supports versioning, back- and forward compatibility. A binary format.
Portfolio for Jira
Software-as-a-Service project management tool, used to track, roadmap, and visualize various project metrics.
Open source programming language, used extensively in research for data manipulation, analysis, and modeling
The planned and systematic activities implemented in quality management so that quality requirements for a product or service satisfy stated goals and expectations.
The operational techniques and activities aimed at monitoring and measuring work processes and eliminating the causes of unsatisfactory outputs.
Responsible, Accountable, Consulted and Informed; tool that can be used for identifying roles and responsibilities during an organizational change process; BioData Catalyst RACI
Researcher Auth Service (RAS) Will be a service provided by NIH's Center for Information Technology to facilitate access to NIH’s open and controlled data assets and repositories in a consistent and user-friendly manner. The RAS initiative is advancing data infrastructure and ecosystem goals defined in the NIH Strategic Plan for Data Science.
Request for Comment: A process that documents and enables effective interactions between stakeholders to support shared decision making.
A tool used to continuously identify risk, risk response planning and status updates throughout the project lifecycle. This project risk register is the primary risk reporting tool, and is located in the Project Management Plan.
Scientific use case
Defined in this project as an analysis of data from the designated sources which has relevance and value in the domain of health sciences, probably implementation and software agnostic.
SF or SFP
BioData Catalyst Strategic Framework [Plan]; defines what the BioData Catalyst teams have accomplished up to this point, what we plan to accomplish in a timeline fashion, and milestones to track and measure implementation.
Secure File Transfer Protocol
Software Developers Kit
A set of software development tools that allows the creation of applications for a certain software package, software framework, hardware platform, computer system, or similar development platform
Term of art used in software generation, referring to short, iterative cycles of development, with continuous review of code through daily builds and end-of-sprint demos
Term of art referring to a suite of services that run in the cloud and enable ubiquitous, convenient, on-demand access to a shared pool of configurable computing resources.
Responsible for decision-making and communication in BioData Catalyst.
Science & Technology Research Infrastructure for Discovery, Experimentation, and Sustainability
In workflow writing, the term task encompasses all of the information necessary to execute a command, such as specifying input/output files and parameters.
Groups of people led by a Principal Investigator (PI), or PIs, who will complete milestones and produce deliverables. Each group has been assigned a name, represented by the elements on the periodic chart.
A diversified group of experts brought together to investigate, solve, build, or recommend possible solutions to unique situations or problems. Populated with mature experts who know what's at stake, what needs to be done, and how to work well with others; their strengths are diversity of knowledge, a single focus or purpose, cross-functional communications, decision-making sovereignty, and organizational agility.
In Common Workflow Language, the term tool specifies a single command. This definition is not as discrete in other workflow languages such as WDL.
Tool Registry Service (TRS)
The GA4GH Cloud Work Stream has released a standard API for exchanging tools and workflows to analyze, read, and manipulate genomic data. The Tool Registry Service (TRS) API is one of a series of technical standards from the Cloud Work Stream that together allow genomics researchers to bring algorithms to datasets in disparate cloud environments, rather than moving data around.
Trans-Omics for Precision Medicine. One of the primary data sets of the DCPPC.
TOPMed Data Coordinating Center
A provider-agnostic multi-cloud deployment architecture.
Descriptions of a user interaction experience within the system from the perspective of a particular persona. User Narratives are further broken down into Features, Epics, and User Stories. Currently formulated into rough 6-month timelines to benchmark progress.
A description of a software feature from a technical/process-oriented perspective; a backlog item that describes a requirement or functionality for a user
*Finest level of PM Monitoring
Variant Call Format (VCF)
File format for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome. There is also a Genomic VCF (gVCF) extended format, which includes additional information about "blocks" that match the reference and their qualities (from Wikipedia). See http://www.internationalgenome.org/wiki/Analysis/vcf4.0/.
A composite of complete server hardware, along with the operating system (OS), which is powered by a remote access layer that allows end users to globally access their server via the Internet
Virtual Private Cloud
A security measure to permit only an approved list of entities. We recommend instead using the term "allow list".
A sequence of processes, usually computational in this context, through which a user may analyze data.
Workflow Description Language (WDL) Way to specify data processing workflows with a human-readable and writeable syntax. Define complex analysis tasks, chain them together in workflows, and parallelize their execution.
Areas to work on/with data within a platform. Examples: projects within Seven Bridges
A collection of related features; orthogonal to a User Narrative
The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.
Virtual Machine (VM)
An isolated computing environment with its own operating system.