arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

2021-10-04 BioData Catalyst Ecosystem Release Notes

hashtag
Introduction

The 2021-10-04 release marks the seventh release for the NHLBI BioData Catalyst ecosystem. This release includes several new features (e.g., project cost reporting on Terra and archiving files on AWS) along with documentation and tutorials (e.g., estimating and managing cloud costs) to help new users get started on the system. This release also includes enhanced support for semantic search and R Shiny apps. Please find more detail on the new features and user support materials in the sections below.

The 2021-10-04 data release includes the addition of the final BioLINCC training dataset plus another BioLINCC study, BabyHug. The TOPMed Combined Exchange Area buckets were updated with more datasets from multiple new freezes. The last dataset ingested was PCGC’s CMG. Please refer to the Data Release section below for more information as well as the on the BioData Catalyst website.

hashtag
Significant new features

Updated Semantic Search UI: Dug, the BioData Catalyst's Semantic Search, has an updated user interface. The new interface makes it easy to see more results on one page. A zoom feature lets users expand individual results to explore in greater detail. Provenance in knowledge graphs and links to published literature are presented where available.

Archive files on AWS: Users on BioData Catalyst Powered by Seven Bridges can now select files to move from AWS S3 storage to AWS Glacier (archival storage). Moving files to archival storage can result in an ~80% cost reduction. It’s recommended that users move files to archival storage if the files will not be used for three or more months.

Project Per Work Space Cost Reporting on Terra: Users on BioData Catalyst Powered by Terra will now have more transparency and access to cost information with the . This update associates each Terra workspace with its own Google Project, created by Terra on behalf of users when workspaces are created. Switching to this “project-per-workspace” model enables added functionality for displaying a breakdown of costs per workspace in the Terra user interface, and allows Terra users to to be notified of cloud spending. This change will only apply to new workspaces created, with plans to migrate existing workspaces over to this model in the future.

Try out R Shiny apps in Terra: Since the rollout of last quarter, Terra’s Interactive Analysis team has expanded the capabilities of the cloud environments framework that supports running RStudio, Jupyter Notebook and in Terra. Most recently, Terra users now have the ability to . Check out an example of an developed by the Manning Lab to visualize whole-genome association data.

Save data from an IA environment: With the new R Shiny apps in Terra, users can . Saving data from an interactive cloud environment (such as an instance of RStudio or a Jupyter notebook) is a useful trick in some situations. Users worried about losing work done in an interactive environment because they need to delete or modify the persistent disk can use "gsutil" to copy it to the workspace bucket.

Speed up machine learning work with GPUs on Terra: Terra’s Interactive Analysis team has released an upgrade that enables . Terra already offered the , and are now responding to user requests to run GPU-enabled computations interactively with GPU support for Jupyter Notebooks.

Speed up workflows and save costs using N2 instances sporting Intel’s 2nd Generation Xeon CPUs on Terra: Terra users will now have the option to use new-generation N2 instances, which have demonstrated faster performance and reduced cost. Read more about these updates and how to request N2 instances for workflows .

hashtag
New user support materials and documentation

Cross-study harmonization example notebook: will demonstrate how to query and work with the BioData Catalyst studies, particularly cross-study harmonization using the PIC-SURE API.

Estimate and Manage Cloud Costs on Seven Bridges: describes how to estimate costs associated with using Seven Bridges. The tutorial includes an overview of both cloud storage costs and cloud computation costs and the primary drivers of those costs. The tutorial also provides guidance on how to approach estimating cloud storage and computation costs so that researchers can budget for cloud costs in their grants, request cloud credits, and plan their work on BioData Catalyst.

Public project for TOPMed Freeze8 variant calling pipelines: Users on Seven Bridges can now access a public project that walks through how to use the CWL tools and workflows that were used to perform variant calling of TOPMed Freeze8. The public project provides explanations of the purpose of all of the tools and workflows and how they are used together, along with examples of completed analyses. All of the CWL tools and workflows in the project are available in the Public Apps Gallery.

Need an easy way to explain Terra to your colleagues or collaborators? Try this

Estimate Workflow Costs on Terra: Terra users can also follow . This is the original document describing the steps summarized in this blog post.

Understanding and controlling cloud costs on Terra: includes a detailed breakdown of the types of costs that you may incur when working on Google Cloud, plus some advice on how to reduce costs.

Understanding costs and billing on Terra: includes an overview of how billing works, including how billing accounts, projects and workspaces relate to each other, and the difference between workspace permissions and billing permissions.

Controlling cloud costs on Terra – sample use cases: includes a selection of typical analysis use cases, for which the costs are broken down in several scenarios in order to illustrate the effect of cost control strategies.

New tools and workflows released to :

  • Three additional WDL workflows have been released in the , including KING, PC-Relate, and PC-AIR.

  • WDL was released to the Utilities collection. This workflow provides the full power of to subset, subsample, and filter VCF files.

  • New with CWL workflows can predict gene expression (or whatever biology the models predict) in a cohort with available genotypes and run associations to a trait measured in the cohort.

, including Terra:

  • New to Galaxy? The Galaxy Training Network is continuing to add training material in their on Dockstore.

  • Additionally, users can explore some of the Galaxy community’s best practices workflows in their on Dockstore.

Ready to publish and share the tool or workflow you developed with the research community? Dockstore users can link their accounts to their ORCID and Zenodo accounts, , and now can .

New video tutorials demonstrate exporting data from PIC-SURE to and sing BioLINCC/Sickle Cell related data.

hashtag
Data Releases

The table below highlights which studies were included in the 2021-10-04 data release. The final BioLINCC training dataset was uploaded, plus another BioLINCC study, BabyHug. The ORCHID dataset was re-ingested after the data owners found they had provided incorrect versions of the files at the time of initial ingestion. The TOPMed Combined Exchange Area buckets were updated with more datasets from multiple new freezes. The last dataset ingested was PCGC’s CMG. The data is now available for access across the entire ecosystem.

hashtag

hashtag
Planned upcoming Data Releases

hashtag

hashtag
For detailed platform release notes please consult the following resources:

  • Gen3 release notes

PETAL - ORCHID (data re-ingested since files initially provided by data submitters were not the final version )

phs002299

ORCHID

false

1

PCGC (CMG/Wagner)

CMG

true

1

CureSCi - BabyHug (via BioLINCC)

phs002415

BabyHug

true

1

PIC-SURE release notes
  • Dockstore release notesarrow-up-right

  • Study Name

    phs I.D. #

    Acronym

    New to BioData Catalyst

    New study version

    BioLINCC (Phase 1) - Training Data (Digitalis)

    open

    true

    NA

    Additional TOPMed combined EA

    c999

    Freeze1/

    Freeze9b/

    Freeze10a

    true

    Study Name

    phs I.D. #

    Acronym

    New to BioData Catalyst

    New study version

    TOPMed Freeze 9 - Batch 1

    (22 datasets included)

    Various

    Various

    false

    NA

    PCGC SRA Data

    Additional TOPMed Freeze 8 Studies (CATHGen)

    phs000571

    true

    Data pagearrow-up-right
    rollout of PPWSarrow-up-right
    set up and use GCP budget alertsarrow-up-right
    Rstudio and Bioconductorarrow-up-right
    Galaxyarrow-up-right
    launch R Shiny apps from Terra’s built-in RStudio environmentarrow-up-right
    open-source R Shiny apparrow-up-right
    save data from an IA environmentarrow-up-right
    adding Graphical Processor Units (GPUs) to Notebook cloud environments in Terraarrow-up-right
    ability to use GPUs in workflowsarrow-up-right
    herearrow-up-right
    This tutorial notebookarrow-up-right
    This tutorialarrow-up-right
    quick (2-min.) overview of Terra.arrow-up-right
    this documentation to estimate costs of workflowsarrow-up-right
    This articlearrow-up-right
    This articlearrow-up-right
    This articlearrow-up-right
    Dockstore’s NHLBI BioData Catalyst Organizationarrow-up-right
    UWGAC Ancestry, Relatedness, and Association Testing Collectionarrow-up-right
    xvcfViewarrow-up-right
    bcftools viewarrow-up-right
    PrediXcan collectionarrow-up-right
    Launch Galaxy workflows from Dockstore into multiple Galaxy instancesarrow-up-right
    Organizationarrow-up-right
    IWC Organizationarrow-up-right
    mint DOIs for their workflows hosted on Dockstorearrow-up-right
    export their workflows directly to their ORCID profilearrow-up-right
    Terraarrow-up-right
    Seven Bridges uarrow-up-right
    Terra release notesarrow-up-right
    Seven Bridges release notesarrow-up-right

    NA

    6

    phs001194arrow-up-right