2021-07-09 BioData Catalyst Ecosystem Release Notes

Introduction

The 2021-07-09 release marks the sixth release for the NHLBI BioData Catalyst ecosystem. This release includes several new features (e.g., SAS on Seven Bridges and Galaxy’s integration with Terra) along with documentation and tutorials to help new users get started on the system (e.g., PIC-SURE Open Access). This release also includes enhanced support for maintaining and versioning CWL on external tool repositories. Please find more details on the new features and user support materials in the sections below.

The 2021-07-09 data release includes the addition of CRAMs and unharmonized clinical files for the parent project CARDIA and 3 other TOPMed programs. The TOPMed program REDS-III received a version update. The unharmonized clinical files were uploaded for the 5 BioLINCC projects and the 2 open tutorial projects. Please refer to the Data Release section below for more information as well as the Dataarrow-up-right page on the BioData Catalyst website.

Significant new features

Authentication through the NIH Researcher Authentication Service: The BioData Catalyst ecosystem updated the authentication mechanism to use the NIH Researcher Authentication Service. Researchers will now be redirected to the NIH RAS page to enter their eRA Commons credentials when logging into one of the platforms within the ecosystem.

New Jupyter notebook Sample and Variant quality control methods with Hailarrow-up-right published in the BioData Catalyst Collection Featured Workspace on Terra.

Galaxy has integrated with Terra: Galaxy is now available through the other “faces” of Terra, including the NHLBI BioData Catalystarrow-up-right. You can launch your very own Galaxy server without having to do any configuration yourself, right from the Terra web interface. This marks a transition from alpha to beta development status of Galaxy on Terra, meaning that the software is more mature and considered reliable enough for regular work, with the caveat that minor changes may occur over time as we smooth out any remaining rough edges and improve user experience in the application. Learn more about how to use Galaxy in Terra herearrow-up-right. Features of Galaxy and its use within Terra are also featured in our blog post herearrow-up-right. You can also import Dockstore workflows into Galaxy when it's launched in Terra. Speaking of workflows, Cromwell 64 is now live on Terra.

Seurat package now included by default in R-based cloud environmentsarrow-up-right: The RStudio image now has Seuratarrow-up-right, a tool for single-cell transcriptomics, as well as crcmodarrow-up-right, a package for verifying the integrity of an object in Google Cloud Storage.

Interactive Analysis: Jupyter Notebook images have been updated with Bioconductor 3.13.0. See the Bioconductor release notes herearrow-up-right.

SAS: Users on Seven Bridges can now launch SAS for interactive analysis from the Data Cruncher feature. All project files are available within SAS. Users can select from three SAS offerings built on top of SAS Studio: 1) SAS Business Intelligence enables users to utilize SAS code to manage data, create, modify and compare descriptive and predictive models. Capabilities include clustering, decision trees, linear and logistic regression. 2) SAS Analytics adds the power of SAS Viya’s Data Mining and Machine Learning algorithms such as neural networks, gradient boosting, and random forest. 3) SAS Data Science provides access to text analysis, time series models, advanced forecasting and model governance.

LocusZoom Interactive Application: Users on Seven Bridges can now launch an R Shiny application that enables users to select, visualize and interactively explore single variant association test results data, with no prior R programming knowledge. Researchers can explore existing analyses available in the University of Michiganarrow-up-right database, generate LocusZoom plots for example data, or provide their own association .RData files. The app also provides the JSONizer tool, which enables researchers to subset their association test results (.RData) files and to convert them into the appropriate JSON files required by LocusZoom. Users can launch the application from the Public project “LocusZoom Shiny App”arrow-up-right in the top navigation bar.

Example notebook for data import with DRS: These example notebooks on Seven Bridges provide users with the code and steps for importing data from CAVATICA (Kids First data) as well as importing GTEx data from the NHGRI AnVIL system. The import utilizes the DRS functionality to access files that are stored on other NIH cloud systems. Users can find notebooks from the Public Project “Data Interoperability”arrow-up-right in the top navigation bar.

CWL v1.2 available: BioData Catalyst Powered by Seven Bridges now supports Common Workflow Language (CWL) version 1.2. The new version of CWL brings a major new functionality - conditional execution of workflow stepsarrow-up-right, as well as several minor features and improvements. For the detailed change log please see the CWL CommandLineTool specificationarrow-up-right and the CWL Workflow specificationarrow-up-right.

New CWL tools and workflows on BioData Catalyst Powered by Seven Bridges: Users can find all these tools and more in the Public Apps Galleryarrow-up-right:

PIC-SURE Data Access Dashboard Updates: PIC-SURE’s Data Access Dashboard has been updated to include the number of studies and participants the user has access to based on their authorization.

New PIC-SURE Open Access: PIC-SURE Open Accessarrow-up-right is now available in BioData Catalyst! PIC-SURE Open Access is available to users who have an eRA Commons account, including those who are not authorized to access any studies. The Open Access feature allows users to explore de-stigmatized, phenotypic data available in PIC-SURE prior to requesting access to data. For more information check out the user guidearrow-up-right and tutorialarrow-up-right.

New PIC-SURE Jupyter notebook examplesarrow-up-right are available as public projects in Seven Bridges and Terra as follows:

  • Example showing the users how to access lipid measurements across harmonized variables and multiple visits using the PIC-SURE API in R, RStudio, and Python.

  • All previous notebooks examples are now available in RStudio on Seven Bridges.

New UWGAC Ancestry and Relatedness analysis collectionarrow-up-right on Dockstore under the BioData Catalyst organization: This collection includes two WDL workflows to help users prepare their data for association testing: one for converting VCF files to GDS and one for linkage disequilibrium pruning. Stay tuned as more workflows are released.

New Large-scale Gene by Environment collectionarrow-up-right on Dockstore under the BioData Catalyst organization: The WDL workflows in this collection enable scalable, efficient, and flexible genome-wide gene-environment interaction analysis. GEM conducts single-variant analysis for common variants (currently in unrelated individuals only) and MAGEE conducts single-variant and variant set-based analysis for common or rare variants while allowing for relatedness. The collection also includes examples of cloud costs in the README.

Known issues and workarounds

BioLINCC Phase 2 data dictionaries: These data dictionaries were submitted in PDF format which required additional intervention and delayed general release to the platform. These data dictionaries will be released as soon as is feasible for use across the platform.

New user support materials and documentation

Maintaining and Versioning CWL on External Tool Repositories: This tutorialarrow-up-right presents best practices for writing and maintaining CWL tools/workflows in an external tool repository, such as GitHub, so that users can better manage versions of their tools. Users should follow these best practices if they would like to publish and share their CWL tools and workflows in the Dockstore repositoryarrow-up-right since Dockstore has the ability to automatically pull changes from GitHub. These best practices will ensure that the CWL is fully portable and can run successfully not only on Seven Bridges Platforms, but also on other CWL executors such as cwltool and Toil.

Transferring Files Between Seven Bridges and Terra: This tutorialarrow-up-right guides users through the process of transferring files between the two workspace environments Seven Bridges and Terra.

Accessing Egress-Free GTEx Data From AnVIL: A new data interoperability page that includes linked instructions for how to access egress-free GTEx data from NHGRI’s AnVIL cloud ecosystem is herearrow-up-right.

PIC-SURE Documentation Updates: New PIC-SURE documentationarrow-up-right provides new information on the Data Access Dashboard, PIC-SURE Open Access, and a new table for understanding study-specific subject identifiers.

PIC-SURE Video Tutorials: PIC-SURE Video tutorialsarrow-up-right are now available for the following topics:

  • Introduction to PIC-SURE

  • Introduction to PIC-SURE Open Access: Harmonized

  • Introduction to PIC-SURE Open Access: One Criterion Search

  • Introduction to PIC-SURE Open Access

  • Introduction to PIC-SURE Open Access: Multiple search criteria

  • Introduction to PIC-SURE Authorized Access

  • Introduction to PIC-SURE Authorized Access: Data Export

Published a blog postarrow-up-right on the role of a secure cloud ecosystem for supporting infrastructure projects and creating connected communities, highlighting BioData Catalyst as one of several NIH-commissioned infrastructure development projects that involve not just putting data on the cloud but also building the additional layers of services that are necessary to deliver on the extraordinary promise of this new model for data sharing and analysis.

Data Releases

The table below highligts which studies were included in the 2021-07-09 data release. CRAMs and unharmonized clinical files were uploaded for the parent project CARDIA and 3 other TOPMed programs. The TOPMed program REDS-III received a version update. The unharmonized clinical files were uploaded for the 5 BioLINCC projects and the 2 open tutorial projects. The data is now available for access across the entire ecosystem.

Study Name

phs I.D. #

Acronym

New to BioData Catalyst

New study version

Treatment of Pulmonary Hypertension and Sickle Cell Disease with Sildenafil Therapy

phs002383

WalkPHaSST

true

1

CARDIA Cohort

phs000285

CARDIA

false

3

Tutorial-biolincc_camp

open

true

tutorial-biolincc_framingham

open

true

Planned upcoming Data Releases

Study Name

phs I.D. #

Acronym

New to BioData Catalyst

New study version

Combined Exchange Area new data

false

BioLINCC – Training Dataset – Digitalis

BioLINCC – BabyHug

phs002415

true

For detailed platform release notes please consult the following resources:

Last updated

Was this helpful?