2021-07-09 BioData Catalyst Ecosystem Release Notes

Introduction

The 2021-07-09 release marks the sixth release for the NHLBI BioData Catalyst ecosystem. This release includes several new features (e.g., SAS on Seven Bridges and Galaxy’s integration with Terra) along with documentation and tutorials to help new users get started on the system (e.g., PIC-SURE Open Access). This release also includes enhanced support for maintaining and versioning CWL on external tool repositories. Please find more details on the new features and user support materials in the sections below.

The 2021-07-09 data release includes the addition of CRAMs and unharmonized clinical files for the parent project CARDIA and 3 other TOPMed programs. The TOPMed program REDS-III received a version update. The unharmonized clinical files were uploaded for the 5 BioLINCC projects and the 2 open tutorial projects. Please refer to the Data Release section below for more information as well as the Data page on the BioData Catalyst website.

Significant new features

Authentication through the NIH Researcher Authentication Service: The BioData Catalyst ecosystem updated the authentication mechanism to use the NIH Researcher Authentication Service. Researchers will now be redirected to the NIH RAS page to enter their eRA Commons credentials when logging into one of the platforms within the ecosystem.

New Jupyter notebook Sample and Variant quality control methods with Hail published in the BioData Catalyst Collection Featured Workspace on Terra.

Galaxy has integrated with Terra: Galaxy is now available through the other “faces” of Terra, including the NHLBI BioData Catalyst. You can launch your very own Galaxy server without having to do any configuration yourself, right from the Terra web interface. This marks a transition from alpha to beta development status of Galaxy on Terra, meaning that the software is more mature and considered reliable enough for regular work, with the caveat that minor changes may occur over time as we smooth out any remaining rough edges and improve user experience in the application. Learn more about how to use Galaxy in Terra here. Features of Galaxy and its use within Terra are also featured in our blog post here. You can also import Dockstore workflows into Galaxy when it's launched in Terra. Speaking of workflows, Cromwell 64 is now live on Terra.

Seurat package now included by default in R-based cloud environments: The RStudio image now has Seurat, a tool for single-cell transcriptomics, as well as crcmod, a package for verifying the integrity of an object in Google Cloud Storage.

Interactive Analysis: Jupyter Notebook images have been updated with Bioconductor 3.13.0. See the Bioconductor release notes here.

SAS: Users on Seven Bridges can now launch SAS for interactive analysis from the Data Cruncher feature. All project files are available within SAS. Users can select from three SAS offerings built on top of SAS Studio: 1) SAS Business Intelligence enables users to utilize SAS code to manage data, create, modify and compare descriptive and predictive models. Capabilities include clustering, decision trees, linear and logistic regression. 2) SAS Analytics adds the power of SAS Viya’s Data Mining and Machine Learning algorithms such as neural networks, gradient boosting, and random forest. 3) SAS Data Science provides access to text analysis, time series models, advanced forecasting and model governance.

LocusZoom Interactive Application: Users on Seven Bridges can now launch an R Shiny application that enables users to select, visualize and interactively explore single variant association test results data, with no prior R programming knowledge. Researchers can explore existing analyses available in the University of Michigan database, generate LocusZoom plots for example data, or provide their own association .RData files. The app also provides the JSONizer tool, which enables researchers to subset their association test results (.RData) files and to convert them into the appropriate JSON files required by LocusZoom. Users can launch the application from the Public project “LocusZoom Shiny App” in the top navigation bar.

Example notebook for data import with DRS: These example notebooks on Seven Bridges provide users with the code and steps for importing data from CAVATICA (Kids First data) as well as importing GTEx data from the NHGRI AnVIL system. The import utilizes the DRS functionality to access files that are stored on other NIH cloud systems. Users can find notebooks from the Public Project “Data Interoperability” in the top navigation bar.

CWL v1.2 available: BioData Catalyst Powered by Seven Bridges now supports Common Workflow Language (CWL) version 1.2. The new version of CWL brings a major new functionality - conditional execution of workflow steps, as well as several minor features and improvements. For the detailed change log please see the CWL CommandLineTool specification and the CWL Workflow specification.

New CWL tools and workflows on BioData Catalyst Powered by Seven Bridges: Users can find all these tools and more in the Public Apps Gallery:

  • Regenie 2.0.1 - This is a tool for whole genome regression analysis.

  • GENESIS Association results plotting - This UW-GAC tool is a standalone app for creating Manhattan and QQ plots from the GENESIS association test results with additional filtering and stratification options available.

  • WGSA 0.9 - This is a scalable SNV and INDEL annotation pipeline, performing a spectrum of annotations in a single tool. It integrates annotations from dozens of databases and annotation tools.

  • GENESIS Update Null Model for Fast Score Test - This updates the null model file obtained with the GENESIS Null model workflow so that it can be used in the GENESIS Single Variant Association Testing workflow in fast score mode.

  • Missing rate by sample - This UW-GAC tool was created for QC in GWAS. The tool calculates missing rate by sample. A subset of variants may be specified.

  • Missing rate by variant - This UW-GAC tool was created for QC in GAWS. The tool calculates missing rate by variant. A subset of samples and/or variants may be specified.

  • Allele frequency - This UW-GAC tool was created for QC in GAWS. The tool calculates allele frequency and counts. Values for both the alternate allele (count, frequency) and the minor allele (MAC, MAF) are returned. A subset of samples and/or variants may be specified.

  • Id-index - This UW-GAC tool calculates the LD among an index variant and each variant in a set of other variants stored in a GDS file using the snpgdsLDMat function in the SNPRelate R package and a wrapper LDcompute R package.

  • Id-pair - This UW-GAC tool calculates the LD between a pair of variants stored in a GDS file using the snpgdsLDMat function in the SNPRelate R package and a wrapper LDcompute R package.

  • Id-set - This UW-GAC tool calculates the LD between all pairs in a user-specified set of variants stored in a GDS file using the snpgdsLDMat function in the SNPRelate R package and a wrapper LDcompute R package.

PIC-SURE Data Access Dashboard Updates: PIC-SURE’s Data Access Dashboard has been updated to include the number of studies and participants the user has access to based on their authorization.

New PIC-SURE Open Access: PIC-SURE Open Access is now available in BioData Catalyst! PIC-SURE Open Access is available to users who have an eRA Commons account, including those who are not authorized to access any studies. The Open Access feature allows users to explore de-stigmatized, phenotypic data available in PIC-SURE prior to requesting access to data. For more information check out the user guide and tutorial.

New PIC-SURE Jupyter notebook examples are available as public projects in Seven Bridges and Terra as follows:

  • Example showing the users how to access lipid measurements across harmonized variables and multiple visits using the PIC-SURE API in R, RStudio, and Python.

  • All previous notebooks examples are now available in RStudio on Seven Bridges.

New UWGAC Ancestry and Relatedness analysis collection on Dockstore under the BioData Catalyst organization: This collection includes two WDL workflows to help users prepare their data for association testing: one for converting VCF files to GDS and one for linkage disequilibrium pruning. Stay tuned as more workflows are released.

New Large-scale Gene by Environment collection on Dockstore under the BioData Catalyst organization: The WDL workflows in this collection enable scalable, efficient, and flexible genome-wide gene-environment interaction analysis. GEM conducts single-variant analysis for common variants (currently in unrelated individuals only) and MAGEE conducts single-variant and variant set-based analysis for common or rare variants while allowing for relatedness. The collection also includes examples of cloud costs in the README.

Known issues and workarounds

BioLINCC Phase 2 data dictionaries: These data dictionaries were submitted in PDF format which required additional intervention and delayed general release to the platform. These data dictionaries will be released as soon as is feasible for use across the platform.

New user support materials and documentation

Maintaining and Versioning CWL on External Tool Repositories: This tutorial presents best practices for writing and maintaining CWL tools/workflows in an external tool repository, such as GitHub, so that users can better manage versions of their tools. Users should follow these best practices if they would like to publish and share their CWL tools and workflows in the Dockstore repository since Dockstore has the ability to automatically pull changes from GitHub. These best practices will ensure that the CWL is fully portable and can run successfully not only on Seven Bridges Platforms, but also on other CWL executors such as cwltool and Toil.

Transferring Files Between Seven Bridges and Terra: This tutorial guides users through the process of transferring files between the two workspace environments Seven Bridges and Terra.

Accessing Egress-Free GTEx Data From AnVIL: A new data interoperability page that includes linked instructions for how to access egress-free GTEx data from NHGRI’s AnVIL cloud ecosystem is here.

PIC-SURE Documentation Updates: New PIC-SURE documentation provides new information on the Data Access Dashboard, PIC-SURE Open Access, and a new table for understanding study-specific subject identifiers.

PIC-SURE Video Tutorials: PIC-SURE Video tutorials are now available for the following topics:

  • Introduction to PIC-SURE

  • Introduction to PIC-SURE Open Access: Harmonized

  • Introduction to PIC-SURE Open Access: One Criterion Search

  • Introduction to PIC-SURE Open Access

  • Introduction to PIC-SURE Open Access: Multiple search criteria

  • Introduction to PIC-SURE Authorized Access

  • Introduction to PIC-SURE Authorized Access: Data Export

Published a blog post on the role of a secure cloud ecosystem for supporting infrastructure projects and creating connected communities, highlighting BioData Catalyst as one of several NIH-commissioned infrastructure development projects that involve not just putting data on the cloud but also building the additional layers of services that are necessary to deliver on the extraordinary promise of this new model for data sharing and analysis.

Data Releases

The table below highligts which studies were included in the 2021-07-09 data release. CRAMs and unharmonized clinical files were uploaded for the parent project CARDIA and 3 other TOPMed programs. The TOPMed program REDS-III received a version update. The unharmonized clinical files were uploaded for the 5 BioLINCC projects and the 2 open tutorial projects. The data is now available for access across the entire ecosystem.

Planned upcoming Data Releases

For detailed platform release notes please consult the following resources:

Last updated