2021-07-09 BioData Catalyst Ecosystem Release Notes
Last updated
Was this helpful?
Last updated
Was this helpful?
The 2021-07-09 release marks the sixth release for the NHLBI BioData Catalyst ecosystem. This release includes several new features (e.g., SAS on Seven Bridges and Galaxy’s integration with Terra) along with documentation and tutorials to help new users get started on the system (e.g., PIC-SURE Open Access). This release also includes enhanced support for maintaining and versioning CWL on external tool repositories. Please find more details on the new features and user support materials in the sections below.
The 2021-07-09 data release includes the addition of CRAMs and unharmonized clinical files for the parent project CARDIA and 3 other TOPMed programs. The TOPMed program REDS-III received a version update. The unharmonized clinical files were uploaded for the 5 BioLINCC projects and the 2 open tutorial projects. Please refer to the Data Release section below for more information as well as the page on the BioData Catalyst website.
Authentication through the NIH Researcher Authentication Service: The BioData Catalyst ecosystem updated the authentication mechanism to use the NIH Researcher Authentication Service. Researchers will now be redirected to the NIH RAS page to enter their eRA Commons credentials when logging into one of the platforms within the ecosystem.
New Jupyter notebook published in the BioData Catalyst Collection Featured Workspace on Terra.
Galaxy has integrated with Terra: Galaxy is now available through the other “faces” of Terra, including the . You can launch your very own Galaxy server without having to do any configuration yourself, right from the Terra web interface. This marks a transition from alpha to beta development status of Galaxy on Terra, meaning that the software is more mature and considered reliable enough for regular work, with the caveat that minor changes may occur over time as we smooth out any remaining rough edges and improve user experience in the application. Learn more about how to use Galaxy in Terra . Features of Galaxy and its use within Terra are also featured in our blog post . You can also import Dockstore workflows into Galaxy when it's launched in Terra. Speaking of workflows, Cromwell 64 is now live on Terra.
: The RStudio image now has , a tool for single-cell transcriptomics, as well as , a package for verifying the integrity of an object in Google Cloud Storage.
Interactive Analysis: Jupyter Notebook images have been updated with Bioconductor 3.13.0. See the Bioconductor release notes .
SAS: Users on Seven Bridges can now launch SAS for interactive analysis from the Data Cruncher feature. All project files are available within SAS. Users can select from three SAS offerings built on top of SAS Studio: 1) SAS Business Intelligence enables users to utilize SAS code to manage data, create, modify and compare descriptive and predictive models. Capabilities include clustering, decision trees, linear and logistic regression. 2) SAS Analytics adds the power of SAS Viya’s Data Mining and Machine Learning algorithms such as neural networks, gradient boosting, and random forest. 3) SAS Data Science provides access to text analysis, time series models, advanced forecasting and model governance.
LocusZoom Interactive Application: Users on Seven Bridges can now launch an R Shiny application that enables users to select, visualize and interactively explore single variant association test results data, with no prior R programming knowledge. Researchers can explore existing analyses available in the database, generate LocusZoom plots for example data, or provide their own association .RData files. The app also provides the JSONizer tool, which enables researchers to subset their association test results (.RData) files and to convert them into the appropriate JSON files required by LocusZoom. Users can launch the application from the in the top navigation bar.
Example notebook for data import with DRS: These example notebooks on Seven Bridges provide users with the code and steps for importing data from CAVATICA (Kids First data) as well as importing GTEx data from the NHGRI AnVIL system. The import utilizes the DRS functionality to access files that are stored on other NIH cloud systems. Users can find notebooks from the in the top navigation bar.
CWL v1.2 available: BioData Catalyst Powered by Seven Bridges now supports Common Workflow Language (CWL) version 1.2. The new version of CWL brings a major new functionality - , as well as several minor features and improvements. For the detailed change log please see the and the .
New CWL tools and workflows on BioData Catalyst Powered by Seven Bridges: Users can find all these tools and more in the :
- This is a tool for whole genome regression analysis.
- This UW-GAC tool is a standalone app for creating Manhattan and QQ plots from the GENESIS association test results with additional filtering and stratification options available.
- This is a scalable SNV and INDEL annotation pipeline, performing a spectrum of annotations in a single tool. It integrates annotations from dozens of databases and annotation tools.
- This updates the null model file obtained with the GENESIS Null model workflow so that it can be used in the GENESIS Single Variant Association Testing workflow in fast score mode.
- This UW-GAC tool was created for QC in GWAS. The tool calculates missing rate by sample. A subset of variants may be specified.
- This UW-GAC tool was created for QC in GAWS. The tool calculates missing rate by variant. A subset of samples and/or variants may be specified.
- This UW-GAC tool was created for QC in GAWS. The tool calculates allele frequency and counts. Values for both the alternate allele (count, frequency) and the minor allele (MAC, MAF) are returned. A subset of samples and/or variants may be specified.
- This UW-GAC tool calculates the LD among an index variant and each variant in a set of other variants stored in a GDS file using the snpgdsLDMat function in the and a wrapper .
- This UW-GAC tool calculates the LD between a pair of variants stored in a GDS file using the snpgdsLDMat function in the and a wrapper .
- This UW-GAC tool calculates the LD between all pairs in a user-specified set of variants stored in a GDS file using the snpgdsLDMat function in the and a wrapper .
PIC-SURE Data Access Dashboard Updates: PIC-SURE’s Data Access Dashboard has been updated to include the number of studies and participants the user has access to based on their authorization.
Example showing the users how to access lipid measurements across harmonized variables and multiple visits using the PIC-SURE API in R, RStudio, and Python.
All previous notebooks examples are now available in RStudio on Seven Bridges.
BioLINCC Phase 2 data dictionaries: These data dictionaries were submitted in PDF format which required additional intervention and delayed general release to the platform. These data dictionaries will be released as soon as is feasible for use across the platform.
Introduction to PIC-SURE
Introduction to PIC-SURE Open Access: Harmonized
Introduction to PIC-SURE Open Access: One Criterion Search
Introduction to PIC-SURE Open Access
Introduction to PIC-SURE Open Access: Multiple search criteria
Introduction to PIC-SURE Authorized Access
Introduction to PIC-SURE Authorized Access: Data Export
The table below highligts which studies were included in the 2021-07-09 data release. CRAMs and unharmonized clinical files were uploaded for the parent project CARDIA and 3 other TOPMed programs. The TOPMed program REDS-III received a version update. The unharmonized clinical files were uploaded for the 5 BioLINCC projects and the 2 open tutorial projects. The data is now available for access across the entire ecosystem.
Study Name
phs I.D. #
Acronym
New to BioData Catalyst
New study version
Treatment of Pulmonary Hypertension and Sickle Cell Disease with Sildenafil Therapy
phs002383
WalkPHaSST
true
1
CARDIA Cohort
phs000285
CARDIA
false
3
phs001601
CCDG-PMBB
true
1
phs002385
CIBMTR
true
1
phs002362
CSSCD
true
1
phs002348
MSH
true
1
phs002386
STOPII
true
1
phs001542
GALA
true
1
phs001661
GCPD-A
true
2
phs001468
REDS-III
false
2
Tutorial-biolincc_camp
open
true
tutorial-biolincc_framingham
open
true
Study Name
phs I.D. #
Acronym
New to BioData Catalyst
New study version
Combined Exchange Area new data
false
BioLINCC – Training Dataset – Digitalis
BioLINCC – BabyHug
phs002415
true
Gen3 release notes
PIC-SURE release notes
New PIC-SURE Open Access: is now available in BioData Catalyst! PIC-SURE Open Access is available to users who have an eRA Commons account, including those who are not authorized to access any studies. The Open Access feature allows users to explore de-stigmatized, phenotypic data available in PIC-SURE prior to requesting access to data. For more information check out the and .
New PIC-SURE are available as public projects in Seven Bridges and Terra as follows:
New on Dockstore under the BioData Catalyst organization: This collection includes two WDL workflows to help users prepare their data for association testing: one for converting VCF files to GDS and one for linkage disequilibrium pruning. Stay tuned as more workflows are released.
New on Dockstore under the BioData Catalyst organization: The WDL workflows in this collection enable scalable, efficient, and flexible genome-wide gene-environment interaction analysis. GEM conducts single-variant analysis for common variants (currently in unrelated individuals only) and MAGEE conducts single-variant and variant set-based analysis for common or rare variants while allowing for relatedness. The collection also includes examples of cloud costs in the README.
Maintaining and Versioning CWL on External Tool Repositories: presents best practices for writing and maintaining CWL tools/workflows in an external tool repository, such as GitHub, so that users can better manage versions of their tools. Users should follow these best practices if they would like to publish and share their CWL tools and workflows in the since Dockstore has the ability to automatically pull changes from GitHub. These best practices will ensure that the CWL is fully portable and can run successfully not only on Seven Bridges Platforms, but also on other CWL executors such as cwltool and Toil.
Transferring Files Between Seven Bridges and Terra: guides users through the process of transferring files between the two workspace environments Seven Bridges and Terra.
Accessing Egress-Free GTEx Data From AnVIL: A new data interoperability page that includes linked instructions for how to access egress-free GTEx data from NHGRI’s AnVIL cloud ecosystem is .
PIC-SURE Documentation Updates: New provides new information on the Data Access Dashboard, PIC-SURE Open Access, and a new table for understanding study-specific subject identifiers.
PIC-SURE Video Tutorials: are now available for the following topics:
Published a on the role of a secure cloud ecosystem for supporting infrastructure projects and creating connected communities, highlighting BioData Catalyst as one of several NIH-commissioned infrastructure development projects that involve not just putting data on the cloud but also building the additional layers of services that are necessary to deliver on the extraordinary promise of this new model for data sharing and analysis.