2023-01-09 BioData Catalyst Ecosystem Release Notes

Introduction

The 2023-01-09 release marks the twelfth release for the NHLBI BioData Catalyst® (BDC) ecosystem. This release includes several new features (e.g., Azure volumes now available on both main analysis platforms) along with documentation and tutorials (e.g., information on how variable tags are generated) to help new users get started on the system. This release also includes enhanced support for moving data seamlessly across platforms. Please find more detail on the new features and user support materials in the sections below.

The 2023-01-09 data releases include the addition of the Pediatric Cardiac Genomics Consortium (PCGC). Please refer to the Data Releases section below for more information as well as the Data page on the BDC website.

Significant new features

Azure volumes are now available on BDC Powered by Seven Bridges: Users can now link a Microsoft Azure bucket to their Seven Bridges workspaces. After logging in, go to Data > Volumes and select “Microsoft Azure” to be led through a bucket-linking wizard.

DRS Manifest Export: In order to further improve interoperability and allow users to move their data in a seamless way across platforms, the DRS export option on the Seven Bridges’ platforms is now available. With the new functionality, users can generate links to platform files (DRS URIs) and metadata into a manifest file, which can then be used for importing the files and metadata on other platforms.

OmicsCircos R Shiny app now available on BDC-Seven Bridges: OmicCircos app is a R Shiny application created around the OmicCircos R package for more effective generation of high-quality circular plots for visualizing genomic data. Common use cases include mutation patterns, copy number variations (CNVs), expression patterns, and methylation patterns. Such variations can be displayed as scatterplot, line, or text-label figures.

Introduction to SAS Public Project on BDC-Seven Bridges: Seven Bridges released a Public Project to train users on how to use SAS. The public project contains three notebooks that walk a user through: 1) loading and cleaning data in SAS using ICD9 codes, 2) pulling the CDC’s Social Vulnerability Index data via API and running a regression, and 3) loading hosted 1000 Genomes data into SAS and visualizing mutation information. A user can copy the public project to their own workspace and modify the tutorial notebooks to suit their needs.

New CWL Tools/Workflows on BDC-Seven Bridges:

  • BEDTools 2.30.0 toolkit:

    • BEDTools Coverage - returns the depth and breadth of coverage of features from B on the intervals in A

    • BEDTools Genomecov - computes histograms of feature coverage for a given genome

    • BEDTools GetFasta - extracts sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file

    • BEDTools Intersect - screens for overlaps between two sets of genomic features

    • BEDTools Merge - combines overlapping or “book-ended” features in an interval file into a single feature

    • BEDTools Sort - sorts a feature file by chromosome and other criteria

  • FlowSOM 2.4.0 which presents an algorithm used to distinguish cell populations from both flow and mass cytometry data in an unsupervised way.

  • cytofkit2 0.99.80 which is designed to analyze mass cytometry data from FCS files. It includes preprocessing, cell subset detection, cell subset visualization and interpretation, and inference of subset progression.

  • flowAI 1.24.0 which performs quality control on FCS data acquired using flow cytometry instruments. By evaluating three different properties: flow rate, signal acquisition, dynamic range, and quality control, it enables the detection and removal of anomalies.

  • CNVkit 0.9.9 toolkit for inferring and visualizing copy number from high-throughput DNA sequencing data.

  • SBG Single-Cell RNA Deep Learning - Training is a single cell classifier pipeline for human data. It relies on the transfer learning approach, which uses pre-trained gene embeddings as the starting point for building a model adjusted to given single-cell datasets.

  • SBG Single-Cell RNA Deep Learning - Predict is a single-cell classifier pipeline for human data. This tool uses the deep learning model generated by the SBG Single-Cell RNA Deep Learning - Training workflow to classify the input dataset.

Azure is now available on BDC Powered by Terra: Users can now log into Terra with a Microsoft Azure Cloud account. This is an invite-only version of Terra on the Azure platform. The public offering of Terra on Azure is expected in early 2023.

A new spend report is now available for BDC-Terra billing projects: The report identifies which workspaces are costing the most, to provide more transparency around cloud costs incurred in Terra. To access the spend report, go to your billing project (main menu > billing > billing project) and click on the "Spend report" tab.

New streamlined user journey from BDC Powered by PIC-SURE to analysis platforms: PIC-SURE has added “Export to Seven Bridges” and “Export to Terra” buttons to streamline data export into a BioData Catalyst analysis workspace. After exploring and filtering variables in PIC-SURE Authorized Access, users can package their data with the Select and Package Data Tool. Once the data is packaged, users can select their preferred BDC analysis platform with the new Export buttons. This provides all information needed and points the user directly to the public PIC-SURE project on either Seven Bridges or Terra.

Take a Tour of BDC-PIC-SURE: PIC-SURE has updated the guided tour of the interface to interactively display search results based on the user’s authorization. This guided tour walks through the different parts of the platform, including how to use tags, where search results are displayed, and how to interpret the Results Panel.

Known issues and workarounds

BABYHUG Data Field Issue: The study BABYHUG, phs002415, contained a data file that included SAS-derived new line characters in data fields. As provided by the data submitter this caused shifts in the data rows, leading to fields being incorrectly mapped to the wrong variable. A new corrected version of the file has been requested from the data submitter.

New user support materials and documentation

BDC-PIC-SURE Tag Generation: PIC-SURE has updated help text in the user interface and documentation to address the frequently asked question, “How are variable tags generated?” Users can find this help text in the “Filter by Variable Tags” box on the PIC-SURE platform and in the PIC-SURE User Guide.

Updated BDC-PIC-SURE documentation on the Export buttons: The PIC-SURE User Guide and Authorized Access: Select and Package Data Tool YouTube video were updated to include information about the new Export buttons. These updates were also released in the BDC Gitbook documentation.

BDC GitBook on BDC-PIC-SURE: Users can now access the BDC GitBook documentation directly from the PIC-SURE platform under the “Help” tab.

Data Releases

The table below highlights which studies were included in the 2023-01-09 data release.

The PCGC substudy contains whole exome sequences, targeted sequences, and SNP array data. It is a multi-center, observational cohort study of individuals with congenital heart defects. The study aims to investigate the relationship between genetic factors and phenotypic and clinical outcomes in patients with CHD. Summary level phenotypes for the study participants can be viewed on the top-level study page. Individual level data and molecular data for the study are available by requesting Authorized Access. The study has collected phenotypic data and source DNA from 10,000 probands, parents, and families of interest. The data is now available for access across the entire ecosystem.

Study Namephs I.D. #AcronymNew to BioData Catalyst New study version

The Pediatric Cardiac Genomics Consortium (PCGC)

phs000571.v6.p2.c1

PCGC-CHD-GENES_HMB

No

Yes

Planned Upcoming Data Releases

Study Namephs I.D. #AcronymNew to BioData CatalystNew study version

The Collaborative Cohort of Cohorts for COVID-19 Research (C4R)

phs002988.v1.p1.c1

phs002910.v1.p1.c1

phs002910.v1.p1.c2

phs002911.v1.p1.c1

phs002911.v1.p1.c2

phs003017.v1.p1.c1

phs002919.v1.p1.c1

C4R_ARIC_phs002988

C4R_COPDGene_phs002910

C4R_FHS_phs002911

C4R_MESA_phs003017

C4R_REGARDS_phs002919

No

Yes

Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-Be (nuMoM2b)

phs002339.v1.p1.c1

topmed-NuMom2B_GRU-IRB

Yes

Yes

For detailed platform release notes please consult the following resources:

Gen3 release notes Terra release notes Seven Bridges release notes PIC-SURE release notes Dockstore release notes

Last updated