2023-04-04 BioData Catalyst Ecosystem Release Notes

Introduction

The 2023-04-04 release marks the thirteenth release for the NHLBI BioData Catalyst® (BDC) ecosystem. This release includes several new features, e.g., a new gallery for Public Projects and new project-based download restrictions on BDC Powered by Seven Bridges (BDC-Seven Bridges). It also includes documentation and tutorials to help new users get started on the system, e.g., how to start using the BDC Powered by PIC-SURE (BDC-PIC-SURE) API. Please find more details on the new features and user support materials in the sections below.

Please refer to the Data Releases section below for information on upcoming data releases. A list of currently available data can be viewed on the Data page of the BDC website.

Significant new features

New gallery for Public Projects on BDC-Seven Bridges: BDC-Seven Bridges has released a new user interface to make browsing and selecting public projects easier. Previously, Public Projects were found as a list under a dropdown menu. The interface has been updated where the Public Resources > Projects dropdown displays a gallery of project cards with summaries and easily clickable “Copy Project” buttons.

Project-based download restrictions on BDC-Seven Bridges: Many consortia have found value in using the BDC-Seven Bridges project member permissions to collaborate and distribute data prior to public release. However, the ability to add new files to a project also allows a user to download files to their local environment. BDC-Seven Bridges released a new feature providing project-based download restrictions to the owner of the project. When creating a project, a user can turn on Download Restrictions and select to either allow analysis (CWL tools/workflows or Data Studio) but no download to a local environment, or no analysis and no download to the local environment. To request access to the new feature, email support@sevenbridges.com.

New CWL tools and workflows on BDC-Seven Bridges:

  • Minimac 4 4.1.2: a tool for imputing genotypes.

  • GATK 4.4.0.0

    • GATK IndexFeatureFile for indexing of provided feature files.

    • GATK MergeVcfs for combining multiple variant files.

    • GATK VariantEval BETA for evaluating variant calls.

    • GATK FilterMutectCalls filter somatic SNVs and indels called by Mutect2.

  • HTSeq-count 2.0.2: HTSeq-count is a Python tool for counting how many reads map to each feature.

  • GraphicsMagick 1.3.38

    • GraphicsMagick compare compares two images using statistics and/or visual differencing. The tool compares two images and reports difference statistics according to specified metrics, and/or outputs an image with a visual representation of the differences.

    • GraphicsMagick composite composites (combines) images to create a new image.

    • GraphicsMagick conjure interprets and executes scripts in the Magick Scripting Language (MSL). The Magick scripting language (MSL) will primarily benefit those that want to accomplish custom image processing tasks but do not wish to program.

    • GraphicsMagick convert is used to convert an input image file using one image format to an output file with the same or different image format while applying an arbitrary number of image transformations.

    • GraphicsMagick montage creates a composite image by combining several separate images.

  • MHC-I Binding Prediction tool (MHC I 3.1.2 toolkit) - which is used for prediction of peptides that bind to MHC I molecules.

  • MHC-II Binding Prediction tool (MHC II 3.1.6 toolkit) - which is used for prediction of peptides that bind to MHC II molecules.

  • MHCflurry Predict tool (MHCflurry 2.0.4 toolkit) - which is used for peptide/MHC I binding affinity prediction.

  • MHCflurry Scan tool (MHCflurry 2.0.4 toolkit) - which is designed to scan protein sequences and predict MHC-I ligands.

  • AXEL-F: Antigen eXpression based Epitope Likelihood-Function tool (AXEL-F 1.0.0 toolkit) - which is used for MHC-I epitope prediction.

  • NetChop tool (NetChop 3.0 toolkit) - which is a predictor of proteasomal processing based upon a neural network.

  • NetCTL tool (NetCTL 3.0 toolkit) - which is a T cell epitopes predictor.

  • NetCTLpan tool (NetCTLpan 3.0 toolkit) - which is a T cell epitopes predictor.

  • Class I Immunogenicity tool (Class I Immunogenicity 3.0 toolkit) - which predicts the immunogenicity of a peptide MHC (pMHC) complex.

  • TCRMatch tool (TCRMatch 1.0.2 toolkit) - which predicts T-Cell receptor specificity based on sequence similarity to characterized receptors.

  • BCell tool (BCell 3.1 toolkit) - which predicts linear B cell epitopes based on the antigen characteristics.

  • ElliPro tool (ElliPro 1.0 toolkit) - which predicts antibody epitopes based upon solvent-accessibility and flexibility.

  • Population Coverage tool (Population Coverage 3.0 toolkit) - which calculates the fraction of individuals predicted to respond to a given set of epitopes.

  • Epitope Cluster Analysis tool (Epitope Cluster Analysis 1.0 toolkit) - which groups epitopes into clusters based on sequence identity.

  • Picard 3.0.0 toolkit:

    • Picard CollectMultipleMetrics collects BAM statistics by running multiple Picard modules at once.

    • Picard ValidateSamFile validates an alignments file against the SAM specification.

    • Picard SortSam sorts alignment files (BAM or SAM).

    • Picard RevertSam reverts a BAM/SAM file to a previous state.

    • Picard MarkDuplicates marks duplicate reads in alignment files.

    • Picard GenotypeConcordance calculates genotype concordance between two VCF files.

    • Picard GatherBamFiles merges BAM files after a scattered analysis.

    • Picard FixMateInformation verifies and fixes mate-pair information.

    • Picard FastqToSam converts FASTQ files to an unaligned SAM or BAM file.

    • Picard CrosscheckFingerprints checks a set of data files for sample identity.

    • Picard CreateSequenceDictionary creates a DICT index file for a sequence.

    • Picard CollectWgsMetricsWithNonZeroCoverage evaluates the coverage and performance of WGS experiments.

    • Picard CollectVariantCallingMetrics can be used to collect variant call statistics after variant calling.

    • Picard CollectSequencingArtifactMetrics collects metrics to quantify single-base sequencing artifacts.

    • Picard CollectHsMetrics collects hybrid-selection metrics for alignments in SAM or BAM format.

    • Picard CollectAlignmentSummaryMetrics produces a summary of alignment metrics from a SAM or BAM file.

    • Picard CheckFingerprint checks sample identity of provided data against known genotypes.

    • Picard BedToIntervalList converts a BED file to a Picard INTERVAL_LIST format.

    • Picard AddOrReplaceReadGroups assigns all reads to the specified read group.

  • MetaCyto workflow (1.16.0 in CWL 1.2): based on R package MetaCyto that performs meta-analysis of both flow cytometry and mass cytometry (CyTOF) data. It is able to jointly analyze cytometry data from different studies with diverse sets of markers.

New and improved R adapter for BDC-PIC-SURE API: The R adapter for the BDC-PIC-SURE API has been completely revamped to improve performance, address known bugs, and make the API easier to use for R coders. All example code, in both Jupyter and RStudio, has been updated to show these code improvements in practice. Note: The old version of the R API will be available for use until August 31st, 2023. It is recommended that you update your code with the new changes.

BDC Powered by Gen3 (BDC-Gen3) Metadata Being Updated to bring data from dbGaP FHIR database: BDC-Gen3’s Discovery Page (and underlying BDC-Gen3 Source of Truth Metadata API) allows unauthenticated users to discover what datasets are available in BDC. Fast Health Interoperability Resources (FHIR) is an Health Level Seven International (HL7) specification for Healthcare Interoperability. The database of Genotypes and Phenotypes (dbGaP) has recently exposed a FHIR server. BDC-Gen3 has worked to consume the new metadata from the dbGaP FHIR Server (as part of the officially defined data ingestion process). BDC-Gen3’s Python-based Software Development Kit (SDK) and Command Line Interface (CLI) now has:

  • A FHIR client

  • Direct interaction with dbGaP’s FHIR API

  • Extract, Transform, Load (ETL) logic to parse the content from dbGaP’s FHIR and load into BDC-Gen3’s Metadata API

BDC-Gen3’s Data Ingestion Pipeline will be updated to use the above tool to load FHIR metadata every new data release. In April 2023, loaded metadata will be available to all clients/users through BDC-Gen3’s Metadata API, and loaded metadata will be viewable in BDC-Gen3’s Discovery Page.

New user support materials and documentation

Learn about and start using the BDC-PIC-SURE API on the new “API” page: The “API” page on the BDC-PIC-SURE website provides everything you need to get started with the BDC-PIC-SURE API. This includes the personalized access token, links to publicly available R and Python code on both BDC Powered by Seven Bridges and Powered by Terra, and links to additional documentation.

Data Releases

In Q1 2023, progress was made in establishing procedures, clarifying data submission, and reworking screening protocols for multiple datasets for use with upcoming dataset ingestion. This included collaborative efforts with NHLBI to support pre-ingestion quality assurance, as well as data support for screening and assisting data submitters in preparing their data for future ingestion into BDC. Key datasets that underwent these processes include nuMoM2b (phs002808.v1.p1.c1), BABY HUG (phs002415.v1.p1.c1), MSH (phs002348.v1.p1.c1), NSRR-CFS (phs002715.v1.p1.c1), and CRA (phs000988.v4.p1.c1).

Planned Upcoming Data Releases

For detailed platform release notes please consult the following resources:

Gen3 release notes Terra release notes Seven Bridges release notes PIC-SURE release notes Dockstore release notes

Last updated