1 of 32

Analyze Data

Transferring Files Between Seven Bridges and Terra

Instructions on transferring files between NHLBI BioData Catalyst Powered by Seven Bridges and NHLBI BioData Catalyst Powered by Terra

Introduction

This tutorial guides users through the process of transferring files between the two workspace environments of NHLBI BioData Catalyst: NHLBI BioData Catalyst Powered by Seven Bridges and NHLBI BioData Catalyst Powered by Terra.

Most researchers select one of the workspaces as their primary analysis environment and their labmates and collaborators typically work with them on the same workspace environment. However, there are cases where some collaborators work on Seven Bridges and others work on Terra. In this case, researchers need to share data files between the two workspaces to facilitate collaboration. When researchers run analyses on Seven Bridges, the results, or derived data, is only available on Seven Bridges. Likewise, when researchers run analyses on Terra, the results are only available on Terra. This tutorial provides step-by-step guidance on how to share derived data between the workspace environments. These instructions can also be used to share private data that has been uploaded to Seven Bridges or Terra.

Both open access data and controlled access data can be shared across workspace environments. Importantly, if a researcher intends to share controlled access data, they must ensure that all recipients have the necessary dbGaP permissions for those files. In some cases, this may mean the researchers must be listed as collaborators on their respective dbGaP applications. These instructions are intended for sharing files under 1 terabyte (TB) in size. If you want to share data larger than 1 TB, contact the BioData Catalyst Help Desk to discuss your use case.

It is not recommended to transfer large amounts of data between cloud providers or regions; for example, AWS --> Google costs approximately $100/TB.

Initial Considerations

Platform Accounts

The first consideration is platform accounts. Moving data between Seven Bridges and Terra is currently a manual process and requires that one of the researchers involved in sharing has an account on both platforms. It is recommended that the recipient of the shared data is the person to have accounts on both Seven Bridges and Terra.

Let’s consider an example case: Sebastian who is working on Seven Bridges and Teresa who is working on Terra. If Sebastian wants to share data with Teresa so that she can use the data on Terra, Teresa first needs to set up an account on Seven Bridges. Now Teresa has an account on Terra and an account on Seven Bridges. Sebastian will share the data with Teresa on Seven Bridges by adding her as a member of the project with the data he wants to share, with Copy permissions. For information on permissions, refer to the Seven Bridges Set permissions documentation. Once Teresa is added as a member of the project, she can move the data from the Seven Bridges project to a workspace on the Terra platform, following the instructions in the section titled Moving Data From Seven Bridges to Terra.

If Teresa (Terra) wants to share data with Sebastian (Seven Bridges) so that he can use the data on Seven Bridges, Sebastian first needs to create an account on Terra. Now Sebastian has an account on Seven Bridges and an account on Terra. Teresa can share the data with Sebastian on Terra by sharing the workspace with the data she wants to share with Sebastian. For information on sharing workspaces, refer to the Terra How to share a workspace documentation.

To create a Terra account, refer to the Terra documentation.

To create a Seven Bridges account, refer to the Seven Bridges documentation. If you are new to Seven Bridges, you may find this Getting Started Guide helpful.

Billing

The second consideration is making sure the researcher moving data between the two workspaces has billing groups set up on both workspaces to cover cloud costs if necessary. Contact the BioData Catalyst Help Desk if you have questions about how to get a billing group on Seven Bridges or Terra.

Moving Files From Terra to Seven Bridges

The following steps describe how to use the Seven Bridges platform to pull data securely from a Terra workspace into a Seven Bridges project.

Refer to the Terra documentation for Moving data to/from a Google bucket (workspace or external), specifically the section Upload and download data files in a terminal using gsutil. This method:

Works well for all size transfers.
Ideal for large file sizes or 1000s of files.
Can be used for transfers between local storage and a bucket, workspace VM or persistent disk and a Google bucket, as well as between Google buckets (external and workspace).

You will use the terminal in JupyterLab on the Seven Bridges workspace environment. The reason for this is that although Seven Bridges can run on the Google Cloud Platform, the Google bucket API is not exposed in the same manner as it is on Terra. Therefore you will start a JupyterLab notebook on Seven Bridges, using the project you would like to be the destination for the copied data. Refer to the Seven Bridges documentation for launching Jupyter Lab notebooks on Seven Bridges and accessing the terminal in a JupyterLab environment.

After launching the notebook, the next step is to open the terminal and install the program gsutil which is a python program that lets end users add data to or copy data from a Google Cloud bucket. After opening the terminal, run the following commands:

pip install gsutil
gsutil config

Installing gsutil takes only a few seconds.

The config command provides a secure URL for you to navigate to in the browser. You will authenticate with the same credentials that were used to login to Terra. The shortcut to access the printed URL in the JupyterLab terminal is to press shift and right click, which will display options to copy the URL. Copy and then navigate to the URL in a new browser tab, which will direct you to Google authentication:

Google will provide an authentication code that you will copy and paste into the terminal.

Next, you will type in the Google Project id. This is found on the right side of the Terra Workspace Dashboard.

Next, run the command below to display the different Google buckets that are attached to the project id.

gsutil ls

The Google bucket name for the Terra project can be found in the lower right corner of the Terra Workspace.

Running gsutil ls on the Google bucket name will display the folders and files from the Terra workspace.

To copy a folder to the Seven Bridges workspace environment, run the following command:

gsutil cp -R gs://[Google-Bucket-Name] /sbgenomics/output-files/

There are a couple important things to mention about the gsutil cp command. First, the -R flag for gsutil cp is used to recursively copy a folder and all of its subfolders and files. Most users will likely want to use the -R flag. This flag should be omitted if copying individual files or if using a wild card such as “*.vcf”.

Additionally, /sbgenomics/output-files should be the destination folder when bringing in data from Terra, as this will ensure the files or folders get populated back to the Seven Bridges project. Refer to the Save analysis outputs documentation for information about working with files in Data Cruncher environments. After the JupyterLab instance is shut down, your files will automatically be populated in your project-files tab on Seven Bridges.

Moving Data From Seven Bridges to Terra

In this section we will discuss pushing data from a Seven Bridges project to a Terra workspace.

The process of moving data from Seven Bridges to Terra is the same setup as the previous section with some modifications to the gsutil copy command. Instead, we reverse the arguments.

gsutil cp -R /sbgenomics/output-files/vcfs_to_transfer \ gs://[Google-Bucket-Name]

You will still use the -R flag but the destination is a Terra bucket. The Terra workspace’s Google bucket name/id can be found on the Terra workspace Dashboard tab. You can verify that the folder has been copied by navigating to the Files section of the Data tab in your Terra workspace.

Clicking on the folder, you will see that all three files have been copied.

Seven Bridges

About Seven Bridges

BioData Catalyst Powered by Seven Bridges offers researchers collaborative workspaces for analyzing genomics data at scale. Researchers can find and analyze the hosted TOPMed studies by using hundreds of optimized analysis tools and workflows (pipelines); creating their own workflows; or interactive analysis. On the platform, researchers can utilize collaborative workspaces for analyzing genomics data at scale, and access hosted datasets along with Common Workflow Language (CWL) and GENESIS R package pipelines for analysis. This platform also enables users to bring their own data for analysis and work in RStudio and Jupyterlab Notebooks for interactive analysis.

Key Features

Private, secure, workspaces (projects) for running analyses at scale
Collaboration features with the ability to set granular permissions on project members
Direct access to BioData Catalyst without needing to set up a Google or AWS billing account
Access hosted TOPMed studies all in one place and analyze data on the cloud at scale
Tools and features for performing multiple-variant and single-variant association studies including:
- Annotation Explorer for variant aggregations
- Cloud-optimized Genesis R package workflows in Common Workflow Language
Cohort creation by searching phenotype data
- Use PIC-SURE API for searching phenotype data
- Search by known dbGaP identifiers
Rstudio and Jupyterlab Notebooks built directly into the platform for easy interactive analysis and manipulation of phenotype data
Hosted TOPMed data you can combine with your own data on AWS or Google Cloud
Billing and administrative controls to help your research funding go further: avoid forgotten instances, abort infinite loops, get usage breakdowns by project.

Knowledge Center

The Seven Bridges Knowledge Center is a collection of user documentation which describes all of the various components of the platform, with step-by-step guides on their use. Our Knowledge Center is the central location where you can learn how to store, analyze, and jointly interpret your bioinformatic data using BioData Catalyst Powered by Seven Bridges.

From the Knowledge Center, you can access platform documentation. This content is organized into sections that deal with the important aspects of accessing and using BioData Catalyst Powered by Seven Bridges.

You can also read the Release Notes in the Knowledge Center, keeping you up-to-date on all of the latest updates and new features for BioData Catalyst Powered by Seven Bridges.

Getting Started Guide

Just starting out on NHLBI BioData Catalyst Powered by Seven Bridges, and need to get up to speed on how to use the platform? Our experts have created a Getting Started Guide to help you jump right in. We recommend users begin learning how to use BioData Catalyst Powered by Seven Bridges by following the steps in this guide. After reading this guide, you will know how to create an account on NHLBI BioData Catalyst Powered by Seven Bridges, learn the basics of creating a workspace (project), run an analysis, and search through the hosted data.

To read our Getting Started Guide, please refer to our documentation page here.

Comprehensive Analysis Tips

This guide has been prepared to help you with your first set of projects on BioData Catalyst Powered by Seven Bridges.

This guide aims to help you learn how to take advantage of all the various features and functionality for performing analyses on Seven Bridges and ensure that you can set up your analyses in the most efficient way possible to save time and money.

The following topics are covered in this guide:

The basics of working with CWL tools and workflows on the platform.
How to specify computational resources on the platform and how to use the default options selected by the execution scheduler.
How to run Batch analyses and take advantage of parallelization with scatter.
The basics of working with Jupyterlab Notebooks and Rstudio for interactive analysis.

You can refer to the guide here.

Troubleshooting Tasks

One of the key steps to becoming an advanced user and being able to fully understand and leverage the power of BioData Catalyst Powered by Seven Bridges is to learn how to detect and correct errors that prevent the successful execution of your analyses. The Troubleshooting tutorial presents some of the most common errors in task execution on the platform and shows you how to debug and resolve them. There is also a corresponding public project on the platform called "Troubleshooting Failed Tasks" which has examples of the failed analyses presented in the written tutorial.

Find the written tutorial here.
Find the platform public project with examples here.

GWAS with GENESIS workflows

Overview of the GENESIS pipelines

For researchers interested in performing genotype-phenotype association studies, Seven Bridges offers a suite of tools for both single-variant and multiple-variant association testing on NHLBI BioData Catalyst Powered by Seven Bridges. These tools and features include the GENetic EStimation and Inference in Structured samples (GENESIS) pipelines, which were developed by the Trans-Omics for Precision Medicine (TOPMed) Data Coordinating Center (DCC) at the University of Washington. The Seven Bridges team collaborated with the TOPMed DCC to create Common Workflow Language (CWL) tools for the GENESIS R functions, and arranged these tools into five computationally-efficient workflows (pipelines).

These GENESIS pipelines offer methods for working with genotypic data obtained from sequencing and microarray analysis. Importantly, these pipelines have the robust ability to estimate and account for population and pedigree structure, which makes them ideal for performing association studies on data from the TOPMed program. These pipelines also implement linear mixed models for association testing of quantitative phenotypes, as well as logistic mixed models for association testing of binary (e.g. case/control) phenotypes.

Below, we feature our GENESIS Benchmarking Guide to assist users in estimate cloud costs when running GENESIS workflows on NHLBI BioData Catalyst Powered by Seven Bridges.

GENESIS Benchmarking Guide

Introduction

The objective of the GENESIS Benchmarking Guide is to instruct users on the drivers of cloud costs when running GENESIS workflows on the NHLBI BioData Catalyst Powered by Seven Bridges.

For all GENESIS workflows, the Seven Bridges team has performed comprehensive benchmarking analysis on Amazon Web Services (AWS) and Google Cloud Provider (GCP) instances for different scenarios:

2.5k samples (1000G data)
10k samples (TOPMed Freeze5 data)
36k samples (TOPMed Freeze5 data)
50k samples (TOPMed Freeze5 data)

The resulting execution times, costs, and general advice for running GENESIS workflows can be found in the sections below. In these sections, each GENESIS workflow is described, followed by the benchmarking results and some tips for implementing that workflow from the Seven Bridges Team. Lastly, we included a Methods section to describe our approach to benchmarking and interpretation for your reference.

The contents of this guide are arranged as follows:

Introduction
Helpful Terms to Know
GENESIS VCF to GDS
GENESIS Null model
GENESIS Single Association testing
GENESIS Aggregate Association testing
GENESIS Sliding window Association testing
General considerations

Below is a link to download the results of our Benchmarking Analysis described herein. It may prove useful to have this file open for reference when reading through this guide.

Helpful terms to know

Before continuing on to the benchmarking results, please familiarize yourself with the following helpful terms to know:

Tool: Refers to a stand-alone bioinformatics tool or its Common Workflow Language (CWL) wrapper that is created or already available on the platform.
Workflow/Pipeline (interchangeably used): Denotes a number of tools connected together in order to perform multiple analysis steps in one run.
App: Stands for a CWL wrapper of a tool or a workflow that is created or already available on the platform.
Task: Represents an execution of a particular tool or workflow on the platform. Depending on what is being executed (tool or workflow), a single task can consist of only one tool execution (tool case) or multiple executions (one or more per each tool in the workflow).
Job: This refers to the “execution” part from the “Task” definition (see above). It represents a single run of a single tool found within a workflow. If you are coming from a computer science background, you will notice that the definition is quite similar to a common understanding of the term “job” (wikipedia). Except that the “job” is a component of a bigger unit of work called a “task” and not the other way around, as in some other areas may be the case. To further illustrate what job means on the platform, we can visually inspect jobs after the task has been executed using the View stats & logs panel (button in the upper right corner on the task page):

Figure 1. The jobs for an example run of RNA-Seq Quantification (HISAT2, StringTie) public workflow

The green bars under the gray ones (apps) represent the jobs (Figure 1). As you can see, some apps (e.g. HISAT2_Build) consist of only one job, whereas others (e.g. HISAT2) contain multiple jobs that are executed simultaneously.

GENESIS VCF to GDS

In this section, we detail the process of converting a VCF to a GDS via a GENESIS workflow. This VCF to GDS workflow consists of 3 steps:

Vcf2gds
Unique variant id
Check GDS

The first two steps are required while the last one is optional. The Check GDS step when included is the biggest cost driver in these tasks.

Check GDS tool is QC, which checks whether the final GDS file contains all variants that the input VCF/BCF has. This step is computationally intensive and its execution time can be 4-5 times longer than the rest of the workflow. Also, a failure of this step is something that we experience very rarely. In our results, there is a Check GDS column which is used as an indicator whether the Check GDS was performed or not.

We advise anyone who is using this workflow to consider results from the table below because differences in execution time and price with and without this check are considerable. A final decision on the approach that someone will use depends on the resources that one has (budget and time), and the preference of including or excluding the optional QC step.

In addition, CPU/job and Memory/job parameters have direct effects on execution time and the cost of the GENESIS VCF to GDS workflow. A combination of these parameters defines the number of jobs (files) that will be processed in parallel.

For example:

If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job the number of jobs run in parallel will be min{36/1,72/4}=18. If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job the number of jobs run in parallel will be min{36/1,72/8}=9. In this example, the second case would take twice as long as the first.

The following conclusions were drawn from performed benchmarking:

Benchmarking showed that the most suitable AWS instances for this workflow are c5 instances.
For all tasks that we run (from 2.5k up to 50k samples), 1CPU and 4GB per job were sufficient.
For small sample sizes (up to 10k samples), tasks can be run on spot/preemptible instances to additionally decrease the cost.
For samples up to 10k, 2GB per job could suffice, but consider that if we run check GDS step as well, execution time and price will not be much lower because CPU and Mem per job inputs are related only to vcf2gds step and not to the whole workflow.
We recommend using VCF.GZ as input files rather than BCF, as the conversion process cannot be parallelized when using BCFs.
If you have more files to convert (e.g. multiple chromosomes), we recommend running one analysis with all files as an input, rather than batch analysis with separate tasks for each file.

GENESIS Null model

The GENESIS Null model workflow is not computationally intensive and it is relatively low-cost compared to other GENESIS workflows. For that reason, we present results that we obtained without any optimization below:

The null model can be fit with relatedness matrices (i.e. mixed models) or without relatedness matrices (i.e. simple regression models). If a relatedness matrix is provided, it can be sparse or dense. The tasks with dense relatedness matrix are the most expensive and take the longest to run. For the Null model workflow, available AWS instances appear to be more suitable than Google instances available on the platform.

GENESIS Single Association testing

Results of the GENESIS Single Association Testing workflow benchmarking can be seen in the table above. Some important notes to consider when using this workflow:

Null model type effect: The main cost and duration driver in these tasks is the null model type. The null model can be fit without a relatedness matrix (i.e. simple regression models), or with a relatedness matrix that can be sparse or dense (i.e. mixed models). The table above shows that task cost and execution time increases as the null model is weighted (dense > sparse > none). Differences between the model with dense matrix and the model with sparse matrix are significant, which is driven by both increased CPU time and memory required to use a dense matrix. Our advice is to check the relatedness matrix before fitting the model, and transform it to sparse if it is dense, especially if sample size is large. The null model type has direct influences on further parameters setup such as: instance type selection, CPU per job, Memory per job, etc.
Instance type: Benchmarking showed that the most appropriate instance type is an AWS r5.x instance. Instance type selection is especially important when we are performing analysis with many samples (30k participants and above). In the tasks with up to 30k samples, r5.4xlarge instances can be used, and r5.12xlarge with more participants included. In addition, it is important to note that if a single association test is performed with a dense null model, then r5.12xlarge or r5.24xlarge instances should be picked. When it comes to Google instances, results can be seen in the above table as well. Since there often isn’t a Google instance that is the exact equivalent of the AWS instance, , we recommend choosing the most appropriate Google instance (matching the chosen AWS instance) from the list of available Google instances on BioData Catalyst.
CPU and memory per job: CPUs and memory per job input parameters are determining the number of jobs to be run in parallel on one instance. For example:
- If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job, the number of jobs run in parallel will be min{36/1,72/4}=18.
- If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job, the number of jobs run in parallel will be min{36/1,72/8}=9.
The bottleneck in single variant association testing is memory, so we suggest carefully considering this parameter and instance type. Workflow defaults are 1 CPU/job and 8GB/job. The table above shows that these tasks require much more memory than CPUs, therefore r.x instances are most appropriate in these cases. The table additionally shows that the task where the null model is fit with the dense relatedness matrix requires the most memory per job. This parameter also depends on the number of participants included in the analysis.
Maximum number of parallel instances: The default number of parallel instances is 8. The impact of changing this number is mainly reflected through execution time. The tasks with more parallel instances will be finalized faster. This parameter can be set in Execution settings when drafting a task. However, each user has a limited total number of parallel instances and capturing a big number of parallel instances per task leads to a decrease in the total number of different tasks that can run at the same time.
Spot instances: If it is expected for the task to be finalized within a few hours, it can be run on spot instances in order to reduce the execution cost. However, losing a spot instance leads to rerunning the task using on-demand instances which can lead to a higher cost than running the task on on-demand instances from the beginning. That is why spot instances are generally only suitable for short tasks.

GENESIS Aggregate Association testing

GENESIS Aggregate association testing can be performed using burden, SKAT, SMMAT, fastSKAT and SKAT-O tests. Our general conclusions are as follows:

Null model selection: The same as in the single variant association testing, the main cost and duration driver in these tasks is the null model type. The null model can be fit without a relatedness matrix (i.e. simple regression models), or with a relatedness matrix that can be sparse or dense (i.e. mixed models). The table above shows that task cost and execution time increases as the null model is weighted (dense > sparse > none). Differences between the model with dense matrix and the model with sparse matrix are significant, which is driven by both increased CPU time and memory required to use a dense matrix. Our advice is to check the relatedness matrix before fitting the model, and transform it to sparse if it is dense, especially if sample size is large. The null model type has direct influences on further parameters setup such as: instance type selection, CPU per job, Memory per job, etc.
Instance type: Benchmarking showed that the most appropriate instance type is an AWS r5.x instance. The majority of the tasks can be run on r5.12xlarge instances or on r5.24xlarge instances when the null model is with the dense relatedness matrix. Results for Google instances can be seen in the above table as well. Since the Google instance options often do not have an exact AWS equivalent, we selected the closest match from the list of available Google instances on BioData Catalyst.
CPU and memory per job: CPUs and memory per job input parameters determine the number of jobs to be run in parallel on one instance. For example:
- If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job, the number of jobs run in parallel will be min{36/1,72/4}=18.
- If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job, the number of jobs run in parallel will be min{36/1,72/8}=9.
Different tests can require different computational resources.

As can be seen for small sample sizes, up to 10GB per job can be sufficient for successfully completed tasks. One exception is the case when running a task with null model fit with the dense relatedness matrix when approx. 36GB/job is needed. When there are 50k samples, jobs require 70GB. Details can be seen in the table above. In addition to sample size, the memory required is determined by the number of variants included in each aggregation unit, as all variants in an aggregation unit are analyzed together.

SKAT and SMMAT tests are similar when it comes to CPU and Memory per job requirements. Roughly, these tests require 8GB/CPU, and details for different task configurations can be seen in the table below:

Maximum number of parallel instances: The default number of parallel instances is 8. The impact of changing this number is mainly reflected through execution time. The tasks with more parallel instances will be finalized faster. This parameter can be set in Execution settings when drafting a task. However, each user has a limited total number of parallel instances and capturing a big number of parallel instances per task leads to a decrease in the total number of different tasks that can run at the same time.
Spot instances: If it is expected for the task to be finalized in a few hours it can be run on spot instances. This will reduce the execution cost. However, losing a spot instance leads to rerunning the task on on-demand instances, which can lead to a higher cost than running the task on on-demand instances from the beginning. That is why spot instances are generally suitable for short tasks.

GENESIS Sliding window Association testing

GENESIS Sliding window association testing can be performed using burden, SKAT, SMMAT, fastSKAT and SKAT-O tests. When running sliding window test is good to know:

Null model selection: The same as in the previous tests the main cost and duration driver in these tasks is the null model type. The null model can be fit without a relatedness matrix (i.e. simple regression models), or with a relatedness matrix that can be sparse or dense (i.e. mixed models). The table below shows that task cost and execution time increases as the null model is weighted (dense > sparse > none). Differences between the model with dense matrix and the model with sparse matrix are significant, which is driven by both increased CPU time and memory required to use a dense matrix. Our advice is to check the relatedness matrix before fitting the model, and transform it to sparse if it is dense, especially if sample size is large. The null model type has direct influences on further parameters setup such as: instance type selection, CPU per job, Memory per job, etc.
Instance type: Benchmarking showed that for analysis with or without sparse relatedness matrix tasks can be completed on a c5.9xlarge AWS instance. For analysis with dense relatedness matrix included in the null model and with 50k samples or more, r5.12xlarge instances can be used. Also, it is important to note that in this case increasing the instance (for example from c5.9xlarge to c5.18xlarge) will not lead to shorter execution time. Furthermore, it can be completely opposite. By increasing the size of the instance we also increase the number of jobs running in parallel. At one point there will be a lot of jobs running in parallel and accessing the same memory space which can reduce the performance and increase task duration. Results for Google instances can be seen in respective tables. Since the Google instance options often do not have an exact AWS equivalent, we selected the closest match from the list of available Google instances on BioData Catalyst.
CPU and memory per job: When running a sliding window test it is important to ensure that CPU resources at the instances that we are using are not overused. Avoiding 100% CPU usage in these tasks is crucial for fast execution. For that reason, it is good to decrease the number of jobs which are running in parallel on one instance. The number of parallel jobs is highlighted in the summary table as it is an important parameter for the execution of this task. We can choose different CPU and memory inputs as long as that combination gives us an appropriate number of parallel jobs. This is example how the number of parallel jobs are calculated:
- If we run our task on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job, the number of jobs run in parallel will be min{36/1,72/4}=18.
- If we run our task on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job, the number of jobs run in parallel will be min{36/1,72/8}=9.
For details on the number of jobs that we’ve set for each tested case please refer to the table below.
Window size and window step: The default values for these parameters are 50kb and 20kb (kilobases), respectively. Please have in mind that since the sliding window algorithm is considering all bases inside the window, the window length and number of windows are parameters that are directly affecting the execution time and the price of the task.
Maximum number of parallel instances: The default number of parallel instances is 8. The impact of changing this number is mainly reflected through execution time. The tasks with more parallel instances will be finalized faster. This parameter can be set in Execution settings when drafting a task. However, each user has a limited total number of parallel instances and capturing a big number of parallel instances per task leads to a decrease in the total number of different tasks that can run at the same time.
Spot instances: If it is expected for the task to be finalized in a few hours it can be run on spot instances. This will reduce the execution cost. However, losing a spot instance leads to rerunning the task on on-demand instances which can lead to a higher cost than running the task on on-demand instances from the beginning. That is why spot instances are generally suitable for short tasks.

Benchmarking results:

General considerations

In this text, we have highlighted the main cloud cost and execution time drivers when running GENESIS analyses. Please have in mind that when running an analysis users may experience additional costs due to different factors such as task failures or need for rerunning the analysis. When estimating cloud costs for your study, please account for a cost buffer for these two factors as well.

To prevent task failures, we advise you to carefully read app descriptions and if you have any questions or doubts, contact our Support Team at support@sevenbridges.com. Also, using memoization can help in cost reduction when rerunning the task after initial failure.

Methods

Throughout this document, it is important to note that the figures in the tables above are intended to be informative as opposed to predictive. The actual costs incurred for a given analysis will also depend on the number of samples and number of variants in the input files. For our analysis described above, we selected 1000G and TOPMed Freeze5 data as inputs. For TOPMed Freeze5, we selected cohorts of 10k, 36k, and 50k subjects. The benchmarking results for the selected tasks would vary if the cohorts were defined differently.

The selection of instances is another factor that can lead to variation in results for a given analysis. The results highly depend on the user’s skill to choose an appropriate instance and use the instance resources optimally. For that reason, if two users run the same task with different configurations (different instance type CPU/job and/or RAM/job parameters), the results may vary.

The results (execution time and cost) are directly connected to the CPU per job and Memory per job parameters. Different resources dedicated to a given job will result in a different number of total jobs run on the selected instance, as so with the different execution time and cost. For that reason, setting up a task draft properly is crucial. In this document, we provided details on what we consider optimal CPU and Memory per job inputs for TOPMed Freeze5 and 1000G data. These numbers can be used as a good starting point, bearing in mind that each study has its own unique requirements.

For both Single and Sliding Window Association Testing:

Please note that results for single and sliding window tests are approximations. To avoid unnecessary cloud costs, we performed both single and sliding window tests only on 2 chromosomes. These results were the basis on which we assessed the cost and execution time for the whole genome.

The following is an explanation of the procedure we applied for GENESIS Single Association testing workflow and TOPMed freeze5 data (the similar stands for GENESIS Sliding window Association testing workflow):

In GENESIS Single Association testing workflow, the variants are tested in segments. The number of segments that the workflow will process is a ratio of the total number of variants and a segment length (which is one of the input parameters in this workflow). For example: if we are testing a whole genome with 3,000,000,000 variants and use the default segment length value of 10,000kb, we will have 300 segments. Furthermore, if we use the default value for maximum number of parallel instances, which is 8, we can approximate the average number of segments that each instance processes: 37.

The GENESIS Single Association testing workflow can process segments in parallel (processing of one segment is a job). The number of parallel segments (jobs) depends on the CPU per job and Memory per job parameters, and can be calculated as described previously. For example: if we are running the analysis on a c5.9xlarge instance (36 CPUs and 72GB RAM) with 1 CPU/job and 4GB/job, we will have 18 jobs in parallel. Knowing that each of our 8 instances is processing approximately 37 jobs in parallel it means that each instance will have approximately 2 cycles. Furthermore, knowing the average job length we can approximate the running time of 1 instance: it will be 2 cycles multiplied by average job length. Since the instances are running in parallel, this will be the total execution time. Lastly, when execution time is known, we can calculate the task price: the number of instances multiplied by execution time per hour, multiplied by instance price per hour. For each tested scenario in our benchmarking analysis, we obtained the average job length based on the corresponding tasks which included 2 chromosomes, such that the total number of jobs was above 30.

Annotation Explorer

The Annotation Explorer is an application developed by Seven Bridges in collaboration with the TOPMed Data Coordinating Center. The application enables users to interactively explore, query, and study characteristics of an inventory of annotations for the variants called in TOPMed studies. This application can be used pre-association testing to interactively explore aggregation and filtering strategies for variants based on annotations and generate input files for multiple-variant association testing. It can also be used post-association testing to explore annotations associated with a set of variants, like variant sets found significant during association testing.

The Annotation Explorer currently hosts a subset of genomic annotations obtained using Whole Genome Sequence Annotator software for TOPMed variants. Currently, annotations for TOPMed Freeze5 variants and TOPMed Freeze8 variants are integrated with the Annotation Explorer. Researchers who are approved to access one or more of the TOPMed studies included in Freeze8 or Freeze5 will be able to access these annotations in the Annotation Explorer.

For more information, refer to the Annotation Explorer's Public Project Page.

Terra

BioData Catalyst Powered by Terra is a user-friendly system for doing biomedical research in the cloud. Terra workspaces integrate data, analysis tools, and built-in security components to deliver smooth research flows from data to results.

The following entries in this section of the BioData Catalyst documentation are a starting point for learning how to use Terra in the context of the BioData Catalyst ecosystem. You can also dive deeper into Terra by visiting the Terra website and the Terra Support Center. Wherever possible, we highlight specific articles, tutorial videos, and example workspaces that will help you learn what you need to know to accelerate your research.

If you can't find what you are looking for, we are happy to help. See the Troubleshooting and Support section for more information.

Please note that Terra is designed for and tested with the Chrome browser.

Account Setup

Logging in to Terra for the first time is a quick and straight-forward process. The process is easiest if you already have an email address hosted by Google. If you want to use an email address that is not hosted by Google, we have instructions for that as well. Article: How to register for a Terra account Article: Setting up a Google account with a non-Google email We also recommend our article on navigating in Terra to get familiar with basic menus and options in Terra, as well as this video introduction to Terra.

Read on in the next two subsections for primers on how to set up billing and how to manage costs.

Billing

Now that you can log in, you’ll want to make sure that you have access to a Billing Account and Billing Project. This will allow you to charge storage and analysis costs through a Google account linked to Terra. A Terra Billing Project is Terra's way of connecting a workspace where you accrue costs for things, back to a Google Billing account where you pay for it. You must have a Google Billing Account established before creating a Terra Billing Project. Outlined here are the steps necessary to set this up, as well as instructions on how to add or be added to an existing account/billing project.

Detailed instructions for setting up your billing can be found by following the links below. If you are a BioData Catalyst Fellow, your procedure for billing set up is a bit different, but you may find some of the information below still relevant (sharing a billing project with another user, for example). Step 1: Get Cloud credits for BioData Catalyst Step 2: Wait for approval & review the Billing overview for BioData Catalyst users Step 3: Credits approved. Now create a new Terra billing project Step 4 (optional): Sharing Billing Projects among colleagues

Managing Costs

We have a number of articles on tracking and minimizing the costs of operating on Terra. There are multiple ways of estimating how much your analyses are costing you, including built-in tools and external resources. The articles below contain instructions and advice on managing your cloud resources in a variety of ways: Article: Understanding and controlling cloud costs Article: Best practices for managing shared team costs Article: How much did a workflow cost? Article: How to disable billing on a Terra project

Workspace Setup

Workspaces are the fundamental building blocks of Terra. You can think of them as modular digital laboratories that enable you to organize and access your data in a number of ways for analysis.

To learn about the basics of operating a Terra workspace, we recommend these resources: Article: Working with workspaces Video: Introduction to using workspaces in Terra

Read on in this section to get familiar with:

Data Storage & Management

Terra workspaces include a dedicated workspace Google bucket, as well as a built-in data model for managing your data. We provide articles in Terra’s knowledge base explaining how to organize and access data in a variety of ways.

A key to understanding the power of Terra is understanding it’s built-in data model, which allows you to rewire the inputs and outputs of your workflows and Jupyter notebooks.

The following resources give you guided instructions using cloud-based data with Terra: Article: Managing data with table VIdeo: Introduction to Terra data tables Article: Uploading to a workspace Google bucket Article: How to import metadata to a workspace data table Video: Making and uploading data tables to Terra

Collaboration

Sharing a workspace allows collaborators to actively work together in the same project workspace. Workspaces can be used as repositories of data, workflows, and Jupyter notebooks. Learn more about how to securely share a workspace: Article: How to share a workspace Article: Reader, writer or owner? Workspace access controls, explained Article: Using permissions Video: Introduction to Collaboration and Sharing in Terra

Security

Terra has a number of features to ensure the security of sensitive data accessed through the platform. Many of these features are in place automatically, while tools like authorization domains give you greater control over your data. These articles contain an overview of the security features enabled on Terra: Article: Authorization Domain overview for BioData Catalyst users Article: Managing data privacy and access with Authorization Domains Article: Best Practices for accessing external resources Article: Terra security posture

Bring Data into a Workspace

You can import data into your workspace by either linking directly to external files you have access to, or by interfacing with a number of platforms with which Terra has integrated access.

For BioData Catalyst researchers, one of the most relevant of these interfacing platforms is Gen3. However this section also provides you with resources that teach how to import data from other public datasets integrated into Terra’s data library, as well as how to bring in your own data.

Read on in this section for more information on:

Bring in Data from Gen3

BioData Catalyst Powered by Gen3 provides data for many projects and conveniently supports search across the vast set of subjects to identify the best available cohorts for research analysis. Searches are based on harmonized phenotypic variables and may be performed both within and across projects.

When a desired cohort has been identified in Gen3, the cohort may be conveniently "handed-off" to Terra for analysis. Optionally, this dataset may be enhanced with additional metadata from dbGaP, or extended to include additional researcher-provided subject data.

Here we provide essential information for all researchers using BioData Catalyst data from Gen3, including how to access and select Gen3 subject data and hand it off to Terra, as well as a description of the GA4GH Data Repository Service (DRS) protocol and data identifiers used by Gen3 and Terra.

The resources below contain the information you’ll need to access your desired data: Video: Data Analysis with Gen3, Terra and Dockstore Article: Discovering Data Using Gen3 Article: Understanding and using Gen3 data in Terra Article: Data Access with the GA4GH Data Repository Service (DRS) Article: Linking Terra to External Servers Article: Understanding and setting up a proxy group Workspace: BioDataCatalyst Gen3 data on Terra tutorial Workspace: TOPMed Aligner workspace

From Terra’s Data Library

Terra’s Dataset Library includes a number of integrated datasets, many of which have individualized Data Explorer interfaces, useful for generating and exporting custom cohorts. If you click into a dataset and have the proper permissions, you'll be able to explore the data. If you don't have the necessary permission, you'll be taken to a page that tells you whom to contact for access.

The resources linked below provide guided instructions for creating custom cohorts from the data library and importing them to your workspace, and using a Jupyter notebook to interact with the data: Article: Accessing and analysing custom cohorts with Data Explorer Video: Notebooks Quickstart walkthrough Workspace: Notebooks Quickstart workspace

Use Your Own Data with Terra

This page describes how researchers may bring their own data files and metadata into Terra. Some researchers may choose to bring their own data to Terra in addition to - or instead of - using BioData Catalyst data from Gen3. For example, this may be done when bringing additional (e.g., longitudinal) phenotypic data to enhance the harmonized metadata available from Gen3, or when using Joint variant calling with additional researcher provided genomic data, or even using researcher provided data exclusively,

Generally, there are two types of data that researchers typically bring to Terra. Data files (e.g., genomic data, including CRAM and VCF data), and metadata (e.g., tables of clinical/phenotypic or other data, typically regarding the subjects in their study). These are described separately below.

There are two ways a researcher's data files may be made available in Terra: By uploading data to the researcher's workspace bucket or enabling Terra to access the researcher's data in a researcher managed Google bucket, for which you need to set up a proxy group.

Article: Uploading to a workspace Google bucket Article: Understanding and setting up a proxy group

The ways in which a researcher may import metadata to the Terra Data tables are described in the articles and tutorials below:

Article: Managing data with tables Article: How to import metadata to a workspace data table VIdeo: Introduction to Terra data tables Video: Making and uploading data tables to Terra

Run Analyses

Terra supports the following types of analysis: Batch processing with Workflows and Interactive analysis with Jupyter Notebooks. This section will orient you with resources that teach you how to do:

As an introduction, we recommend reading our article on the kinds of analysis you can do in Terra.

Batch Processing with Workflows

The batch workflow features of Terra provide support for computationally-intensive, long-running, and large-scale analysis.

You can perform whole pipelines—from preprocessing and trimming sequencing data to alignment and downstream analyses—using Terra workflows. Written in the human-readable Workflow Description Language (WDL), you can search for and import workflows into your workspace from Dockstore or the Broad Methods Repository.

Video: Data Analysis with Gen3, Terra and Dockstore Article: How to import data from Gen3 into Terra and run the TOPMed aligner workflow Article: Configure a workflow to process your data Article: Getting workflows up and running faster with a JSON file Article: Importing a Dockstore workflow into Terra Video: Importing a Dockstore workflow into Terra walkthrough Video: Workflows Quickstart walkthrough Workspace: Workflows Quickstart workspace Workspace for BioData Catalyst: TOPMed Aligner workspace Workspace for BioData Catalyst: GWAS with 1000 Genomes and synthetic clinical data Workspace for BioData Catalyst: GWAS with TOPMed data

Interactive Analysis

The interactive analysis features of Terra support interactive data exploration, including the use of statistical methods and graphical display. Versatile and powerful interactive analysis is provided through Jupyter Notebooks in both Python and R languages.

Jupyter Notebooks run on a virtual machine (VM). You can customize your VM’s installed software by selecting one of Terra's preinstalled notebook cloud environments or choosing a custom environment by specifying a Docker container. Dockers ensure you and your colleagues analyze with the same software, making your results reproducible.

Article: Interactive statistics and visualization with Jupyter notebooks Article: Customizing your interactive analysis application compute Article: Terra's Jupyter Notebooks environment Part I: Key components Article: Terra's Jupyter Notebooks environment Part II: Key operations Article: Terra's Jupyter Notebooks environment Part III: Best Practices Video: Notebooks overview Video: Notebooks Quickstart walkthrough Workspace: Notebooks Quickstart workspace Workspace: BioData Catalyst notebooks collection Workspace: PIC-SURE Tutorial in R Workspace: PIC-SURE Tutorial in Python

Genome-Wide Association Studies

Terra provides powerful support for performing Genome-Wide Association Studies (GWAS). The following featured and template workspaces include Jupyter notebooks for phenotypic and genomic data preparation (using Hail) and workflows (using GENESIS) to perform single or aggregate variant association tests using mixed models. We will continue to provide more resources for performing more complex GWAS scenarios in BioData Catalyst.

Kinship Matrices

A Jupyter Notebook in both of the following workspaces uses Hail to generate Genetic Related Matrices for input into the GWAS workflows. Users with access to kinship matrices from the TOPMed consortium may wish to exclude these steps and instead import kinship files using the bring your own data instructions.

BioData Catalyst GWAS tutorial workspace

The BioData Catalyst GWAS tutorial workspace was created to walk users through a GWAS with training data that includes synthetic phenotypic data (modeled after traits available in TOPMed) paired with 1000 Genomes open-access data. This tutorial aims to familiarize users with the Gen3 data model so that they can become empowered to use this data model with any existing tutorials available in the Terra library’s showcase section.

BioData Catalyst GWAS blood pressure trait template workspace

This template is an example workspace that asks researchers to export TOPMed projects (for which they have access) into an example template for conducting a common variant, mixed-models GWAS of a blood pressure trait. Our goal was to include settings and suggestions to help users interact with data exactly as they receive it through BioData Catalyst. Accommodating other datasets may require modifying many parts of this notebook. Inherently, the notebook is an interactive analysis where decisions are made as you go. It is not recommended that the notebook be applied to another dataset without careful thought.

Cost Examples Below are reported costs from using 1,000 and 10,000 samples to conduct a GWAS using the BioData Catalyst GWAS Blood Pressure Trait template workspace. The costs were derived from single variant tests that used Freeze 5b VCF files that were filtered for common variants (MAF <0.05) for input into workflows. The way these steps scale will vary with the number of variants, individuals, and parameters chosen. TOPMed Freeze 5b VCF files contain 582 million variants and Freeze 8 increases to ~1.2 billion. For GWAS analyses with Freeze 8 data, computational resources and costs are expected to be significantly higher.

These costs were derived from running these analyses in Terra in June 2020.

Troubleshooting & Support

If things aren’t going quite as expected, there are a number of avenues to help unblock any issues you may have.

Troubleshooting This section of the Terra knowledge base contains many useful articles on how to address problems, including a variety of articles describing common workflow errors, as well as more general articles that explain how to find which errors are affecting your work, and how to proceed once you’ve diagnosed your problem.

Monitor your jobs The Job History tab is your workflow operations dashboard, where you can check the status of past and current workflow submissions and find links to the job manager where you can diagnose issues.

How to report an issue There are a number of ways you can report an issue directly to us outlined in this article. If something appears broken, slow, or just plain weird, feel free to let us know.

Community forum A lot of answers can be found on our forum, which is monitored by our dedicated frontline support team and has an integrated search function. If you suspect that you’re running into a common issue but can’t find an answer in the documentation, this is a great place to check.

Dockstore

"An app store for bioinformatics workflows"

Dockstore is an open platform used by the GA4GH for sharing Docker-based tools described with either the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL). Dockerized workflows come packaged with all of their requirements, meaning you spend less time searching the web for obscure installation errors and more time doing research.

Dockstore is aimed at scientific use cases, and we hope this helps users find helpful resources more quickly. Our documentation is also created with researchers in mind: we work to distill down information about the technologies we use to the relevant points to get users started quickly.

This section highlights the documentation relevant to BioData Catalyst users. If you are brand new to Dockstore, it is suggested to review the Getting Started Guide. Our entire suite of documentation is available here.

Launch workflows with BioData Catalyst

How to use Dockstore workflows in our cloud partner platforms

Using the NHLBI BioData Catalyst ecosystem, you can launch workflows from Dockstore in both of our partner analysis platforms, Terra and SevenBridges. It is important to know that these platforms use different workflow languages: Terra uses WDL and SevenBridges uses CWL.

When you open any WDL and CWL workflow in Dockstore, you will see the option to "Launch with NHLBI BioData Catalyst":

If you selected a CWL workflow, this workflow will launch in BioData Catalyst Powered by SevenBridges. Learn more about how this integration works.
If you selected a WDL workflow, this workflow will launch in BioData Catalyst Powered by Terra. Learn more about how this integration works.

Discover our catalog

How to search our catalog

Dockstore offers faceted search, which allows for flexible querying of tools and workflows. Tabs are used to split up the results between tools and workflows. You can search for basic terms/phrases, filter using facets (like CWL vs WDL), and also use advanced search queries. Learn more here.

Organizations

You can also search curated workflows in Dockstore's Organizations page.

Organizations are landing pages for collaborations, institutions, consortiums, companies, etc. that allow users to showcase tools and workflows. This is achieved through the creation of collections, which are groupings of related tools and workflows. Learn more about Organization and Collections, including how your research group can create your own organization to share your work with the community.

Dockstore Organizations relevant to BioData Catalyst users:

NHLBI BioData Catalyst

Here, you can find a suite of analysis tools we have developed with researchers that are aimed at the BioData Catalyst community. Examples include workflows for performing GWAS and Structural Variant Calling. Many of these collections also point users to tutorials where you can launch these workflows in our partner platforms and run an analysis.

TOPMed

These workflows are based on pipelines the University of Michigan developed to perform alignment and variant calling on TOPMed data. If you're bringing your own data to BioData Catalyst to compare with TOPMed data, these may be helpful resources.

Intro to Docker, WDL, CWL

Technologies for reproducible analysis in the cloud

Introduction to Docker

Docker is a fantastic tool for creating light-weight containers to run your tools. It gives you a fast, VM-like environment for Linux where you can automatically install dependencies, make configurations, and setup your tool exactly the way you want, just as you would on a “normal” Linux host. You can then quickly and easily share these Docker images with the world using registries like Quay.io (indexed by Dockstore), Docker Hub, and GitLab.

Learn how to create a Docker image

Introduction to Workflow Languages

There are multiple workflow languages currently available to use with docker technology. In the BioData Catalyst ecosystem, SevenBridges uses CWL and Terra uses WDL. To learn more about how these language compare and differ, read Dockstore's documentation on tools and workflows.

Once you have picked what language works best for you, prepare your pipeline for analysis in the cloud with these tutorials aimed at bioinformaticians:

Learn how to create a tool in Common Workflow Language (CWL)

Learn how to create a tool in Workflow Descriptor Language (WDL)

Best Practices for Secure and FAIR Workflows

Dockstore’s integration with BioData Catalyst allows researchers the ability to easily launch reproducible tools and workflows in secure workspace environments for use with sensitive data. This privilege to work with sensitive data requires assurances of safe software.

We believe we can enhance the security and reliability of tools and workflows through open, community-driven best practices that exemplify the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles. We have established a best practices framework for secure and FAIR workflows published in Dockstore. We ask that users try to implement these practices for all workflows they develop.

Dockstore Forum

Dockstore Forum

This forum is a great place to find and post questions about Docker files, workflow languages, Dockstore features, and workflow learning resources. The user base includes CWL, WDL, Nextflow, and Galaxy workflow authors and users.

Contribute to the community

Our mission is to catalyze open, reproducible research in the cloud

We hope Dockstore provides a reference implementation for tool sharing in the sciences. Dockstore is essentially a living and evolving proof of concept designed as a starting point for two activities that we hope will result in community standards within the GA4GH:

a best practices guide for describing tools in Docker containers with CWL/WDL/Nextflow
a minimal web service standard for registering, searching and describing CWL/WDL-annotated Docker containers that can be federated and indexed by multiple websites

We plan on expanding the Dockstore in several ways over the coming months. Please see our issues page for details and discussions.

Building a community

To help Dockstore grow, we encourage users to publish their tools and workflows on Dockstore so that they can be used by the greater scientific community. Here is how to get started:

Create a Dockstore account

Create an Organization, invite your collaborators, and promote your work in collections

GWAS with GENESIS workflows

Overview of the GENESIS pipelines

Below, we feature our GENESIS Benchmarking Guide to assist users in estimate cloud costs when running GENESIS workflows on NHLBI BioData Catalyst Powered by Seven Bridges.

GENESIS Benchmarking Guide

Introduction

The objective of the GENESIS Benchmarking Guide is to instruct users on the drivers of cloud costs when running GENESIS workflows on the NHLBI BioData Catalyst Powered by Seven Bridges.

For all GENESIS workflows, the Seven Bridges team has performed comprehensive benchmarking analysis on Amazon Web Services (AWS) and Google Cloud Provider (GCP) instances for different scenarios:

2.5k samples (1000G data)
10k samples (TOPMed Freeze5 data)
36k samples (TOPMed Freeze5 data)
50k samples (TOPMed Freeze5 data)

The contents of this guide are arranged as follows:

Introduction
Helpful Terms to Know
GENESIS VCF to GDS
GENESIS Null model
GENESIS Single Association testing
GENESIS Aggregate Association testing
GENESIS Sliding window Association testing
General considerations

Below is a link to download the results of our Benchmarking Analysis described herein. It may prove useful to have this file open for reference when reading through this guide.

1KB

GENESIS Benchmarking - VCF to GDS.csv

964B

GENESIS Benchmarking - Null Model.csv

2KB

GENESIS Benchmarking - Single test.csv

5KB

GENESIS Benchmarking - Aggregate test.csv

5KB

GENESIS Benchmarking - Sliding window test.csv

Helpful terms to know

Before continuing on to the benchmarking results, please familiarize yourself with the following helpful terms to know:

Tool: Refers to a stand-alone bioinformatics tool or its Common Workflow Language (CWL) wrapper that is created or already available on the platform.
Workflow/Pipeline (interchangeably used): Denotes a number of tools connected together in order to perform multiple analysis steps in one run.
App: Stands for a CWL wrapper of a tool or a workflow that is created or already available on the platform.
Task: Represents an execution of a particular tool or workflow on the platform. Depending on what is being executed (tool or workflow), a single task can consist of only one tool execution (tool case) or multiple executions (one or more per each tool in the workflow).
Job: This refers to the “execution” part from the “Task” definition (see above). It represents a single run of a single tool found within a workflow. If you are coming from a computer science background, you will notice that the definition is quite similar to a common understanding of the term “job” (wikipedia). Except that the “job” is a component of a bigger unit of work called a “task” and not the other way around, as in some other areas may be the case. To further illustrate what job means on the platform, we can visually inspect jobs after the task has been executed using the View stats & logs panel (button in the upper right corner on the task page):

Figure 1. The jobs for an example run of RNA-Seq Quantification (HISAT2, StringTie) public workflow

GENESIS VCF to GDS

In this section, we detail the process of converting a VCF to a GDS via a GENESIS workflow. This VCF to GDS workflow consists of 3 steps:

Vcf2gds
Unique variant id
Check GDS

The first two steps are required while the last one is optional. The Check GDS step when included is the biggest cost driver in these tasks.

For example:

The following conclusions were drawn from performed benchmarking:

Benchmarking showed that the most suitable AWS instances for this workflow are c5 instances.
For all tasks that we run (from 2.5k up to 50k samples), 1CPU and 4GB per job were sufficient.
For small sample sizes (up to 10k samples), tasks can be run on spot/preemptible instances to additionally decrease the cost.
For samples up to 10k, 2GB per job could suffice, but consider that if we run check GDS step as well, execution time and price will not be much lower because CPU and Mem per job inputs are related only to vcf2gds step and not to the whole workflow.
We recommend using VCF.GZ as input files rather than BCF, as the conversion process cannot be parallelized when using BCFs.
If you have more files to convert (e.g. multiple chromosomes), we recommend running one analysis with all files as an input, rather than batch analysis with separate tasks for each file.

GENESIS Null model

GENESIS Single Association testing

Results of the GENESIS Single Association Testing workflow benchmarking can be seen in the table above. Some important notes to consider when using this workflow:

Null model type effect: The main cost and duration driver in these tasks is the null model type. The null model can be fit without a relatedness matrix (i.e. simple regression models), or with a relatedness matrix that can be sparse or dense (i.e. mixed models). The table above shows that task cost and execution time increases as the null model is weighted (dense > sparse > none). Differences between the model with dense matrix and the model with sparse matrix are significant, which is driven by both increased CPU time and memory required to use a dense matrix. Our advice is to check the relatedness matrix before fitting the model, and transform it to sparse if it is dense, especially if sample size is large. The null model type has direct influences on further parameters setup such as: instance type selection, CPU per job, Memory per job, etc.
Instance type: Benchmarking showed that the most appropriate instance type is an AWS r5.x instance. Instance type selection is especially important when we are performing analysis with many samples (30k participants and above). In the tasks with up to 30k samples, r5.4xlarge instances can be used, and r5.12xlarge with more participants included. In addition, it is important to note that if a single association test is performed with a dense null model, then r5.12xlarge or r5.24xlarge instances should be picked. When it comes to Google instances, results can be seen in the above table as well. Since there often isn’t a Google instance that is the exact equivalent of the AWS instance, , we recommend choosing the most appropriate Google instance (matching the chosen AWS instance) from the list of available Google instances on BioData Catalyst.
CPU and memory per job: CPUs and memory per job input parameters are determining the number of jobs to be run in parallel on one instance. For example:
- If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job, the number of jobs run in parallel will be min{36/1,72/4}=18.
- If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job, the number of jobs run in parallel will be min{36/1,72/8}=9.
The bottleneck in single variant association testing is memory, so we suggest carefully considering this parameter and instance type. Workflow defaults are 1 CPU/job and 8GB/job. The table above shows that these tasks require much more memory than CPUs, therefore r.x instances are most appropriate in these cases. The table additionally shows that the task where the null model is fit with the dense relatedness matrix requires the most memory per job. This parameter also depends on the number of participants included in the analysis.
Maximum number of parallel instances: The default number of parallel instances is 8. The impact of changing this number is mainly reflected through execution time. The tasks with more parallel instances will be finalized faster. This parameter can be set in Execution settings when drafting a task. However, each user has a limited total number of parallel instances and capturing a big number of parallel instances per task leads to a decrease in the total number of different tasks that can run at the same time.
Spot instances: If it is expected for the task to be finalized within a few hours, it can be run on spot instances in order to reduce the execution cost. However, losing a spot instance leads to rerunning the task using on-demand instances which can lead to a higher cost than running the task on on-demand instances from the beginning. That is why spot instances are generally only suitable for short tasks.

GENESIS Aggregate Association testing

GENESIS Aggregate association testing can be performed using burden, SKAT, SMMAT, fastSKAT and SKAT-O tests. Our general conclusions are as follows:

Null model selection: The same as in the single variant association testing, the main cost and duration driver in these tasks is the null model type. The null model can be fit without a relatedness matrix (i.e. simple regression models), or with a relatedness matrix that can be sparse or dense (i.e. mixed models). The table above shows that task cost and execution time increases as the null model is weighted (dense > sparse > none). Differences between the model with dense matrix and the model with sparse matrix are significant, which is driven by both increased CPU time and memory required to use a dense matrix. Our advice is to check the relatedness matrix before fitting the model, and transform it to sparse if it is dense, especially if sample size is large. The null model type has direct influences on further parameters setup such as: instance type selection, CPU per job, Memory per job, etc.
Instance type: Benchmarking showed that the most appropriate instance type is an AWS r5.x instance. The majority of the tasks can be run on r5.12xlarge instances or on r5.24xlarge instances when the null model is with the dense relatedness matrix. Results for Google instances can be seen in the above table as well. Since the Google instance options often do not have an exact AWS equivalent, we selected the closest match from the list of available Google instances on BioData Catalyst.
CPU and memory per job: CPUs and memory per job input parameters determine the number of jobs to be run in parallel on one instance. For example:
- If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job, the number of jobs run in parallel will be min{36/1,72/4}=18.
- If a task is run on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job, the number of jobs run in parallel will be min{36/1,72/8}=9.
Different tests can require different computational resources.

Maximum number of parallel instances: The default number of parallel instances is 8. The impact of changing this number is mainly reflected through execution time. The tasks with more parallel instances will be finalized faster. This parameter can be set in Execution settings when drafting a task. However, each user has a limited total number of parallel instances and capturing a big number of parallel instances per task leads to a decrease in the total number of different tasks that can run at the same time.
Spot instances: If it is expected for the task to be finalized in a few hours it can be run on spot instances. This will reduce the execution cost. However, losing a spot instance leads to rerunning the task on on-demand instances, which can lead to a higher cost than running the task on on-demand instances from the beginning. That is why spot instances are generally suitable for short tasks.

GENESIS Sliding window Association testing

GENESIS Sliding window association testing can be performed using burden, SKAT, SMMAT, fastSKAT and SKAT-O tests. When running sliding window test is good to know:

Null model selection: The same as in the previous tests the main cost and duration driver in these tasks is the null model type. The null model can be fit without a relatedness matrix (i.e. simple regression models), or with a relatedness matrix that can be sparse or dense (i.e. mixed models). The table below shows that task cost and execution time increases as the null model is weighted (dense > sparse > none). Differences between the model with dense matrix and the model with sparse matrix are significant, which is driven by both increased CPU time and memory required to use a dense matrix. Our advice is to check the relatedness matrix before fitting the model, and transform it to sparse if it is dense, especially if sample size is large. The null model type has direct influences on further parameters setup such as: instance type selection, CPU per job, Memory per job, etc.
Instance type: Benchmarking showed that for analysis with or without sparse relatedness matrix tasks can be completed on a c5.9xlarge AWS instance. For analysis with dense relatedness matrix included in the null model and with 50k samples or more, r5.12xlarge instances can be used. Also, it is important to note that in this case increasing the instance (for example from c5.9xlarge to c5.18xlarge) will not lead to shorter execution time. Furthermore, it can be completely opposite. By increasing the size of the instance we also increase the number of jobs running in parallel. At one point there will be a lot of jobs running in parallel and accessing the same memory space which can reduce the performance and increase task duration. Results for Google instances can be seen in respective tables. Since the Google instance options often do not have an exact AWS equivalent, we selected the closest match from the list of available Google instances on BioData Catalyst.
CPU and memory per job: When running a sliding window test it is important to ensure that CPU resources at the instances that we are using are not overused. Avoiding 100% CPU usage in these tasks is crucial for fast execution. For that reason, it is good to decrease the number of jobs which are running in parallel on one instance. The number of parallel jobs is highlighted in the summary table as it is an important parameter for the execution of this task. We can choose different CPU and memory inputs as long as that combination gives us an appropriate number of parallel jobs. This is example how the number of parallel jobs are calculated:
- If we run our task on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 4GB/job, the number of jobs run in parallel will be min{36/1,72/4}=18.
- If we run our task on c5.9xlarge(36CPUs and 72GB RAM) with 1CPU/job and 8GB/job, the number of jobs run in parallel will be min{36/1,72/8}=9.
For details on the number of jobs that we’ve set for each tested case please refer to the table below.
Window size and window step: The default values for these parameters are 50kb and 20kb (kilobases), respectively. Please have in mind that since the sliding window algorithm is considering all bases inside the window, the window length and number of windows are parameters that are directly affecting the execution time and the price of the task.
Maximum number of parallel instances: The default number of parallel instances is 8. The impact of changing this number is mainly reflected through execution time. The tasks with more parallel instances will be finalized faster. This parameter can be set in Execution settings when drafting a task. However, each user has a limited total number of parallel instances and capturing a big number of parallel instances per task leads to a decrease in the total number of different tasks that can run at the same time.
Spot instances: If it is expected for the task to be finalized in a few hours it can be run on spot instances. This will reduce the execution cost. However, losing a spot instance leads to rerunning the task on on-demand instances which can lead to a higher cost than running the task on on-demand instances from the beginning. That is why spot instances are generally suitable for short tasks.