Terra provides powerful support for performing Genome-Wide Association Studies (GWAS). The following featured and template workspaces include Jupyter notebooks for phenotypic and genomic data preparation (using Hail) and workflows (using GENESIS) to perform single or aggregate variant association tests using mixed models. We will continue to provide more resources for performing more complex GWAS scenarios in BioData Catalyst.
A Jupyter Notebook in both of the following workspaces uses Hail to generate Genetic Related Matrices for input into the GWAS workflows. Users with access to kinship matrices from the TOPMed consortium may wish to exclude these steps and instead import kinship files using the bring your own data instructions.
The BioData Catalyst GWAS tutorial workspace was created to walk users through a GWAS with training data that includes synthetic phenotypic data (modeled after traits available in TOPMed) paired with 1000 Genomes open-access data. This tutorial aims to familiarize users with the Gen3 data model so that they can become empowered to use this data model with any existing tutorials available in the Terra library’s showcase section.
This template is an example workspace that asks researchers to export TOPMed projects (for which they have access) into an example template for conducting a common variant, mixed-models GWAS of a blood pressure trait. Our goal was to include settings and suggestions to help users interact with data exactly as they receive it through BioData Catalyst. Accommodating other datasets may require modifying many parts of this notebook. Inherently, the notebook is an interactive analysis where decisions are made as you go. It is not recommended that the notebook be applied to another dataset without careful thought.
Cost Examples Below are reported costs from using 1,000 and 10,000 samples to conduct a GWAS using the BioData Catalyst GWAS Blood Pressure Trait template workspace. The costs were derived from single variant tests that used Freeze 5b VCF files that were filtered for common variants (MAF <0.05) for input into workflows. The way these steps scale will vary with the number of variants, individuals, and parameters chosen. TOPMed Freeze 5b VCF files contain 582 million variants and Freeze 8 increases to ~1.2 billion. For GWAS analyses with Freeze 8 data, computational resources and costs are expected to be significantly higher.
These costs were derived from running these analyses in Terra in June 2020.
Analysis Step
Cost (n=1,000; Freeze5b)
Cost (n=10,000; Freeze 5b)
$29.34 ($19.56/hr for 1.5 hours)
$336 ($56/hr for 6 hours)
vcfTogds workflow
$1.01
$5.01
genesis_GWAS workflow
$0.94
$6.67
TOTAL
$32.29
$347.68
The interactive analysis features of Terra support interactive data exploration, including the use of statistical methods and graphical display. Versatile and powerful interactive analysis is provided through Jupyter Notebooks in both Python and R languages.
Jupyter Notebooks run on a virtual machine (VM). You can customize your VM’s installed software by selecting one of Terra's preinstalled notebook cloud environments or choosing a custom environment by specifying a Docker container. Dockers ensure you and your colleagues analyze with the same software, making your results reproducible.
Article: Interactive statistics and visualization with Jupyter notebooks Article: Customizing your interactive analysis application compute Article: Terra's Jupyter Notebooks environment Part I: Key components Article: Terra's Jupyter Notebooks environment Part II: Key operations Article: Terra's Jupyter Notebooks environment Part III: Best Practices Video: Notebooks overview Video: Notebooks Quickstart walkthrough Workspace: Notebooks Quickstart workspace Workspace: BioData Catalyst notebooks collection Workspace: PIC-SURE Tutorial in R Workspace: PIC-SURE Tutorial in Python
Terra supports the following types of analysis: Batch processing with Workflows and Interactive analysis with Jupyter Notebooks. This section will orient you with resources that teach you how to do:
As an introduction, we recommend reading our article on the kinds of analysis you can do in Terra.
The batch workflow features of Terra provide support for computationally-intensive, long-running, and large-scale analysis.
You can perform whole pipelines—from preprocessing and trimming sequencing data to alignment and downstream analyses—using Terra workflows. Written in the human-readable Workflow Description Language (WDL), you can search for and import workflows into your workspace from Dockstore or the Broad Methods Repository.
Video: Data Analysis with Gen3, Terra and Dockstore Article: How to import data from Gen3 into Terra and run the TOPMed aligner workflow Article: Configure a workflow to process your data Article: Getting workflows up and running faster with a JSON file Article: Importing a Dockstore workflow into Terra Video: Importing a Dockstore workflow into Terra walkthrough Video: Workflows Quickstart walkthrough Workspace: Workflows Quickstart workspace Workspace for BioData Catalyst: TOPMed Aligner workspace Workspace for BioData Catalyst: GWAS with 1000 Genomes and synthetic clinical data Workspace for BioData Catalyst: GWAS with TOPMed data