Genome Wide Association Study with TOPMed Data Tutorial
Powered by Dockstore, Terra, and Gen3
This page is an open access preview of the Terra tutorial workspace that you can find at this link.
GWAS Tutorial Using Blood Pressure Traits
This template workspace was created to offer example tools for conducting a single variant, mixed-models GWAS focusing on a blood pressure trait from start to finish using the NHLBI BioData Catalyst ecosystem. We have created a set of documents to get you started in the BioData Catalyst system. If you're ready to conduct an analysis, proceed with this dashboard:
Data Model
This template was set up to work with the NHLBI BioData Catalyst Gen3 data model. In this dashboard, you will learn how to import data from the Gen3 platform into this Terra template and conduct an association test using this particular data model.
TOPMed Data
Currently, BioData Catalyst's Gen3 hosts the TOPMed program, which is controlled access. If you do not already have access to a TOPMed project through dbGAP, this template workspace may not yet be helpful to you. To apply for access to TOPMed, submit an application within dbGAP.
If you already have access to a TOPMEd project and have been onboarded to the BioData Catalyst platform, you should be able to access your data through BioData Catalyst powered by Gen3 and use your data with this template workspace. We focused this template on analyzing a blood pressure trait, but not all TOPMed projects may contain blood pressure data. You will need to carefully consider how to update this analysis for the dataset you bring and how this may affect the scientific accuracy of the question you are asking.
A note about TOPMed metadata
Some types of metadata will always be present: GUID, Case ID, Project Name, Number of Samples, Study, Gender, Age at Index, Race, Ethnicity, Number of Aliquots, SNP Array Files, Unaligned Read Files, Aligned Read Files, Germline Variation Files.
Other metadata depend on the analysis plan submitted when applying for TOPMed access. Examples include BMI, Years Smoked, years smoked greater than 89, hypertension, hypertension medications, diastolic blood pressure, systolic blood pressure, etc.
The TOPMed Data Coordinating Center (DCC) is currently harmonizing select phenotypes across TOPMed, which will also be deposited into the TOPMed accessions. The progress of phenotype metadata harmonization can be tracked here. The data in the Gen3 graph model in this tutorial are harmonized phenotypes. You can find the unharmonized phenotypic and environmental data in the "Reference File" node of the Gen3 graph. Documentation about how to interact with unharmonized data in Terra is coming soon.
Outline of this template:
Part 1: Navigate the BioData Catalyst environment Learn how to search and export data from Gen3 and workflows from Dockstore into a Terra workspace. Each cloud-based platform interoperates with one another for fast and secure research. The template we have created here can be cloned for you to walk through as suggested, or you can use the basics you learn here to perform your own analysis.
Part 2: Explore TOPMed data in an interactive Jupyter notebook In this Terra workspace, you can find a series of interactive notebooks to explore TOPMed data.
First, you will use the notebook 1-unarchive-vcf-tar-file-to-workspace to extract the contents of tar bundles to your workspace for use in the GWAS. These tar bundles were generated by dbGAP and contain TOPMed multi-sample VCFs per consent code and chromosome.
Next, the 2-GWAS-preliminary-analysis notebook will lead you through a series of steps to explore the phenotypic and genotypic data. It will call the functions in the companion terra-data-util notebook to consolidate your clinical data from Gen3 into a single data table that can be imported into the Jupyter notebook. Then you will examine phenotypic distributions and genetic relatedness using the HAIL genomic data analysis tool.
Part 3: Perform mixed-model association tests using workflows Next, perform mixed models genetic association tests (run as a series of batch workflows using GCP Compute engine). For details on the four workflows and what they do, scroll down to Perform mixed-model association test workflows. The workflows are publicly available in Dockstore in this collection.
Mixed models require two steps within the GENESIS package in R: 1) Fitting a null model assuming that each genetic variant has no effect on phenotype and 2) Testing each genetic variant for association with the outcome, using the fitted null model.
Part 1: Navigate the NHLBI BioData Catalyst multi-platform ecosystem
1a) Link your Terra account to external services
Before you're able to access genomic data from Gen3 in the Terra data table, you need to link your Terra account to external services. Link your profile by following these instructions.
1b) Create an Authorization Domain to protect your controlled-access data
Because this workspace was created to be used with controlled access data, it should be registered under an Authorization Domain that limits its access to only researchers with the appropriate approvals. Learn how to set up an Authorization Domain here before proceeding.
1c) Export a TOPMed project with blood pressure data from Gen3
Start by learning about Gen3's graph-structured data model for NHLBI's BioData Catalyst using this orientation document.
Once you better understand the graph, log into Gen3 through the NIH portal using your eRA Commons username and password.
Navigate to the Gen3 Explorer view to see what datasets you currently have and do not have access to. On the left-hand side, you can use the faceted search tool to narrow your results to specific projects.
First, under "Files" and "Access", select "Data with Access" to filter through projects that you currently have access to.
Next, under "Filters", you can select phenotypic or environmental data to narrow your results. Here, select the "Diagnosis" tab and under "BP Diastolic" move the left hand side of the sliding bar from 0 to 35. This will make your search range 35 - 163. This will only show the TOPMed projects that contain that trait data.
In all of TOPMed, there are 23 studies with diastolic blood pressure data. You may see anywhere from 0 to 23, depending on what projects you have applied for and received access to.
Next, click on the "Subject" tab. If you have access to a TOPMed project with blood pressure data, this will list all of the project names. Select only a single project to use in this template.
Once selected, click the button "Export all to Terra", wait until the Terra window appears, and add your data to your copy of this template workspace.
1d) Select a workspace for your data
Once the new Terra window appears, you are given a few options for where to place your data.
1) "Start with a template" This feature allows you to import data directly into a template workspace that has everything set up for you to do an analysis but does not contain any data. Once you select a workspace, you will need to enter:
Workspace name: Enter a name that is meaningful for your records.
Billing Project: Select the billing projects available to you.
Authorization Domain: Assign the authorization domain that you generated above to protect your data. This is important for working with controlled access. data.
2) "Start with an existing workspace" If you have already created a workspace, you can import your data directly to this workspace.
3) "Start a new workspace" This will create an empty workspace. You can individually copy notebooks and workflows from other workspaces, import workflows from Dockstore, or start fresh.
Part 2: Explore TOPMed data in Jupyter Notebooks
2a) Extract multi-sample VCFs to your workspace
Gen3 uploaded tar compressed bundles, as they are provided by dbGAP, into cloud buckets owned by BioData Catalyst. To make these tar files actionable and ready for use in analyses, users will need to unarchive these tar bundlers to their workspace.
First, open the 1-unarchive-vcf-tar-file-to-workspace notebook and follow the steps to select which tar bundle(s) to extract to your workspace for use in the GWAS. Please understand that this step may be time consuming since TOPMed multi-sample VCF files are several hundred gigabytes in size.
2b) Prepare your phenotypic and genotypic data for input into association test workflows
Now that you can interact with the Gen3 structured data more easily, you will use an interactive notebook to explore your phenotypic and environmental data and performs several analyses to prepare the data for use in batch association workflows.
Learn how to customize your interactive analysis compute to work with the data you imported.
Open the 2-GWAS-preliminary-analysis notebook and set your runtime configuration. We have given a suggested configuration within the notebook.
From within this 2-GWAS-preliminary-analysis notebook you can call functions from the companion terra_data_table_util notebook to reformat multiple data tables into a single data table that can be loaded as a dataframe in the notebook.
Subset the dataframe to include only your traits of interest and remove any individuals that lack data for these traits.
Visualize phenotype and environmental variable distributions in a series of plots.
Import the multi-sample VCF from the "Reference File" data table using DRS. You can learn more about GA4GH's Data Repository Service here.
Filter your VCF to only common variants to increase statistical power. Genetic analyses in this notebook utilize the Hail software. Hail is a framework for distributed computing with a focus on genetics. Particularly relevant for whole genome sequence (WGS) analysis, Hail allows for efficient, nearly boundless computing (in terms of variant and sample size).
Perform a principal component analysis (PCA) to assess population stratification. Genetic stratification can strongly affect association tests and should be accounted for.
Generate a genetic relatedness matrix (GRM) to account for closely related individuals in your association testing workflows.
Generate a new "sample_set" data table that holds the derived files we created in the steps above using the FireCloud Service Selector (FISS) package. The files in this data table will be used in the workflows we run in Part 3.
Time and cost estimate
You can adjust the runtime configuration to fit your computational needs in the Jupyter notebook. We recommend selecting the default environment and selecting the custom profile to use and configure the spark cluster for parallel processing. Using the profile suggested profile within the Jupyter notebook and a project with around 1000 samples, running this notebook on this dataset takes about 90 minutes and $20/hr to compute.
When working in a notebook with computing times over 30 minutes, learn more about Terra's auto-pause feature and how to adjust auto-pause for your needs. Please carefully consider how adjusting auto-pause can remove protections that help you from accidentally accumulating cloud costs that you did not need.
Part 3: Perform mixed-model association tests using workflows
In Part 2, we explored the data we imported from Gen3 and performed a few important steps for preparing our data for association testing. We generated a new "sample_set" data table that holds the files we created in the interactive notebook. These files will be used in our batch workflows that will perform the association tests. Below, we describe the four workflows in this workspace and their cost estimates for running on the sample set we create in this tutorial.
The workflows used in this template were imported from Dockstore and their parameters were configured to work with Terra's data model. If you're interested in searching other Docker-based workflows, learn more about how they can easily be launched in Terra.
Notes on how attributes are set in workflows
We have set the input and output attributes for each workflow in this template. Before running the first workflow, you can look through the inputs and outputs of each workflow and see that outputs from the first workflow feed into the second workflow, and so on.
In the 2-GWAS-preliminary-analysis notebook, we created a Sample Set data table that holds a row called "systolicbp" which contains the input files for the following workflows. You can check this data table out in the Data tab of this workspace. When you open a workflow, make sure that "Sample Set" is set and the "systolicbp" (or whatever you named your run) is selected before running a workflow.
1-vcfToGds
This workflow converts genotype files from Variant Call Format (VCF) to Genomic Data Structure (GDS), the input format required by the R package GENESIS.
Time and cost estimates
Sample Set Name
Sample Size
# Variants
Time
Cost $
systolicbp
1,052 samples
6,429,788
15m
$1.01
Inputs:
VCF genotype file (or chunks of VCF files)
Outputs:
GDS genotype file
2-genesis_GWAS
This workflow creates a null model from phenotype data with the GENESIS biostatistical package. This null model can then be used for association testing. This workflow also runs single variant and aggregate test for genetic data. Implements Single-variant, Burden, SKAT, SKAT-O and, SMMAT tests for Continuous or Dichotomous outcomes. All tests account for familiar relatedness through kinship matrixes. Underlying functions adapted from: Conomos MP and Thornton T (2016). GENESIS: GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness. R package version 2.3.4.
Time and cost estimates
Sample Set Name
Sample Size
Time
Cost
systolicbp
1,052 samples
26m
$0.94
Inputs:
GDS genotype file
Genetic Relatedness Matrix
Trait outcome name
Trait outcome type
CSV file of covariate traits
Sample ID list
Outputs:
A null model as an RData file
Compressed CSV file(s) containing raw results
CSV file containing all associations
CSV file containing top associations
PNG file of Quantile-Quantile and Manhattan plots
Cost Examples
Below are reported costs from using 1,000 and 10,000 samples to conduct a GWAS using the BioData Catalyst GWAS Blood Pressure Trait template workspace. The costs were derived from single variant tests that used Freeze 5b VCF files that were filtered for common variants (MAF <0.05) for input into workflows. The way these steps scale will vary with the number of variants, individuals, and parameters chosen. TOPMed Freeze 5b VCF files contain 582 million variants and Freeze 8 increases to ~1.2 billion. For GWAS analyses with Freeze 8 data, computational resources and costs are expected to be significantly higher.
Analysis Step
Cost (n=1,000; Freeze5b)
Cost (n=10,000; Freeze 5b)
GWAS Preliminary Analysis Notebook
$29.34 ($19.56/hr for 1.5 hours)
$336 ($56/hr for 6 hours)
vcfTogds workflow
$1.01
$5.01
genesis_GWAS workflow
$0.94
$6.67
TOTAL
$32.29
$347.68
These costs were derived from running these analyses in Terra in June 2020.
Optional: Bring your own data
Both the notebook and workflow can be adapted to other genetic datasets. The steps for adapting these tools to another dataset are outlined below:
Update the data tables Learn more about uploading data to Terra here. You can use functions available from the terra_data_table_util companion notebook to consolidate new data tables you generate.
Update the notebook Accommodating other datasets may require modifying many parts of this notebook. Inherently, the notebook is an interactive analysis where decisions are made as you go. It is not recommended that the notebook be applied to another dataset without careful thought.
Run an additional workflow You can search Dockstore for available workflows and export them to Terra following this method.
Helpful resources to master this tutorial
If you are new to BioData Catalyst powered by Terra, we have created an onboarding syllabus that includes several introductory webinars.
Authors, contact information, and funding
This template was created for the NHLBI's BioData Catalyst project in collaboration with the Computational Genomics Platform at UCSC Genomics Institute and the Data Sciences Platform at The Broad Institute. The association analysis tools were contributed by the Manning Lab.
Contributing authors include:
Beth Sheets (UC Santa Cruz Genomics Institute)
Michael Baumann (Broad Institute, Data Sciences Platform)
Brian Hannafious (UC Santa Cruz Genomics Institute)
Tim Majarian (Manning Lab)
Alisa Manning (Manning Lab)
Ash O'Farrell (UC Santa Cruz Genomics Institute)
Workspace change log
Date
Change
Author
December 9, 2020
Update notebooks, workflows, and workspace markdown
Ash
June 26, 2020
terra_data_table_util updates
Beth
Feb 26, 2020
Added notebook to copy/extract VCF
Beth
Jan 31, 2020
Replaced text with new Broad documentation
Beth
Jan 30, 2020
Template updates
Beth
Jan 3, 2020
Updates from BDC F2F
Beth
December 3, 2019
Gen3 updates
Beth
November 22, 2019
Updates from Alisa
Beth
October 22, 2019
User experience edits from Beri
Beth
Last updated