arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

Genome Wide Association Study with TOPMed Data Tutorial

Powered by Dockstore, Terra, and Gen3

This page is an open access preview of the Terra tutorial workspace that you can find at this linkarrow-up-right.

hashtag
GWAS Tutorial Using Blood Pressure Traits

This template workspace was created to offer example tools for conducting a single variant, mixed-models GWAS focusing on a blood pressure trait from start to finish using the NHLBI BioData Catalystarrow-up-right ecosystem. We have created a set of documents to get you started in the BioData Catalyst systemarrow-up-right. If you're ready to conduct an analysis, proceed with this dashboard:

hashtag
Data Model

This template was set up to work with the NHLBI BioData Catalyst Gen3 data model. In this dashboard, you will learn how to import data from the Gen3 platform into this Terra template and conduct an association test using this particular data model.

hashtag
TOPMed Data

Currently, BDC-Gen3 hosts the program, which is controlled access. If you do not already have access to a TOPMed project through dbGAP, this template workspace may not yet be helpful to you. To apply for access to TOPMed, submit an application within .

If you already have access to a TOPMEd project and have been onboarded to the BDC platform, you should be able to access your data through BDC-Gen3 and use your data with this template workspace. We focused this template on analyzing a blood pressure trait, but not all TOPMed projects may contain blood pressure data. You will need to carefully consider how to update this analysis for the dataset you bring and how this may affect the scientific accuracy of the question you are asking.

hashtag
A note about TOPMed metadata

Some types of metadata will always be present: GUID, Case ID, Project Name, Number of Samples, Study, Gender, Age at Index, Race, Ethnicity, Number of Aliquots, SNP Array Files, Unaligned Read Files, Aligned Read Files, Germline Variation Files.

Other metadata depend on the analysis plan submitted when applying for TOPMed access. Examples include BMI, Years Smoked, years smoked greater than 89, hypertension, hypertension medications, diastolic blood pressure, systolic blood pressure, etc.

The TOPMed Data Coordinating Center (DCC) is currently harmonizing select phenotypes across TOPMed, which will also be deposited into the TOPMed accessions. The progress of phenotype metadata harmonization can . The data in the Gen3 graph model in this tutorial are harmonized phenotypes. You can find the unharmonized phenotypic and environmental data in the "Reference File" node of the Gen3 graph. Documentation about how to interact with unharmonized data in Terra is coming soon.

hashtag
Outline of this template:

Part 1: Navigate the BDC environment Learn how to search and export data from Gen3 and workflows from Dockstore into a Terra workspace. Each cloud-based platform interoperates with one another for fast and secure research. The template we have created here can be cloned for you to walk through as suggested, or you can use the basics you learn here to perform your own analysis.

Part 2: Explore TOPMed data in an interactive Jupyter notebook In this Terra workspace, you can find a series of interactive notebooks to explore TOPMed data.

First, you will use the notebook 1-unarchive-vcf-tar-file-to-workspace to extract the contents of tar bundles to your workspace for use in the GWAS. These tar bundles were generated by dbGAP and contain TOPMed multi-sample VCFs per consent code and chromosome.

Next, the 2-GWAS-preliminary-analysis notebook will lead you through a series of steps to explore the phenotypic and genotypic data. It will call the functions in the companion terra-data-util notebook to consolidate your clinical data from Gen3 into a single data table that can be imported into the Jupyter notebook. Then you will examine phenotypic distributions and genetic relatedness using the .

Part 3: Perform mixed-model association tests using workflows Next, perform mixed models genetic association tests (run as a series of batch workflows using GCP Compute engine). For details on the four workflows and what they do, scroll down to Perform mixed-model association test workflows. The workflows are publicly available in in this .

Mixed models require two steps within the package in : 1) Fitting a null model assuming that each genetic variant has no effect on phenotype and 2) Testing each genetic variant for association with the outcome, using the fitted null model.

hashtag
Part 1: Navigate the BDC multi-platform ecosystem

hashtag
1a) Link your Terra account to external services

Before you're able to access genomic data from Gen3 in the Terra data table, you need to link your Terra account to external services. Link your profile .

hashtag
1b) Create an Authorization Domain to protect your controlled-access data

Because this workspace was created to be used with controlled access data, it should be registered under an Authorization Domain that limits its access to only researchers with the appropriate approvals. Learn how to set up an Authorization Domain before proceeding.

hashtag
1c) Export a TOPMed project with blood pressure data from Gen3

  1. Start by learning about Gen3's graph-structured data model for BDC using this .

  2. Once you better understand the graph, log into through the NIH portal using your eRA Commons username and password.

  3. Navigate to the view to see what datasets you currently have and do not have access to. On the left-hand side, you can use the faceted search tool to narrow your results to specific projects.

hashtag
1d) Select a workspace for your data

Once the new Terra window appears, you are given a few options for where to place your data.

1) "Start with a template" This feature allows you to import data directly into a template workspace that has everything set up for you to do an analysis but does not contain any data. Once you select a workspace, you will need to enter:

  • Workspace name: Enter a name that is meaningful for your records.

  • Billing Project: Select the billing projects available to you.

  • Authorization Domain: Assign the authorization domain that you generated above to protect your data. This is important for working with controlled access. data.

2) "Start with an existing workspace" If you have already created a workspace, you can import your data directly to this workspace.

3) "Start a new workspace" This will create an empty workspace. You can individually copy notebooks and workflows from other workspaces, import workflows from Dockstore, or start fresh.

hashtag
Part 2: Explore TOPMed data in Jupyter Notebooks

hashtag
2a) Extract multi-sample VCFs to your workspace

Gen3 uploaded tar compressed bundles, as they are provided by dbGAP, into cloud buckets owned by BDC. To make these tar files actionable and ready for use in analyses, users will need to unarchive these tar bundlers to their workspace.

First, open the 1-unarchive-vcf-tar-file-to-workspace notebook and follow the steps to select which tar bundle(s) to extract to your workspace for use in the GWAS. Please understand that this step may be time consuming since TOPMed multi-sample VCF files are several hundred gigabytes in size.

hashtag
2b) Prepare your phenotypic and genotypic data for input into association test workflows

Now that you can interact with the Gen3 structured data more easily, you will use an interactive notebook to explore your phenotypic and environmental data and performs several analyses to prepare the data for use in batch association workflows.

  1. to work with the data you imported.

  2. Open the 2-GWAS-preliminary-analysis notebook and set your runtime configuration. We have given a suggested configuration within the notebook.

  3. From within this 2-GWAS-preliminary-analysis notebook you can call functions from the companion terra_data_table_util notebook to reformat multiple data tables into a single data table that can be loaded as a dataframe in the notebook.

hashtag
Time and cost estimate

You can adjust the runtime configuration to fit your computational needs in the Jupyter notebook. We recommend selecting the default environment and selecting the custom profile to use and configure the spark cluster for parallel processing. Using the profile suggested profile within the Jupyter notebook and a project with around 1000 samples, running this notebook on this dataset takes about 90 minutes and $20/hr to compute.

When working in a notebook with computing times over 30 minutes, learn more about Terra's and for your needs. Please carefully consider how adjusting auto-pause can remove protections that help you from accidentally accumulating cloud costs that you did not need.

hashtag
Part 3: Perform mixed-model association tests using workflows

In Part 2, we explored the data we imported from Gen3 and performed a few important steps for preparing our data for association testing. We generated a new "sample_set" data table that holds the files we created in the interactive notebook. These files will be used in our batch workflows that will perform the association tests. Below, we describe the four workflows in this workspace and their cost estimates for running on the sample set we create in this tutorial.

The workflows used in this template were imported from and their parameters were configured to work with Terra's data model. If you're interested in searching other Docker-based workflows, .

hashtag
Notes on how attributes are set in workflows

We have set the input and output attributes for each workflow in this template. Before running the first workflow, you can look through the inputs and outputs of each workflow and see that outputs from the first workflow feed into the second workflow, and so on.

In the 2-GWAS-preliminary-analysis notebook, we created a Sample Set data table that holds a row called "systolicbp" which contains the input files for the following workflows. You can check this data table out in the Data tab of this workspace. When you open a workflow, make sure that "Sample Set" is set and the "systolicbp" (or whatever you named your run) is selected before running a workflow.

1-vcfToGds

This workflow converts genotype files from Variant Call Format () to Genomic Data Structure (), the input format required by the R package GENESIS.

Time and cost estimates

Inputs:

  • VCF genotype file (or chunks of VCF files)

Outputs:

  • GDS genotype file

2-genesis_GWAS

This workflow creates a null model from phenotype data with the GENESIS biostatistical package. This null model can then be used for association testing. This workflow also runs single variant and aggregate test for genetic data. Implements Single-variant, Burden, SKAT, SKAT-O and, SMMAT tests for Continuous or Dichotomous outcomes. All tests account for familiar relatedness through kinship matrixes. Underlying functions adapted from: Conomos MP and Thornton T (2016). GENESIS: GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness. R package version 2.3.4.

Time and cost estimates

Inputs:

  • GDS genotype file

  • Genetic Relatedness Matrix

  • Trait outcome name

Outputs:

  • A null model as an RData file

  • Compressed CSV file(s) containing raw results

  • CSV file containing all associations

hashtag
Cost Examples

Below are reported costs from using 1,000 and 10,000 samples to conduct a GWAS using the BioData Catalyst GWAS Blood Pressure Trait template workspace. The costs were derived from single variant tests that used Freeze 5b VCF files that were filtered for common variants (MAF <0.05) for input into workflows. The way these steps scale will vary with the number of variants, individuals, and parameters chosen. TOPMed Freeze 5b VCF files contain 582 million variants and Freeze 8 increases to ~1.2 billion. For GWAS analyses with Freeze 8 data, computational resources and costs are expected to be significantly higher.

These costs were derived from running these analyses in Terra in June 2020.

hashtag
Optional: Bring your own data

Both the notebook and workflow can be adapted to other genetic datasets. The steps for adapting these tools to another dataset are outlined below:

Update the data tables Learn more about uploading data to Terra . You can use functions available from the terra_data_table_util companion notebook to consolidate new data tables you generate.

Update the notebook Accommodating other datasets may require modifying many parts of this notebook. Inherently, the notebook is an interactive analysis where decisions are made as you go. It is not recommended that the notebook be applied to another dataset without careful thought.

Run an additional workflow You can search for available workflows and export them to Terra following .

hashtag
Helpful resources to master this tutorial

  • If you are new to BDC-Terra, we have created an that includes several introductory webinars.

hashtag
Authors, contact information, and funding

This template was created for the project in collaboration with the at and the at . The association analysis tools were contributed by the .

Contributing authors include:

  • (UC Santa Cruz Genomics Institute)

  • Michael Baumann (Broad Institute, Data Sciences Platform)

  • Brian Hannafious (UC Santa Cruz Genomics Institute)

hashtag
Workspace change log

First, under "Files" and "Access", select "Data with Access" to filter through projects that you currently have access to.

  • Next, under "Filters", you can select phenotypic or environmental data to narrow your results. Here, select the "Diagnosis" tab and under "BP Diastolic" move the left hand side of the sliding bar from 0 to 35. This will make your search range 35 - 163. This will only show the TOPMed projects that contain that trait data.

  • In all of TOPMed, there are 23 studies with diastolic blood pressure data. You may see anywhere from 0 to 23, depending on what projects you have applied for and received access to.

  • Next, click on the "Subject" tab. If you have access to a TOPMed project with blood pressure data, this will list all of the project names. Select only a single project to use in this template.

  • Once selected, click the button "Export all to Terra", wait until the Terra window appears, and add your data to your copy of this template workspace.

  • Subset the dataframe to include only your traits of interest and remove any individuals that lack data for these traits.

  • Visualize phenotype and environmental variable distributions in a series of plots.

  • Import the multi-sample VCF from the "Reference File" data table using DRS. You can learn more about GA4GH's Data Repository Service herearrow-up-right.

  • Filter your VCF to only common variants to increase statistical power. Genetic analyses in this notebook utilize the Hail softwarearrow-up-right. Hail is a framework for distributed computing with a focus on genetics. Particularly relevant for whole genome sequence (WGSarrow-up-right) analysis, Hail allows for efficient, nearly boundless computing (in terms of variant and sample size).

  • Perform a principal component analysis (PCAarrow-up-right) to assess population stratification. Genetic stratification can strongly affect association tests and should be accounted for.

  • Generate a genetic relatedness matrix (GRMarrow-up-right) to account for closely related individuals in your association testing workflows.

  • Generate a new "sample_set" data table that holds the derived files we created in the steps above using the FireCloud Service Selector (FISS) packagearrow-up-right. The files in this data table will be used in the workflows we run in Part 3.

  • Trait outcome type
  • CSV file of covariate traits

  • Sample ID list

  • CSV file containing top associations
  • PNG file of Quantile-Quantile and Manhattan plots

  • $347.68

    Intro to Hail using a Terra workspacearrow-up-right

  • GWAS tutorial using open data from the 1000 Genomes Projectarrow-up-right

  • Tim Majarianenvelope (Manning Lab)
  • Alisa Manning (Manning Lab)

  • Ash O'Farrell (UC Santa Cruz Genomics Institute)

  • Beth

    Jan 30, 2020

    Template updates

    Beth

    Jan 3, 2020

    Updates from BDC F2F

    Beth

    December 3, 2019

    Gen3 updates

    Beth

    November 22, 2019

    Updates from Alisa

    Beth

    October 22, 2019

    User experience edits from Beri

    Beth

    Sample Set Name

    Sample Size

    # Variants

    Time

    Cost $

    systolicbp

    1,052 samples

    6,429,788

    15m

    $1.01

    Sample Set Name

    Sample Size

    Time

    Cost

    systolicbp

    1,052 samples

    26m

    $0.94

    Analysis Step

    Cost (n=1,000; Freeze5b)

    Cost (n=10,000; Freeze 5b)

    GWAS Preliminary Analysis Notebook

    $29.34 ($19.56/hr for 1.5 hours)

    $336 ($56/hr for 6 hours)

    vcfTogds workflow

    $1.01

    $5.01

    genesis_GWAS workflow

    $0.94

    $6.67

    TOTAL

    Date

    Change

    Author

    December 9, 2020

    Update notebooks, workflows, and workspace markdown

    Ash

    June 26, 2020

    terra_data_table_util updates

    Beth

    Feb 26, 2020

    Added notebook to copy/extract VCF

    Beth

    Jan 31, 2020

    TOPMedarrow-up-right
    dbGAParrow-up-right
    be tracked herearrow-up-right
    HAIL genomic data analysis toolarrow-up-right
    Dockstorearrow-up-right
    collectionarrow-up-right
    GENESISarrow-up-right
    Rarrow-up-right
    by following these instructionsarrow-up-right
    herearrow-up-right
    orientation documentarrow-up-right
    Gen3arrow-up-right
    Gen3 Explorerarrow-up-right
    Learn how to customize your interactive analysis computearrow-up-right
    auto-pause featurearrow-up-right
    how to adjust auto-pausearrow-up-right
    Dockstorearrow-up-right
    learn more about how they can easily be launched in Terraarrow-up-right
    VCFarrow-up-right
    GDSarrow-up-right
    herearrow-up-right
    Dockstorearrow-up-right
    this methodarrow-up-right
    onboarding syllabusarrow-up-right
    Controlling cloud costsarrow-up-right
    Intro to Jupyter notebooks in Terraarrow-up-right
    NHLBI's BioData Catalystarrow-up-right
    Computational Genomics Platformarrow-up-right
    UCSC Genomics Institutearrow-up-right
    Data Sciences Platformarrow-up-right
    The Broad Institutearrow-up-right
    Manning Labarrow-up-right
    Beth Sheetsenvelope

    $32.29

    Replaced text with new Broad documentation