1 of 1

Genome Wide Association Study with TOPMed Data Tutorial

This page is an open access preview of the Terra tutorial workspace that you can find at this link.

GWAS Tutorial Using Blood Pressure Traits

This template workspace was created to offer example tools for conducting a single variant, mixed-models GWAS focusing on a blood pressure trait from start to finish using the NHLBI BioData Catalyst ecosystem. We have created a set of documents to get you started in the BioData Catalyst system. If you're ready to conduct an analysis, proceed with this dashboard:

Data Model

This template was set up to work with the NHLBI BioData Catalyst Gen3 data model. In this dashboard, you will learn how to import data from the Gen3 platform into this Terra template and conduct an association test using this particular data model.

TOPMed Data

Currently, BDC-Gen3 hosts the program, which is controlled access. If you do not already have access to a TOPMed project through dbGAP, this template workspace may not yet be helpful to you. To apply for access to TOPMed, submit an application within .

If you already have access to a TOPMEd project and have been onboarded to the BDC platform, you should be able to access your data through BDC-Gen3 and use your data with this template workspace. We focused this template on analyzing a blood pressure trait, but not all TOPMed projects may contain blood pressure data. You will need to carefully consider how to update this analysis for the dataset you bring and how this may affect the scientific accuracy of the question you are asking.

A note about TOPMed metadata

Some types of metadata will always be present: GUID, Case ID, Project Name, Number of Samples, Study, Gender, Age at Index, Race, Ethnicity, Number of Aliquots, SNP Array Files, Unaligned Read Files, Aligned Read Files, Germline Variation Files.

Other metadata depend on the analysis plan submitted when applying for TOPMed access. Examples include BMI, Years Smoked, years smoked greater than 89, hypertension, hypertension medications, diastolic blood pressure, systolic blood pressure, etc.

The TOPMed Data Coordinating Center (DCC) is currently harmonizing select phenotypes across TOPMed, which will also be deposited into the TOPMed accessions. The progress of phenotype metadata harmonization can . The data in the Gen3 graph model in this tutorial are harmonized phenotypes. You can find the unharmonized phenotypic and environmental data in the "Reference File" node of the Gen3 graph. Documentation about how to interact with unharmonized data in Terra is coming soon.

Outline of this template:

Part 1: Navigate the BDC environment Learn how to search and export data from Gen3 and workflows from Dockstore into a Terra workspace. Each cloud-based platform interoperates with one another for fast and secure research. The template we have created here can be cloned for you to walk through as suggested, or you can use the basics you learn here to perform your own analysis.

Part 2: Explore TOPMed data in an interactive Jupyter notebook In this Terra workspace, you can find a series of interactive notebooks to explore TOPMed data.

First, you will use the notebook 1-unarchive-vcf-tar-file-to-workspace to extract the contents of tar bundles to your workspace for use in the GWAS. These tar bundles were generated by dbGAP and contain TOPMed multi-sample VCFs per consent code and chromosome.

Next, the 2-GWAS-preliminary-analysis notebook will lead you through a series of steps to explore the phenotypic and genotypic data. It will call the functions in the companion terra-data-util notebook to consolidate your clinical data from Gen3 into a single data table that can be imported into the Jupyter notebook. Then you will examine phenotypic distributions and genetic relatedness using the .

Part 3: Perform mixed-model association tests using workflows Next, perform mixed models genetic association tests (run as a series of batch workflows using GCP Compute engine). For details on the four workflows and what they do, scroll down to Perform mixed-model association test workflows. The workflows are publicly available in in this .

Mixed models require two steps within the package in : 1) Fitting a null model assuming that each genetic variant has no effect on phenotype and 2) Testing each genetic variant for association with the outcome, using the fitted null model.

Part 1: Navigate the BDC multi-platform ecosystem

1a) Link your Terra account to external services

Before you're able to access genomic data from Gen3 in the Terra data table, you need to link your Terra account to external services. Link your profile .

1b) Create an Authorization Domain to protect your controlled-access data

Because this workspace was created to be used with controlled access data, it should be registered under an Authorization Domain that limits its access to only researchers with the appropriate approvals. Learn how to set up an Authorization Domain before proceeding.

1c) Export a TOPMed project with blood pressure data from Gen3

Start by learning about Gen3's graph-structured data model for BDC using this .
Once you better understand the graph, log into through the NIH portal using your eRA Commons username and password.
Navigate to the view to see what datasets you currently have and do not have access to. On the left-hand side, you can use the faceted search tool to narrow your results to specific projects.

1d) Select a workspace for your data

Once the new Terra window appears, you are given a few options for where to place your data.

1) "Start with a template" This feature allows you to import data directly into a template workspace that has everything set up for you to do an analysis but does not contain any data. Once you select a workspace, you will need to enter:

Workspace name: Enter a name that is meaningful for your records.
Billing Project: Select the billing projects available to you.
Authorization Domain: Assign the authorization domain that you generated above to protect your data. This is important for working with controlled access. data.

2) "Start with an existing workspace" If you have already created a workspace, you can import your data directly to this workspace.

3) "Start a new workspace" This will create an empty workspace. You can individually copy notebooks and workflows from other workspaces, import workflows from Dockstore, or start fresh.

Part 2: Explore TOPMed data in Jupyter Notebooks

2a) Extract multi-sample VCFs to your workspace

Gen3 uploaded tar compressed bundles, as they are provided by dbGAP, into cloud buckets owned by BDC. To make these tar files actionable and ready for use in analyses, users will need to unarchive these tar bundlers to their workspace.

First, open the 1-unarchive-vcf-tar-file-to-workspace notebook and follow the steps to select which tar bundle(s) to extract to your workspace for use in the GWAS. Please understand that this step may be time consuming since TOPMed multi-sample VCF files are several hundred gigabytes in size.

2b) Prepare your phenotypic and genotypic data for input into association test workflows

Now that you can interact with the Gen3 structured data more easily, you will use an interactive notebook to explore your phenotypic and environmental data and performs several analyses to prepare the data for use in batch association workflows.

to work with the data you imported.
Open the 2-GWAS-preliminary-analysis notebook and set your runtime configuration. We have given a suggested configuration within the notebook.
From within this 2-GWAS-preliminary-analysis notebook you can call functions from the companion terra_data_table_util notebook to reformat multiple data tables into a single data table that can be loaded as a dataframe in the notebook.

Time and cost estimate

You can adjust the runtime configuration to fit your computational needs in the Jupyter notebook. We recommend selecting the default environment and selecting the custom profile to use and configure the spark cluster for parallel processing. Using the profile suggested profile within the Jupyter notebook and a project with around 1000 samples, running this notebook on this dataset takes about 90 minutes and $20/hr to compute.

When working in a notebook with computing times over 30 minutes, learn more about Terra's and for your needs. Please carefully consider how adjusting auto-pause can remove protections that help you from accidentally accumulating cloud costs that you did not need.

Part 3: Perform mixed-model association tests using workflows

In Part 2, we explored the data we imported from Gen3 and performed a few important steps for preparing our data for association testing. We generated a new "sample_set" data table that holds the files we created in the interactive notebook. These files will be used in our batch workflows that will perform the association tests. Below, we describe the four workflows in this workspace and their cost estimates for running on the sample set we create in this tutorial.

The workflows used in this template were imported from and their parameters were configured to work with Terra's data model. If you're interested in searching other Docker-based workflows, .

Notes on how attributes are set in workflows

We have set the input and output attributes for each workflow in this template. Before running the first workflow, you can look through the inputs and outputs of each workflow and see that outputs from the first workflow feed into the second workflow, and so on.

In the 2-GWAS-preliminary-analysis notebook, we created a Sample Set data table that holds a row called "systolicbp" which contains the input files for the following workflows. You can check this data table out in the Data tab of this workspace. When you open a workflow, make sure that "Sample Set" is set and the "systolicbp" (or whatever you named your run) is selected before running a workflow.

1-vcfToGds

This workflow converts genotype files from Variant Call Format () to Genomic Data Structure (), the input format required by the R package GENESIS.

Time and cost estimates

Inputs:

VCF genotype file (or chunks of VCF files)

Outputs:

GDS genotype file

2-genesis_GWAS

This workflow creates a null model from phenotype data with the GENESIS biostatistical package. This null model can then be used for association testing. This workflow also runs single variant and aggregate test for genetic data. Implements Single-variant, Burden, SKAT, SKAT-O and, SMMAT tests for Continuous or Dichotomous outcomes. All tests account for familiar relatedness through kinship matrixes. Underlying functions adapted from: Conomos MP and Thornton T (2016). GENESIS: GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness. R package version 2.3.4.

Time and cost estimates

Inputs:

GDS genotype file
Genetic Relatedness Matrix
Trait outcome name

Outputs:

A null model as an RData file
Compressed CSV file(s) containing raw results
CSV file containing all associations

Cost Examples

Below are reported costs from using 1,000 and 10,000 samples to conduct a GWAS using the BioData Catalyst GWAS Blood Pressure Trait template workspace. The costs were derived from single variant tests that used Freeze 5b VCF files that were filtered for common variants (MAF <0.05) for input into workflows. The way these steps scale will vary with the number of variants, individuals, and parameters chosen. TOPMed Freeze 5b VCF files contain 582 million variants and Freeze 8 increases to ~1.2 billion. For GWAS analyses with Freeze 8 data, computational resources and costs are expected to be significantly higher.

These costs were derived from running these analyses in Terra in June 2020.

Optional: Bring your own data

Both the notebook and workflow can be adapted to other genetic datasets. The steps for adapting these tools to another dataset are outlined below:

Update the data tables Learn more about uploading data to Terra . You can use functions available from the terra_data_table_util companion notebook to consolidate new data tables you generate.

Update the notebook Accommodating other datasets may require modifying many parts of this notebook. Inherently, the notebook is an interactive analysis where decisions are made as you go. It is not recommended that the notebook be applied to another dataset without careful thought.

Run an additional workflow You can search for available workflows and export them to Terra following .

Helpful resources to master this tutorial

If you are new to BDC-Terra, we have created an that includes several introductory webinars.

Authors, contact information, and funding

This template was created for the project in collaboration with the at and the at . The association analysis tools were contributed by the .

Contributing authors include:

(UC Santa Cruz Genomics Institute)
Michael Baumann (Broad Institute, Data Sciences Platform)
Brian Hannafious (UC Santa Cruz Genomics Institute)

Workspace change log

Genome Wide Association Study with TOPMed Data Tutorial

This page is an open access preview of the Terra tutorial workspace that you can find at this link.

GWAS Tutorial Using Blood Pressure Traits

Data Model

TOPMed Data

A note about TOPMed metadata

Outline of this template:

Part 2: Explore TOPMed data in an interactive Jupyter notebook In this Terra workspace, you can find a series of interactive notebooks to explore TOPMed data.

Part 1: Navigate the BDC multi-platform ecosystem

1a) Link your Terra account to external services

Before you're able to access genomic data from Gen3 in the Terra data table, you need to link your Terra account to external services. Link your profile .

1b) Create an Authorization Domain to protect your controlled-access data

1c) Export a TOPMed project with blood pressure data from Gen3

Start by learning about Gen3's graph-structured data model for BDC using this .
Once you better understand the graph, log into through the NIH portal using your eRA Commons username and password.
Navigate to the view to see what datasets you currently have and do not have access to. On the left-hand side, you can use the faceted search tool to narrow your results to specific projects.

1d) Select a workspace for your data

Once the new Terra window appears, you are given a few options for where to place your data.

Workspace name: Enter a name that is meaningful for your records.
Billing Project: Select the billing projects available to you.
Authorization Domain: Assign the authorization domain that you generated above to protect your data. This is important for working with controlled access. data.

2) "Start with an existing workspace" If you have already created a workspace, you can import your data directly to this workspace.

3) "Start a new workspace" This will create an empty workspace. You can individually copy notebooks and workflows from other workspaces, import workflows from Dockstore, or start fresh.

Part 2: Explore TOPMed data in Jupyter Notebooks

2a) Extract multi-sample VCFs to your workspace

2b) Prepare your phenotypic and genotypic data for input into association test workflows

to work with the data you imported.
Open the 2-GWAS-preliminary-analysis notebook and set your runtime configuration. We have given a suggested configuration within the notebook.
From within this 2-GWAS-preliminary-analysis notebook you can call functions from the companion terra_data_table_util notebook to reformat multiple data tables into a single data table that can be loaded as a dataframe in the notebook.

Time and cost estimate

Part 3: Perform mixed-model association tests using workflows

The workflows used in this template were imported from and their parameters were configured to work with Terra's data model. If you're interested in searching other Docker-based workflows, .

Notes on how attributes are set in workflows

1-vcfToGds

This workflow converts genotype files from Variant Call Format () to Genomic Data Structure (), the input format required by the R package GENESIS.

Time and cost estimates

Inputs:

VCF genotype file (or chunks of VCF files)

Outputs:

GDS genotype file

2-genesis_GWAS

Time and cost estimates

Inputs:

GDS genotype file
Genetic Relatedness Matrix
Trait outcome name

Outputs:

A null model as an RData file
Compressed CSV file(s) containing raw results
CSV file containing all associations

Cost Examples

These costs were derived from running these analyses in Terra in June 2020.

Optional: Bring your own data

Both the notebook and workflow can be adapted to other genetic datasets. The steps for adapting these tools to another dataset are outlined below:

Update the data tables Learn more about uploading data to Terra . You can use functions available from the terra_data_table_util companion notebook to consolidate new data tables you generate.

Run an additional workflow You can search for available workflows and export them to Terra following .

Helpful resources to master this tutorial

If you are new to BDC-Terra, we have created an that includes several introductory webinars.

Authors, contact information, and funding

This template was created for the project in collaboration with the at and the at . The association analysis tools were contributed by the .

Contributing authors include:

(UC Santa Cruz Genomics Institute)
Michael Baumann (Broad Institute, Data Sciences Platform)
Brian Hannafious (UC Santa Cruz Genomics Institute)

Genome Wide Association Study with TOPMed Data Tutorial

hashtagGWAS Tutorial Using Blood Pressure Traits

hashtagData Model

hashtagTOPMed Data

hashtagA note about TOPMed metadata

hashtagOutline of this template:

hashtagPart 1: Navigate the BDC multi-platform ecosystem

hashtag1a) Link your Terra account to external services

hashtag1b) Create an Authorization Domain to protect your controlled-access data

hashtag1c) Export a TOPMed project with blood pressure data from Gen3

hashtag1d) Select a workspace for your data

hashtagPart 2: Explore TOPMed data in Jupyter Notebooks

hashtag2a) Extract multi-sample VCFs to your workspace

hashtag2b) Prepare your phenotypic and genotypic data for input into association test workflows

hashtagTime and cost estimate

hashtagPart 3: Perform mixed-model association tests using workflows

hashtagNotes on how attributes are set in workflows

hashtagCost Examples

hashtagOptional: Bring your own data

hashtagHelpful resources to master this tutorial

hashtagAuthors, contact information, and funding

hashtagWorkspace change log

Genome Wide Association Study with TOPMed Data Tutorial

hashtagGWAS Tutorial Using Blood Pressure Traits

hashtagData Model

hashtagTOPMed Data

hashtagA note about TOPMed metadata

hashtagOutline of this template:

hashtagPart 1: Navigate the BDC multi-platform ecosystem

hashtag1a) Link your Terra account to external services

hashtag1b) Create an Authorization Domain to protect your controlled-access data

hashtag1c) Export a TOPMed project with blood pressure data from Gen3

hashtag1d) Select a workspace for your data

hashtagPart 2: Explore TOPMed data in Jupyter Notebooks

hashtag2a) Extract multi-sample VCFs to your workspace

hashtag2b) Prepare your phenotypic and genotypic data for input into association test workflows

hashtagTime and cost estimate

hashtagPart 3: Perform mixed-model association tests using workflows

hashtagNotes on how attributes are set in workflows

hashtagCost Examples

hashtagOptional: Bring your own data

hashtagHelpful resources to master this tutorial

hashtagAuthors, contact information, and funding

hashtagWorkspace change log

GWAS Tutorial Using Blood Pressure Traits

Data Model

TOPMed Data

A note about TOPMed metadata

Outline of this template:

Part 1: Navigate the BDC multi-platform ecosystem

1a) Link your Terra account to external services

1b) Create an Authorization Domain to protect your controlled-access data

1c) Export a TOPMed project with blood pressure data from Gen3

1d) Select a workspace for your data

Part 2: Explore TOPMed data in Jupyter Notebooks

2a) Extract multi-sample VCFs to your workspace

2b) Prepare your phenotypic and genotypic data for input into association test workflows

Time and cost estimate

Part 3: Perform mixed-model association tests using workflows

Notes on how attributes are set in workflows

Cost Examples

Optional: Bring your own data

Helpful resources to master this tutorial

Authors, contact information, and funding

Workspace change log

GWAS Tutorial Using Blood Pressure Traits

Data Model

TOPMed Data

A note about TOPMed metadata

Outline of this template:

Part 1: Navigate the BDC multi-platform ecosystem

1a) Link your Terra account to external services

1b) Create an Authorization Domain to protect your controlled-access data

1c) Export a TOPMed project with blood pressure data from Gen3

1d) Select a workspace for your data

Part 2: Explore TOPMed data in Jupyter Notebooks

2a) Extract multi-sample VCFs to your workspace

2b) Prepare your phenotypic and genotypic data for input into association test workflows

Time and cost estimate

Part 3: Perform mixed-model association tests using workflows

Notes on how attributes are set in workflows

Cost Examples

Optional: Bring your own data

Helpful resources to master this tutorial

Authors, contact information, and funding

Workspace change log