1 of 17

Terra

BioData Catalyst Powered by Terra is a user-friendly system for doing biomedical research in the cloud. Terra workspaces integrate data, analysis tools, and built-in security components to deliver smooth research flows from data to results.

The following entries in this section of the BioData Catalyst documentation are a starting point for learning how to use Terra in the context of the BioData Catalyst ecosystem. You can also dive deeper into Terra by visiting the Terra website and the Terra Support Center. Wherever possible, we highlight specific articles, tutorial videos, and example workspaces that will help you learn what you need to know to accelerate your research.

If you can't find what you are looking for, we are happy to help. See the Troubleshooting and Support section for more information.

Please note that Terra is designed for and tested with the Chrome browser.

Account Setup

Logging in to Terra for the first time is a quick and straight-forward process. The process is easiest if you already have an email address hosted by Google. If you want to use an email address that is not hosted by Google, we have instructions for that as well. Article: How to register for a Terra account Article: Setting up a Google account with a non-Google email We also recommend our article on navigating in Terra to get familiar with basic menus and options in Terra, as well as this video introduction to Terra.

Read on in the next two subsections for primers on how to set up billing and how to manage costs.

Billing

Now that you can log in, you’ll want to make sure that you have access to a Billing Account and Billing Project. This will allow you to charge storage and analysis costs through a Google account linked to Terra. A Terra Billing Project is Terra's way of connecting a workspace where you accrue costs for things, back to a Google Billing account where you pay for it. You must have a Google Billing Account established before creating a Terra Billing Project. Outlined here are the steps necessary to set this up, as well as instructions on how to add or be added to an existing account/billing project.

Detailed instructions for setting up your billing can be found by following the links below. If you are a BioData Catalyst Fellow, your procedure for billing set up is a bit different, but you may find some of the information below still relevant (sharing a billing project with another user, for example). Step 1: Get Cloud credits for BioData Catalyst Step 2: Wait for approval & review the Billing overview for BioData Catalyst users Step 3: Credits approved. Now create a new Terra billing project Step 4 (optional): Sharing Billing Projects among colleagues

Managing Costs

We have a number of articles on tracking and minimizing the costs of operating on Terra. There are multiple ways of estimating how much your analyses are costing you, including built-in tools and external resources. The articles below contain instructions and advice on managing your cloud resources in a variety of ways: Article: Understanding and controlling cloud costs Article: Best practices for managing shared team costs Article: How much did a workflow cost? Article: How to disable billing on a Terra project

Workspace Setup

Workspaces are the fundamental building blocks of Terra. You can think of them as modular digital laboratories that enable you to organize and access your data in a number of ways for analysis.

To learn about the basics of operating a Terra workspace, we recommend these resources: Article: Working with workspaces Video: Introduction to using workspaces in Terra

Read on in this section to get familiar with:

Data Storage & Management

Terra workspaces include a dedicated workspace Google bucket, as well as a built-in data model for managing your data. We provide articles in Terra’s knowledge base explaining how to organize and access data in a variety of ways.

A key to understanding the power of Terra is understanding it’s built-in data model, which allows you to rewire the inputs and outputs of your workflows and Jupyter notebooks.

The following resources give you guided instructions using cloud-based data with Terra: Article: Managing data with table VIdeo: Introduction to Terra data tables Article: Uploading to a workspace Google bucket Article: How to import metadata to a workspace data table Video: Making and uploading data tables to Terra

Collaboration

Sharing a workspace allows collaborators to actively work together in the same project workspace. Workspaces can be used as repositories of data, workflows, and Jupyter notebooks. Learn more about how to securely share a workspace: Article: How to share a workspace Article: Reader, writer or owner? Workspace access controls, explained Article: Using permissions Video: Introduction to Collaboration and Sharing in Terra

Security

Terra has a number of features to ensure the security of sensitive data accessed through the platform. Many of these features are in place automatically, while tools like authorization domains give you greater control over your data. These articles contain an overview of the security features enabled on Terra: Article: Authorization Domain overview for BioData Catalyst users Article: Managing data privacy and access with Authorization Domains Article: Best Practices for accessing external resources Article: Terra security posture

Bring Data into a Workspace

You can import data into your workspace by either linking directly to external files you have access to, or by interfacing with a number of platforms with which Terra has integrated access.

For BioData Catalyst researchers, one of the most relevant of these interfacing platforms is Gen3. However this section also provides you with resources that teach how to import data from other public datasets integrated into Terra’s data library, as well as how to bring in your own data.

Read on in this section for more information on:

Bring in Data from Gen3

BioData Catalyst Powered by Gen3 provides data for many projects and conveniently supports search across the vast set of subjects to identify the best available cohorts for research analysis. Searches are based on harmonized phenotypic variables and may be performed both within and across projects.

When a desired cohort has been identified in Gen3, the cohort may be conveniently "handed-off" to Terra for analysis. Optionally, this dataset may be enhanced with additional metadata from dbGaP, or extended to include additional researcher-provided subject data.

Here we provide essential information for all researchers using BioData Catalyst data from Gen3, including how to access and select Gen3 subject data and hand it off to Terra, as well as a description of the GA4GH Data Repository Service (DRS) protocol and data identifiers used by Gen3 and Terra.

The resources below contain the information you’ll need to access your desired data: Video: Data Analysis with Gen3, Terra and Dockstore Article: Discovering Data Using Gen3 Article: Understanding and using Gen3 data in Terra Article: Data Access with the GA4GH Data Repository Service (DRS) Article: Linking Terra to External Servers Article: Understanding and setting up a proxy group Workspace: BioDataCatalyst Gen3 data on Terra tutorial Workspace: TOPMed Aligner workspace

From Terra’s Data Library

Terra’s Dataset Library includes a number of integrated datasets, many of which have individualized Data Explorer interfaces, useful for generating and exporting custom cohorts. If you click into a dataset and have the proper permissions, you'll be able to explore the data. If you don't have the necessary permission, you'll be taken to a page that tells you whom to contact for access.

The resources linked below provide guided instructions for creating custom cohorts from the data library and importing them to your workspace, and using a Jupyter notebook to interact with the data: Article: Accessing and analysing custom cohorts with Data Explorer Video: Notebooks Quickstart walkthrough Workspace: Notebooks Quickstart workspace

Use Your Own Data with Terra

This page describes how researchers may bring their own data files and metadata into Terra. Some researchers may choose to bring their own data to Terra in addition to - or instead of - using BioData Catalyst data from Gen3. For example, this may be done when bringing additional (e.g., longitudinal) phenotypic data to enhance the harmonized metadata available from Gen3, or when using Joint variant calling with additional researcher provided genomic data, or even using researcher provided data exclusively,

Generally, there are two types of data that researchers typically bring to Terra. Data files (e.g., genomic data, including CRAM and VCF data), and metadata (e.g., tables of clinical/phenotypic or other data, typically regarding the subjects in their study). These are described separately below.

There are two ways a researcher's data files may be made available in Terra: By uploading data to the researcher's workspace bucket or enabling Terra to access the researcher's data in a researcher managed Google bucket, for which you need to set up a proxy group.

Article: Uploading to a workspace Google bucket Article: Understanding and setting up a proxy group

The ways in which a researcher may import metadata to the Terra Data tables are described in the articles and tutorials below:

Article: Managing data with tables Article: How to import metadata to a workspace data table VIdeo: Introduction to Terra data tables Video: Making and uploading data tables to Terra

Run Analyses

Terra supports the following types of analysis: Batch processing with Workflows and Interactive analysis with Jupyter Notebooks. This section will orient you with resources that teach you how to do:

As an introduction, we recommend reading our article on the kinds of analysis you can do in Terra.

Batch Processing with Workflows

The batch workflow features of Terra provide support for computationally-intensive, long-running, and large-scale analysis.

You can perform whole pipelines—from preprocessing and trimming sequencing data to alignment and downstream analyses—using Terra workflows. Written in the human-readable Workflow Description Language (WDL), you can search for and import workflows into your workspace from Dockstore or the Broad Methods Repository.

Video: Data Analysis with Gen3, Terra and Dockstore Article: How to import data from Gen3 into Terra and run the TOPMed aligner workflow Article: Configure a workflow to process your data Article: Getting workflows up and running faster with a JSON file Article: Importing a Dockstore workflow into Terra Video: Importing a Dockstore workflow into Terra walkthrough Video: Workflows Quickstart walkthrough Workspace: Workflows Quickstart workspace Workspace for BioData Catalyst: TOPMed Aligner workspace Workspace for BioData Catalyst: GWAS with 1000 Genomes and synthetic clinical data Workspace for BioData Catalyst: GWAS with TOPMed data

Interactive Analysis

The interactive analysis features of Terra support interactive data exploration, including the use of statistical methods and graphical display. Versatile and powerful interactive analysis is provided through Jupyter Notebooks in both Python and R languages.

Jupyter Notebooks run on a virtual machine (VM). You can customize your VM’s installed software by selecting one of Terra's preinstalled notebook cloud environments or choosing a custom environment by specifying a Docker container. Dockers ensure you and your colleagues analyze with the same software, making your results reproducible.

Article: Interactive statistics and visualization with Jupyter notebooks Article: Customizing your interactive analysis application compute Article: Terra's Jupyter Notebooks environment Part I: Key components Article: Terra's Jupyter Notebooks environment Part II: Key operations Article: Terra's Jupyter Notebooks environment Part III: Best Practices Video: Notebooks overview Video: Notebooks Quickstart walkthrough Workspace: Notebooks Quickstart workspace Workspace: BioData Catalyst notebooks collection Workspace: PIC-SURE Tutorial in R Workspace: PIC-SURE Tutorial in Python

Genome-Wide Association Studies

Terra provides powerful support for performing Genome-Wide Association Studies (GWAS). The following featured and template workspaces include Jupyter notebooks for phenotypic and genomic data preparation (using Hail) and workflows (using GENESIS) to perform single or aggregate variant association tests using mixed models. We will continue to provide more resources for performing more complex GWAS scenarios in BioData Catalyst.

Kinship Matrices

A Jupyter Notebook in both of the following workspaces uses Hail to generate Genetic Related Matrices for input into the GWAS workflows. Users with access to kinship matrices from the TOPMed consortium may wish to exclude these steps and instead import kinship files using the bring your own data instructions.

BioData Catalyst GWAS tutorial workspace

The BioData Catalyst GWAS tutorial workspace was created to walk users through a GWAS with training data that includes synthetic phenotypic data (modeled after traits available in TOPMed) paired with 1000 Genomes open-access data. This tutorial aims to familiarize users with the Gen3 data model so that they can become empowered to use this data model with any existing tutorials available in the Terra library’s showcase section.

BioData Catalyst GWAS blood pressure trait template workspace

This template is an example workspace that asks researchers to export TOPMed projects (for which they have access) into an example template for conducting a common variant, mixed-models GWAS of a blood pressure trait. Our goal was to include settings and suggestions to help users interact with data exactly as they receive it through BioData Catalyst. Accommodating other datasets may require modifying many parts of this notebook. Inherently, the notebook is an interactive analysis where decisions are made as you go. It is not recommended that the notebook be applied to another dataset without careful thought.

Cost Examples Below are reported costs from using 1,000 and 10,000 samples to conduct a GWAS using the BioData Catalyst GWAS Blood Pressure Trait template workspace. The costs were derived from single variant tests that used Freeze 5b VCF files that were filtered for common variants (MAF <0.05) for input into workflows. The way these steps scale will vary with the number of variants, individuals, and parameters chosen. TOPMed Freeze 5b VCF files contain 582 million variants and Freeze 8 increases to ~1.2 billion. For GWAS analyses with Freeze 8 data, computational resources and costs are expected to be significantly higher.

These costs were derived from running these analyses in Terra in June 2020.

Troubleshooting & Support

If things aren’t going quite as expected, there are a number of avenues to help unblock any issues you may have.

Troubleshooting This section of the Terra knowledge base contains many useful articles on how to address problems, including a variety of articles describing common workflow errors, as well as more general articles that explain how to find which errors are affecting your work, and how to proceed once you’ve diagnosed your problem.

Monitor your jobs The Job History tab is your workflow operations dashboard, where you can check the status of past and current workflow submissions and find links to the job manager where you can diagnose issues.

How to report an issue There are a number of ways you can report an issue directly to us outlined in this article. If something appears broken, slow, or just plain weird, feel free to let us know.

Community forum A lot of answers can be found on our forum, which is monitored by our dedicated frontline support team and has an integrated search function. If you suspect that you’re running into a common issue but can’t find an answer in the documentation, this is a great place to check.

Terra

Account Setup

Billing

Managing Costs

Workspace Setup

Data Storage & Management

Collaboration

Security

Bring Data into a Workspace

Bring in Data from Gen3

From Terra’s Data Library

Use Your Own Data with Terra

Run Analyses

Batch Processing with Workflows

Interactive Analysis

Genome-Wide Association Studies

Kinship Matrices

BioData Catalyst GWAS tutorial​ workspace

BioData Catalyst GWAS blood pressure trait ​template workspace

Troubleshooting & Support

Managing Costs

Terra

Account Setup

Workspace Setup

Billing

Data Storage & Management

Collaboration

Security

Run Analyses

Bring Data into a Workspace

Bring in Data from Gen3

From Terra’s Data Library

Use Your Own Data with Terra

Interactive Analysis

Batch Processing with Workflows

Troubleshooting & Support

Genome-Wide Association Studies

Kinship Matrices

BioData Catalyst GWAS tutorial​ workspace

BioData Catalyst GWAS blood pressure trait ​template workspace

BioData Catalyst GWAS tutorial workspace

BioData Catalyst GWAS blood pressure trait template workspace

BioData Catalyst GWAS tutorial workspace

BioData Catalyst GWAS blood pressure trait template workspace