LogoLogo
  • NHLBI BioData Catalyst® (BDC) Documentation
  • Community
    • Who We Are
    • BDC Glossary
    • Citation and Acknowledgement
    • Strategic Planning
    • Request for Comments
      • NHLBI BioData Catalyst Ecosystem Security Statement
      • NHLBI DICOM Medical Image De-Identification Baseline Protocol
    • BDC Video Content Guidance
    • Contributing User Resources to BDC
  • Written Documentation
    • Getting Started
    • Data Access
      • Data Interoperability
      • Understanding Access
      • Submitting a dbGaP Data Access Request
      • Checking Access
    • Explore Available Data
      • Dug Semantic Search
        • Search and Results
      • PIC-SURE User Guide
        • Getting Started
          • Requirements and Login
          • Available Data and Managing Data Access
            • TOPMed and TOPMed related datasets
            • BioLINCC Datasets
            • CONNECTS Dataset
        • Data Organization in PIC-SURE
        • PIC-SURE Features and General Layout
        • PIC-SURE Open Access vs. PIC-SURE Authorized Access
          • PIC-SURE Open Access
          • PIC-SURE Authorized Access
        • Data Analysis Using the PIC-SURE API
        • Additional Resources
        • PIC-SURE API Documentation
        • Appendix 1: BioData Catalyst Identifiers - dbGaP, TOPMed, and PIC-SURE
        • Appendix 2: Table of Harmonized Variables
      • Discovering Data Using Gen3
        • Dictionary
        • Exploration
        • Query
        • Workspace
        • Profile
        • PFB Files
        • Current Projects
    • Analyze Data
      • Transferring Files Between Seven Bridges and Terra
      • Seven Bridges
        • Knowledge Center
        • Getting Started Guide
        • Comprehensive Analysis Tips
        • Troubleshooting Tasks
        • GWAS with GENESIS workflows
        • Annotation Explorer
      • Terra
        • Account Setup
          • Billing
          • Managing Costs
        • Workspace Setup
          • Data Storage & Management
          • Collaboration
          • Security
        • Bring Data into a Workspace
          • Bring in Data from Gen3
          • From Terra’s Data Library
          • Use Your Own Data with Terra
        • Run Analyses
          • Batch Processing with Workflows
          • Interactive Analysis
          • Genome-Wide Association Studies
        • Troubleshooting & Support
      • Dockstore
        • Launch workflows with BioData Catalyst
        • Discover our catalog
        • Intro to Docker, WDL, CWL
        • Dockstore Forum
        • Contribute to the community
    • Community Tools & Integration
      • Bring Your Own Tool(s)
        • BYOT Glossary
        • Working with Docker
        • Creating, testing & scaling WDL workflows
        • Creating, testing & scaling CWL workflows
        • Version Control, Publishing & Validation of Workflows
        • Advanced Topics
      • Import a Dockstore App With Seven Bridges
    • Writing BDC into a Grant Proposal
    • Incurring Cloud Costs
    • Release Notes
      • 2025-04-15 BDC Release Notes
      • 2025-01-15 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2024-10-21 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2024-07-02 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2024-04-01 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2024-01-08 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2023-10-04 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2023-07-11 NHLBI BioData Catalyst Ecosystem Release Notes
      • 2023-04-04 BioData Catalyst Ecosystem Release Notes
      • 2023-01-09 BioData Catalyst Ecosystem Release Notes
      • 2022-10-03 BioData Catalyst Ecosystem Release Notes
      • 2022-07-11 BioData Catalyst Ecosystem Release Notes
      • 2022-04-04 BioData Catalyst Ecosystem Release Notes
      • 2022-01-24 BioData Catalyst Ecosystem Release Notes
      • 2021-10-04 BioData Catalyst Ecosystem Release Notes
      • 2021-07-09 BioData Catalyst Ecosystem Release Notes
      • 2021-04-02 BioData Catalyst Ecosystem Release Notes
      • 2021-01-15 BioData Catalyst Ecosystem Release Notes
      • 2020-10-23 BioData Catalyst Ecosystem Release Notes
      • 2020-08-24 BioData Catalyst Ecosystem Release Notes
      • 2020-04-02 BioData Catalyst Ecosystem Release Notes
    • Data Versioning Release Notes
    • NIH RECOVER Release Notes
  • Tutorials: Videos & Modules
    • Seven Bridges Tutorials
      • Genetic Association Testing using GENESIS Workflows
      • Estimating and Managing Your Cloud Costs
    • Terra Tutorials
      • Getting Started with Gen3 Data on Terra Tutorial
      • Genome Wide Association Study with 1000 Genomes Data Tutorial
      • Genome Wide Association Study with TOPMed Data Tutorial
      • TOPMed Aligner, or, How to Import Data From Gen3 into Terra and Run a Workflow on It
  • Data Management
    • Data Management Strategy
    • Instructions for Data Submission to BDC
      • De-identification Readme
      • Data Dictionary Requirement
    • dbGaP Study Configuration Process for Submission of Data to BDC
Powered by GitBook
On this page
  • Overview
  • Tutorial Overview
  • Science overview
  • Data disclaimer
  • Step-by-step Instructions
  • Set up tutorial workspace
  • Follow-along worksheet
  • Step 1. Link your Terra profile with Gen3
  • Step 2. Access synthetic dataset on Gen3 platform
  • Step 3. Export data to workspace
  • Step 4. Consolidate and format exported Gen3 data
  • Step 5. Tidy up the consolidated table
  • Step 6. Analyze consolidated data ("2-Analyze Gen3 data" notebook)
  • Step 7. Run a workflow on genomic data
  • Next steps and additional resources
  • Gen3 data overview
  • Authors

Was this helpful?

Export as PDF
  1. Tutorials: Videos & Modules
  2. Terra Tutorials

Getting Started with Gen3 Data on Terra Tutorial

A step-by-step tutorial to help researchers working with Gen3 data in Terra

PreviousTerra TutorialsNextGenome Wide Association Study with 1000 Genomes Data Tutorial

Last updated 3 years ago

Was this helpful?

Overview

This has been developed to help researchers working with Gen3 data in Terra. The tutorial includes step-by-step instructions to cover the entire analysis process:

  • How to link your Gen3 and Terra accounts

  • How to find and export Gen3 data to a project workspace

  • Understanding the structure of Gen3 data in Terra and how to make exported data more familiar

  • How to run an interactive analysis in a Jupyter notebook

  • How to configure and run a bulk analysis using workflow tools

Reading the documentation and following the step-by-step instructions will help familiarize you with Terra. The content on this page is available in a in Terra.

Tutorial Overview

1. Link Gen3 and Terra accounts

2. Access dataset in Gen3

3. Export data to workspace

4. Consolidate Gen3 data tables

5. Tidy workspace tables

6. Analyze data in a notebook

7. Run a bulk analysis workflow

2 minutes

3 minutes

10 minutes

20 minutes

2 minutes

20 minutes

5 minutes

Tutorial flow diagram

A few caveats

  • You should run through the steps below in order

  • Note that you do not have to go through all of the steps in one sitting, only in the correct order

  • Time estimates are conservative - many steps will take less time

How long does the tutorial take and how much does it cost?

Working through all of the tutorial steps should take just over an hour total (you can break it up and do at different times). The total computation cost of running both notebooks and the workflow will be less than $1.00.

Science overview

The tutorial models steps typical in a research journey using synthetic Gen3 phenotypic data and public-access 1,000 genomes genomic data. The data are stored in the BioData Catalyst instance on the Gen3 platform. Phenotypic data are contained in the lab results and demographic nodes in Gen3 (and corresponding tables in Terra). The data are exported to Terra in the form of fifteen individual data tables connected to each other by UUIDs. The notebook section of the tutorial demonstrates how to select a subset of Gen3 tables you need - based on the data required for your analysis - and consolidate into one table with familiar subject ID conventions. This process of converting Gen3 data to a format that is more intuitive in Terra will be the same no matter what your final analysis. Section 7 covers how to run a bulk analysis workflow on the genomic data exported from Gen3. You would use the same process to do any bulk analysis on genomic data, such as aligning reads or calling variants.

Data disclaimer

Step-by-step Instructions

Set up tutorial workspace

Estimated time: one minute

Before you begin, you will need to create your own editable copy (clone) of this WORKSPACE to work in.

  1. Click on the round circle with three dots in the upper right corner of the workspace.

  2. Select "Clone" from the dropdown menu.

  3. Rename your new workspace something memorable. It may help to write down or memorize the name of your workspace.

  4. Choose your billing project (could be free credit).

  5. Select the Gen3 BioData Catalyst authorization domain.

  6. Click the "Clone Workspace" button to make your own copy. You'll be automatically redirected to your new workspace.

  7. Open a second tab of these instructions, to make it easier to follow along.

Follow-along worksheet

Step 1. Link your Terra profile with Gen3

Estimated time: 2 minutes

Step 2. Access synthetic dataset on Gen3 platform

Estimated time: 5 minutes

Step 3. Export data to workspace

Estimated time: 10 minutes - mostly exporting time

In this step you will practice exporting a Gen3 dataset to Terra. This process will be the same no matter what Gen3 data you are using. As part of this step, you will also learn about the structure of the exported data in a Terra workspace.

Step 4. Consolidate and format exported Gen3 data

Estimated time: 20 minutes

In this step, you will run the notebook 1-Consolidate-Gen3-data-in-Terra-tutorial. The tutorial notebook resolves some of the challenges of Gen3 data exported to Terra. In particular, it consolidates the tables that contain phenotypic data into one Terra table and reorganizes the tables and subject IDs to be more familiar. The notebook is also a good introduction to interactive analysis in a Jupyter notebook.

Notebook outline

  • Set up the notebook environment

  • Discover all of the metadata available from the Gen3 export

  • Define and use a set of custom functions to

    • Merge the metadata in a consolidated Pandas dataframe

    • Reformat a bit of the Gen3 graph language to be more familiar TOPMed nomenclature

    • Push the consolidated dataframe to a workspace data table

Step 5. Tidy up the consolidated table

Estimated time: 3 minutes

The consolidated table generated in Step 4 has almost 50 columns, including all of the Gen3 metadata from phenotypic data nodes. To help make the table more manageable to read without scrolling left and right, you can select the columns you want to see when you expand the table. This will allow you to focus on the information you need, without having to scroll across tens of columns. Your selections will persist even when you leave the workspace.

Step 6. Analyze consolidated data ("2-Analyze Gen3 data" notebook)

Estimated time: 15 minutes

In this step you will run the notebook 2-Analyze-consolidated-data-tutorial, which pulls data from the consolidated table into the notebook environment and does a bit of plotting analysis.

Notebook outline

  • Import the consolidated phenotypic data from the new data table to the notebook compute environment

  • This is where you would begin your project-specific analysis (in R or Python). We will explore the data by generating a few plots in the notebook.

Step 7. Run a workflow on genomic data

Estimated time: 5 minutes

Another mode of analysis in the Terra platform is bulk analysis with workflows. This mode is what you would typically use to do automated analysis steps such as processing genomic data. Bulk analysis uses workflow tools, which you can add to your workspace from another workspace, or by searching Dockstore or the Broad Methods Repository. In this step you will learn how to run a bulk analysis workflow to process genomic data in the the synthetic dataset from the Gen3 platform.

Next steps and additional resources

Congratulations! You have accessed and analyzed your first Gen3 data in Terra. Now that you have the basics of how to work with Gen3 data, here are some additional BioData Catalyst and TOPMed resources to explore.

  • Getting Started with Gen3

  • GWAS template

  • TOPMed alignment workflows

Gen3 data overview

To understand what you will see when you export Gen3 data to a Terra workspace, it helps to first understand how the data is organized in Gen3. A diagram of the graph structure of the synthetic data in Gen3 is below on the left. Each box is a Gen3 data node and each line represents UUIDs connecting the metadata of the two nodes. Administrative nodes are purple, clinical data are in blue boxes, and genomic data are green boxes. When you export a project's data to a Terra workspace, each node in the graph gets imported as its own data table. Your data page will look like the figure on the right. The synthetic data in this tutorial is captured in 15 tables, corresponding to the 15 nodes in the graph structure.

Each data table includes all of the metadata fields associated with the Gen3 node, in alphabetical order. The lab_results table, for example, looks like this:

You might see from this example several features specific to the Gen3 data structure that can be challenging to work with in Terra:

  • There are 15 separate tables from the export - not all of them include data that you will need

    • The phenotypic data we want to use for this tutorial are in two separate tables (demographic and lab_results)

  • The first column in each table is not the subject ID you are used to, is not human-readable, and is not used across levels. Rather, a UUID connects each node to the node directly above it in the graph.

    • The familiar subject ID (called `submitter ID in Gen3 tables) is buried (in the 13th column of the lab_results table, for example)

    • The subject ID is only indirectly connected to the phenotypes in the lab_results and demographics tables with two different UUIDs (i.e. two separate lines connect their nodes)

  • There are 16 columns (metadata fields) in the lab_results table, and the five or six we want to include in our analysis aren't easy to pick out because they are far to the right of the table

Authors

Workspace author: Allie Hajian Notebook code, edits and bug-squashing: Michael Baumann, Beth Sheets

This workspace was completed under the NHLBI BioData Catalyst project.

Workspace change log

Date

Change

Author

March 22, 2020

initial draft

Allie

The tutorial uses a synthetic, public-access dataset derived from 1,000 Genomes data and stored in the BioData Catalyst instance in Gen3. For more information about how the synthetic data were generated, see .

Work from your copy of this workspace following these . You can skim summaries of each step below.

If you haven't linked your Terra and Gen3 accounts yet, you need to follow these before you can analyze genomic data from Gen3 inside your Terra workspace.

This section walks through the process of accessing the public-access tutorial dataset (tutorial-synthetic_data_set_1) by going to the .

workspace: This workspace includes the terra_data_util notebook, a notebook for unarchiving tar files, and eventually a vcf merge notebook

To learn more about the structure of Gen3 data, see .

this workspace
step-by-step instructions
steps
Gen3 platform
BioData Catalyst Utilities Collection
this article
hands-on tutorial
tutorial workspace