arrow-left

All pages
gitbookPowered by GitBook
1 of 7

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Working with Docker

Docker technology has revolutionized reproducibility by creating a fast, portable, easily shareable method to generate the exact compute environment, with all dependencies and configurations, that were used to run a tool or workflow.

Below, we provide resources for finding public Docker images or creating your own image to use with your analysis. Docker is commonly used by software engineers, and learning material on the internet may be overly complex for the researcher use case. We compiled learning materials from each platform within BDC to help you get started using Docker specifically for bioinformatics pipelines.

hashtag
Available Docker images

We highly recommend users begin with an official or maintained image (for example, from BioContainer) to ensure you are using secure software.

  • maintains

hashtag
Introductory learning material

Below, we have compiled learning resources from various sources to help you get started learning Docker:

  • Dockstore’s

Terra
  • Install Docker and test that it worksarrow-up-right

  • Publish a Docker container to Google Container Registryarrow-up-right

  • Make a Docker container the easy wayarrow-up-right

  • Seven Bridges

    • Docker basicsarrow-up-right

    • Install Dockerarrow-up-right

  • Official Imagesarrow-up-right
    UW-GAC TOPMedarrow-up-right
    BioContainerarrow-up-right
    images for 1K+ bioinformatics toolsarrow-up-right
    Best Practices for Writing Dockerfilesarrow-up-right
    Build Context tipsarrow-up-right
    Getting Started with Dockerarrow-up-right
    Making a safe and secure custom Docker imagearrow-up-right
    BioData Catalyst Image Registryarrow-up-right
    Manage Docker repositoriesarrow-up-right

    Creating, testing & scaling CWL workflows

    hashtag
    Creating, Testing, and Scaling CWL Workflows

    hashtag
    Section overview:

    In this section, the reader will first learn how the Seven Bridges Software Development Kit (SDK) enables the easy creation of CWL workflows that will benefit the greater BDCatalyst community. We will review the benefits of the SDK and then walk through an example of workflow creation, testing, and scaling. There are also links to more detailed resources for further reading.

    hashtag
    Helpful Terms to Know on Seven Bridges:

    Wrapping: The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.

    Tool: A CWL description of a reusable piece of software that performs one specific function. An example is the bwa read alignment tool which can be applied to multiple workflows in different contexts. Tools need to have several things specified in the CWL description that includes Docker image, Linux base command, input files (or parameters), and output files. Tools can be used in completely disparate workflows and can be thought of as building blocks.

    Workflow: Chains of connected tools to accomplish a full analysis. Tools are often connected in a specific way to enable maximum computational efficiency and are also constructed to accomplish a specific analysis goal. Whereas tools describe a single software step (e.g. alignment or read sorting), a workflow describes a full analysis (e.g. variant discovery, differential expression, or multiple variant association tests).

    App: An app is a general term to refer to both tools and workflows. The platform user will typically only see the term “app” in reference to mixed groups of tools and workflows, such as in the Public Apps Gallery, the Apps collection tab, or within a workspace.

    hashtag
    Introduction to Seven Bridges Software Development Kit

    Throughout this guide, it will be useful for the reader to refer to our documentation found for each section. For the Seven Bridges Software Development Kit documentation, please see the following:

    BDC Powered by Seven Bridges (BDC-Seven Bridges) provides a full Software Development Kit (SDK) that enables the BDC community to easily create CWL apps that can be tested and scaled up to production level directly on the platform. Once validated by the user, these workflows can be exported and published on Dockstore so they can become searchable and findable by other users.

    The SDK consists of a tool editor and a pipeline editor. Both are based on the open-source project Rabix, a portmanteau of "Reproducible Analyses for Bioinformatics" (for more information, see ). The goal of the SDK is to guide the user through the process of creating fully functional analytical pipelines that can be tested, scaled up to population-scale analysis, and shared with the research community. The SDK also has built-in version control at the tool and workflow level to enable the full reproducibility of previous versions.

    The Tool Editor guides the user through the creation of a portable CWL description by linking a pre-built Docker image (see section Working with Docker) to a command line or script that will be run inside the container. The above image shows the tool wrapping process. The Tool Editor enables users to easily create CWL by filling out the GUI template (Figure 4). This simplifies the technical aspects of this process and makes it as easy as possible for users to get their tools set up on the platform. The CWL code can also be edited directly in the tool editor if that is desired. For users working with JavaScript, JavaScript dynamic expressions can be tested without having to leave the tool editor.

    Learn more via this .

    The Workflow Editor enables users to create full pipelines by linking together multiple discrete tools. The workflow editor is a drag-and-drop visual interface that makes it easy to build even the most complex pipelines.

    Before we dive into more detail on how to use the Tool Editor and the Workflow Editor, it is important to understand the distinction between tools and workflows. The distinction is only present in the CWL, and it is an important one. Wrapping a tool requires knowledge of Docker and Linux command lines. The Tool Editor helps the user get past even the most technical and dynamic of command-line and script issues, with the goal being the creation of a reusable and shareable component. For building workflows, the Docker and Linux command lines are abstracted away to enable less-technical users to build full analytical pipelines. We can refer to this as “separation of concerns.” Each tool should be designed to handle one functional aspect, and therefore will be able to operate in multiple analytical pipelines. For example, BWA-MEM or the Samtools suite can be used in both DNA analysis workflows and RNA analysis workflows.

    Linking together multiple tools into a full computational pipeline can have many advantages. It is important to understand the benefits of building a full and robust workflow that includes each of the analysis steps. The most apparent benefit is that it makes the entire pipeline easier to share, as there will only be one resulting CWL file. The CWL file is a human-readable text file that can be distributed digitally in multiple ways, such as through Dockstore, Seven Bridges, GitHub, or over email. A novice user can easily reproduce the full analysis using one file. They can also use the SDK to make adjustments if necessary, or even decompose the workflow to get at the constituent tools for use in other contexts (more on this below in the section Version Control, Publishing, and Validation of workflows). The Seven Bridges platform has built-in optimizations to execute a workflow for maximum efficiency and cost savings. For example, workflows only save final output files back to the project. Since intermediate files from earlier steps in the workflow are not saved, they do not accumulate cloud storage costs, saving funds that would otherwise be used for intermediate file object storage. Users can still make use of intermediate files for subsequent reruns of a task by simply turning on “memorization” for that task and intermediate files will be re-used where appropriate.

    Finally, linking multiple tools together also has the added benefit of increasing computational efficiency. When running workflows, multiple tools can use the same compute instance if multiple CPU cores are available. This saves time and funds and increases the ability to run jobs in parallel with no additional configurations.

    In the following sections, we will build the workflow in the above image. Here, we can visually see the importance of creating a workflow: running each of these tools separately would require more steps from the user and require more unnecessary data to be moved back and forth from the cloud computational instance to the user’s workspace. Therefore, running as a single workflow achieves the best efficiency.

    hashtag
    Creating A Development Project

    Before getting started with this section, we recommend first creating a development workspace (called projects on Seven Bridges) to house the new tool(s) and workflow(s) while they are being created and tested. Please see the Seven Bridges for detailed instructions about how to create projects.

    Figure 6 shows all the options available when creating a project on Seven Bridges including selecting the Billing Group. If used conservatively, the BDC pilot funding is adequate to cover the costs associated with developing a tool or workflow on the platform.

    hashtag
    Outlining Your Workflow

    For the purposes of this tutorial, we will create a Next-Generation Sequencing (NGS) alignment Quality Control (QC) workflow as an example problem. BDC hosts data from TOPMed and TOPMed studies generally have the most up-to-date alignments to HG38. Therefore, for this example tutorial we will (1) create a pipeline that can be used to make sure these CRAMs have high-quality reads, and (2) perform alignment read depth QC. We will also show how to bring a new tool to the platform that will combine the outputs of the previous tools.

    Researchers should outline their pipeline into individual steps. These steps should correspond to individual software executables (i.e. bwa, samtools) or scripts (i.e. R, Python, shell).

    A great place to outline your tool is in your development project description, shown below:

    It is important to determine if there are tools (steps in your outline) that have already been wrapped and are published in either Dockstore or the Seven Bridges Public Apps Gallery. This reduces the time in porting analytical workflows to the cloud because these steps will not have to be re-validated or re-benchmarked. This also promotes developing with “separation of concerns.” This means that every tool (step) can be versioned, tested, and improved without adversely affecting the entire workflow. We recommend searching the Seven Bridges Public Apps and the on Dockstore to find validated and reusable components.

    Tools from the Seven Bridges Public Apps Gallery can be easily imported directly into your project. These apps have been validated and optimized for the cloud. By re-using existing tools, the development time is dramatically reduced.

    Searching the reveals that CWL tools are available for and . Therefore, the only tool that needs to be wrapped is MultiQC.

    hashtag
    Wrapping Your Tool

    As described previously, the process of describing a command-line tool or script in CWL so that it can be run in a cloud environment either by itself or in a larger workflow is known as wrapping. Let us proceed with wrapping our MultiQC tool. The first step is to either (1) create a Docker image from a Docker build file or (2) find one available to us on a hosted repository. Next, we should run the Docker locally to test out the basic command line. If a Docker image was previously created and hosted for us we can use that to save time. On the other hand, if the software programs are not available in a single Docker image you will need to build it. Please see the section on for more information on creating images.

    For this example, a MultiQC Docker image is available for us via with the image specially hosted at quay.io/biocontainers/multiqc:1.9--pyh9f0ad1d_0.

    To use the SDK tool editor to wrap MultiQC, we will follow these steps in the development project:

    Step 1: In the development project, click on the Apps tab and select “Add app.”

    Step 2: On the next screen, select “Create a Tool.”

    Step 3: Name your tool “MultiQC” and create a version CWL 1.0 tool. This will automatically take you to the visual Tool Editor.

    Step 4: To complete our wrapping of MultiQC, we need to fill in the Docker Image, Base Command, Input Ports, and Outputs sections in the Visual Editor.

    • The Docker image/repository is quay.io/biocontainers/multiqc:1.9--pyh9f0ad1d_0.

    • For this example, the base command (see screenshot below) is simply: MultiQC (See Creating a tool from a script for how this will look for your own custom script).

    • For this example, we just need one input port which is an array of our quality control files from the upstream apps. We do not need any additional parameters for this example. If you are wrapping your own custom script, you can configure multiple input ports of different types.

    Be sure to create an input of type “array” which has items of type “File.” The MultiQC executable does not require that inputs of different types be prefaced with any flags or indicators. When wrapping an executable that requires distinguishing inputs (e.g. “--arg1 --arg2”), multiple inputs would need to be added.

    The tool editor gives the user a preview of the resulting command line:

    This is a relatively simple one and the arguments are just the full paths in the input file array. Features such as file metadata and JavaScript expression can be used to create a more sophisticated Linux command line for other tools.

    • The output port is the comprehensive MultiQC report (Figure 9) that the software tool creates. For this output, we will use a wildcard inside the “glob” field. The “glob” is simply how the tool will select which files to keep from the current working directory. The user can create as many output ports as necessary. Since MultiQC is a simplification and summarization tool we will only have one HTML report which can be acquired using a glob of “*.html.”

    The output “glob” field (like all fields in the tool editor) has the ability to use JavaScript expressions to dynamically search for files in a very specific manner such as the full path that is based on the input files or to scan through a deep folder structure.

    The completed tool will look like this:

    Finally, we should consider the Computational Resources section of the Tool Editor. Here it is important to specify the minimum compute required. Because our example tool is not computationally intensive we can require a minimal amount of RAM and CPU. Through JavaScript dynamic expressions we can customize these computational requirements to scale with either the input file sizes or user input parameters. The Seven Bridges job scheduler will select the appropriate cloud instance(s) based on these constraints. In the next section, we will discuss how the user can also specify a suggested AWS or GCP cloud instance by adding “hints.”

    Tools can be tested by themselves, but in some cases, it makes more sense to test the tool in the context of the complete workflow. For simplicity, we will add MultiQC to the workflow and use the output from the tool upstream of MultiQC in the workflow to test the MultiQC tool.

    Finding appropriate test data is key to testing tools and workflows. Wherever possible, we recommend working with data that is small in size when testing tools and workflows. Small in size generally means a small size on disk and usually correlates to a smaller number of NGS reads, a smaller number of variants, or a smaller number of samples. Sometimes this small data is referred to as a “toy” dataset or a “subset” of data. Testing the tool wrapper will generally require multiple test runs using this small data set.

    Seven Bridges hosts a number of test files in the that range from reference files to test size input data. Users can link these test files to their project instead of uploading their own test data to avoid storage costs. One of these test files is the human whole-exome sequencing sample which we will use for testing here. You can view the provenance of this test file by clicking on the file name and then on “metadata”:

    This file is a “subset” of the whole exome data and is, therefore, a good choice for testing since the cost per analysis will be less than if data from all chromosomes were used. Tools should always be tested separately. When wrapping a tool the user should obtain access to data they can use for testing. The above metadata description also tells us the exact reference that was used for the read alignment. Seven Bridges also has the same reference file in the Public Files Gallery called .

    Make sure to copy both testing files to your development project. Because these files are hosted in the Public Files Gallery, linking these files to your project will not lead to any additional storage costs.

    hashtag
    Extending Into A Workflow

    The next step is to add our tool to a workflow with the upstream QC tools. We will use the pipeline editor to do this.

    Step 1. The first step is to create a new “blank canvas” in the workflow editor. Go to the Apps tab in the development project and click on “Add app.” This time select “Create a workflow”.

    Step 2. After creating the workflow, the next screen is a blank canvas in the Workflow Editor. From here, we can add multiple QC apps that are compatible with MultiQC to the canvas directly from the Public Apps Gallery. Search for “fastqc” and then for “picard alignment metrics” and use the mouse to drag them onto the workflow canvas.

    Add the MultiQC CWL tool from the current project in the “My Projects” tab. The screen will look like this:

    The next step is to connect the apps together. The nodes that are displayed on the workflow canvas represent apps. The input and output ports are represented by small circles on the perimeter of the node. Circles on the left of the node represent input ports whereas the ones on the right indicate output ports. Use the mouse to connect the wireframe together. The completed workflow will look like this:

    This simple workflow highlights several advantages of the workflow editor. Notice that the “input file” input port node which represents an aligned bam file for this workflow feeds into both the Picard CollectWgsMetrics and FastQC tools. This means that the end-user only needs to specify this input one time when running the task and that the alignment metrics and FastQC tools will run in parallel, conserving time and funds.

    Take note that one of the FastQC outputs is not connected to any downstream tool. In this case, this output port creates a zip file of the raw report data. However, the MultiQC tool does not need this output file and therefore it does not need to be moved or persisted outside the Docker container of the FastQC tool. In addition, although the CollectWgsMetrics and FastQC nodes feed into MultiQC, they do not have output nodes for themselves. This workflow has only 1 output which is the MultiQC HTML report. The intermediate reports will be saved temporarily in case the tool needs to be re-run, but will not persist in the file page of the user’s workspace, highlighting another way the workflow conserves funding.

    We can test the workflow directly on the platform. Seven Bridges has multiple reference files in the Public Gallery. A completed task of the workflow will have one interactive report as an output. See the completed task in Figure 12.

    The output of MultiQC is an interactive report that is viewable directly on the platform:

    For more information about the workflow editor and for other examples please refer to the following materials in the Seven Bridges documentation:

    hashtag
    Scaling up your analysis

    There are two easy ways to scale your workflows on Seven Bridges. We refer to these as “batching” and “scattering.” The batch analysis separates files into batches or groups when running your analysis. The batching is done according to the specified metadata criteria of your input files or according to a files list you provide. A batch analyses can be defined at run time with no special setup in the tool. However, each batch is run on a separate instance. For more information on batch analyses, please see here.

    Using our NGS QC workflow example we can create a batch task for every file in the input file port, as shown in Figure 14. This batch task will create 1 child task for each input bam file.

    We can use another method called “scattering,” which operates inside a single task. This means that a workflow can utilize multiple cores in a single compute instance, which is often more efficient than using multiple instances. Scattering can only be used at the workflow level, not at the tool level. To use scattering, we need to edit our workflow. We make the input file of type “array” and the array type “file” as shown in Figure 15.

    Click on each of our QC tools and select “Step.” In the “step” panel select the appropriate input to scatter on. In this case, we scatter by “input_bam” for the Picard Collect WGS Metrics tool and by “input_fastq” for the FastQC tool. When the workflow is run, the user can select multiple input files and each of them will be processed in parallel on separate compute nodes.

    This was a brief introduction to the powerful scatter ability of the workflow editor. Please see the section of the Seven Bridges documentation for more information.

    When running your custom workflow, you can define computational requirements so that there are enough memory and CPUs to run multiple jobs in parallel. For example, if your tool requires 4GB of RAM and you select an instance with 8 CPUs and 32G RAM, you will see that 8 jobs are running in parallel when you run your workflow as shown in Figure 18.

    If you have followed this guide, your tool has now been wrapped and added to a workflow. It has also been tested on a “toy” dataset and validated against real data for your project. In the next sections, you will learn how to export CWL from the platform, create a GitHub repository for version control, and also how to publish to Dockstore.

    hashtag
    Creating a tool from a script

    Not all tools need to be command-line binaries. Many researchers bring their shell scripts, Python and, R scripts to Seven Bridges and this is all possible using the Seven Bridges Tool Editor.

    For example, if we wanted to run an R script using the GENESIS Docker image we could do that without having to recreate the Docker image. To run a specific script that is not included in the Docker image, use the “File requirements” field shown in Figure 19. Specify a name for your file and paste in the file contents.

    Then enter the name of the file in the “Base command” section along with the command required to execute it (e.g. Rscript):

    Similarly, if you were using a python script the base command would be “Python.” Using the files requirements section of the Tool Editor we can execute any type of script without having to create a new Docker container.

    hashtag
    Additional resources

    • (CWL solutions available for the same exercise)

  • Please see Figure 8 for where to fill in the details.

  • https://sb-biodatacatalyst.readme.io/docs/sdk-overviewarrow-up-right
    rabix.ioarrow-up-right
    tutorialarrow-up-right
    Getting Started Guidearrow-up-right
    Galleryarrow-up-right
    BioData Catalyst Organizationarrow-up-right
    Public Apps Galleryarrow-up-right
    Fastqcarrow-up-right
    Picard CollectWgsMetricsarrow-up-right
    Working with Dockerarrow-up-right
    biocontainers.proarrow-up-right
    Public Files Galleryarrow-up-right
    merged-normal.bamarrow-up-right
    human_g1k_v37_decoy.fastaarrow-up-right
    about the workflow editorarrow-up-right
    create a workflowarrow-up-right
    workflow editor tutorialarrow-up-right
    Comprehensive tips for reliable and efficient analysis set-uparrow-up-right
    Getting Started with CWLarrow-up-right
    CWL Best Practicesarrow-up-right
    Training Exercisesarrow-up-right
    Figure 3. Overview of BDC Tool Wrapping process
    Figure 4. Visual interface for creation and editing of CWL workflows
    Figure 5. Showing how tools can be connected together into 1 workflow
    Figure 6. Create a project on Seven Bridges
    Figure 7. Project description space can be used to outline tool
    Figure 8. Fill out details for MultiQC tool
    Figure 9. MultiQC report as output port
    Figure 10. Completed Tool
    Figure 11. Computational Resources section of Tool Editor
    Figure 12. Completed Task page where users can access output files
    Figure 13. Interactive report from MultiQC tool
    Figure 14. Creating a batch task
    Figure 15. Input file settings for scatter
    Figure 16. Scatter by input_bam for Picard tool
    Figure 17. Scatter by input_fastq for FastQC tool.
    Figure 18. Scatter
    Figure 19. File Requirements box
    SevenBridges Introduction to tool wrappingarrow-up-right
    Task Troubleshooting Guidearrow-up-right

    Version Control, Publishing & Validation of Workflows

    hashtag
    Version Control

    Version control is vital in reproducibility since it helps track changes you or contributors make to your code and documentation. We suggest using GitHub to host your workflows in an open access repository so that the research community can benefit from your work, and your work can benefit from feedback from the research community. Below, find steps for getting started with GitHub :

    • Upload your descriptor file (workflow), parameter files, and source code to a GitHub repository (see an )

    hashtag
    Publish your workflow in Dockstore

    We encourage users to publish their tools and workflows on Dockstore so that they can be used by the greater scientific community. Dockstore features allow users to build their pipelines to be open, reusable, and interoperable. Publishing your work in this way will enhance the value of your work and the resources available to the scientific community.

    Here is how to get started sharing your work on Dockstore:

    • and link your account to external services, such as GitHub

    • Link your Dockstore account to your to display your scientific identity.

    hashtag
    Best Practices for Secure and FAIR tools and workflows

    We believe we can enhance the security and reusability of tools and workflows we share through open, community-driven best practices that exemplify the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles. We have established published in Dockstore. We ask that users try to implement these practices in the workflows that they share with the community.

    hashtag
    Findable, Accessible, Interoperable, and Reusable (FAIR) examples from the community

    Dockstore can help you create more accessible and transparent data science methods in your scientific publications. In this section, we want to provide some examples of FAIR workflows the community has shared.

    In this , the researchers provided transparent methods by citing immutable DOI archives of their container-based workflows, and also shared a collection in the Broad Institute's organization on Dockstore. This collection includes several workflows, a README, and a link to a public workspace tutorial in the Terra cloud environment where users can learn exactly how to recreate their methods.

    In this the authors shared their pipelines written in the Workflow Description Language in this collection on Dockstore, and created a where the community can recreate an exact analysis and figure from their publication.

  • Create an , invite your collaborators, and promote your work in collections

  • Create a GitHub accountarrow-up-right
    A primer on GitHubarrow-up-right
    examplearrow-up-right
    Create a Dockstore Accountarrow-up-right
    ORCIDarrow-up-right
    Register your tool on Dockstorearrow-up-right
    best practices for secure and FAIR workflowsarrow-up-right
    2020 Science paper by Lemieux, et al.arrow-up-right
    Viral Genomics arrow-up-right
    2020 Nature paper by Li, et al. arrow-up-right
    Cumulusarrow-up-right
    public Terra workspacearrow-up-right
    Automatically sync updates between GitHub and Dockstore with GitHub Appsarrow-up-right
    Organizationarrow-up-right
    Request a DOI for your Dockstore workflowarrow-up-right

    BYOT Glossary

    An introduction to terms used in this document

    Each platform within BDC may have slight variations on these definitions. You will find a more specific definition within the section of the BYOT document. Below, we highlight a few terms to introduce you to before you get started.

    • App: 1) In Seven Bridges, an app is a general term to refer to both tools and workflows. 2) App may also refer to persistent software that is integrated into a platform.

    • Container: A standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another (for example, Docker).

    • Command: In workflow writing, the command specifies the literal command line run (akin to what you would run in the terminal).

    • Common Workflow Language (CWL): Simple scripting language for describing computational workflows for performing sequential operations on data. CWL is a way to describe command-line tools and connect them together to create workflows. CWL is well suited for describing large-scale workflows in cluster, cloud, and high-performance computing environments where tasks are scheduled in parallel across many nodes.

    • Docker: Software for running packaged, portable units of code, and dependencies that can be run in the same way across many computers. See also Container.

    • Dockerfile: A text document that contains all the commands a user could call on the command line to assemble an image.

    • Dockstore: An open platform developed by the Cancer Genome Collaboratory and used by the GA4GH for sharing Docker-based tools described with the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL).

    • Image: In the context of containers and Docker, this refers to the resting state of the software.

    • Instance: Refers to a virtual server instance from a public or private cloud network.

    • Task: In workflow writing, the term task encompasses all of the information necessary to execute a command, such as specifying input/output files and parameters.

    • Tool: In CWL, the term tool specifies a single command. This specification is not as discrete in other languages such as WDL.

    • Workflow Description Language (WDL): Way to specify data processing workflows with a human-readable and writable syntax. Define complex analysis tasks, chain them together in workflows, and parallelize their execution.

    • Workflow: A sequence of processes, usually computational in this context, through which a user may analyze data.

    • Workspace: Areas to work on/with data within a platform. Examples: projects within Seven Bridges.

    • Wrapping: The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.

    • Virtual Machine (VM): An isolated computing environment with its own operating system.

    For other terms, you can reference the .

    BioData Catalyst glossaryarrow-up-right

    Creating, testing & scaling WDL workflows

    In this section, the reader will learn how to use the Terra and Dockstore platforms for the creation of WDL workflows for analysis and sharing with the scientific community. Below we have compiled community and BDC resources to help users get started learning WDL to create their own workflows.

    hashtag
    Helpful definitions of terms when working on Terra

    Workspace: A dedicated space where you and collaborators can access and organize the same data and tools and run analyses together. They can include: data, notebooks, and workflows. They can be public or controlled access.

    Workflow: Chains of connected tools to accomplish a full analysis. Tools are often connected in a specific way to enable maximum computational efficiency and are also constructed to accomplish a specific analysis goal. A workflow typically describes a full analysis (e.g. variant discovery, differential expression, or multiple variant association tests).

    Workflow Description Language (WDL): A community-driven standard for describing data analysis pipelines and is easily portable across different computing environments. It is the language currently used to run batch-processes in Terra, which uses Cromwell as an executor. Like other descriptor languages, it is paired with Docker containers and can execute pipelines written in any language (bash, R, Python, etc.). Below, we have compiled community and BDC resources to help users get started learning WDL to create their own tools and workflows.

    hashtag
    WDL Toolkit: All the tools you need to write and run WDLs

    • Syntax:

    hashtag
    Learning Resources for writing WDL

    Below are a few learning resource tutorials we have compiled from various sources:

    • Open WDL’s offers a comprehensive set of exercises for users that are just learning WDL.

    • from Dockstore is an introductory guide.

    • These Dockstore along with this accompanying provide more complex examples using common bioinformatics tools.

    hashtag
    Writing your workflow locally using Dockstore’s Command Line Interface

    You can start developing your WDL workflow locally with Dockstore’s CLI and a small test dataset. This route allows you to debug syntax errors while avoiding cloud costs. Once your workflow is debugged, you can launch in a cloud environment to test for permissions errors and scaling issues. The Dockstore CLI automatically installs the Cromwell execution engine for running WDL workflows locally.

    Instructions:

    • locally

    • locally

    hashtag
    Building and releasing your workflow

    In order to transition your workflow from local development to Terra, a typical approach is to make the workflow available in a GitHub repository and then build. Quay.io integrates with Dockstore and GitHub by automatically building upon GitHub pushes. The Quay.io build can then be registered on Dockstore. You can follow the steps for linking your Dockstore account to external services like Quay.io in this .

    You can find more information about this process in the section Version Control, Publishing, and Validation of Workflows below.

    hashtag
    Testing and using your workflow in the cloud with Terra

    Now that you have a workflow ready for running in a cloud environment, you can port your workflow into Terra in two ways. First, if you are already using Dockstore and GitHub for version control, you can navigate to your Dockstore WDL workflow and use the "Launch with NHLBI BioData Catalyst" button. This article provides instructions for selecting a workflow in Dockstore then conveniently importing that workflow into Terra

    Figure 1. Dockstore’s “Launch with BioData Catalyst” button.

    If you haven’t published your workflow to Dockstore, you can also using the Broad Methods Repository. The Broad Methods Repository can easily be found in the “Add workflows” section of your Terra workspace. Similar to Dockstore, this repository hosts many WDL workflows that have been created by the Terra community. These workflows are only public once a user has signed into Terra.

    Figure 2. In Terra workspaces, when you are in the "Workflows" tab you can “Find Additional Workflows” from Dockstore and the Broad Methods Repository.

    Once your workflow is in Terra, you may want to check out some of the learning resources below for configuring, troubleshooting, and optimizing your workflow. There are likely additional configuring and troubleshooting steps needed for getting your workflow up and running on larger datasets hosted in the cloud.

    hashtag
    Optimizing workflows on Terra

    Terra also has several tips for reducing costs in order to promote the efficiency of a workflow. These approaches include deleting intermediate files and returning only final output to limit storage costs. Virtual machines can be configured with certain settings with reduced costs, such as using preemptible machines that trade-off reduced costs for the potential interruption. Cost optimizations are described at the following links:

    Once your workflow is working as expected, we ask that you publish your work to share with the research community. You can find resources for how to publish your work on GitHub and Dockstore in the section below titled Version Control, Publishing, and Validation of Workflows.

    Authoring:

    • SublimeTextarrow-up-right offers a nice balance between usability and editing features.

    • Syntax highlighters: Plugins that enable syntax highlighting (i.e. coloring code elements based on their function) for supported text editors. Syntax highlighting has been developed for SublimeText and Visual Studioarrow-up-right, vimarrow-up-right, and IntelliJarrow-up-right.

  • Visualization: Pipeline Builderarrow-up-right is a web-based tool that creates an interactive graphical representation of any workflow written in WDL; also includes WDL code generation functionality.

  • Execution engine: Cromwellarrow-up-right is an execution engine co-developed with WDL; it can be used on multiple platforms through pluggable backends and offers sophisticated pipeline execution features. See the doc entry on executionarrow-up-right for quickstart instructions.

  • Validation & inputs: WOMToolarrow-up-right is a Java command-line tool co-developed with WDL that performs utility functions, including syntax validation and generation of input JSON templates. See the doc entries on validationarrow-up-right and inputsarrow-up-right for quickstart instructions.

  • Running tools:

    • Terraarrow-up-right is a cloud-based analysis platform for running workflows written in WDL via Cromwell on Google Cloud; it is open to the public and offers sophisticated data and workflow management features. In this BYOT document, we walk through all of the steps to run a workflow in Terra.

    • Dockstore’s Command Line Interfacearrow-up-right

    • is a lightweight command-line workflow submission system that runs WDLs via Cromwell on Google Cloud.

    • is a Bioconductor package to manage WDL workflows from within R, developed by Sean Davis. See docs .

  • Once you are more familiar with writing workflows, we suggest you continue with WDL Best Practicesarrow-up-right from Dockstore.

    This example WDL exercise using Dockstore’s CLIarrow-up-right steps through creating a basic WDL workflow locally and pushing the tool to GitHub, triggering an automated build on Quay.io.
    Troubleshooting workflows: tips & tricksarrow-up-right
  • Troubleshooting workflows: Fail for an unknown reasonarrow-up-right

  • Retrieve the time and cost of workflow jobsarrow-up-right

  • WDL 1.0 Specificationarrow-up-right
    WDL Script Componentsarrow-up-right
    Learn WDLarrow-up-right
    Getting Started with WDLarrow-up-right
    training exercisesarrow-up-right
    videoarrow-up-right
    Create a Dockstore Accountarrow-up-right
    Install Dockstore’s CLIarrow-up-right
    Install Dockerarrow-up-right
    documentarrow-up-right
    Importing a Dockstore workflow into Terraarrow-up-right
    upload a workflow directly into Terraarrow-up-right
    Terra’s Quickstart Workflows Workspacearrow-up-right
    Configure a workflow to process your dataarrow-up-right
    Default runtime attributes for workflowsarrow-up-right
    Controlling cloud costsarrow-up-right
    Enabling call-caching & deleting intermediate filesarrow-up-right
    Experimental Jupyter Notebook for estimating costs of workflows run in Terraarrow-up-right
    Wdl_runnerarrow-up-right
    wdlRunRarrow-up-right
    herearrow-up-right

    Advanced Topics

    hashtag
    Conversion tools

    If you want your workflow to be available to both WDL and CWL communities, you can use conversion tools to aid in the process. It is best practice to review if the conversion was correctly done.

    • cwl2wdlarrow-up-right

    hashtag
    Running workflows outside of the BioData Catalyst ecosystem:

    If you are interested in using Docker on your High-Performance Compute cluster, you may find the tool helpful.

    You can use the workflow runner for large parallelized CWL jobs in the AWS and/or Google clouds, locally, on Kubernetes, and/or high-performance computer clusters. is built for researchers and should run any CWL 1.0 workflow from Dockstore at scale. also has some experimental support for WDL.

    Bring Your Own Tool(s)

    Reproducibility is one of the biggest challenges facing science. Several issues associated with reproducibility have been well summarized in the . The BDC ecosystem promotes FAIR and reproducible analyses by leveraging Docker-based reproducible tools in two descriptor languages. The is currently supported in Seven Bridges workspaces, while the is currently supported in Terra workspaces.

    A combination of software containers (like ) and workflow languages wrap your bioinformatics pipeline, making your analysis portable across local and cloud execution environments. This allows researchers to reproduce your method(s) with exactly the same software, dependencies, and configurations. For example, BDC researchers have been able to reuse CWL and WDL versions of a Genome-Wide Association pipeline developed by the TOPMed Data Coordinating Center in multiple cloud workspaces.

    There are hundreds of CWL and WDL pipelines already available for researchers to run on BDC. Both CWL pipelines and WDL pipelines can be discovered in ’s open-access catalog and then executed in the workspace environments. In addition, the Seven Bridges platform hosts CWL workflows directly on the platform in the Public Apps Gallery, and the Terra platform hosts WDL workflows in the Broad Methods Repository. However, many researchers will want to work with pipelines that do not have CWL or WDL versions yet or need to make changes to existing CWL and WDL pipelines. This guide will describe the steps for how to “Bring Your Own Tool” to the BDC ecosystem.

    Wdl2cwlarrow-up-right
    Singularityarrow-up-right
    Toilarrow-up-right
    Toilarrow-up-right
    Toilarrow-up-right

    Whether you are working with WDL or CWL tools, all users will begin by creating a containerized version of their pipeline. There are multiple methods users take to create these tools, but we simplify this process by walking through two example paths. For researchers utilizing the Terra workspace environment, we describe how to start by writing your WDL tool locally and then configuring and testing in the cloud workspace. For researchers performing analyses on the Seven Bridges workspace environment, we describe how to use the Seven Bridges platform web composer and web editor features to add a CWL wrapper to the Docker image. You may find it easiest to start with learning one language (for example, the one that works in your chosen workspace environment) and then expanding to multiple languages if needed.

    FAIR (Findable, Accessible, Interoperable, and Reusability) Guiding Principlesarrow-up-right
    Common Workflow Language (CWL)arrow-up-right
    Workflow Description Language (WDL)arrow-up-right
    Dockerarrow-up-right
    Dockstorearrow-up-right