Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
An introduction to terms used in this document
Each platform within BioData Catalyst may have slight variations on these definitions. You will find a more specific definition within the section of the BYOT document. Below, we highlight a few terms to introduce you to before you get started.
App: 1) In Seven Bridges, an app is a general term to refer to both tools and workflows. 2) App may also refer to persistent software that is integrated into a platform.
Container: A standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another (for example, Docker).
Command: In workflow writing, the command specifies the literal command line run (akin to what you would run in the terminal).
Common Workflow Language (CWL): Simple scripting language for describing computational workflows for performing sequential operations on data. CWL is a way to describe command-line tools and connect them together to create workflows. CWL is well suited for describing large-scale workflows in cluster, cloud, and high-performance computing environments where tasks are scheduled in parallel across many nodes.
Docker: Software for running packaged, portable units of code, and dependencies that can be run in the same way across many computers. See also Container.
Dockerfile: A text document that contains all the commands a user could call on the command line to assemble an image.
Dockstore: An open platform developed by the Cancer Genome Collaboratory and used by the GA4GH for sharing Docker-based tools described with the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow (NFL).
Image: In the context of containers and Docker, this refers to the resting state of the software.
Instance: Refers to a virtual server instance from a public or private cloud network.
Task: In workflow writing, the term task encompasses all of the information necessary to execute a command, such as specifying input/output files and parameters.
Tool: In CWL, the term tool specifies a single command. This specification is not as discrete in other languages such as WDL.
Workflow Description Language (WDL): Way to specify data processing workflows with a human-readable and writable syntax. Define complex analysis tasks, chain them together in workflows, and parallelize their execution.
Workflow: A sequence of processes, usually computational in this context, through which a user may analyze data.
Workspace: Areas to work on/with data within a platform. Examples: projects within Seven Bridges.
Wrapping: The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.
Virtual Machine (VM): An isolated computing environment with its own operating system.
For other terms, you can reference the BioData Catalyst glossary.
Docker technology has revolutionized reproducibility by creating a fast, portable, easily shareable method to generate the exact compute environment, with all dependencies and configurations, that were used to run a tool or workflow.
Below, we provide resources for finding public Docker images or creating your own image to use with your analysis. Docker is commonly used by software engineers, and learning material on the internet may be overly complex for the researcher use case. We compiled learning materials from each platform within BioData Catalyst to help you get started using Docker specifically for bioinformatics pipelines.
We highly recommend users begin with an official or maintained image (for example, from BioContainer) to ensure you are using secure software.
Below, we have compiled learning resources from various sources to help you get started learning Docker:
Dockstore’s Getting Started with Docker
Version control is vital in reproducibility since it helps track changes you or contributors make to your code and documentation. We suggest using GitHub to host your workflows in an open access repository so that the research community can benefit from your work, and your work can benefit from feedback from the research community. Below, find steps for getting started with GitHub :
Upload your descriptor file (workflow), parameter files, and source code to a GitHub repository (see an )
We encourage users to publish their tools and workflows on Dockstore so that they can be used by the greater scientific community. Dockstore features allow users to build their pipelines to be open, reusable, and interoperable. Publishing your work in this way will enhance the value of your work and the resources available to the scientific community.
Here is how to get started sharing your work on Dockstore:
and link your account to external services, such as GitHub
Link your Dockstore account to your to display your scientific identity.
Create an , invite your collaborators, and promote your work in collections
Dockstore can help you create more accessible and transparent data science methods in your scientific publications. In this section, we want to provide some examples of FAIR workflows the community has shared.
Authors: Beth Sheets (UC Santa Cruz, Genomics Institute), Dave Roberson (Seven Bridges)
Contributors: Dan Vicente (Seven Bridges), Alison Leaf (Seven Bridges), Stephanie Gogarten (Fellow), Sheila Gaynor (Fellow), Jean Monlong (Fellow), Kenny Westermann (Fellow)
Reproducibility is one of the biggest challenges facing science. Several issues associated with reproducibility have been well summarized in the . The BioData Catalyst ecosystem promotes FAIR and reproducible analyses by leveraging Docker-based reproducible tools in two descriptor languages. The is currently supported in Seven Bridges workspaces, while the is currently supported in Terra workspaces.
A combination of software containers (like ) and workflow languages wrap your bioinformatics pipeline, making your analysis portable across local and cloud execution environments. This allows researchers to reproduce your method(s) with exactly the same software, dependencies, and configurations. For example, BioData Catalyst researchers have been able to reuse CWL and WDL versions of a Genome-Wide Association pipeline developed by the TOPMed Data Coordinating Center in multiple cloud workspaces.
There are hundreds of CWL and WDL pipelines already available for researchers to run on BioData Catalyst. Both CWL pipelines and WDL pipelines can be discovered in ’s open-access catalog and then executed in the workspace environments. In addition, the Seven Bridges platform hosts CWL workflows directly on the platform in the Public Apps Gallery, and the Terra platform hosts WDL workflows in the Broad Methods Repository. However, many researchers will want to work with pipelines that do not have CWL or WDL versions yet or need to make changes to existing CWL and WDL pipelines. This guide will describe the steps for how to “Bring Your Own Tool” to the BioData Catalyst ecosystem.
Whether you are working with WDL or CWL tools, all users will begin by creating a containerized version of their pipeline. There are multiple methods users take to create these tools, but we simplify this process by walking through two example paths. For researchers utilizing the Terra workspace environment, we describe how to start by writing your WDL tool locally and then configuring and testing in the cloud workspace. For researchers performing analyses on the Seven Bridges workspace environment, we describe how to use the Seven Bridges platform web composer and web editor features to add a CWL wrapper to the Docker image. You may find it easiest to start with learning one language (for example, the one that works in your chosen workspace environment) and then expanding to multiple languages if needed.
We believe we can enhance the security and reusability of tools and workflows we share through open, community-driven best practices that exemplify the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles. We have established published in Dockstore. We ask that users try to implement these practices in the workflows that they share with the community.
In this , the researchers provided transparent methods by citing immutable DOI archives of their container-based workflows, and also shared a collection in the Broad Institute's organization on Dockstore. This collection includes several workflows, a README, and a link to a public workspace tutorial in the Terra cloud environment where users can learn exactly how to recreate their methods.
In this the authors shared their pipelines written in the Workflow Description Language in this collection on Dockstore, and created a where the community can recreate an exact analysis and figure from their publication.
In this section, the reader will learn how to use the Terra and Dockstore platforms for the creation of WDL workflows for analysis and sharing with the scientific community. Below we have compiled community and BioData Catalyst resources to help users get started learning WDL to create their own workflows.
Workspace: A dedicated space where you and collaborators can access and organize the same data and tools and run analyses together. They can include: data, notebooks, and workflows. They can be public or controlled access.
Workflow: Chains of connected tools to accomplish a full analysis. Tools are often connected in a specific way to enable maximum computational efficiency and are also constructed to accomplish a specific analysis goal. A workflow typically describes a full analysis (e.g. variant discovery, differential expression, or multiple variant association tests).
Workflow Description Language (WDL): A community-driven standard for describing data analysis pipelines and is easily portable across different computing environments. It is the language currently used to run batch-processes in Terra, which uses Cromwell as an executor. Like other descriptor languages, it is paired with Docker containers and can execute pipelines written in any language (bash, R, Python, etc.). Below, we have compiled community and BioData Catalyst resources to help users get started learning WDL to create their own tools and workflows.
Authoring:
SublimeText offers a nice balance between usability and editing features.
Syntax highlighters: Plugins that enable syntax highlighting (i.e. coloring code elements based on their function) for supported text editors. Syntax highlighting has been developed for SublimeText and Visual Studio, vim, and IntelliJ.
Visualization: Pipeline Builder is a web-based tool that creates an interactive graphical representation of any workflow written in WDL; also includes WDL code generation functionality.
Validation & inputs: WOMTool is a Java command-line tool co-developed with WDL that performs utility functions, including syntax validation and generation of input JSON templates. See the doc entries on validation and inputs for quickstart instructions.
Running tools:
Terra is a cloud-based analysis platform for running workflows written in WDL via Cromwell on Google Cloud; it is open to the public and offers sophisticated data and workflow management features. In this BYOT document, we walk through all of the steps to run a workflow in Terra.
Wdl_runner is a lightweight command-line workflow submission system that runs WDLs via Cromwell on Google Cloud.
Below are a few learning resource tutorials we have compiled from various sources:
Open WDL’s Learn WDL offers a comprehensive set of exercises for users that are just learning WDL.
Getting Started with WDL from Dockstore is an introductory guide.
These Dockstore training exercises along with this accompanying video provide more complex examples using common bioinformatics tools.
Once you are more familiar with writing workflows, we suggest you continue with WDL Best Practices from Dockstore.
You can start developing your WDL workflow locally with Dockstore’s CLI and a small test dataset. This route allows you to debug syntax errors while avoiding cloud costs. Once your workflow is debugged, you can launch in a cloud environment to test for permissions errors and scaling issues. The Dockstore CLI automatically installs the Cromwell execution engine for running WDL workflows locally.
Instructions:
Install Dockstore’s CLI locally
Install Docker locally
This example WDL exercise using Dockstore’s CLI steps through creating a basic WDL workflow locally and pushing the tool to GitHub, triggering an automated build on Quay.io.
In order to transition your workflow from local development to Terra, a typical approach is to make the workflow available in a GitHub repository and then build. Quay.io integrates with Dockstore and GitHub by automatically building upon GitHub pushes. The Quay.io build can then be registered on Dockstore. You can follow the steps for linking your Dockstore account to external services like Quay.io in this document.
You can find more information about this process in the section Version Control, Publishing, and Validation of Workflows below.
Now that you have a workflow ready for running in a cloud environment, you can port your workflow into Terra in two ways. First, if you are already using Dockstore and GitHub for version control, you can navigate to your Dockstore WDL workflow and use the "Launch with NHLBI BioData Catalyst" button. This article Importing a Dockstore workflow into Terra provides instructions for selecting a workflow in Dockstore then conveniently importing that workflow into Terra
Figure 1. Dockstore’s “Launch with BioData Catalyst” button.
If you haven’t published your workflow to Dockstore, you can also upload a workflow directly into Terra using the Broad Methods Repository. The Broad Methods Repository can easily be found in the “Add workflows” section of your Terra workspace. Similar to Dockstore, this repository hosts many WDL workflows that have been created by the Terra community. These workflows are only public once a user has signed into Terra.
Figure 2. In Terra workspaces, when you are in the "Workflows" tab you can “Find Additional Workflows” from Dockstore and the Broad Methods Repository.
Once your workflow is in Terra, you may want to check out some of the learning resources below for configuring, troubleshooting, and optimizing your workflow. There are likely additional configuring and troubleshooting steps needed for getting your workflow up and running on larger datasets hosted in the cloud.
Terra also has several tips for reducing costs in order to promote the efficiency of a workflow. These approaches include deleting intermediate files and returning only final output to limit storage costs. Virtual machines can be configured with certain settings with reduced costs, such as using preemptible machines that trade-off reduced costs for the potential interruption. Cost optimizations are described at the following links:
Once your workflow is working as expected, we ask that you publish your work to share with the research community. You can find resources for how to publish your work on GitHub and Dockstore in the section below titled Version Control, Publishing, and Validation of Workflows.
In this section, the reader will first learn how the Seven Bridges Software Development Kit (SDK) enables the easy creation of CWL workflows that will benefit the greater BDCatalyst community. We will review the benefits of the SDK and then walk through an example of workflow creation, testing, and scaling. There are also links to more detailed resources for further reading.
Wrapping: The process of describing a command-line tool or custom script in Common Workflow Language (CWL) so that it can be easily run in multiple cloud environments, dynamically scale compute requirements based on inputs, and be inserted into distinct analysis pipelines.
Tool: A CWL description of a reusable piece of software that performs one specific function. An example is the bwa read alignment tool which can be applied to multiple workflows in different contexts. Tools need to have several things specified in the CWL description that includes Docker image, Linux base command, input files (or parameters), and output files. Tools can be used in completely disparate workflows and can be thought of as building blocks.
Workflow: Chains of connected tools to accomplish a full analysis. Tools are often connected in a specific way to enable maximum computational efficiency and are also constructed to accomplish a specific analysis goal. Whereas tools describe a single software step (e.g. alignment or read sorting), a workflow describes a full analysis (e.g. variant discovery, differential expression, or multiple variant association tests).
App: An app is a general term to refer to both tools and workflows. The platform user will typically only see the term “app” in reference to mixed groups of tools and workflows, such as in the Public Apps Gallery, the Apps collection tab, or within a workspace.
Throughout this guide, it will be useful for the reader to refer to our documentation found for each section. For the Seven Bridges Software Development Kit documentation, please see the following:
NHLBI BioData Catalyst powered by SevenBridges (Seven Bridges) provides a full Software Development Kit (SDK) that enables the BioData Catalyst community to easily create CWL apps that can be tested and scaled up to production level directly on the platform. Once validated by the user, these workflows can be exported and published on Dockstore so they can become searchable and findable by other users.
The SDK consists of a tool editor and a pipeline editor. Both are based on the open-source project Rabix, a portmanteau of "Reproducible Analyses for Bioinformatics" (for more information, see ). The goal of the SDK is to guide the user through the process of creating fully functional analytical pipelines that can be tested, scaled up to population-scale analysis, and shared with the research community. The SDK also has built-in version control at the tool and workflow level to enable the full reproducibility of previous versions.
The Tool Editor guides the user through the creation of a portable CWL description by linking a pre-built Docker image (see section Working with Docker) to a command line or script that will be run inside the container. The above image shows the tool wrapping process. The Tool Editor enables users to easily create CWL by filling out the GUI template (Figure 4). This simplifies the technical aspects of this process and makes it as easy as possible for users to get their tools set up on the platform. The CWL code can also be edited directly in the tool editor if that is desired. For users working with JavaScript, JavaScript dynamic expressions can be tested without having to leave the tool editor.
The Workflow Editor enables users to create full pipelines by linking together multiple discrete tools. The workflow editor is a drag-and-drop visual interface that makes it easy to build even the most complex pipelines.
Before we dive into more detail on how to use the Tool Editor and the Workflow Editor, it is important to understand the distinction between tools and workflows. The distinction is only present in the CWL, and it is an important one. Wrapping a tool requires knowledge of Docker and Linux command lines. The Tool Editor helps the user get past even the most technical and dynamic of command-line and script issues, with the goal being the creation of a reusable and shareable component. For building workflows, the Docker and Linux command lines are abstracted away to enable less-technical users to build full analytical pipelines. We can refer to this as “separation of concerns.” Each tool should be designed to handle one functional aspect, and therefore will be able to operate in multiple analytical pipelines. For example, BWA-MEM or the Samtools suite can be used in both DNA analysis workflows and RNA analysis workflows.
Linking together multiple tools into a full computational pipeline can have many advantages. It is important to understand the benefits of building a full and robust workflow that includes each of the analysis steps. The most apparent benefit is that it makes the entire pipeline easier to share, as there will only be one resulting CWL file. The CWL file is a human-readable text file that can be distributed digitally in multiple ways, such as through Dockstore, Seven Bridges, GitHub, or over email. A novice user can easily reproduce the full analysis using one file. They can also use the SDK to make adjustments if necessary, or even decompose the workflow to get at the constituent tools for use in other contexts (more on this below in the section Version Control, Publishing, and Validation of workflows). The Seven Bridges platform has built-in optimizations to execute a workflow for maximum efficiency and cost savings. For example, workflows only save final output files back to the project. Since intermediate files from earlier steps in the workflow are not saved, they do not accumulate cloud storage costs, saving funds that would otherwise be used for intermediate file object storage. Users can still make use of intermediate files for subsequent reruns of a task by simply turning on “memorization” for that task and intermediate files will be re-used where appropriate.
Finally, linking multiple tools together also has the added benefit of increasing computational efficiency. When running workflows, multiple tools can use the same compute instance if multiple CPU cores are available. This saves time and funds and increases the ability to run jobs in parallel with no additional configurations.
In the following sections, we will build the workflow in the above image. Here, we can visually see the importance of creating a workflow: running each of these tools separately would require more steps from the user and require more unnecessary data to be moved back and forth from the cloud computational instance to the user’s workspace. Therefore, running as a single workflow achieves the best efficiency.
Figure 6 shows all the options available when creating a project on Seven Bridges including selecting the Billing Group. If used conservatively, the NHLBI BioData Catalyst pilot funding is adequate to cover the costs associated with developing a tool or workflow on the platform.
For the purposes of this tutorial, we will create a Next-Generation Sequencing (NGS) alignment Quality Control (QC) workflow as an example problem. BioData Catalyst hosts data from TOPMed and TOPMed studies generally have the most up-to-date alignments to HG38. Therefore, for this example tutorial we will (1) create a pipeline that can be used to make sure these CRAMs have high-quality reads, and (2) perform alignment read depth QC. We will also show how to bring a new tool to the platform that will combine the outputs of the previous tools.
Researchers should outline their pipeline into individual steps. These steps should correspond to individual software executables (i.e. bwa, samtools) or scripts (i.e. R, Python, shell).
A great place to outline your tool is in your development project description, shown below:
Tools from the Seven Bridges Public Apps Gallery can be easily imported directly into your project. These apps have been validated and optimized for the cloud. By re-using existing tools, the development time is dramatically reduced.
To use the SDK tool editor to wrap MultiQC, we will follow these steps in the development project:
Step 1: In the development project, click on the Apps tab and select “Add app.”
Step 2: On the next screen, select “Create a Tool.”
Step 3: Name your tool “MultiQC” and create a version CWL 1.0 tool. This will automatically take you to the visual Tool Editor.
Step 4: To complete our wrapping of MultiQC, we need to fill in the Docker Image, Base Command, Input Ports, and Outputs sections in the Visual Editor.
The Docker image/repository is quay.io/biocontainers/multiqc:1.9--pyh9f0ad1d_0.
For this example, the base command (see screenshot below) is simply: MultiQC (See Creating a tool from a script for how this will look for your own custom script).
For this example, we just need one input port which is an array of our quality control files from the upstream apps. We do not need any additional parameters for this example. If you are wrapping your own custom script, you can configure multiple input ports of different types.
Please see Figure 8 for where to fill in the details.
Be sure to create an input of type “array” which has items of type “File.” The MultiQC executable does not require that inputs of different types be prefaced with any flags or indicators. When wrapping an executable that requires distinguishing inputs (e.g. “--arg1 --arg2”), multiple inputs would need to be added.
The tool editor gives the user a preview of the resulting command line:
This is a relatively simple one and the arguments are just the full paths in the input file array. Features such as file metadata and JavaScript expression can be used to create a more sophisticated Linux command line for other tools.
The output port is the comprehensive MultiQC report (Figure 9) that the software tool creates. For this output, we will use a wildcard inside the “glob” field. The “glob” is simply how the tool will select which files to keep from the current working directory. The user can create as many output ports as necessary. Since MultiQC is a simplification and summarization tool we will only have one HTML report which can be acquired using a glob of “*.html.”
The output “glob” field (like all fields in the tool editor) has the ability to use JavaScript expressions to dynamically search for files in a very specific manner such as the full path that is based on the input files or to scan through a deep folder structure.
The completed tool will look like this:
Finally, we should consider the Computational Resources section of the Tool Editor. Here it is important to specify the minimum compute required. Because our example tool is not computationally intensive we can require a minimal amount of RAM and CPU. Through JavaScript dynamic expressions we can customize these computational requirements to scale with either the input file sizes or user input parameters. The Seven Bridges job scheduler will select the appropriate cloud instance(s) based on these constraints. In the next section, we will discuss how the user can also specify a suggested AWS or GCP cloud instance by adding “hints.”
Tools can be tested by themselves, but in some cases, it makes more sense to test the tool in the context of the complete workflow. For simplicity, we will add MultiQC to the workflow and use the output from the tool upstream of MultiQC in the workflow to test the MultiQC tool.
Finding appropriate test data is key to testing tools and workflows. Wherever possible, we recommend working with data that is small in size when testing tools and workflows. Small in size generally means a small size on disk and usually correlates to a smaller number of NGS reads, a smaller number of variants, or a smaller number of samples. Sometimes this small data is referred to as a “toy” dataset or a “subset” of data. Testing the tool wrapper will generally require multiple test runs using this small data set.
Make sure to copy both testing files to your development project. Because these files are hosted in the Public Files Gallery, linking these files to your project will not lead to any additional storage costs.
The next step is to add our tool to a workflow with the upstream QC tools. We will use the pipeline editor to do this.
Step 1. The first step is to create a new “blank canvas” in the workflow editor. Go to the Apps tab in the development project and click on “Add app.” This time select “Create a workflow”.
Step 2. After creating the workflow, the next screen is a blank canvas in the Workflow Editor. From here, we can add multiple QC apps that are compatible with MultiQC to the canvas directly from the Public Apps Gallery. Search for “fastqc” and then for “picard alignment metrics” and use the mouse to drag them onto the workflow canvas.
Add the MultiQC CWL tool from the current project in the “My Projects” tab. The screen will look like this:
The next step is to connect the apps together. The nodes that are displayed on the workflow canvas represent apps. The input and output ports are represented by small circles on the perimeter of the node. Circles on the left of the node represent input ports whereas the ones on the right indicate output ports. Use the mouse to connect the wireframe together. The completed workflow will look like this:
This simple workflow highlights several advantages of the workflow editor. Notice that the “input file” input port node which represents an aligned bam file for this workflow feeds into both the Picard CollectWgsMetrics and FastQC tools. This means that the end-user only needs to specify this input one time when running the task and that the alignment metrics and FastQC tools will run in parallel, conserving time and funds.
Take note that one of the FastQC outputs is not connected to any downstream tool. In this case, this output port creates a zip file of the raw report data. However, the MultiQC tool does not need this output file and therefore it does not need to be moved or persisted outside the Docker container of the FastQC tool. In addition, although the CollectWgsMetrics and FastQC nodes feed into MultiQC, they do not have output nodes for themselves. This workflow has only 1 output which is the MultiQC HTML report. The intermediate reports will be saved temporarily in case the tool needs to be re-run, but will not persist in the file page of the user’s workspace, highlighting another way the workflow conserves funding.
We can test the workflow directly on the platform. Seven Bridges has multiple reference files in the Public Gallery. A completed task of the workflow will have one interactive report as an output. See the completed task in Figure 12.
The output of MultiQC is an interactive report that is viewable directly on the platform:
For more information about the workflow editor and for other examples please refer to the following materials in the Seven Bridges documentation:
There are two easy ways to scale your workflows on Seven Bridges. We refer to these as “batching” and “scattering.” The batch analysis separates files into batches or groups when running your analysis. The batching is done according to the specified metadata criteria of your input files or according to a files list you provide. A batch analyses can be defined at run time with no special setup in the tool. However, each batch is run on a separate instance. For more information on batch analyses, please see here.
Using our NGS QC workflow example we can create a batch task for every file in the input file port, as shown in Figure 14. This batch task will create 1 child task for each input bam file.
We can use another method called “scattering,” which operates inside a single task. This means that a workflow can utilize multiple cores in a single compute instance, which is often more efficient than using multiple instances. Scattering can only be used at the workflow level, not at the tool level. To use scattering, we need to edit our workflow. We make the input file of type “array” and the array type “file” as shown in Figure 15.
Click on each of our QC tools and select “Step.” In the “step” panel select the appropriate input to scatter on. In this case, we scatter by “input_bam” for the Picard Collect WGS Metrics tool and by “input_fastq” for the FastQC tool. When the workflow is run, the user can select multiple input files and each of them will be processed in parallel on separate compute nodes.
When running your custom workflow, you can define computational requirements so that there are enough memory and CPUs to run multiple jobs in parallel. For example, if your tool requires 4GB of RAM and you select an instance with 8 CPUs and 32G RAM, you will see that 8 jobs are running in parallel when you run your workflow as shown in Figure 18.
If you have followed this guide, your tool has now been wrapped and added to a workflow. It has also been tested on a “toy” dataset and validated against real data for your project. In the next sections, you will learn how to export CWL from the platform, create a GitHub repository for version control, and also how to publish to Dockstore.
Not all tools need to be command-line binaries. Many researchers bring their shell scripts, Python and, R scripts to Seven Bridges and this is all possible using the Seven Bridges Tool Editor.
For example, if we wanted to run an R script using the GENESIS Docker image we could do that without having to recreate the Docker image. To run a specific script that is not included in the Docker image, use the “File requirements” field shown in Figure 19. Specify a name for your file and paste in the file contents.
Then enter the name of the file in the “Base command” section along with the command required to execute it (e.g. Rscript):
Similarly, if you were using a python script the base command would be “Python.” Using the files requirements section of the Tool Editor we can execute any type of script without having to create a new Docker container.
Learn more via this .
Before getting started with this section, we recommend first creating a development workspace (called projects on Seven Bridges) to house the new tool(s) and workflow(s) while they are being created and tested. Please see the Seven Bridges for detailed instructions about how to create projects.
It is important to determine if there are tools (steps in your outline) that have already been wrapped and are published in either Dockstore or the Seven Bridges Public Apps Gallery. This reduces the time in porting analytical workflows to the cloud because these steps will not have to be re-validated or re-benchmarked. This also promotes developing with “separation of concerns.” This means that every tool (step) can be versioned, tested, and improved without adversely affecting the entire workflow. We recommend searching the Seven Bridges Public Apps and the on Dockstore to find validated and reusable components.
Searching the reveals that CWL tools are available for and . Therefore, the only tool that needs to be wrapped is MultiQC.
As described previously, the process of describing a command-line tool or script in CWL so that it can be run in a cloud environment either by itself or in a larger workflow is known as wrapping. Let us proceed with wrapping our MultiQC tool. The first step is to either (1) create a Docker image from a Docker build file or (2) find one available to us on a hosted repository. Next, we should run the Docker locally to test out the basic command line. If a Docker image was previously created and hosted for us we can use that to save time. On the other hand, if the software programs are not available in a single Docker image you will need to build it. Please see the section on for more information on creating images.
For this example, a MultiQC Docker image is available for us via with the image specially hosted at quay.io/biocontainers/multiqc:1.9--pyh9f0ad1d_0.
Seven Bridges hosts a number of test files in the that range from reference files to test size input data. Users can link these test files to their project instead of uploading their own test data to avoid storage costs. One of these test files is the human whole-exome sequencing sample which we will use for testing here. You can view the provenance of this test file by clicking on the file name and then on “metadata”:
This file is a “subset” of the whole exome data and is, therefore, a good choice for testing since the cost per analysis will be less than if data from all chromosomes were used. Tools should always be tested separately. When wrapping a tool the user should obtain access to data they can use for testing. The above metadata description also tells us the exact reference that was used for the read alignment. Seven Bridges also has the same reference file in the Public Files Gallery called .
This was a brief introduction to the powerful scatter ability of the workflow editor. Please see the section of the Seven Bridges documentation for more information.
(CWL solutions available for the same exercise)
If you want your workflow to be available to both WDL and CWL communities, you can use conversion tools to aid in the process. It is best practice to review if the conversion was correctly done.
If you are interested in using Docker on your High-Performance Compute cluster, you may find the Singularity tool helpful.
You can use the workflow runner Toil for large parallelized CWL jobs in the AWS and/or Google clouds, locally, on Kubernetes, and/or high-performance computer clusters. Toil is built for researchers and should run any CWL 1.0 workflow from Dockstore at scale. Toil also has some experimental support for WDL.