Instructions on transferring files between NHLBI BioData Catalyst Powered by Seven Bridges and NHLBI BioData Catalyst Powered by Terra
This tutorial guides users through the process of transferring files between the two workspace environments of NHLBI BioData Catalyst: NHLBI BioData Catalyst Powered by Seven Bridges and NHLBI BioData Catalyst Powered by Terra.
Most researchers select one of the workspaces as their primary analysis environment and their labmates and collaborators typically work with them on the same workspace environment. However, there are cases where some collaborators work on Seven Bridges and others work on Terra. In this case, researchers need to share data files between the two workspaces to facilitate collaboration. When researchers run analyses on Seven Bridges, the results, or derived data, is only available on Seven Bridges. Likewise, when researchers run analyses on Terra, the results are only available on Terra. This tutorial provides step-by-step guidance on how to share derived data between the workspace environments. These instructions can also be used to share private data that has been uploaded to Seven Bridges or Terra.
Both open access data and controlled access data can be shared across workspace environments. Importantly, if a researcher intends to share controlled access data, they must ensure that all recipients have the necessary dbGaP permissions for those files. In some cases, this may mean the researchers must be listed as collaborators on their respective dbGaP applications. These instructions are intended for sharing files under 1 terabyte (TB) in size. If you want to share data larger than 1 TB, contact the BioData Catalyst Help Desk to discuss your use case.
It is not recommended to transfer large amounts of data between cloud providers or regions; for example, AWS --> Google costs approximately $100/TB.
The first consideration is platform accounts. Moving data between Seven Bridges and Terra is currently a manual process and requires that one of the researchers involved in sharing has an account on both platforms. It is recommended that the recipient of the shared data is the person to have accounts on both Seven Bridges and Terra.
Let’s consider an example case: Sebastian who is working on Seven Bridges and Teresa who is working on Terra. If Sebastian wants to share data with Teresa so that she can use the data on Terra, Teresa first needs to set up an account on Seven Bridges. Now Teresa has an account on Terra and an account on Seven Bridges. Sebastian will share the data with Teresa on Seven Bridges by adding her as a member of the project with the data he wants to share, with Copy permissions. For information on permissions, refer to the Seven Bridges Set permissions documentation. Once Teresa is added as a member of the project, she can move the data from the Seven Bridges project to a workspace on the Terra platform, following the instructions in the section titled Moving Data From Seven Bridges to Terra.
If Teresa (Terra) wants to share data with Sebastian (Seven Bridges) so that he can use the data on Seven Bridges, Sebastian first needs to create an account on Terra. Now Sebastian has an account on Seven Bridges and an account on Terra. Teresa can share the data with Sebastian on Terra by sharing the workspace with the data she wants to share with Sebastian. For information on sharing workspaces, refer to the Terra How to share a workspace documentation.
To create a Terra account, refer to the Terra documentation.
To create a Seven Bridges account, refer to the Seven Bridges documentation. If you are new to Seven Bridges, you may find this Getting Started Guide helpful.
The second consideration is making sure the researcher moving data between the two workspaces has billing groups set up on both workspaces to cover cloud costs if necessary. Contact the BioData Catalyst Help Desk if you have questions about how to get a billing group on Seven Bridges or Terra.
The following steps describe how to use the Seven Bridges platform to pull data securely from a Terra workspace into a Seven Bridges project.
Refer to the Terra documentation for Moving data to/from a Google bucket (workspace or external), specifically the section Upload and download data files in a terminal using gsutil. This method:
Works well for all size transfers.
Ideal for large file sizes or 1000s of files.
Can be used for transfers between local storage and a bucket, workspace VM or persistent disk and a Google bucket, as well as between Google buckets (external and workspace).
You will use the terminal in JupyterLab on the Seven Bridges workspace environment. The reason for this is that although Seven Bridges can run on the Google Cloud Platform, the Google bucket API is not exposed in the same manner as it is on Terra. Therefore you will start a JupyterLab notebook on Seven Bridges, using the project you would like to be the destination for the copied data. Refer to the Seven Bridges documentation for launching Jupyter Lab notebooks on Seven Bridges and accessing the terminal in a JupyterLab environment.
After launching the notebook, the next step is to open the terminal and install the program gsutil
which is a python program that lets end users add data to or copy data from a Google Cloud bucket. After opening the terminal, run the following commands:
Installing gsutil
takes only a few seconds.
The config
command provides a secure URL for you to navigate to in the browser. You will authenticate with the same credentials that were used to login to Terra. The shortcut to access the printed URL in the JupyterLab terminal is to press shift and right click, which will display options to copy the URL. Copy and then navigate to the URL in a new browser tab, which will direct you to Google authentication:
Google will provide an authentication code that you will copy and paste into the terminal.
Next, you will type in the Google Project id. This is found on the right side of the Terra Workspace Dashboard.
Next, run the command below to display the different Google buckets that are attached to the project id.
The Google bucket name for the Terra project can be found in the lower right corner of the Terra Workspace.
Running gsutil ls
on the Google bucket name will display the folders and files from the Terra workspace.
To copy a folder to the Seven Bridges workspace environment, run the following command:
There are a couple important things to mention about the gsutil cp
command. First, the -R
flag for gsutil cp
is used to recursively copy a folder and all of its subfolders and files. Most users will likely want to use the -R flag. This flag should be omitted if copying individual files or if using a wild card such as “*.vcf”.
Additionally, /sbgenomics/output-files
should be the destination folder when bringing in data from Terra, as this will ensure the files or folders get populated back to the Seven Bridges project. Refer to the Save analysis outputs documentation for information about working with files in Data Cruncher environments. After the JupyterLab instance is shut down, your files will automatically be populated in your project-files tab on Seven Bridges.
In this section we will discuss pushing data from a Seven Bridges project to a Terra workspace.
The process of moving data from Seven Bridges to Terra is the same setup as the previous section with some modifications to the gsutil
copy command. Instead, we reverse the arguments.
You will still use the -R
flag but the destination is a Terra bucket. The Terra workspace’s Google bucket name/id can be found on the Terra workspace Dashboard tab. You can verify that the folder has been copied by navigating to the Files section of the Data tab in your Terra workspace.
Clicking on the folder, you will see that all three files have been copied.