BDCatalyst-RFC-#: 11 BDCatalyst-RFC-Title: NHLBI BioData Catalyst Ecosystem Security Statement BDCatalyst-RFC-Type: Consensus Building Name of the person who is to be Point of Contact: Sarah Davis Email of the person who is to be Point of Contact: sdavis@renci.org Submitting Team: BDC3/NHLBI Requested BDCatalyst-RFC posting start date: 6/14/2021 Date Emailed for consideration: 6/14/2021 BDCatalyst-RFC-Status: Comment only URL Link to this document: https://bdcatalyst.gitbook.io/biodata-catalyst-documentation/community/request-for-comments/nhlbi-biodata-catalyst-ecosystem-security-statement URL Link to the website: https://www.biodatacatalyst.org/collaboration/rfcs/bdcatalyst-rfc-11/ License: This work is licensed under a CC-BY-4.0 license.
The purpose of this RFC is to provide the NHLBI BioData Catalyst Consortium and users of the NHLBI BioData Catalyst ecosystem with a clear statement on security mechanisms of the ecosystem that protect the confidentiality, integrity, provenance, and availability of the hosted data as well as any data that may be uploaded using the ecosystem’s “Bring Your Own Data” (BYOD) functionality.
Figure 1. The NHLBI BioData Catalyst ecosystem leverages separately developed and managed platforms to maximize flexibility for users based on their research needs, expertise, and backgrounds. Utilizing multiple Authorizations to Operate (ATO), these platforms combine to provide secure, cloud-based workspaces, user authentication and authorization, search, tools and workflows, applications, and new innovative features to address community needs.
The NHLBI and BioData Catalyst Consortium recognizes the importance of protecting both the privacy and security of the data and respecting the consent of the study participants whose data is stored within the BioData Catalyst ecosystem. Tackling these issues produces some challenges beyond those faced by most Federal Information Systems. The BioData Catalyst Consortium has implemented many innovative approaches to enable compliance and ensure that users understand their responsibility to protect data as articulated in specific Data Use Agreements (DUA). These approaches and controls work to protect the confidentiality, integrity and availability of the data; the privacy of the study participants who have contributed data; and data that may be uploaded to BioData Catalyst using the ecosystem’s “Bring Your Own Data” (BYOD) functionality. While the same general security controls are applied to both system and BYOD data, BYOD data is further protected as the ecosystem provides access only to the data’s uploaders and their designated collaborators.
From a Federal Information Security Modernization Act (FISMA) perspective, the BioData Catalyst ecosystem is a set of software systems with distinct security boundaries. Each system owner holds an Authority to Operate (ATO) issued by the NIH. The ATO is the result of a rigorous Security Assessment and Authorization (SA&A) process and third party assessment consistent with guidance from the National Institute of Standards and Technology (NIST). The ecosystem operates via a set of Interconnection Security Agreements (ISA) (Reindl 1979) and utilizes several existing components of security infrastructure (Bridges 2017, Gutiérrez-Sacristán et al. 2018) developed for other NIH platforms. Where the documentation provided as part of the SA&A process describes how security controls are implemented based on the NIST Special Publication 800-53r4 (see Endnote), the ISAs describe the permitted exchange of data and establish ecosystem-wide incident response, logging and auditing expectations that enable the consortium to respond in a unified manner to any suspected cybersecurity incident. The SA&A documentation provides for regular evaluation of the security of the component systems including regular scanning for vulnerabilities and the conduct of an annual penetration test. This level of security represents a baseline, and the BioData Catalyst ecosystem will extend protections over time.
Where the processes, policies, and technical controls protect confidentiality, integrity, and availability of data in accordance with Federal statute and regulation, there are additional ways to ensure that data is used in a manner consistent with study participants’ wishes, as represented by the consent form participants sign when enrolling in a specific study. Respect for these consents is critical to maintaining the public’s trust and requires additional policy, process, and technical controls. The respect for consent in NHLBI BioData Catalyst is enforced using normative NIH policies and processes for data sharing and using the existing infrastructure provided by the National Center for Biomedical Information’s (NCBI) Database of Genotypes and Phenotypes (dbGaP). All NHLBI-provided data within the NHLBI BioData Catalyst ecosystem are registered in dbGaP; in this process, data are assigned “consent groups” that describe in a machine-readable format the parameters of the consent for the data. These range from the most expansive “General Research Use” to more restrictive, such as only allowing for research outcomes related to “Health Medical Biomedical topics” or even to specific diseases, such as Chronic Obstructive Pulmonary Disease. Further, while secondary analysis of data is not considered human subjects research as described in the Common Rule (45 CRF Part 46), some datasets require the review of a research proposal by an Independent Review Board (IRB) or a Letter of Collaboration (LOC) with the originating study Principal Investigator as determined by the informed consents or special considerations that the submitting institution has determined are needed. These measures provide additional protection for datasets with particular sensitivity or special criteria for use.
For instance, IRB review is required when 1) informed consents that were signed by the study participants state that IRB oversight for secondary use of the data is required, and/or 2) the study IRB of record determines that the data may contain sensitive information that requires IRB oversight for secondary research. For collaboration letters, the informed consents indicate that the secondary use of the data by researchers outside the study will work with the study and therefore formal collaborations need to be set in place. While these are rarely used, they provide additional protection under special circumstances, such as where an indigenous population or sovereign nation requires direct control of how their data is used. Because consent is expressed at the individual level there may be a variety of consents for a study, either because the study offered choices to their participants or because the study consent evolved over an extended longitudinal study, such as the Framingham Heart Study. These variations in consent are reflected as multiple “consent groups” within a study and may mean that an investigator may only receive permission for subsets of study participants.
BioData Catalyst uses telemetry provided by dbGaP to enforce compliance with consents. Accordingly, users of BioData Catalyst can see only the data for which they have completed the process of a dbGaP Data Access Request (DAR) and after they receive approval from an NIH Data Access Committee (DAC). DAC approval results in a Data Use Agreement (DUA) describing any Data Use Limitations asserted by the originating study Principal Investigator, including ensuring that any requirements for an IRB review or Letter of Collaboration are met. On the BioData Catalyst ecosystem, the Gen3 “Fence” service, developed and operated by the University of Chicago and utilized by other NIH platforms and many other non-federal data commons, ensures enforcement of data access requirements. In order to ensure that the NHLBI maintains control over the use of the data and has the ability to audit this use, the BioData Catalyst policy does not allow download of source data outside the cloud ecosystem. Instead, BioData Catalyst provides access to data in approved cloud environments where computation occurs, which is documented in an ISA via standard Application Programming Interfaces (APIs) that use the authentication and authorization provided by the NIH Researcher Auth Service (RAS) (REF), together with Fence, to protect access and to allow for monitoring and auditing for appropriate data use (e.g. within scope of the approved project). These APIs include implementations of protected GA4GH Data Repository Service (DRS) for access to data objects such as genomic data and protected PIC-SURE interfaces for access to phenotypic, genotypic variant, and electronic health record data. The use of these APIs will, once fully implemented, enable other trusted cloud-based systems that meet equivalent security requirements to access data stored within BioData Catalyst for analysis using that other systems’ tools without the data being downloaded outside the security perimeters of the systems. This commitment to the use of APIs together with the requirement that data stay within the designated security boundaries is a critical component of making NHLBI data FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016, Corpas et al. 2018), while also ensuring confidentiality of data and respect for consent, regardless of the platform where the data is analyzed. BioData Catalyst has extended this model through the use of the NIH Researcher Auth Service (RAS) to provide authentication and authorization controls which together with the use of secure APIs is enabling secure interoperability with other trusted NIH-funded platforms such as the NHGRI’s AnVIL and the Gabriella Miller Kids First Data Resource.
While NIST-800-53r4 is a thorough standard for many kinds of systems, it has some gaps for the most-modern systems encountered. There are additional standards to apply, either from the High column of NIST-800-53r4 or addendums as “best practice”.
In particular, the Standard does not give guidance on modern applications (“appsec”) -- especially when the Infrastructure is serverless and the entire security surface is the application itself. For instance, there are requirements for regular scanning but scanning tools do not scan APIs or modern Single Page Apps (SPA) sufficiently. A “web app scan” with tools like IBM’s Appscan would return no vulnerabilities on an API without actually testing it, yet it would satisfy RA-5 of NIST-800-53 to run such a scan. Nor does the standard require that the Infrastructure as a Service layer (AWS, GCP, Azure) abide by a continual scanning posture for misconfiguration -- only Networks and VMs are specified. That is one example of where NIST-800-53r4 doesn’t work for modern applications.
There’s also no guidance from NIST at large about running an API where external parties are building “clients'' to that API. Does allowing clients extend the security boundary, and thus all 3rd party applications are to be evaluated as such? Is there a different consideration for these 3rd party applications? The standard is silent there. Companies like Apple and Google through their AppStores require all 3rd party apps to undergo evaluation to their own security standards and such an idea might be applicable here. NHLBI might consider adding some extra controls, what the AllOfUs Program calls FISMA+, for enhanced security.
BDCatalyst-RFC-#: 6 BDCatalyst-RFC-Title: Data Access Working Group Data Upload and Download Policy and Recommendations For Users BDCatalyst-RFC-Type: Process Name of the person who is to be Point of Contact: Kira Bradford, Jessica Lyons Email of the person who is to be Point of Contact: kcbradford@renci.org, Jessica_Lyons@hms.harvard.edu Submitting Team: Data Access Working Group Requested BDCatalyst-RFC posting start date: 2020-03-31 BDCatalyst-RFC-Status: Comment only URL Link to this document: https://bdcatalyst.gitbook.io/biodata-catalyst-documentation/community/request-for-comments/data-upload-and-download-policy-and-recommendations-for-users URL Link to the BDCatalyst-RFC: https://www.nhlbidatastage.org/collaboration/rfcs/bdcatalyst-rfc-6 License: This work is licensed under a CC-BY-4.0 license.
This document describes and defines data movement (data egress and ingress), explaining the types of data movement currently allowable on each platform, and what kinds of data should be downloaded or uploaded. This document is meant to inform the BioData Catalyst Go Live users of the Data Access Working Group’s (DAWG) data upload and download recommendations and policies. These policies and recommendations herein are specific to BioData Catalyst users. The following terminology is in use throughout this document.
Ecosystem: BioData Catalyst
Platform: Piece of the BioData Catalyst ecosystem.
Examples: Terra, Gen3, Seven Bridges, PIC-SURE
Workspace: Areas to work on or with data within a platform.
Examples: Projects/workspaces within Seven Bridges or Terra
Individual-level Data: Data at the level of the individual.
Controlled-access data: Data that is not publicly accessible and requires specific credentials or approvals for access and use, primarily due to study participant consents.
External data/user uploaded data: Other data sources not hosted on BioData Catalyst, such as a users own created dataset, or data from another source.
FISMA moderate: The Federal Information Security Management Act (FISMA) creates standards to ensure that all government partners handle confidential and sensitive data appropriately. FISMA moderate is the designated level that means if data is compromised there could be a serious adverse impact, such as a loss of confidentiality, integrity, or availability of data.
Security boundary: Refers to the technical infrastructure boundaries of a BioData Catalyst platform and Ecosystem.
We define data movement as the transfer of data, including controlled-access data, in and out of the FISMA moderate security boundaries of the BioData Catalyst ecosystem. Types of data movement are listed below along with available functionality by Go Live:
Type 1: Uploading and downloading data to a workspace within a platform
Uploading: Moving data that exists outside the BioData Catalyst security boundary inside the security boundary (i.e. uploading data) - available by Go Live (See more information below on Permissible Data Upload)
Downloading: Moving data from within the BioData Catalyst security boundary outside of the security boundary (i.e. downloading data) - available by Go Live (See Permissible Data Download for data download limitations)
Type 2: Moving data from one workspace to another workspace within the same platform - available by Go Live
Example: move data from one workspace to another workspace within Terra. For example, a user copies a workspace on Terra and creates a new workspace to run a similar analysis in a different Terra workspace.
Type 3: Moving data from one platform workspace to another platform workspace - not available by Go Live from all platforms (see Table 1 below). Moving data using core services will be available, such as accessing data available from Gen3 and/or curated data from PIC-SURE and moving it to another platform for analysis.
Example: If the user generates result files from running an analysis in one workspace (e.g. Seven Bridges workspace), those work files are not currently able to be accessed by a different platform workspace (e.g. Terra workspace).
Type 4: Sharing controlled-access data brought into or datasets with controlled-access data produced by users in the BioData Catalyst ecosystem with unauthorized users - prohibited for Go Live. This pertains to controlled access data brought into the ecosystem by the user or data available through BioData Catalyst, including data that has been created or altered for analyses. For user-generated results files, the BioData Catalyst consortium does not currently have policies in place or technical implementations to track this. We do not have the technical implementation to track a user bringing their own data on to the BioData Catalyst ecosystem, and sharing it with others. We also do not have any data provenance technical implementation for tracking how a dataset is transformed. Therefore, our policy is that the user is responsible for any external uploaded or transformed data. Users must adhere to all regulations and data use agreements and are solely responsible for the use of any data uploaded and transformed within the ecosystem.
The following table describes the current types of data movement allowed for BioData Catalyst users based on which platform they are using.
Table 1: Data Movement allowed in BioData Catalyst
Data Movement from core services to platforms is permissible. These include accessing data available from Gen3 and/or curated data from PIC-SURE and moving it to another platform for analysis.
Users are permitted to upload data not available to the BioData Catalyst ecosystem (i.e. external data) to their own workspace. Users may upload data into BioData Catalyst if they have the required approvals for such use. These approvals include: Approval of a Data Access Request for controlled access data that includes a BioData Catalyst cloud use statement and the user's institutional review board policies and guidelines. At all times, it is the user's responsibility to ensure they use the data they upload consistent with applicable Data Use Agreements, Data Use Limitations, IRB and any other restrictions on use.
Due to the sensitive nature of data available through the BioData Catalyst ecosystem, users are only allowed to download certain pieces of data/results as outlined in Table 2. It is acknowledged that the technical infrastructure allows for data download on Biodata Catalyst platforms so the responsibility for compliance with data download requirements lies with the user of BioData Catalyst. Results and data that the user would broadly share, such as in an academic publication, may be downloaded through shared workspaces on the BioData Catalyst ecosystem; however, users are strongly encouraged to keep results and data within the BioData Catalyst ecosystem. Users are prohibited from downloading any controlled access, individual-level data (see Table 2). Users should be aware that if they choose to download permitted data and results, that the act of transferring that data through the BioData Catalyst security boundary may or may not be supported by your Data Use Agreement(s), Limitation(s), or Institutional Review Board policies and guidelines. BioData Catalyst users are solely responsible for adhering to the terms of these policies. Users should be aware that all data downloads are logged and regularly reviewed for compliance. See below examples of data permissible and prohibited for download.
Table 2: BioData Catalyst Permissible and Prohibited Download
This list is not exhaustive of all possible scenarios and is subject to change. If you have questions about permissible data download please contact https://biodatacatalyst.nhlbi.nih.gov/contact.
Allowing users to share data with other collaborators is permissible per BioData Catalyst policy, but platforms are not required to have this sharing capability available on BioData Catalyst for Go Live. The BioData Catalyst user is ultimately responsible for maintaining the confidentiality, integrity, and the availability of any data uploaded or downloaded from the BioData Catalyst ecosystem. It is therefore essential that all users of the BioData Catalyst ecosystem accessing controlled access data understand their responsibilities for ensuring appropriate information security controls and that they work with their institutions to effectively implement those responsibilities. Users can upload their data to a workspace; however, it is the responsibility of the uploader to ensure that data policies and permissions are in place to permit data transfer to any users of the shared workspace. Additionally, the uploader should make all collaborators aware of any Data Use Agreements or Limitations of any newly uploaded data.
Users will be made aware and reminded of the data upload and download policy recommendations when working in the BioData Catalyst Ecosystem. The user will see this or similar messaging when working on BioData Catalyst platforms.
“You are transferring data through the BioData Catalyst security boundary. Downloading controlled-access, individual-level data through BioData Catalyst is prohibited and downloading other types of data is strongly discouraged, due to the sensitive nature of the data hosted on the platform. Please see the Permissible and Prohibited Data Download section of the Data Upload and Download Policy for more information. Additionally, transferring data may or may not be supported by your Data Use Agreement(s), Limitation(s), or your Institutional Review Board policies and guidelines. As a BioData Catalyst user, you are solely responsible for adhering to the terms of these policies.”
Platform
Type 1
Type 2
Type 3
Type 4
Gen3
✔
✔
✔
(as part of core services)
prohibited
Terra
✔
✔
Not available by Go Live
prohibited
Seven Bridges
✔
✔
Not available by Go Live
prohibited
PIC-SURE
✔
✔
✔
(as part of core services)
prohibited
Permissible to Download
Prohibited to Download
Aggregate results/tables that would be publishable in academic publication
Summary data that does not include individual-level and/or controlled-access data
Your own data that you brought to the platform for analyses following your DUAs and/or IRB protocols
To request special permission for other types of download, please contact the BioData Catalyst help desk at https://biodatacatalyst.nhlbi.nih.gov/contact.
Users are prohibited to download any controlled-access, individual level data such as:
Hosted TOPMed CRAM files with individual-level data
Hosted TOPMed VCF files with individual-level data
Hosted TOPMed or TOPMed-related phenotypic study data files with individual-level data