NHLBI BioData Catalyst Ecosystem Security Statement

BDCatalyst-RFC-#: 11 BDCatalyst-RFC-Title: NHLBI BioData Catalyst Ecosystem Security Statement BDCatalyst-RFC-Type: Consensus Building Name of the person who is to be Point of Contact: Sarah Davis Email of the person who is to be Point of Contact: sdavis@renci.org Submitting Team: BDC3/NHLBI Requested BDCatalyst-RFC posting start date: 6/14/2021 Date Emailed for consideration: 6/14/2021 BDCatalyst-RFC-Status: Comment only URL Link to this document: https://bdcatalyst.gitbook.io/biodata-catalyst-documentation/community/request-for-comments/nhlbi-biodata-catalyst-ecosystem-security-statement URL Link to the website: https://www.biodatacatalyst.org/collaboration/rfcs/bdcatalyst-rfc-11/ License: This work is licensed under a CC-BY-4.0 license.


The purpose of this RFC is to provide the NHLBI BioData Catalyst Consortium and users of the NHLBI BioData Catalyst ecosystem with a clear statement on security mechanisms of the ecosystem that protect the confidentiality, integrity, provenance, and availability of the hosted data as well as any data that may be uploaded using the ecosystem’s “Bring Your Own Data” (BYOD) functionality.

Figure 1. The NHLBI BioData Catalyst ecosystem leverages separately developed and managed platforms to maximize flexibility for users based on their research needs, expertise, and backgrounds. Utilizing multiple Authorizations to Operate (ATO), these platforms combine to provide secure, cloud-based workspaces, user authentication and authorization, search, tools and workflows, applications, and new innovative features to address community needs.

NHLBI BioData Catalyst Ecosystem Security Statement

The NHLBI and BioData Catalyst Consortium recognizes the importance of protecting both the privacy and security of the data and respecting the consent of the study participants whose data is stored within the BioData Catalyst ecosystem. Tackling these issues produces some challenges beyond those faced by most Federal Information Systems. The BioData Catalyst Consortium has implemented many innovative approaches to enable compliance and ensure that users understand their responsibility to protect data as articulated in specific Data Use Agreements (DUA). These approaches and controls work to protect the confidentiality, integrity and availability of the data; the privacy of the study participants who have contributed data; and data that may be uploaded to BioData Catalyst using the ecosystem’s “Bring Your Own Data” (BYOD) functionality. While the same general security controls are applied to both system and BYOD data, BYOD data is further protected as the ecosystem provides access only to the data’s uploaders and their designated collaborators.

From a Federal Information Security Modernization Act (FISMA) perspective, the BioData Catalyst ecosystem is a set of software systems with distinct security boundaries. Each system owner holds an Authority to Operate (ATO) issued by the NIH. The ATO is the result of a rigorous Security Assessment and Authorization (SA&A) process and third party assessment consistent with guidance from the National Institute of Standards and Technology (NIST). The ecosystem operates via a set of Interconnection Security Agreements (ISA) (Reindl 1979) and utilizes several existing components of security infrastructure (Bridges 2017, Gutiérrez-Sacristán et al. 2018) developed for other NIH platforms. Where the documentation provided as part of the SA&A process describes how security controls are implemented based on the NIST Special Publication 800-53r4 (see Endnote), the ISAs describe the permitted exchange of data and establish ecosystem-wide incident response, logging and auditing expectations that enable the consortium to respond in a unified manner to any suspected cybersecurity incident. The SA&A documentation provides for regular evaluation of the security of the component systems including regular scanning for vulnerabilities and the conduct of an annual penetration test. This level of security represents a baseline, and the BioData Catalyst ecosystem will extend protections over time.

Where the processes, policies, and technical controls protect confidentiality, integrity, and availability of data in accordance with Federal statute and regulation, there are additional ways to ensure that data is used in a manner consistent with study participants’ wishes, as represented by the consent form participants sign when enrolling in a specific study. Respect for these consents is critical to maintaining the public’s trust and requires additional policy, process, and technical controls. The respect for consent in NHLBI BioData Catalyst is enforced using normative NIH policies and processes for data sharing and using the existing infrastructure provided by the National Center for Biomedical Information’s (NCBI) Database of Genotypes and Phenotypes (dbGaP). All NHLBI-provided data within the NHLBI BioData Catalyst ecosystem are registered in dbGaP; in this process, data are assigned “consent groups” that describe in a machine-readable format the parameters of the consent for the data. These range from the most expansive “General Research Use” to more restrictive, such as only allowing for research outcomes related to “Health Medical Biomedical topics” or even to specific diseases, such as Chronic Obstructive Pulmonary Disease. Further, while secondary analysis of data is not considered human subjects research as described in the Common Rule (45 CRF Part 46), some datasets require the review of a research proposal by an Independent Review Board (IRB) or a Letter of Collaboration (LOC) with the originating study Principal Investigator as determined by the informed consents or special considerations that the submitting institution has determined are needed. These measures provide additional protection for datasets with particular sensitivity or special criteria for use.

For instance, IRB review is required when 1) informed consents that were signed by the study participants state that IRB oversight for secondary use of the data is required, and/or 2) the study IRB of record determines that the data may contain sensitive information that requires IRB oversight for secondary research. For collaboration letters, the informed consents indicate that the secondary use of the data by researchers outside the study will work with the study and therefore formal collaborations need to be set in place. While these are rarely used, they provide additional protection under special circumstances, such as where an indigenous population or sovereign nation requires direct control of how their data is used. Because consent is expressed at the individual level there may be a variety of consents for a study, either because the study offered choices to their participants or because the study consent evolved over an extended longitudinal study, such as the Framingham Heart Study. These variations in consent are reflected as multiple “consent groups” within a study and may mean that an investigator may only receive permission for subsets of study participants.

BioData Catalyst uses telemetry provided by dbGaP to enforce compliance with consents. Accordingly, users of BioData Catalyst can see only the data for which they have completed the process of a dbGaP Data Access Request (DAR) and after they receive approval from an NIH Data Access Committee (DAC). DAC approval results in a Data Use Agreement (DUA) describing any Data Use Limitations asserted by the originating study Principal Investigator, including ensuring that any requirements for an IRB review or Letter of Collaboration are met. On the BioData Catalyst ecosystem, the Gen3 “Fence” service, developed and operated by the University of Chicago and utilized by other NIH platforms and many other non-federal data commons, ensures enforcement of data access requirements. In order to ensure that the NHLBI maintains control over the use of the data and has the ability to audit this use, the BioData Catalyst policy does not allow download of source data outside the cloud ecosystem. Instead, BioData Catalyst provides access to data in approved cloud environments where computation occurs, which is documented in an ISA via standard Application Programming Interfaces (APIs) that use the authentication and authorization provided by the NIH Researcher Auth Service (RAS) (REF), together with Fence, to protect access and to allow for monitoring and auditing for appropriate data use (e.g. within scope of the approved project). These APIs include implementations of protected GA4GH Data Repository Service (DRS) for access to data objects such as genomic data and protected PIC-SURE interfaces for access to phenotypic, genotypic variant, and electronic health record data. The use of these APIs will, once fully implemented, enable other trusted cloud-based systems that meet equivalent security requirements to access data stored within BioData Catalyst for analysis using that other systems’ tools without the data being downloaded outside the security perimeters of the systems. This commitment to the use of APIs together with the requirement that data stay within the designated security boundaries is a critical component of making NHLBI data FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016, Corpas et al. 2018), while also ensuring confidentiality of data and respect for consent, regardless of the platform where the data is analyzed. BioData Catalyst has extended this model through the use of the NIH Researcher Auth Service (RAS) to provide authentication and authorization controls which together with the use of secure APIs is enabling secure interoperability with other trusted NIH-funded platforms such as the NHGRI’s AnVIL and the Gabriella Miller Kids First Data Resource.


While NIST-800-53r4 is a thorough standard for many kinds of systems, it has some gaps for the most-modern systems encountered. There are additional standards to apply, either from the High column of NIST-800-53r4 or addendums as “best practice”.

In particular, the Standard does not give guidance on modern applications (“appsec”) -- especially when the Infrastructure is serverless and the entire security surface is the application itself. For instance, there are requirements for regular scanning but scanning tools do not scan APIs or modern Single Page Apps (SPA) sufficiently. A “web app scan” with tools like IBM’s Appscan would return no vulnerabilities on an API without actually testing it, yet it would satisfy RA-5 of NIST-800-53 to run such a scan. Nor does the standard require that the Infrastructure as a Service layer (AWS, GCP, Azure) abide by a continual scanning posture for misconfiguration -- only Networks and VMs are specified. That is one example of where NIST-800-53r4 doesn’t work for modern applications.

There’s also no guidance from NIST at large about running an API where external parties are building “clients'' to that API. Does allowing clients extend the security boundary, and thus all 3rd party applications are to be evaluated as such? Is there a different consideration for these 3rd party applications? The standard is silent there. Companies like Apple and Google through their AppStores require all 3rd party apps to undergo evaluation to their own security standards and such an idea might be applicable here. NHLBI might consider adding some extra controls, what the AllOfUs Program calls FISMA+, for enhanced security.

Last updated