BDCatalyst-RFC-#: 11 BDCatalyst-RFC-Title: NHLBI BioData Catalyst Ecosystem Security Statement BDCatalyst-RFC-Type: Consensus Building Name of the person who is to be Point of Contact: Sarah Davis Email of the person who is to be Point of Contact: sdavis@renci.org Submitting Team: BDC3/NHLBI Requested BDCatalyst-RFC posting start date: 6/14/2021 Date Emailed for consideration: 6/14/2021 BDCatalyst-RFC-Status: Comment only URL Link to this document: https://bdcatalyst.gitbook.io/biodata-catalyst-documentation/community/request-for-comments/nhlbi-biodata-catalyst-ecosystem-security-statement URL Link to the website: https://www.biodatacatalyst.org/collaboration/rfcs/bdcatalyst-rfc-11/ License: This work is licensed under a CC-BY-4.0 license.
The purpose of this RFC is to provide the NHLBI BioData Catalyst Consortium and users of the NHLBI BioData Catalyst ecosystem with a clear statement on security mechanisms of the ecosystem that protect the confidentiality, integrity, provenance, and availability of the hosted data as well as any data that may be uploaded using the ecosystem’s “Bring Your Own Data” (BYOD) functionality.
Figure 1. The NHLBI BioData Catalyst ecosystem leverages separately developed and managed platforms to maximize flexibility for users based on their research needs, expertise, and backgrounds. Utilizing multiple Authorizations to Operate (ATO), these platforms combine to provide secure, cloud-based workspaces, user authentication and authorization, search, tools and workflows, applications, and new innovative features to address community needs.
The NHLBI and BioData Catalyst Consortium recognizes the importance of protecting both the privacy and security of the data and respecting the consent of the study participants whose data is stored within the BioData Catalyst ecosystem. Tackling these issues produces some challenges beyond those faced by most Federal Information Systems. The BioData Catalyst Consortium has implemented many innovative approaches to enable compliance and ensure that users understand their responsibility to protect data as articulated in specific Data Use Agreements (DUA). These approaches and controls work to protect the confidentiality, integrity and availability of the data; the privacy of the study participants who have contributed data; and data that may be uploaded to BioData Catalyst using the ecosystem’s “Bring Your Own Data” (BYOD) functionality. While the same general security controls are applied to both system and BYOD data, BYOD data is further protected as the ecosystem provides access only to the data’s uploaders and their designated collaborators.
From a Federal Information Security Modernization Act (FISMA) perspective, the BioData Catalyst ecosystem is a set of software systems with distinct security boundaries. Each system owner holds an Authority to Operate (ATO) issued by the NIH. The ATO is the result of a rigorous Security Assessment and Authorization (SA&A) process and third party assessment consistent with guidance from the National Institute of Standards and Technology (NIST). The ecosystem operates via a set of Interconnection Security Agreements (ISA) (Reindl 1979) and utilizes several existing components of security infrastructure (Bridges 2017, Gutiérrez-Sacristán et al. 2018) developed for other NIH platforms. Where the documentation provided as part of the SA&A process describes how security controls are implemented based on the NIST Special Publication 800-53r4 (see Endnote), the ISAs describe the permitted exchange of data and establish ecosystem-wide incident response, logging and auditing expectations that enable the consortium to respond in a unified manner to any suspected cybersecurity incident. The SA&A documentation provides for regular evaluation of the security of the component systems including regular scanning for vulnerabilities and the conduct of an annual penetration test. This level of security represents a baseline, and the BioData Catalyst ecosystem will extend protections over time.
Where the processes, policies, and technical controls protect confidentiality, integrity, and availability of data in accordance with Federal statute and regulation, there are additional ways to ensure that data is used in a manner consistent with study participants’ wishes, as represented by the consent form participants sign when enrolling in a specific study. Respect for these consents is critical to maintaining the public’s trust and requires additional policy, process, and technical controls. The respect for consent in NHLBI BioData Catalyst is enforced using normative NIH policies and processes for data sharing and using the existing infrastructure provided by the National Center for Biomedical Information’s (NCBI) Database of Genotypes and Phenotypes (dbGaP). All NHLBI-provided data within the NHLBI BioData Catalyst ecosystem are registered in dbGaP; in this process, data are assigned “consent groups” that describe in a machine-readable format the parameters of the consent for the data. These range from the most expansive “General Research Use” to more restrictive, such as only allowing for research outcomes related to “Health Medical Biomedical topics” or even to specific diseases, such as Chronic Obstructive Pulmonary Disease. Further, while secondary analysis of data is not considered human subjects research as described in the Common Rule (45 CRF Part 46), some datasets require the review of a research proposal by an Independent Review Board (IRB) or a Letter of Collaboration (LOC) with the originating study Principal Investigator as determined by the informed consents or special considerations that the submitting institution has determined are needed. These measures provide additional protection for datasets with particular sensitivity or special criteria for use.
For instance, IRB review is required when 1) informed consents that were signed by the study participants state that IRB oversight for secondary use of the data is required, and/or 2) the study IRB of record determines that the data may contain sensitive information that requires IRB oversight for secondary research. For collaboration letters, the informed consents indicate that the secondary use of the data by researchers outside the study will work with the study and therefore formal collaborations need to be set in place. While these are rarely used, they provide additional protection under special circumstances, such as where an indigenous population or sovereign nation requires direct control of how their data is used. Because consent is expressed at the individual level there may be a variety of consents for a study, either because the study offered choices to their participants or because the study consent evolved over an extended longitudinal study, such as the Framingham Heart Study. These variations in consent are reflected as multiple “consent groups” within a study and may mean that an investigator may only receive permission for subsets of study participants.
BioData Catalyst uses telemetry provided by dbGaP to enforce compliance with consents. Accordingly, users of BioData Catalyst can see only the data for which they have completed the process of a dbGaP Data Access Request (DAR) and after they receive approval from an NIH Data Access Committee (DAC). DAC approval results in a Data Use Agreement (DUA) describing any Data Use Limitations asserted by the originating study Principal Investigator, including ensuring that any requirements for an IRB review or Letter of Collaboration are met. On the BioData Catalyst ecosystem, the Gen3 “Fence” service, developed and operated by the University of Chicago and utilized by other NIH platforms and many other non-federal data commons, ensures enforcement of data access requirements. In order to ensure that the NHLBI maintains control over the use of the data and has the ability to audit this use, the BioData Catalyst policy does not allow download of source data outside the cloud ecosystem. Instead, BioData Catalyst provides access to data in approved cloud environments where computation occurs, which is documented in an ISA via standard Application Programming Interfaces (APIs) that use the authentication and authorization provided by the NIH Researcher Auth Service (RAS) (REF), together with Fence, to protect access and to allow for monitoring and auditing for appropriate data use (e.g. within scope of the approved project). These APIs include implementations of protected GA4GH Data Repository Service (DRS) for access to data objects such as genomic data and protected PIC-SURE interfaces for access to phenotypic, genotypic variant, and electronic health record data. The use of these APIs will, once fully implemented, enable other trusted cloud-based systems that meet equivalent security requirements to access data stored within BioData Catalyst for analysis using that other systems’ tools without the data being downloaded outside the security perimeters of the systems. This commitment to the use of APIs together with the requirement that data stay within the designated security boundaries is a critical component of making NHLBI data FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016, Corpas et al. 2018), while also ensuring confidentiality of data and respect for consent, regardless of the platform where the data is analyzed. BioData Catalyst has extended this model through the use of the (RAS) to provide authentication and authorization controls which together with the use of secure APIs is enabling secure interoperability with other trusted NIH-funded platforms such as the NHGRI’s AnVIL and the Gabriella Miller Kids First Data Resource.
While NIST-800-53r4 is a thorough standard for many kinds of systems, it has some gaps for the most-modern systems encountered. There are additional standards to apply, either from the High column of NIST-800-53r4 or addendums as “best practice”.
In particular, the Standard does not give guidance on modern applications (“appsec”) -- especially when the Infrastructure is serverless and the entire security surface is the application itself. For instance, there are requirements for regular scanning but scanning tools do not scan APIs or modern Single Page Apps (SPA) sufficiently. A “web app scan” with tools like IBM’s Appscan would return no vulnerabilities on an API without actually testing it, yet it would satisfy RA-5 of NIST-800-53 to run such a scan. Nor does the standard require that the Infrastructure as a Service layer (AWS, GCP, Azure) abide by a continual scanning posture for misconfiguration -- only Networks and VMs are specified. That is one example of where NIST-800-53r4 doesn’t work for modern applications.
There’s also no guidance from NIST at large about running an API where external parties are building “clients'' to that API. Does allowing clients extend the security boundary, and thus all 3rd party applications are to be evaluated as such? Is there a different consideration for these 3rd party applications? The standard is silent there. Companies like Apple and Google through their AppStores require all 3rd party apps to undergo evaluation to their own security standards and such an idea might be applicable here. NHLBI might consider adding some extra controls, what the AllOfUs Program calls FISMA+, for enhanced security.
BDC-RFC-#: 28 Title: DICOM Medical Image De-Identification Baseline Protocol Type: Process Contact Name and Email: Keyvan Farahani, farahank@mail.nih.gov Submitting Teams: NHLBI, DMC Date Sent to Consortium: Oct. 11, 2023 Status: Closed for comment URL Link to this Google Document: https://docs.google.com/document/d/14-WfeMqgZz115DbBnFs-8AvcdRY1oIjCgi0K33pwMjE/edit?usp=sharing License: This work is licensed under a CC-BY-4.0 license.
Contributors:
Zixin Nie (BDC Data Management Core)
Keyvan Farahani (NHLBI)
David Clunie (PixelMed Publishing)
De-identification of protected health information (PHI) is often a necessary procedure to undertake in order to share potentially sensitive information, such as health data. Many data repositories that allow human data to be deposited and shared require the data to be de-identified. Medical images and their associated metadata (i.e., DICOM headers) often contain PHI, such as patient names, dates of birth, or medical record numbers. The de-identification of these images is essential to minimize privacy risk and comply with regulations and standards that require the protection of PHI. The overarching goal in medical image de-identification is to reduce the risk of identification as much as possible.
De-identification facilitates the sharing of medical imaging data, enabling greater access by researchers and the public and allowing for secondary research to be conducted. Several standards exist for de-identification of medical images, including the confidentiality profile detailed in the DICOM Part 15 standard, HIPAA Safe Harbor and Expert Determination. The BioData Catalyst Data Management Core (BDC DMC) performed an evaluation of these standards and used them to create the protocol detailed in this document. This document describes the de-identification processes and technical considerations for de-identifying medical images as they are being added to BDC and made available to researchers using the BDC platform. The protocol, referred to as the “BDC Baseline Protocol for Image De-identification,” takes into account the data use cases for researchers accessing the BDC platform by defining a de-identification profile that strikes a balance between privacy protection and preserving utility.
The Baseline protocol only applies to the metadata in radiologic (DICOM) images (see table below). It does not apply to image pixel information, other imaging formats, or other types of data that may be imported into BDC, such as clinical and omics data. It reflects the understanding of the de-identification needs of BDC as of October 2023. Future RFCs are planned that will address masking of unique identifiers, the details of how imaging pixel data will be de-identified, the de-identification process workflow, and quality management.
The focus of this RFC is on de-identification of DICOM images.
The de-identification protocol described in this section is intended to be a baseline for de-identification within BDC. The protocol is compliant with regulations such as the HIPAA Privacy Rule and the Common Rule, while retaining the maximal amount of research utility possible. It is designed based on the experiences from the HeartShare imaging pilot project. The protocol will evolve over time, with future iterations to address new issues as they arise, and customizations to address specific research use cases. These may involve Expert Determinations, which can both increase privacy protections and improve research utility. This protocol is to be used for all medical imaging data to be submitted to the BDC. The protocol may be implemented in an image de-identification tool at the submitter’s site, or in a central BDC-related data curation service. Any deviation from this protocol must be discussed with and approved by the BDC/DMC. The baseline de-identification protocol can be found at this link: .
Introduction to HIPAA Safe Harbor and DICOM Part 15
De-identification of DICOM data can be performed according to different standards. Two commonly accepted standards are HIPAA Safe Harbor and (referred to in the rest of this document as the DICOM Part 15 Standard).
HIPAA Safe Harbor de-identification calls for the removal of 18 types of identifiers (detailed here: ). The standard legally applies to PHI handled by HIPAA Covered Entities, however as it has been in use for over 20 years it is generally accepted as a standard for de-identification for other types of data as well.
The DICOM Part 15 Standard was developed through a careful review of all DICOM attributes, identifying any that had the possibility of containing identifying information and creating a mitigation strategy. It is more extensive than HIPAA Safe Harbor, covering attributes that are not part of the 18 prescribed types of identifiers such as ethnicity and biological sex. Various mitigation strategies are presented to treat the attributes detailed as part of the standard, with the Basic DICOM Part 15 Confidentiality Profile being the most conservative, calling for suppression of most of the attributes.
In order to have de-identified data that still possesses analytic utility for BDC researchers, while also being a standardized implementation of de-identification that can be applied across most data to be ingested by BDC, an evaluation was performed to produce a set of de-identification rules that can be applied to DICOM header attributes. The evaluation leveraged the de-identification profiles detailed in the DICOM Part 15 standard by evaluating its contents and aligning with the minimum requirements to comply with HIPAA Safe Harbor. The resulting de-identification strategy should be sufficient to construct a de-identification profile that can be applied across all DICOM headers.
The steps for performing this evaluation were as follows:
Attributes from each profile were classified into the following categories: Direct Identifier (DI), Quasi-Identifier (QI), and Non-Identifier (NI), according to the classification framework detailed in the following diagram:
After classification, DIs and QIs were then aligned with the 18 types of identifiers specified for removal within the HIPAA Safe Harbor provision.
Each of the attributes that aligns with one of the HIPAA Safe Harbor identifiers was then assigned a mitigation technique to remove the identifying information that could appear in the field.
Of the attributes within the DICOM Part 15 standard that must be removed for compliance with HIPAA Safe Harbor, there are:
4 name attributes
4 patient address attributes
122 date attributes
5 telephone number attributes
Names, addresses, and telephone numbers should be suppressed from the data. Dates can be kept accurate to the year (a future BDC medical image de-identification RFC will address improving this approach for longitudinally acquired imaging studies). The other unique IDs can either be suppressed or they can be masked in a way so that their original values cannot be re-obtained. The specifics of how the other unique IDs will be masked will come in a separate RFC that describes the masking procedures. Additionally, there are 26 attributes that contain various forms of free text, such as comments, notes, labels, and text strings. Identifying information may be written in these attributes. As such, they should be suppressed to prevent the leakage of identifying information.
The other attributes detailed in the DICOM Part 15 standard do not necessarily require mitigation for compliance with HIPAA Safe Harbor. However, if they do not have analytic usage, it is recommended to mitigate them according to the specifications detailed in the DICOM Part 15 standard in order to decrease the risk of re-identification represented by indirectly identifying fields not mentioned in HIPAA Safe Harbor.
Image pixel data, often encountered in ultrasound (echo) imaging, can contain PHI, such as patient names, dates of birth, and the hospital or imaging center names. This information can be shown either in labels on images, which usually have pre-specified areas, or in the form of burned-in text, which can appear anywhere on the image. Any identifying information contained within pixel data should be removed before it is made available to researchers.
Methods for removal of image pixel data include the following:
Masking through opaque boxes over parts of the image
AI assisted removal of identifying information, deploying optical character recognition (OCR)
Deletion of images from the dataset that contain identifying information
Image pixel de-identification will be performed as a service by tools provided by existing third party tool provided by DMC contractors. After de-identification, images will still require review to ensure that the process was able to capture and remove all identifying information on the images. This is a necessary quality control to ensure that there is no leakage of identifying information.
Metadata associated with images, such as filenames and file paths, can often include unique IDs and dates of medical events. This information is important to associate imaging data correctly with other types of data for linkage, processing, and analysis, however it can also present a risk of leakage of identifying information on de-identified data files. To prevent that from happening, the following rules should be followed:
Folder names should only include the study name and associated visit number, and no further information
e.g., for the first visit of the MESA study, the folder name should be called MESA_V1
Image filenames are to be set to the following format: STUDYNAME_TYPE_VISITNN_ YYYYMMDD_SEQ
The risks presented by using the de-identification methods detailed in this RFC are as follows:
HIPAA Safe Harbor, while being an accepted standard for de-identification, does not cover all potential identifiers (leaving out potentially attributes such as race, employment, diagnoses, procedures, and treatments). Data de-identified under HIPAA Safe Harbor holds a residual risk of re-identification.
Automated imaging de-identification solutions are not 100% accurate, leaving the potential for small amounts of identifying information to be retained.
Data made available through BDC is provided for research purposes to investigators who should not have ulterior motives to perform re-identification. HIPAA Safe Harbor represents a standard that has been in use for over 20 years, so the risks presented from using that standard are well understood and acceptable by BDC. The risk presented by leakage of identifying information from imaging data can be mitigated through human review of de-identified images to ensure that all identifying information has been removed.
In the event that PHI is discovered in de-identified imaging data in BDC, such data shall be pulled off-line, checked for removal of offending PHI, before being posted again on BDC. In such cases, the data submitter shall be informed of the incident.
Local vs. Cloud-based Image De-Identification
Depending on the capabilities of the de-identification tool and the legal and logistic requirements for access to original identifiable images, de-identification may be done locally on the data-generating site or through a central cloud-based service. Although the latter is often more efficient (semi-automated and scalable), the transfer of identifiable (PHI-containing) images to a central cloud may require agreements between the data provider (submitter) and the de-identification service provider, stipulated through execution of Data Transfer Agreement (DTA). Details as to the image de-identification process that will be used will be provided in a future RFC.
91 other unique ID attributes
VISITNN: ”VISIT”+VisitNumber (specifically include the label “VISIT” to inform investigator what the number is referring
YYYYMMDD: AcquisitionDate set to set to 01-01-YYYY, where YYYY is the year of acquisition
SEQ: sequence number to ensure filename is unique
e.g., MESA_ECG_VISIT05_20220101_999.xml
Imaging Data Type
Conventional formats
Radiologic (X-ray, PET/CT, MRI, ultrasound)
DICOM (Digital Imaging and Communication in Medicine)
Cardiac ECG
XML
Digital Pathology
Proprietary TIFF and DICOM Pathology