Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Privacy-Preserving and Standard-Based Architecture for Secondary Use of Clinical Data

Information 2022, 13(2), 87; https://doi.org/10.3390/info13020087

by Mario Ciampi^*

, Mario Sicuranza and Stefano Silvestri

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Information 2022, 13(2), 87; https://doi.org/10.3390/info13020087

Submission received: 24 December 2021 / Revised: 28 January 2022 / Accepted: 10 February 2022 / Published: 13 February 2022

(This article belongs to the Special Issue Health Data Information Retrieval)

Round 1

Reviewer 1 Report

General remarks
===============
The manuscript "A Novel ETL Architecture for the Secondary Use of Health Data" presents an architecture for materialized integration of structured and unstructured data into a standardized representation (HL7 FHIR), from which the integrated information be retrieved through a standard interface and re-used for secondary purposes. To be able to access and provide valuable information contained in unstructured text (e.g., discharge letters), the architecture utilizes NLP (Natural Language Processing) methods. To facilitate sharing these data with other institutions in accordance with national and international data protection regulations, the architecture provides means for data anonymization and pseudonymization.

The problems addressed by this proposal are highly relevant for federated biomedical research infrastructures. The structure of the paper is fine and it is well written. The overall architecture of the proposed solution is very comprehensible. However, I have some concerns regarding the novelty, generalizability and feasibility of the proposal. Some of these concerns may already be addressed by the existing architecture and may just not be described in sufficient detail in the manuscript. Therefore, following questions should be answered in a new version of the manuscript, before I can recommend to accept it:

Novelty
-------
- Based on what is recognizable in Figure 1 (large parts of the text are not readable, see below), the presented architecture is a standard OLAP architecture, in which data are extracted from heterogeneous sources, transformed into a common (often multidimensional) data model, suited for analyses and then loaded into a central data store. The authors propose to use a FHIR based information model instead of a multidimensional data model, and a FHIR storage system instead of a warehouse, which is still can be considered a standard architecture. If the novelty of the architecture is based on the combination of components used (e.g., [11, 21, 26]), or based on the integration of de-identification methods into the ETL process, this should be elaborated clearer and compared with existing approaches (see next point)
- Moreover, some sort of comparison with other applications or prototypes could help to better understand the quality of the proposal. Some of the problems outlined in ETL-processes for integrating heterogeneous data in a privacy preserving manner have already been solved or at least addressed by other architectures, e.g., OHDSI. What improvements does the proposed architecture system architecture provide, in comparison to other architectures and their components? See, e.g., https://www.ohdsi.org/analytic-tools/whiterabbit-for-etl-design, https://www.ohdsi.org/web/wiki/doku.php?id=documentation:software:usagi, https://doi.org/10.1093/jamia/ocw001.

Generalizability
----------------
- Is the source code of the implementation publicly available? Can it be re-used by other institutions or research networks?
- How can the proposed architecture be used to provide ETL-processes for i2b2- and tranSMART-Warehouses or for Warehouses based on the OMOP-CDM?

Feasibility
-----------
- Is it feasible to use a FHIR-Store for analytical Ad-Hoc-Queries on large amounts of data? Typically, data in a FHIR-Store (or FHIR-based data lake) would be transformed to a schema optimized for such analyzes; this should be made clearer in the manuscript.

Remarks regarding data protection aspects
=========================================
- In some European countries, local regulations permit the secondary use of clinical data for research and training purposes, given that the data does not leave the custody of the hospital – without the explicit obligation to de-identify these data. Therefore, the authors might restrict statements like "compliance with these laws requires that only de-identified clinical and personal data can be processed for secondary analyses" or "in compliance with these laws, only de-identified data can be processed" by mentioning that this is the case "when data are shared or exchanged across different institutions".
- "In fact, such laws, along with the ones of the various Countries, establish that the secondary use of clinical data is allowed only avoiding future associations with the people they refer to" this is only correct as long the people they refer to haven't consented to do so. Consider the example of national or international clinical registries, which require to maintain associations with the people, the data refer to, while being compliant with the privacy regulations.
- In Section 2.1, the six main principles addressed by the GDPR are highlighted. It would be helpful to make clear, which of these principles are addressed by the article.

Remarks regarding Section 2.2
======================================
- ETL processes are often used in data warehousing where the warehouse is loaded in the loading phase, not in the extraction phase. To avoid confusion, I suggest not to use the term "data warehouse" in line 111.
- I would suggest re-terming the "repository" in line 111 to "staging area" to avoid confusion with the use of the usage of the same term in line 123.
- Please clarify, what is meant with "average personal data" in line 135.

Remarks regarding Section 2.3
=============================
- To me, it is not quite clear, why in line 147 the term "pseudonymous data" is used instead of "anonymous data"
- It is common in the literature (see, e.g., https://doi.org/10.1145/1749603.1749605, https://doi.org/10.1016/j.jbi.2014.06.002) to distinguish (amongst others (like methods for measuring data utility)) between
+ methods for quantifying privacy threats (e.g., identity disclosure, attribute disclosure), typically by providing specific (often combinable) privacy models (e.g., k-anonymity, l-diversity) for estimating (residual) privacy risks
+ algorithms (e.g., Incognito, Anatomize) which transform the original data for reducing the privacy risks estimated by these models. These algorithms typically employ basic data transformations (e.g., generalization, suppression, aggregation, shuffling).
- The overview of de-identification techniques in lines 156-192 mixes these methods, specifically
+ the privacy models k-anonymity, l-diversity (please not the single "l" throughout the manuscript), and t-closeness (t-proximity is unknown to me – this might be a translation error) are *not* generalization techniques; they rather can be used *in combination with generalization techniques* to anonymize data.
+ aggregation (e.g., transforming all age values in a group of records to their arithmetic mean "55.3" ) is *not* a generalization technique (e.g., transforming each age value to a predefined range of values "[51-60]")
- Please note that "data masking" is widely used as a container term for applying elementary data transformations while ignoring quantification of (residual) privacy risks and data usefulness. Therefore, I would suggest using the term "suppression" or "character/record masking" (whatever the authors consider more suitable), instead of "data masking".
- Listing pseudonymization on the same level as basic data transformation like generalization or randomization does not adequately reflect the role of pseudonymization for data protection concepts and its complexity. I suggest to delete this entry, as it is already appropriately explained in the beginning of this section (lines 126-138).
- Reference [14] in the manuscript considers randomization as "the process of assigning participants to treatment and control groups, assuming that each participant has an equal chance of being assigned to any group. The authors might want to use the term "Shuffling" instead and check the suitability of the reference.

Remarks regarding Section 4
===========================
- Large parts of the text in Figure 1 are not readable. Please provide the figure in higher quality.
- It is not clear, what is meant with "extrapolating and aggregating information even from single data" (lines 279f). These capabilities should be explained in more detail.
- The authors should check, if the citation [34], which refers to estimation of re-identification risks (line 284), is appropriate in the context of NLP.
- Are the methods used for pseudonymization and anonymization developed by the authors, or do they use existing implementations (e.g., [21])?

Remarks regarding Section 5
===========================
- It would be very helpful, if the authors could describe in more detail, how the mapping of the input data representation to FHIR-Resources is specified; particularly for structured input data like HL7 CDA and documents with a proprietary structure. Are the mappings hard coded in the specific UDFs mentioned in line 335, or are they configurable?

Remarks regarding Section 6
===========================
Please describe, in more detail
- the document types of heterogeneous data used for your preliminary tests (e.g., diagnostic reports, discharge summaries)
- the number of different data sources, number of entities (documents, patients, cases)
- what quality checks have been / are performed on these data
- the frequency of entities that could not be transformed due to quality issues

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This work represents an important contribution to an important problem as it is the data management.

We have lots of data but very often we cannot obtain results from them due to different issues as confidentiality or the fact that they come from different software, etc.

The manuscript is well written and is easy to follow even thought that there are technical aspects.

Somethings to improve:

The quality of the figures 1 and 2 should be improved. They appear blurred, specially Figure 1, where is difficult to read or see some parts.
Figure 4, Y axis has not label, what is the unit of measure?

A frequent problem with data bases is that some of the items could be not informed. How is this managed?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Authors have presented an ETL Architecture for the Secondary Use of Health Data .

I have following major concerns

What is novelty of this research, its simply enhancement of the previous study. Please change the title accordingly.

Explain in detail need of this research, highlight the gap during literature review.

What are major contributions of this research.

Explain the methodology in detail.

Analysis section is weak, need further enhancement.

In Fig 3 , authors presented incorrect E-R schema of extracted data case study. There should be 1:M, N:M relationship, please reconsider the fig 3.

Explain the dataset used ?

Include accuracy in abstract and in conclusion mention future work

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Previous comments were properly answered and corrections addressed. Thus, the paper is in the status to be accepted.

Reviewer 2 Report

The authors have improved the figures and made some improvements

Reviewer 3 Report

Agree to corrections, Accepted !

Article Menu

A Privacy-Preserving and Standard-Based Architecture for Secondary Use of Clinical Data

Further Information

Guidelines

MDPI Initiatives

Follow MDPI