Next Article in Journal
Replay-Based Domain Incremental Learning for Cross-User Gesture Recognition in Robot Task Allocation
Next Article in Special Issue
The IDRE Dataset in Practice: Training and Evaluation of Small-to-Medium-Sized LLMs for Empathetic Rephrasing
Previous Article in Journal
Prot-GO: A Parallel Transformer Encoder-Based Fusion Model for Accurately Predicting Gene Ontology (GO) Terms from Full-Scale Protein Sequences
Previous Article in Special Issue
Who Speaks to Whom? An LLM-Based Social Network Analysis of Tragic Plays
 
 
Article
Peer-Review Record

A Markov Chain Replacement Strategy for Surrogate Identifiers: Minimizing Re-Identification Risk While Preserving Text Reuse

Electronics 2025, 14(19), 3945; https://doi.org/10.3390/electronics14193945
by John D. Osborne 1,*, Andrew Trotter 1, Tobias O’Leary 1, Chris Coffee 1, Micah D. Cochran 1, Luis Mansilla-Gonzalez 1, Akhil Nadimpalli 1, Alex McAnnally 1, Abdulateef I. Almudaifer 2, Jeffrey R. Curtis 1, Salma M. Aly 1 and Richard E. Kennedy 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Electronics 2025, 14(19), 3945; https://doi.org/10.3390/electronics14193945
Submission received: 7 September 2025 / Revised: 1 October 2025 / Accepted: 1 October 2025 / Published: 6 October 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper makes a significant contribution to the field of clinical text de-identification. The Markov strategy offers a pragmatic, evidence-backed improvement over existing surrogate replacement methods. With further refinement (parameter tuning, broader utility studies, and human validation), this approach could set a new standard for PHI de-identification in research and healthcare.

There are certain improvements that you can make

For example: Reliance on the faker library may introduce risks if attackers know the surrogate pool. Can you propose a concrete mitigation beyond “held-out values.”

What about Including human evaluation for validation?

Also a concrete example of what type of leakage can occur through some figure can help readers 

The replacement can sometimes generate meaningless sentences. Have you encounter such examples? 

 

Author Response

Comments 1: The paper makes a significant contribution to the field of clinical text de-identification. The Markov strategy offers a pragmatic, evidence-backed improvement over existing surrogate replacement methods. With further refinement (parameter tuning, broader utility studies, and human validation), this approach could set a new standard for PHI de-identification in research and healthcare.

Response 1: We thank the reviewer for their kind words, and we agree that it would benefit from broader utility studies and human validation. While this paper focuses on theoretical and statistical validation of Markov surrogate substitution method and includes evaluation on a few select information extraction corpora, we added the final sentence to the paper to reflect our shared concerns:

“While our work here shows the theoretical and statistical utility of Markov-based surrogate substitution method, further evaluation of the software implementation on a wider-range of benchmarks including human validation would be useful, as would an assessment more complex re-identification attempts including LLM-based attacks.”

Comments 2: There are certain improvements that you can make. For example: Reliance on the faker library may introduce risks if attackers know the surrogate pool. Can you propose a concrete mitigation beyond “held-out values.”

Response 2: We agree with the reviewers that this improvement is needed, we clarified our original text which used the confusing term, “held-out values” and we are working on a mitigation. To reflect this, we modified section 4.3 (Limitations) on page 15, paragraph 3 to include:

 “This vulnerability will be mitigated by private, user-provided pools of PHI for use as replacement values in an updated version of BRATsynthetic.”

We believe this offers the best balance between security-through-obscurity (where the source code and pools are not published) and open-source transparency where users control the injection of secret, unpublished PHI pools.

Comments 3: What about Including human evaluation for validation?

Response 3: We did not perform a quantitative assessment as part of this work, however, qualitative reports from BRATsynthetic developers believed that re-identification was most challenging with the Markov approach.  We address this as part of the line added to the conclusion as part of our response to comment #1.

Comments 4: Also a concrete example of what type of leakage can occur through some figure can help readers 

Response 4: We agree and added Figure 2 on page 5 to provide an example.

 

Reviewer 2 Report

Comments and Suggestions for Authors

This paper systematically studies and evaluates the application of alternative identifier generation strategy based on Markov chain in medical text de-identification, which has strong innovation and practical value. This study compares the trade-offs between privacy protection and text usability of different alternative strategies (Consistent, Random, Markov) through multiple sets of experiments, filling the gap in empirical evaluation of such generative de-identification methods. The structure of the paper is relatively clear, the experimental design is reasonable, and the data presentation is detailed (including multiple charts and table analysis), which provides an important reference for the privacy protection of clinical texts. However, there is still room for improvement in some details, quantitative analysis of safety, and comparison with the latest research. Here are specific questions and suggestions for revision to enhance the rigor and impact of your paper:
1. Inadequate description of experimental setup and reproducibility (lines 110–137):
Although the paper mentions the use of multiple real clinical corpora such as UAB and MIMIC, and explains the IRB approval information, it does not describe in detail the specific preprocessing steps, manual verification process, and parameter configuration of BRATsynthetic in actual operation (such as the number of states of Markov chain, the specific call version and generation rules of Faker library). This can affect the reproducibility of the experiment. It is recommended to add data preprocessing details, manual review criteria, and the specific scope of code and corpus disclosure (e.g., providing sample ANN files or generating scripts) in Materials and Methods.
2. Privacy breach assessment lacks quantitative analysis of adversarial attack scenarios (Section 3.1):
Although the authors assessed the risk of document-level leakage under different strategies using MSRS (Maximum Surrogate Repeat Size) and FNER indicators, they did not consider more complex attack models (such as context-based inference attacks or correlation attacks combining multi-source information). It is recommended to include quantitative analysis of potential attack models (such as Parrot attacks or LLM-enhanced reidentification) in the "Discussion" section, and refer to recent studies such as Patsakis & Lykoussas (2023) or Simancek & Vydiswaran (2024) to further verify the robustness of Markov's strategy using ROC curves or attack success rate (ASR).
3. The evaluation of information extraction tasks can be further expanded (Section 3.2 and Table 7-9):
Although this paper compares the performance of various NLP tools (MedSpacy, BioBERT, Averbis SVM) under different alternative strategies, the selected tasks (such as NER, subject recognition, etc.) are still limited and do not cover more complex downstream applications (such as relationship extraction, event detection, temporal reasoning). It is recommended to include more clinical information extraction tasks (such as drug-disease relationship identification or clinical timeline construction) in the "Assessment of HIPS Strategy", and provide statistical test results (such as paired t-test or ANOVA) to strengthen the persuasiveness of the conclusions.
4. Insufficient comparison with the latest de-identification efforts (Sections 1.1–1.3):
The literature review section covers most mainstream tools (such as Presidio, CliniDeID, nference De-Id, etc.), but does not fully discuss the emerging large language model (LLM)-based de-identification methods (such as GPT-4-assisted pseudonymization or differential privacy text generation) in the past three years. It is recommended to add a comparison with the latest methods in the "Introduction" or "Discussion", such as discussing the comparative advantages and limitations of BRATsynthetic with LLM-based methods in terms of generation quality, privacy protection strength, and computational efficiency.
5. Limitations of BRATsynthetic are not clearly stated (Section 4.5):
Although the author mentions that the tool's reliance on the Faker library may lead to bias in the distribution of synthetic data, it does not discuss its computational overhead and scalability in actual deployment (such as runtime performance during large-scale corpus processing). It is recommended to supplement the time complexity analysis (which can be combined with Table 6) in the "Limitations" section and discuss the feasibility in multi-document patient-level replacement; Additionally, it is recommended to provide alternative synthetic data sources, such as custom dictionaries or localized generation models, to enhance the tool's adaptability.
Overall suggestion: This paper has outstanding performance in terms of method innovation and experimental comprehensiveness, especially in the analysis of privacy-utility trade-offs. If the above problems can be revised, especially by strengthening the quantitative analysis of safety, expanding the evaluation of downstream tasks, and comparing with recent SOTA methods, the academic value and practical impact of the paper will be significantly enhanced. Minor Revision is recommended.

Author Response

Comment 1. Inadequate description of experimental setup and reproducibility (lines 110–137):
Although the paper mentions the use of multiple real clinical corpora such as UAB and MIMIC, and explains the IRB approval information, it does not describe in detail the specific preprocessing steps, manual verification process, and parameter configuration of BRATsynthetic in actual operation (such as the number of states of Markov chain, the specific call version and generation rules of Faker library). This can affect the reproducibility of the experiment. It is recommended to add data preprocessing details, manual review criteria, and the specific scope of code and corpus disclosure (e.g., providing sample ANN files or generating scripts) in Materials and Methods.

Response: 1. We agree with the reviewer that our description was inadequate for reproducibility and that some details should have been included, or in other cases, be better organized. We have edited Section 2.1 “Software Implementation” on page 3, added both a sample .ANN (Figure 1, page 4) and .TXT (Figure 2 Page 5) to the methods as suggested. Also, we have moved our reference to our code on Github (which contains some of these implementation and evaluation details from the Contribution section to the Methods Section 2.1 where it is more appropriate and mention the inclusion of the evaluation. The updated lines from 110-137 (but not the Figures) are shown below:

“We make our tool and source code publicly available at https://github.com/uabnlp/BRATsynthetic and the evaluation in its own evaluation subdirectory to allow for replicability. BRATsynthetic uses the BRAT [30] annotation format and the associated text file (if available) to create surrogate PHI independent of the software originally used to identify that PHI. BRATsynthetic generates realistic text for 27 categories of PHI as described in the Resynthesis Elements section below. Dates and ages are offset by a random number. PHI categories are stored as entities (“T” labels) in BRAT files; associated, non-PHI annotations (attributes “A” and events “E”) may also be stored in these files, and BRATsynthetic properly maintains mapping of non-PHI annotations to PHI categories after synthetic replacements. Prior to replacement, we normalize line endings, require paired .txt/.ann files (example Figures 1 & 2), restrict substitutions to the PHI tag set, and update spans in reverse-offset order to preserve alignment. Synthetic values are produced with Faker 13.7 under deterministic seeding from the configuration, with format- and case-preserving rules for IDs, names, and codes. Each PHI category uses a custom Maker class with unique rules for handling regular expression patterns, edge cases, and Faker function calls. Within-document mention repetition follows a two-state first-order Markov policy (reuse vs. resample) as shown in Figure 3. Automated and manual verification includes counting the number of entities, events, and attributes before and after synthetic replacement, and ensuring spans correctly align for non-PHI annotations with their corresponding synthetically replaced PHI categories.”

Comment 2: Privacy breach assessment lacks quantitative analysis of adversarial attack scenarios (Section 3.1): Although the authors assessed the risk of document-level leakage under different strategies using MSRS (Maximum Surrogate Repeat Size) and FNER indicators, they did not consider more complex attack models (such as context-based inference attacks or correlation attacks combining multi-source information). It is recommended to include quantitative analysis of potential attack models (such as Parrot attacks or LLM-enhanced reidentification) in the "Discussion" section, and refer to recent studies such as Patsakis & Lykoussas (2023) or Simancek & Vydiswaran (2024) to further verify the robustness of Markov's strategy using ROC curves or attack success rate (ASR).

Response 2: We agree with the reviewers that assessing the Markov surrogate strategy against other potential attack models would be useful. We are aware of the good work done by both Patsakis & Lykoussas (2023) and Simancek & Vydiswaran (2024), which we cite in the Introduction and Discussion respectively. We have added a final sentence to the Conclusions section below that points out the need for more advanced threat model testing.

While our work here shows the theoretical and statistical utility of Markov-based surrogate substitution method, further evaluation of the software implementation on a wider-range of benchmarks including human validation would be useful, as would an assessment more complex re-identification attempts including parrot[26] or LLM-based attacks[27].


Comment 3. The evaluation of information extraction tasks can be further expanded (Section 3.2 and Table 7-9):
Although this paper compares the performance of various NLP tools (MedSpacy, BioBERT, Averbis SVM) under different alternative strategies, the selected tasks (such as NER, subject recognition, etc.) are still limited and do not cover more complex downstream applications (such as relationship extraction, event detection, temporal reasoning). It is recommended to include more clinical information extraction tasks (such as drug-disease relationship identification or clinical timeline construction) in the "Assessment of HIPS Strategy", and provide statistical test results (such as paired t-test or ANOVA) to strengthen the persuasiveness of the conclusions.

Response 3: We agree with the reviewers that more tasks would be useful, the 2nd paragraph of Section 4.3 “Limitations” recognizes this limitation. The availability of real clinical corpora with identified PHI also annotated for the tasks mentioned would facilitate this, but we used the resources we had on hand.


Comment 4. Insufficient comparison with the latest de-identification efforts (Sections 1.1–1.3):
The literature review section covers most mainstream tools (such as Presidio, CliniDeID, nference De-Id, etc.), but does not fully discuss the emerging large language model (LLM)-based de-identification methods (such as GPT-4-assisted pseudonymization or differential privacy text generation) in the past three years. It is recommended to add a comparison with the latest methods in the "Introduction" or "Discussion", such as discussing the comparative advantages and limitations of BRATsynthetic with LLM-based methods in terms of generation quality, privacy protection strength, and computational efficiency.

Response 4: The following text was added to page 15 of the Discussion under Section 4.5 “Comparison to LLM-based De-identification Methods” to address this concern:

LLM-based de-identification methods can produce contextually natural text, often exceeding template-driven systems in generation quality. However, they are computationally intensive, more expensive to operate and raise concerns about auditability and potential privacy leakage if safeguards are not rigorously applied. BRATsynthetic, on the contrary, is an inexpensive, efficient, and transparent deterministic surrogate substitution without specialized hardware, although at the cost of a more constrained narrative variety. In practice, these approaches should be viewed as complementary, with LLMs excelling in realism and adaptability, while BRATsynthetic offers practical scalability and predictable control, and its relative weakness in creating smooth transitions is less problematic in clinical documentation, where narratives are typically fragmented.


Comment 5. Limitations of BRATsynthetic are not clearly stated (Section 4.5):
Although the author mentions that the tool's reliance on the Faker library may lead to bias in the distribution of synthetic data, it does not discuss its computational overhead and scalability in actual deployment (such as runtime performance during large-scale corpus processing). It is recommended to supplement the time complexity analysis (which can be combined with Table 6) in the "Limitations" section and discuss the feasibility in multi-document patient-level replacement;

Response 5. In terms of computational overhead and scalability, we have not done the complexity analysis but have not had trouble scaling since there can be more than one instance of a pool. We added to the Limitations section on page 16, paragraph 3, “Finally, we have not performed a complexity or runtime analysis on the faker library, but scaling on large datasets is feasible, as shown in Table 6.”

Comment 6. Additionally, it is recommended to provide alternative synthetic data sources, such as custom dictionaries or localized generation models, to enhance the tool's adaptability.

Response 6. We agree and added this sentence to the Limitations Section on page 16, paragraph 3, “This vulnerability will be mitigated by private, user-provided pools of PHI for use as replacement values in an updated version of BRATsynthetic.”

Comment 7: Overall suggestion: This paper has outstanding performance in terms of method innovation and experimental comprehensiveness, especially in the analysis of privacy-utility trade-offs. If the above problems can be revised, especially by strengthening the quantitative analysis of safety, expanding the evaluation of downstream tasks, and comparing with recent SOTA methods, the academic value and practical impact of the paper will be significantly enhanced. Minor Revision is recommended.

Response 7. We agree and will work towards such an enhancement. We thank Reviewer 2 for their thorough and helpful review.

Back to TopTop