Large Language Models for Electronic Health Record De-Identification in English and German
Abstract
:1. Introduction
2. Related Work
3. Materials and Methods
3.1. De-Identification
3.2. Protected Health Information Categories
3.3. De-Identification Datasets
3.4. In-Context Learning and Prompt Engineering
3.5. Full Fine-Tuning for LLMs
3.6. Large Language Models
- BERT [19] is an encoder-only LLM that comprises an embedding module, a stack of transformer encoders, and a fully connected layer [70]. This model relies on joint masked language modeling and next-sentence prediction objectives [19] and has recently been used for privacy-preserving NLP tasks [14]. In this work, we fine-tuned a 110 M parameter BERTbase uncased model for the English N2C2 dataset.
- ClinicalBERT [77] is a version of BERT fine-tuned on medical-domain language. In this work, we fine-tuned a 110 M parameter ClinicalBERT model for the English N2C2 dataset.
- DistilBERT [72] is a 40% smaller, 60% faster version of BERT with 97% the performance. This model is faster and less memory-consuming to fine-tune than the original BERT model. In this work, we fine-tuned a 66 M parameter DistilBERT model for the English N2C2 dataset.
- FLAN-T5 XXL [75] is an encoder–decoder LLM with 11 B parameters and multilingual capabilities and is fine-tuned on a wide range of NLP tasks. In this work, we performed in-context learning for the de-identification of the English N2C2 dataset for this model.
- GPT-3.5 Turbo [76] is a decoder-only LLM, improving on the GPT-3 model [74] and part of the GPT model family developed by OpenAI (https://openai.com/, accessed on 27 January 2025). This model is closed-source and only accessible via an application programming interface (API) [70]. Additionally, the precise number of model parameters for GPT-3.5 Turbo is undisclosed [76]. In this work, we performed in-context learning for the de-identification of the English N2C2 dataset for the GPT-3.5 Turbo model version ‘gpt-3.5-turbo-0125’.
- GPT-4 [20] is a decoder-only LLM in the GPT family, which far exceeds its predecessors across several benchmarks [70], with state-of-the-art performance, 1.76 T parameters, multi-modal outputs, and multilingual capabilities. Like GPT-3.5 Turbo, this model is closed-source and only accessible via APIs [20]. In this work, we performed in-context learning for the de-identification of both English and German N2C2 datasets for the GPT-4 model version ‘gpt-4-0613’.
- GPT-4o [78] is a decoder-only LLM with cross-modal capabilities in video, audio, and text. It is a flagship, closed-source model in the GPT-4 model family with enhanced performance for languages other than English. Like its preceding models, GPT-4o is only accessible via APIs [78]. In this work, we performed in-context learning for the de-identification of the N2C2 and real-world datasets in German for the GPT-4o model version ‘gpt-4o-2024-08-06’.
- LLaMA 3 [21] is a decoder-only, open-source LLaMA [73] family model, which promotes gains in performance over its predecessors due to improved data quality and a larger trained scale [79]. In this work, we performed in-context learning for the de-identification of the English N2C2 dataset for the 8 B parameter LLaMA-3 model.
3.7. Translation of the N2C2 Dataset
4. Experimental Setup
4.1. Data Pre-Processing
4.2. Large Language Model Settings
4.3. Experiments and Evaluation
5. Results
5.1. In-Context Learning
- The removal of tokens that explicitly contain parts of the original prompt.
- The removal of tokens containing medical terms, such as ‘pager’, ‘ultrasound’, or ‘nebulizer’.
- The removal of tokens that are medication names, such as ‘atenolol’, ‘hydroxychloroquine’, or ‘prednisone’.
- The removal of tokens that are medical conditions, such as ‘diabetes’.
- The removal of tokens that are incomplete name parts, such as ‘Dr.’ or ‘M.D.’.
- The removal of tokens that relate to gender.
- The removal of tokens containing sets of stop words.
- The removal of tokens that are floating-point numbers, percentages, fractions, and temperatures.
5.2. Fine-Tuning Large Language Models
5.3. De-Identification Results for the German N2C2 Dataset
5.4. Real-World Evaluation
5.5. Cost–Performance Trade-Offs for LLMs
6. Discussion
6.1. Limitations
6.2. Integration of Privacy-Enhancing Technologies
6.3. Deployment Considerations
6.4. Future Work Directions
7. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
API | Application programming interface |
BiLSTM | Bidirectional long short-term memory |
CD | Critical difference |
CNNs | Convolutional neural networks |
CPU | Central processing unit |
CRF | Conditional random fields |
DP | Differential privacy |
EHR | Electronic health record |
EU | European Union |
FL | Federated learning |
GDPR | General Data Protection Regulation |
GenAI | Generative artificial intelligence |
GPU | Graphics processing unit |
GRU | Gated recurrent units |
HIPAA | Health Insurance Portability and Accountability Act |
LLMs | Large language models |
LoRA | Low-rank adaptation |
LR | Learning rate |
LSTM | Long short-term memory |
MIST | MITRE Identification Scrubber Toolkit |
NER | Named-entity recognition |
NLP | Natural language processing |
PETs | Privacy-enhancing technologies |
PHI | Protected health information |
RAM | Random-access memory |
SVMs | Support vector machines |
UMLS | Unified Medical Language System |
US | United States |
Appendix A. Large Language Model Versions and Hyperparameters
- ‘temp.’ stands for temperature, which controls the probabilities for the next tokens.
- ‘top_p’ sets the cumulative probability for nucleus sampling.
- ‘top_k’ limits the number of highest-probability tokens.
- ‘n’ limits the number of completions for GPT family models.
- ‘seed’ is the random seed for sampling.
- ‘LR’ stands for the learning rate for the BERT family models.
- ‘batch_size’ defines the batch sizes for training and inference during full fine-tuning.
- ‘epochs’ defines the number of passes through the training set during full fine-tuning.
- BERT (‘google-bert/bert-base-uncased’).
- ClinicalBERT (‘medicalai/ClinicalBERT’).
- DistilBERT (‘distilbert/distilbert-base-uncased’).
- FLAN-T5 XXL (‘google/flan-t5-xxl’).
- LLaMA 3 (‘meta-llama/Meta-Llama-3-8B-Instruct’).
- Mistral-7B (‘mistralai/Mistral-7B-Instruct-v0.3’).
- RoBERTa (‘FacebookAI/roberta-base’).
- GPT-3.5 Turbo (‘gpt-3.5-turbo-0125’).
- GPT-4 (‘gpt-4-0613’).
- GPT-4o (‘gpt-4o-2024-08-06’).
Model | Version | Hyperparameters |
---|---|---|
GPT-3.5 Turbo | ‘gpt-3.5-turbo-0125’ | {temp. = 0.1, top_p = 0.1, n = 1, seed = 1234} |
GPT-4 | ‘gpt-4-0613’ | {temp. = 0.1, top_p = 0.1, n = 1, seed = 1234} |
GPT-4o | ‘gpt-4o-2024-08-06’ | {temp. = 0.1, top_p = 0.1, n = 1, seed = 1234} |
FLAN-T5 XXL | ‘google/flan-t5-xxl’ | {temp. = 0.1, p = 0.1, top_k = 1, seed = 1234} |
LLaMA 3 | ‘meta-llama/Meta-Llama-3-8B-Instruct’ | {temp. = 0.1, top_p = 0.1, top_k = 1, seed = 1234} |
Mistral-7B | ‘mistralai/Mistral-7B-Instruct-v0.3’ | {temp. = 0.1, top_p = 0.1, top_k = 1, seed = 1234} |
BERT | ‘google-bert/bert-base-uncased’ | {LR = , batch_size = 1, epochs = {1, …, 5}} |
ClinicalBERT | ‘medicalai/ClinicalBERT’ | {LR = , batch_size = 1, epochs = {1, …, 3}} |
DistilBERT | ‘distilbert/distilbert-base-uncased’ | {LR = , batch_size = 1, epochs = {1, …, 5}} |
RoBERTa | ‘FacebookAI/roberta-base’ | {LR = , batch_size = 1, epochs = {1, …, 5}} |
Appendix B. Post-Processing Evaluation
Model | Rules | Zero-Shot | One-Shot | ||||
---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | ||
FLAN-T5 XXL | No | 5.37% | 14.46% | 7.83% | 29.31% | 59.10% | 39.19% |
GPT-3.5 Turbo | No | 18.57% | 42.24% | 25.80% | 32.93% | 59.27% | 42.34% |
GPT-4 | No | 25.04% | 74.62% | 37.49% | 32.09% | 74.62% | 37.49% |
LLaMA 3 | No | 37.45% | 32.20% | 34.63% | 37.18% | 48.42% | 42.06% |
Mistral-7B | No | 11.61% | 58.34% | 19.37% | 19.49% | 68.70% | 30.37% |
FLAN-T5 XXL | Yes | 8.62% | 14.46% | 10.80% | 55.25% | 59.10% | 57.11% |
GPT-3.5 Turbo | Yes | 27.81% | 42.24% | 33.54% | 65.41% | 59.27% | 62.19% |
GPT-4 | Yes | 47.73% | 74.62% | 58.23% | 70.17% | 87.14% | 77.74% |
LLaMA 3 | Yes | 55.56% | 32.20% | 40.77% | 59.11% | 48.42% | 53.23% |
Mistral-7B | Yes | 15.55% | 58.34% | 24.56% | 38.33% | 68.70% | 49.20% |
Appendix C. Issues in Real-World EHRs in German
Health Record Snippets | Issues |
---|---|
“vom Pony getreten, Pat. schwanger!!!!. Abklärung (SG+NNH)? Patientin derzeit in der 27. Schwangerschaftswoche. Nach einem Aufklärungsgespräch bezüglich der Strahlenbelastung einer Schädel-CT-Untersuchung für den Fötus, lehnt die Patientin die Untersuchung zum derzeitigen Zeitpunkt ab. Derzeit ist die Patientin subjektiv und objektiv in klinischer Beschwerdefreiheit. Bei einer etwaigen Verschlechterung, ist eine jederzeitige Wiedervorstellung zum Schädel CT möglich.Das Gespräche wird in Beisein des diensthabenden RT Ass geführt. Der zuweisende Dr Mustermann wird über das Gespräch telefonisch in Kenntnis gesetzt.” | Inconsistent spacing and punctuation, excessive use of exclamation marks, sentence fragment, sentences with missing verbs or in indirect order, inconsistent article–noun agreement. |
“St.p.Trauma. Kontrolle? Im Vergleich zur VU vom 15.10.2008 geringfügige Zunahme der epiduralen oder subduralen Blutansammlung, diese nunmehr in einer maximalen Längsausdehnung von maximal etwa 4 cm und einer Breite von etwa 0.5 cm hoch parietookzipital rechts. Zunahme des cortical/subcorticalen Kontusionsherdes, dieser vormals etwa 1.7 cm, nunmehr etwa 2.3 cm haltend temperoparietal rechts. Neu aufgetreten eine etwa 5 cm lange und 0.6 cm breite konvex bogig berandete Blutansammlung offenbar epidural okzipital links. Neu aufgetreten eine SAB temperookzipital links, die SAB frontotemporal geringfügig regredient. Geringfügige Regredienz des Kopfschwartenhämatoms okzipital rechts.” | Unclear abbreviation, sentence fragment, word repetition, complex sentence. |
“SHT, SAB. Kontrolle nativ? Verlaufskontrolle zu einer auswärtigen VU vom 14.10.2009 (Klinikum am Südpark). CT des Gehirnschädels: Im Bildvergleich Umverteilung des Hämatocephalus internus in die Hinterhörner der Seitenventrikel und am Boden des 3. Ventrikels. Regredienz der Hämorrhagien in den basalen und perimesencephalen Zisternen mit Umverteilung der SAB nach parieto-frontal beidseits. Neu eingebrachte Ventrikeldrainage über frontal rechts mit der Spitze im Seitenventrikel im Vorderhorn. Etwas zunehmende Weite der Seitenventrikel. Kein Mittellinienshift. Die Zeichen des Hirnödems abnehmend. Neu demarkiert eine umschriebene Hypodensität cerebellär rechts, fragl. subakut ischämisch. Bekannte ausgedehnte Gesichtsschädelfrakturen. Geringe Zunahme des Hämatosinus sphenoidales.” | Unclear abbreviation, complex terminology, complex sentence, typo. |
Appendix D. File Names of the EHRs Sampled from the German N2C2 Dataset for the De-Identification Experiments
EHR File Names |
---|
112-02.xml, 112-03.xml, 130-03.xml, 132-01.xml, 132-03.xml, |
137-01.xml, 137-02.xml, 138-01.xml, 138-02.xml, 138-03.xml, |
138-04.xml, 160-04.xml, 161-01.xml, 163-01.xml, 163-03.xml, |
166-01.xml, 190-03.xml, 190-04.xml, 193-05.xml, 199-01.xml, |
199-05.xml, 200-04.xml, 202-03.xml, 209-01.xml, 210-04.xml, |
211-02.xml, 214-01.xml, 216-05.xml, 219-01.xml, 219-03.xml, |
219-04.xml, 219-05.xml, 234-01.xml, 310-02.xml, 314-01.xml, |
314-02.xml, 314-05.xml, 316-01.xml, 317-01.xml, 318-01.xml, |
318-03.xml, 318-04.xml, 319-01.xml, 340-01.xml, 340-03.xml, |
343-01.xml, 347-02.xml, 347-03.xml, 349-01.xml, 385-01.xml. |
References
- Liu, Z.; Tang, B.; Wang, X.; Chen, Q. De-identification of clinical notes via recurrent neural network and conditional random field. J. Biomed. Inform. 2017, 75, S34–S42. [Google Scholar] [CrossRef]
- Leevy, J.L.; Khoshgoftaar, T.M.; Villanustre, F. Survey on RNN and CRF models for de-identification of medical free text. J. Big Data 2020, 7, 1–22. [Google Scholar] [CrossRef]
- Act. Health insurance portability and accountability act of 1996. Public Law 1996, 104, 191. [Google Scholar]
- European Commission. A New Era for Data Protection in the EU. 2018. Available online: https://commission.europa.eu/document/download/7fa5e36d-6412-4b44-9a2d-12d4838fd4c6_en?filename=data-protection-factsheet-changes_en.pdf (accessed on 30 December 2024).
- Liu, Z.; Huang, Y.; Yu, X.; Zhang, L.; Wu, Z.; Cao, C.; Dai, H.; Zhao, L.; Li, Y.; Shu, P.; et al. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv 2023, arXiv:2303.11032. [Google Scholar]
- Patil, H.K.; Seshadri, R. Big data security and privacy issues in healthcare. In Proceedings of the 2014 IEEE International Congress on Big Data, Anchorage, AK, USA, 27 June–2 July 2014; pp. 762–765. [Google Scholar]
- Henriksen-Bulmer, J.; Jeary, S. Re-identification attacks—A systematic literature review. Int. J. Inf. Manag. 2016, 36, 1184–1192. [Google Scholar] [CrossRef]
- Zhang, P.; Kamel Boulos, M.N. Generative AI in medicine and healthcare: Promises, opportunities and challenges. Future Internet 2023, 15, 286. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 6000–6010. [Google Scholar] [CrossRef]
- Denecke, K.; May, R.; Rivera-Romero, O. Transformer Models in Healthcare: A Survey and Thematic Analysis of Potentials, Shortcomings and Risks. J. Med. Syst. 2024, 48, 23. [Google Scholar] [CrossRef]
- Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerova, A.; et al. Clinical text summarization: Adapting large language models can outperform human experts. Res. Sq. 2023. [Google Scholar] [CrossRef]
- Chintagunta, B.; Katariya, N.; Amatriain, X.; Kannan, A. Medically aware GPT-3 as a data generator for medical dialogue summarization. In Proceedings of the Machine Learning for Healthcare Conference, PMLR, Virtual Event, 6–7 August 2021; pp. 354–372. [Google Scholar]
- Xu, B.; Gil-Jardiné, C.; Thiessard, F.; Tellier, E.; Avalos, M.; Lagarde, E. Pre-training a neural language model improves the sample efficiency of an emergency room classification model. In Proceedings of the FLAIRS-33-Thirty-Third International Flairs Conference, North Miami Beach, FL, USA, 17–20 May 2020. [Google Scholar]
- Sousa, S.; Kern, R. How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processing. Artif. Intell. Rev. 2023, 56, 1427–1492. [Google Scholar] [CrossRef]
- Trienes, J.; Trieschnigg, D.; Seifert, C.; Hiemstra, D. Comparing rule-based, feature-based and deep neural methods for de-identification of dutch medical records. arXiv 2020, arXiv:2001.05714. [Google Scholar]
- Kolditz, T.; Lohr, C.; Hellrich, J.; Modersohn, L.; Betz, B.; Kiehntopf, M.; Hahn, U. Annotating German clinical documents for de-identification. In MEDINFO 2019: Health and Wellbeing e-Networks for All; IOS Press BV: Amsterdam, The Netherlands, 2019; pp. 203–207. [Google Scholar]
- Rehm, G.; Uszkoreit, H. The German Language in the European Information Society. In The German Language in the Digital Age; Springer: Berlin/Heidelberg, Germany, 2012; pp. 47–53. [Google Scholar]
- Borchert, F.; Lohr, C.; Modersohn, L.; Witt, J.; Langer, T.; Follmann, M.; Gietzelt, M.; Arnrich, B.; Hahn, U.; Schapranow, M.P. GGPONC 2.0—The German clinical guideline corpus for oncology: Curation workflow, annotation policy, baseline NER taggers. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 3650–3660. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- OpenAI. GPT-4 Technical Report. 2023. Available online: https://openai.com/research/gpt-4 (accessed on 12 November 2024).
- AI@Meta. Llama 3 Model Card. 2024. Available online: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md (accessed on 27 January 2025).
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019. [Google Scholar] [CrossRef]
- Meystre, S.M.; Friedlin, F.J.; South, B.R.; Shen, S.; Samore, M.H. Automatic de-identification of textual documents in the electronic health record: A review of recent research. BMC Med. Res. Methodol. 2010, 10, 1–16. [Google Scholar] [CrossRef]
- Berman, J.J. Concept-match medical data scrubbing: How pathology text can be used in research. Arch. Pathol. Lab. Med. 2003, 127, 680–686. [Google Scholar] [CrossRef]
- Beckwith, B.A.; Mahaadevan, R.; Balis, U.J.; Kuo, F. Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med. Inform. Decis. Mak. 2006, 6, 1–9. [Google Scholar] [CrossRef] [PubMed]
- Friedlin, F.J.; McDonald, C.J. A software tool for removing patient identifying information from clinical documents. J. Am. Med. Inform. Assoc. 2008, 15, 601–610. [Google Scholar] [CrossRef]
- Uzuner, Ö.; Sibanda, T.C.; Luo, Y.; Szolovits, P. A de-identifier for medical discharge summaries. Artif. Intell. Med. 2008, 42, 13–35. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Wellner, B.; Huyck, M.; Mardis, S.; Aberdeen, J.; Morgan, A.; Peshkin, L.; Yeh, A.; Hitzeman, J.; Hirschman, L. Rapidly retargetable approaches to de-identification in medical records. J. Am. Med. Inform. Assoc. 2007, 14, 564–573. [Google Scholar] [CrossRef]
- Lafferty, J.; McCallum, A.; Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Icml, Williamstown, MA, USA, 28 June–1 July 2001; Volume 1, p. 3. [Google Scholar]
- Aberdeen, J.; Bayer, S.; Yeniterzi, R.; Wellner, B.; Clark, C.; Hanauer, D.; Malin, B.; Hirschman, L. The MITRE Identification Scrubber Toolkit: Design, training, and assessment. Int. J. Med. Inform. 2010, 79, 849–859. [Google Scholar] [CrossRef]
- Yang, H.; Garibaldi, J.M. Automatic detection of protected health information from clinic narratives. J. Biomed. Inform. 2015, 58, S30–S38. [Google Scholar] [CrossRef]
- Stubbs, A.; Kotfila, C.; Uzuner, Ö. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J. Biomed. Inform. 2015, 58, S11–S19. [Google Scholar] [CrossRef]
- Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
- Haque, A.; Milstein, A.; Fei-Fei, L. Illuminating the dark spaces of healthcare with ambient intelligence. Nature 2020, 585, 193–202. [Google Scholar] [CrossRef]
- Fei, Z.; Ryeznik, Y.; Sverdlov, O.; Tan, C.W.; Wong, W.K. An overview of healthcare data analytics with applications to the COVID-19 pandemic. IEEE Trans. Big Data 2021, 8, 1463–1480. [Google Scholar] [CrossRef]
- Hang, C.N.; Tsai, Y.Z.; Yu, P.D.; Chen, J.; Tan, C.W. Privacy-enhancing digital contact tracing with machine learning for pandemic response: A comprehensive review. Big Data Cogn. Comput. 2023, 7, 108. [Google Scholar] [CrossRef]
- Chen, H.; Lin, Z.; Ding, G.; Lou, J.; Zhang, Y.; Karlsson, B. GRN: Gated relation network to enhance convolutional neural network for named entity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6236–6243. [Google Scholar]
- Tomy, A.; Razzanelli, M.; Di Lauro, F.; Rus, D.; Della Santina, C. Estimating the state of epidemics spreading with graph neural networks. Nonlinear Dyn. 2022, 109, 249–263. [Google Scholar] [CrossRef]
- Tan, C.W.; Yu, P.D.; Chen, S.; Poor, H.V. Deeptrace: Learning to optimize contact tracing in epidemic networks with graph neural networks. arXiv 2022, arXiv:2211.00880. [Google Scholar]
- Obeid, J.S.; Heider, P.M.; Weeda, E.R.; Matuskowitz, A.J.; Carr, C.M.; Gagnon, K.; Crawford, T.; Meystre, S.M. Impact of de-identification on clinical text classification using traditional and deep learning classifiers. Stud. Health Technol. Inform. 2019, 264, 283. [Google Scholar] [PubMed]
- Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar] [CrossRef]
- Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef]
- Dernoncourt, F.; Lee, J.Y.; Uzuner, O.; Szolovits, P. De-identification of patient notes with recurrent neural networks. J. Am. Med. Inform. Assoc. 2017, 24, 596–606. [Google Scholar] [CrossRef]
- Ahmed, T.; Aziz, M.M.A.; Mohammed, N. De-identification of electronic health record using neural network. Sci. Rep. 2020, 10, 18600. [Google Scholar] [CrossRef]
- Cho, K.; van Merriënboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Richter-Pechanski, P.; Amr, A.; Katus, H.A.; Dieterich, C. Deep Learning Approaches Outperform Conventional Strategies in De-Identification of German Medical Reports. In Proceedings of the GMDS, Dortmund, Germany, 8–11 September 2019; pp. 101–109. [Google Scholar]
- Baumgartner, M.; Schreier, G.; Hayn, D.; Kreiner, K.; Haider, L.; Wiesmüller, F.; Brunelli, L.; Pölzl, G. Impact analysis of De-identification in clinical notes classification. In dHealth 2022; IOS Press BV: Amsterdam, The Netherlands, 2022; pp. 189–196. [Google Scholar]
- Eder, E.; Krieg-Holz, U.; Hahn, U. CodE Alltag 2.0—A pseudonymized German-language email corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4466–4477. [Google Scholar]
- Kocaman, V.; Mellah, Y.; Haq, H.; Talby, D. Automated de-identification of arabic medical records. In Proceedings of the ArabicNLP 2023, Singapore, 7 December 2023; pp. 33–40. [Google Scholar]
- Zhao, Y.S.; Zhang, K.L.; Ma, H.C.; Li, K. Leveraging text skeleton for de-identification of electronic medical records. BMC Med. Inform. Decis. Mak. 2018, 18, 65–72. [Google Scholar] [CrossRef] [PubMed]
- Menger, V.; Scheepers, F.; van Wijk, L.M.; Spruit, M. DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text. Telemat. Inform. 2018, 35, 727–736. [Google Scholar] [CrossRef]
- Bourdois, L.; Avalos, M.; Chenais, G.; Thiessard, F.; Revel, P.; Gil-Jardiné, C.; Lagarde, E. De-identification of emergency medical records in French: Survey and comparison of state-of-the-art automated systems. Int. Flairs Conf. Proc. 2021, 34. [Google Scholar] [CrossRef]
- Catelli, R.; Gargiulo, F.; Casola, V.; De Pietro, G.; Fujita, H.; Esposito, M. A novel covid-19 data set and an effective deep learning approach for the de-identification of italian medical records. IEEE Access 2021, 9, 19097–19110. [Google Scholar] [CrossRef]
- Kajiyama, K.; Horiguchi, H.; Okumura, T.; Morita, M.; Kano, Y. De-identifying free text of Japanese electronic health records. J. Biomed. Semant. 2020, 11, 1–12. [Google Scholar] [CrossRef]
- Shin, S.Y.; Park, Y.R.; Shin, Y.; Choi, H.J.; Park, J.; Lyu, Y.; Lee, M.S.; Choi, C.M.; Kim, W.S.; Lee, J.H. A de-identification method for bilingual clinical texts of various note types. J. Korean Med. Sci. 2015, 30, 7–15. [Google Scholar] [CrossRef]
- Bråthen, S.; Wie, W.; Dalianis, H. Creating and evaluating a synthetic Norwegian clinical corpus for de-identification. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), Reykjavik, Iceland (Online), 31 May–2 June 2021; pp. 222–230. [Google Scholar]
- Prado, C.B.; Gumiel, Y.B.; Schneider, E.T.R.; Cintho, L.M.M.; de Souza, J.V.A.; Oliveira, L.E.S.e.; Paraiso, E.C.; Rebelo, M.S.; Gutierrez, M.A.; Pires, F.A.; et al. De-Identification Challenges in Real-World Portuguese Clinical Texts. In Proceedings of the Latin American Conference on Biomedical Engineering, Florianópolis, Brazil, 24–28 October 2022; pp. 584–590. [Google Scholar]
- Marimon, M.; Gonzalez-Agirre, A.; Intxaurrondo, A.; Rodriguez, H.; Martin, J.L.; Villegas, M.; Krallinger, M. Automatic De-identification of Medical Texts in Spanish: The MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results. In Proceedings of the IberLEF@ SEPLN, Bilbao, Spain, 24 September 2019; pp. 618–638. [Google Scholar]
- Berg, H.; Dalianis, H. A Semi-supervised Approach for De-identification of Swedish Clinical Text. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4444–4450. [Google Scholar]
- Ramshaw, L.A.; Marcus, M.P. Text chunking using transformation-based learning. In Natural Language Processing Using Very Large Corpora; Springer: Berlin/Heidelberg, Germany, 1999; pp. 157–176. [Google Scholar]
- European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council. 2016. Available online: https://data.europa.eu/eli/reg/2016/679/oj (accessed on 5 November 2024).
- Stubbs, A.; Uzuner, Ö. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J. Biomed. Inform. 2015, 58, S20–S29. [Google Scholar] [CrossRef]
- Jantscher, M.; Gunzer, F.; Kern, R.; Hassler, E.; Tschauner, S.; Reishofer, G. Information extraction from German radiological reports for general clinical text and language understanding. Sci. Rep. 2023, 13, 2353. [Google Scholar] [CrossRef]
- Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Liu, T.; et al. A Survey on In-context Learning. arXiv 2024. [Google Scholar] [CrossRef]
- Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N.A.; Khashabi, D.; Hajishirzi, H. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 13484–13508. [Google Scholar]
- Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Stanford Alpaca: An Instruction-Following Llama Model. 2023. Available online: https://crfm.stanford.edu/2023/03/13/alpaca.html (accessed on 27 January 2025).
- Lv, K.; Yang, Y.; Liu, T.; Guo, Q.; Qiu, X. Full Parameter Fine-tuning for Large Language Models with Limited Resources. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 8187–8198. [Google Scholar] [CrossRef]
- Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large language models: A survey. arXiv 2024, arXiv:2402.06196. [Google Scholar]
- Sun, C.; Yang, Z.; Wang, L.; Zhang, Y.; Lin, H.; Wang, J. Biomedical named entity recognition using BERT in the machine reading comprehension framework. J. Biomed. Inform. 2021, 118, 103799. [Google Scholar] [CrossRef] [PubMed]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020. [Google Scholar] [CrossRef]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 2024, 25, 1–53. [Google Scholar]
- OpenAI. GPT-3.5 Turbo. 2023. Available online: https://platform.openai.com/docs/models/gpt-3-5#gpt-3-5-turbo (accessed on 12 November 2024).
- Wang, G.; Liu, X.; Ying, Z.; Yang, G.; Chen, Z.; Liu, Z.; Zhang, M.; Yan, H.; Lu, Y.; Gao, Y.; et al. Optimized glycemic control of type 2 diabetes with reinforcement learning: A proof-of-concept trial. Nat. Med. 2023, 29, 2633–2642. [Google Scholar] [CrossRef]
- OpenAI. Hello GPT-4o. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 24 January 2025).
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023. [Google Scholar] [CrossRef]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 2023, 43, 1–55. [Google Scholar]
- Eisinga, R.; Heskes, T.; Pelzer, B.; Te Grotenhuis, M. Exact p-values for pairwise comparison of Friedman rank sums, with application to comparing classifiers. BMC Bioinform. 2017, 18, 1–18. [Google Scholar] [CrossRef] [PubMed]
- Kanjirangat, V.; Antonucci, A.; Zaalon, M. On the Limitations of Zero-Shot Classification of Causal Relations by LLMs (Work in Progress). Proc. ISSN 2024, 1613, 0073. [Google Scholar]
- Gao, J.; Lu, C.; Ding, X.; Li, Z.; Liu, T.; Qin, B. Enhancing Complex Causality Extraction via Improved Subtask Interaction and Knowledge Fusion. arXiv 2024, arXiv:2408.03079. [Google Scholar]
- Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
- Ross, A.; Willson, V.L.; Ross, A.; Willson, V.L. Paired samples T-test. In Basic and Advanced Statistical Tests: Writing Results Sections and Creating Tables and Figures; Sense Publishers: Rotterdam, The Netherlands, 2017; pp. 17–19. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- Bai, G.; Chai, Z.; Ling, C.; Wang, S.; Lu, J.; Zhang, N.; Shi, T.; Yu, Z.; Zhu, M.; Zhang, Y.; et al. Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv 2024, arXiv:2401.00625. [Google Scholar]
- Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, H.; Smith, N. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv 2020, arXiv:2002.06305. [Google Scholar]
- Lin, Z.; Guan, S.; Zhang, W.; Zhang, H.; Li, Y.; Zhang, H. Towards trustworthy LLMs: A review on debiasing and dehallucinating in large language models. Artif. Intell. Rev. 2024, 57, 243. [Google Scholar] [CrossRef]
- You, Z.; Lee, H.; Mishra, S.; Jeoung, S.; Mishra, A.; Kim, J.; Diesner, J. Beyond Binary Gender Labels: Revealing Gender Bias in LLMs through Gender-Neutral Name Predictions. In Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP), Bangkok, Thailand, 16 August 2024; pp. 255–268. [Google Scholar]
- Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating Noise to Sensitivity in Private Data Analysis. In Proceedings of the Theory of Cryptography; Halevi, S., Rabin, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
- Konečnỳ, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated learning: Strategies for improving communication efficiency. In Proceedings of the NIPS Workshop on Private Multi-Party Machine Learning, Barcelona, Spain, 6–8 December 2016. [Google Scholar]
- Hu, L.; Yan, A.; Yan, H.; Li, J.; Huang, T.; Zhang, Y.; Dong, C.; Yang, C. Defenses to membership inference attacks: A survey. ACM Comput. Surv. 2023, 56, 1–34. [Google Scholar] [CrossRef]
- Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar]
- He, Z.; Zhang, T.; Lee, R.B. Model inversion attacks against collaborative inference. In Proceedings of the 35th Annual Computer Security Applications Conference, San Juan, PR, USA, 9–13 December 2019; pp. 148–162. [Google Scholar]
- Hassan, M.U.; Rehmani, M.H.; Chen, J. Differential privacy techniques for cyber physical systems: A survey. IEEE Commun. Surv. Tutor. 2019, 22, 746–789. [Google Scholar] [CrossRef]
- Zhang, C.; Xie, Y.; Bai, H.; Yu, B.; Li, W.; Gao, Y. A survey on federated learning. Knowl.-Based Syst. 2021, 216, 106775. [Google Scholar] [CrossRef]
- Singh, A.; Chatterjee, K. Cloud security issues and challenges: A survey. J. Netw. Comput. Appl. 2017, 79, 88–115. [Google Scholar] [CrossRef]
- Bagdasaryan, E.; Poursaeed, O.; Shmatikov, V. Differential Privacy Has Disparate Impact on Model Accuracy. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Bagdasaryan, E.; Veit, A.; Hua, Y.; Estrin, D.; Shmatikov, V. How to backdoor federated learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Online, 26–28 August 2020; pp. 2938–2948. [Google Scholar]
- Floridi, L. Establishing the rules for building trustworthy AI. Nat. Mach. Intell. 2019, 1, 261–262. [Google Scholar] [CrossRef]
# | HIPAA PHI Categories | Personal Data per the EU’s GDPR |
---|---|---|
1. | Names | Any information relating to an identified or identifiable natural person. |
2. | Dates, except year | |
3. | Telephone numbers | |
4. | Geographic data | |
5. | FAX numbers | |
6. | Social security numbers | |
7. | E-mail addresses | |
8. | Medical record numbers | |
9. | Account numbers | |
10. | Health plan beneficiary numbers | |
11. | Certificate/license numbers | |
12. | Vehicle identifiers and serial numbers | |
13. | Web URLs | |
14. | Device identifiers and serial numbers | |
15. | Internet protocol addresses | |
16. | Full-face photos and comparable images | |
17. | Biometric identifiers | |
18. | Any unique identifying number or code |
Number of EHRs | Number of Tokens | PHI Categories | |
---|---|---|---|
Names | Dates | ||
15 | 1454 | 10 | 15 |
Approach | Model | E | D | E-D | Size | Language | Year |
---|---|---|---|---|---|---|---|
In-context learning | FLAN-T5 XXL | ✓ | 11 B | Multilingual | 2022 | ||
GPT-3.5 Turbo | ✓ | N/A | Multilingual | 2023 | |||
GPT-4 | ✓ | 1.76 T | Multilingual | 2023 | |||
GPT-4o | ✓ | N/A | Multilingual | 2024 | |||
LLaMA 3 | ✓ | 8 B | Multilingual | 2024 | |||
Mistral-7B | ✓ | 7 B | Multilingual | 2023 | |||
Full fine-tuning | BERTbase | ✓ | 110 M | English | 2018 | ||
ClinicalBERT | ✓ | 110 M | English | 2023 | |||
DistilBERT | ✓ | 66 M | English | 2019 | |||
RoBERTabase | ✓ | 125 M | English | 2019 |
Model | Zero-Shot | One-Shot | ||||
---|---|---|---|---|---|---|
Precision | Recall | F1 Score | Precision | Recall | F1 Score | |
FLAN-T5 XXL | 8.62% | 14.46% | 10.80% | 55.25% | 59.10% | 57.11% |
GPT-3.5 Turbo | 27.81% | 42.24% | 33.54% | 65.41% | 59.27% | 62.19% |
GPT-4 | 47.73% | 74.62% | 58.23% | 70.17% | 87.14% | 77.74% |
LLaMA 3 | 55.56% | 32.20% | 40.77% | 59.11% | 48.42% | 53.23% |
Mistral-7B | 15.55% | 58.34% | 24.56% | 38.33% | 68.70% | 49.20% |
Model | Precision | Recall | F1 Score |
---|---|---|---|
Nottingham [33] | 0.990 | 0.964 | 0.976 |
MIST [44] | 0.914 | 0.927 | 0.920 |
BiLSTM-CRF [44] | 0.979 | 0.978 | 0.978 |
GRU [45] | 0.987 | 0.958 | 0.972 |
GRU-GRU [45] | 0.990 | 0.951 | 0.970 |
LSTM-GRU [45] | 0.987 | 0.952 | 0.969 |
Self-attention [45] | 0.980 | 0.984 | 0.982 |
BiLSTM-CRF [15] | 0.959 | 0.869 | 0.912 |
FLAN-T5 XXL (one-shot) | 0.552 | 0.591 | 0.571 |
GPT-3.5 Turbo (one-shot) | 0.654 | 0.592 | 0.621 |
GPT-4 (one-shot) | 0.701 | 0.871 | 0.777 |
LLaMA 3 (one-shot) | 0.591 | 0.484 | 0.532 |
Mistral-7B (one-shot) | 0.383 | 0.687 | 0.492 |
Model | Zero-Shot | One-Shot |
---|---|---|
ChatGPT [5] | 0.929 | – |
LLaMa 2 [5] | 0.612 | – |
FLAN-T5 XXL | 0.057 | 0.399 |
GPT-3.5 Turbo | 0.201 | 0.451 |
GPT-4 | 0.410 | 0.635 |
LLaMA 3 | 0.256 | 0.362 |
Mistral-7B | 0.140 | 0.326 |
Model | Precision | Recall | F1 Score |
---|---|---|---|
BiLSTM-CRF [44] | 0.979 | 0.978 | 0.978 |
GRU [45] | 0.987 | 0.958 | 0.972 |
GRU-GRU [45] | 0.990 | 0.951 | 0.970 |
LSTM-GRU [45] | 0.987 | 0.952 | 0.969 |
Self-attention [45] | 0.980 | 0.984 | 0.982 |
BiLSTM-CRF [15] | 0.959 | 0.869 | 0.912 |
BERTbase (5 epochs) | 0.929 ± 0.002 | 0.948 ± 0.001 | 0.938 ± 0.001 |
ClinicalBERT (3 epochs) | 0.842 ± 0.009 | 0.849 ± 0.005 | 0.845 ± 0.007 |
DistilBERT (5 epochs) | 0.904 ± 0.005 | 0.922 ± 0.005 | 0.913 ± 0.005 |
RoBERTabase (5 epochs) | 0.953 ± 0.001 | 0.964 ± 0.001 | 0.959 ± 0.001 |
Model | Zero-Shot | One-Shot | ||||
---|---|---|---|---|---|---|
Precision | Recall | F1 Score | Precision | Recall | F1 Score | |
GPT-4 | 46.48% | 66.51% | 54.72% | 71.80% | 81.48% | 76.33% |
GPT-4o | 39.10% | 56.88% | 46.34% | 79.57% | 77.33% | 78.43% |
Model | Zero-Shot | One-Shot | ||||
---|---|---|---|---|---|---|
Precision | Recall | F1 Score | Precision | Recall | F1 Score | |
GPT-4 | 41.66% | 60.00% | 49.18% | 78.94% | 60.00% | 68.18% |
GPT-4o | 50.00% | 60.00% | 54.54% | 83.33% | 60.00% | 69.76% |
Model | ZS | OS | FT | Time | Output Token Pricing |
---|---|---|---|---|---|
FLAN-T5 XXL | ✓ | 5.144 s ± 0.598 | – | ||
FLAN-T5 XXL | ✓ | 29.576 s ± 0.173 | – | ||
GPT-3.5 Turbo | ✓ | 0.405 s ± 0.027 | USD 1.50/1M tokens | ||
GPT-3.5 Turbo | ✓ | 0.481 s ± 0.052 | USD 1.50/1M tokens | ||
GPT-4 (English) | ✓ | 1.905 s ± 0.269 | USD 60.00/1M tokens | ||
GPT-4 (English) | ✓ | 1.424 s ± 0.431 | USD 60.00/1M tokens | ||
LLaMA 3 | ✓ | 3.035 s ± 0.057 | – | ||
LLaMA 3 | ✓ | 3.266 s ± 0.049 | – | ||
Mistral-7B | ✓ | 8.507 s ± 0.185 | – | ||
Mistral-7B | ✓ | 9.086 s ± 0.086 | – | ||
BERTbase | ✓ | 0.026 s ± 0.000 | – | ||
ClinicalBERT | ✓ | 0.016 s ± 0.000 | – | ||
DistilBERT | ✓ | 0.018 s ± 0.002 | – | ||
RoBERTabase | ✓ | 0.029 s ± 0.000 | – | ||
GPT-4 (German) | ✓ | 2.408 s ± 0.404 | USD 60.00/1M tokens | ||
GPT-4 (German) | ✓ | 2.644 s ± 0.548 | USD 60.00/1M tokens | ||
GPT-4o (German) | ✓ | 1.446 s ± 0.298 | USD 10.00/1M tokens | ||
GPT-4o (German) | ✓ | 1.634 s ± 0.530 | USD 10.00/1M tokens |
Model | Total Size | Backward Pass | Optimizer Step |
---|---|---|---|
FLAN-T5 XXL | 40.99 GB | 81.98 GB | 163.97 GB |
GPT-3.5 Turbo | N/A | N/A | N/A |
GPT-4 | N/A | N/A | N/A |
GPT-4o | N/A | N/A | N/A |
LLaMA 3 | 28.21 GB | 56.42 GB | 112.83 GB |
Mistral-7B | 27.5 GB | 55.0 GB | 110.0 GB |
BERTbase | 417.65 MB | 835.3 MB | 1.63 GB |
ClinicalBERT | 513.97 MB | 1.0 GB | 2.01 GB |
DistilBERT | 253.16 MB | 506.32 MB | 1012.63 MB |
RoBERTabase | 475.49 MB | 950.99 MB | 1.86 GB |
Model | Zero-Shot | One-Shot | ||
---|---|---|---|---|
Bias | Hallucination | Bias | Hallucination | |
FLAN-T5 XXL | 16.92% | 85.21% | 56.22% | 100% |
GPT-3.5 Turbo | 39.29% | 97.85% | 19.94% | 99.46% |
GPT-4 (English) | 10.31% | 99.46% | 4.66% | 5.25% |
LLaMA 3 | 50% | 100% | 41.16% | 100% |
Mistral-7B | 55.44% | 74.70% | 34.43% | 83.26% |
GPT-4 (German) | 44% | 64% | 26% | 0% |
GPT-4o (German) | 34% | 52% | 22% | 2% |
Risk | Suitable PET |
---|---|
Linking attacks | DP |
Membership inference attacks | DP |
Model inversion attacks | DP |
Attacks on centralized cloud storage | FL |
Data leakage during transmission | FL |
Unauthorized data access | FL |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sousa, S.; Jantscher, M.; Kröll, M.; Kern, R. Large Language Models for Electronic Health Record De-Identification in English and German. Information 2025, 16, 112. https://doi.org/10.3390/info16020112
Sousa S, Jantscher M, Kröll M, Kern R. Large Language Models for Electronic Health Record De-Identification in English and German. Information. 2025; 16(2):112. https://doi.org/10.3390/info16020112
Chicago/Turabian StyleSousa, Samuel, Michael Jantscher, Mark Kröll, and Roman Kern. 2025. "Large Language Models for Electronic Health Record De-Identification in English and German" Information 16, no. 2: 112. https://doi.org/10.3390/info16020112
APA StyleSousa, S., Jantscher, M., Kröll, M., & Kern, R. (2025). Large Language Models for Electronic Health Record De-Identification in English and German. Information, 16(2), 112. https://doi.org/10.3390/info16020112