Next Article in Journal
TGNF-Net: Two-Stage Geometric Neighborhood Fusion Network for Category-Level 6D Pose Estimation
Next Article in Special Issue
Semi-Supervised Relation Extraction Corpus Construction and Models Creation for Under-Resourced Languages: A Use Case for Slovene
Previous Article in Journal
Hybrid U-Net Model with Visual Transformers for Enhanced Multi-Organ Medical Image Segmentation
Previous Article in Special Issue
Bridging Linguistic Gaps: Developing a Greek Text Simplification Dataset
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Large Language Models for Electronic Health Record De-Identification in English and German

1
Know Center Research GmbH, 8010 Graz, Austria
2
Institute of Human-Centred Computing, Graz University of Technology, 8010 Graz, Austria
*
Author to whom correspondence should be addressed.
Information 2025, 16(2), 112; https://doi.org/10.3390/info16020112
Submission received: 9 December 2024 / Revised: 26 January 2025 / Accepted: 28 January 2025 / Published: 6 February 2025
(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

Abstract

:
Electronic health record (EHR) de-identification is crucial for publishing or sharing medical data without violating the patient’s privacy. Protected health information (PHI) is abundant in EHRs, and privacy regulations worldwide mandate de-identification before downstream tasks are performed. The ever-growing data generation in healthcare and the advent of generative artificial intelligence have increased the demand for de-identified EHRs and highlighted privacy issues with large language models (LLMs), especially data transmission to cloud-based LLMs. In this study, we benchmark ten LLMs for de-identifying EHRs in English and German. We then compare de-identification performance for in-context learning and full model fine-tuning and analyze the limitations of LLMs for this task. Our experimental evaluation shows that LLMs effectively de-identify EHRs in both languages. Moreover, in-context learning with a one-shot setting boosts de-identification performance without the costly full fine-tuning of the LLMs.

1. Introduction

Electronic health record (EHR) de-identification can be considered a named-entity recognition (NER) task that identifies and removes protected health information (PHI) from medical data [1]. PHI in EHRs includes information such as names, social security numbers, e-mail addresses, dates, and biometric identifiers, instances of which are associated with an individual patient [2]. Health records are the backbone of many e-health applications yet are subject to regulations worldwide [3,4] that require the removal of PHI to preserve patients’ privacy prior to further data use [1]. Moreover, the growth of big data EHRs [2] has increased privacy issues for medical data, considering unauthorized access to PHI, cloud storage, breaches, and attacks [5,6,7].
Generative artificial intelligence (GenAI) has rapidly promoted significant advances in healthcare [8]. Transformer-based [9] large language models (LLMs) have succeeded in complex tasks [10], such as clinical text summarization [11], de-identification [5], medical dialogue summarization [12], and document classification [13]. However, generative LLMs often hallucinate, produce biased outputs, and have limitations from their training data [5]. The performance benefits of LLMs in healthcare have far outweighed the drawbacks, but privacy concerns remain.
This study evaluates ten recently proposed LLMs from all three model categories applied to the EHR de-identification task in English and German. Privacy is a central requirement in natural language processing (NLP) applications for healthcare since medical data are inherently private [14]. In addition, most of the de-identification methods focus on medical text in English, whereas other languages still face the lower availability of these methods [15]. For instance, the de-identification of German data has only been addressed by a few works [16] despite German being spoken by over 100 million speakers worldwide [17]. Furthermore, there is a shortage of publicly available text data for the medical domain in German [18]. We tested three different scenarios for de-identification by LLMs. First, we performed in-context learning with zero-shot and one-shot settings for encoder–decoder and decoder-only LLMs from a bilingual perspective for English and German. Figure 1 depicts the proposed EHR de-identification method based on generative LLMs with in-context learning and zero-shot settings. Finally, we fine-tuned encoder-only LLMs, such as BERT [19] family models, for the dataset used in our experimental evaluation.
Our contributions are the following: (1) an extensive benchmark of LLMs for EHR de-identification, including GPT-4 [20], LLaMA 3 [21], and RoBERTa [22] models. (2) an experimental evaluation of EHR de-identification for EHRs in English and German; (3) a de-identification performance assessment for in-context learning and full model fine-tuning; (4) a discussion of LLM limitations for de-identification in both English and German languages; (5) the contextualization of privacy issues for LLM-based EHR de-identification in real-world scenarios.

2. Related Work

The ever-growing data generation in healthcare has increased the need for EHR de-identification and recently resulted in a wide variety of proposed methods. Initial de-identification applications used to be mostly based on pattern-matching and machine learning [23]. Currently, de-identification is performed by models ranging from rule-based approaches to deep learning [2] and GenAI [5]. Pattern-matching models rely on manually crafted regular expressions, lists, and dictionaries that require domain experts to design and present constrained generalization abilities despite not necessitating labeled data [23]. In contrast, supervised machine learning models have a tendency not to increase complexity and not to slow down processing speed, although requiring large annotated training databases [23]. Deep learning models present higher generalization capabilities yet highly depend on large annotated datasets for model training [2]. More recently, GenAI models, especially LLMs, have demonstrated potential in processing textual data in zero-shot and few-shot learning scenarios [5]. Therefore, we present the historical perspective on EHR de-identification by reviewing pattern-matching, machine learning, deep learning, and LLM approaches for this task.
Initially, pattern-matching and machine learning approaches were developed to address EHR de-identification [23]. Berman [24] developed a general algorithm that de-identifies free text, removing identifiers and private information. The algorithm extracts all words from the text aside from stop words. Every term with a match in the Unified Medical Language System (UMLS) is replaced with an alternate term mapping to the same UMLS code. All the remaining words are replaced with symbols. The algorithm scrubbed a text corpus with over half a million sentences, but precision and recall for the de-identification were not reported [23]. Beckwith et al. [25] developed an open-source de-identification tool using pattern-matching with regular expressions to detect PHI patterns like dates and addresses. This tool achieved high recall on pathology records from three institutions. Friedlin and McDonald [26] developed a software tool using a series of regular expressions for pattern-matching and word lists as the main methods for de-identification, reaching high de-identification performance in the evaluation. Uzuner et al. [27] used support vector machines (SVMs) [28] to de-identify medical discharge summaries. De-identification was treated as a multi-class classification problem in which tokens were classified as PHI or non-PHI, achieving high precision and recall [23]. Wellner et al. [29] adapted NER toolkits for de-identification, namely, Carafe and LingPipe. Carafe is based on conditional random fields (CRFs) [30], whereas LingPipe uses hierarchical hidden Markov models [23]. Both toolkits were evaluated on i2b2 challenge data, on which Carafe outperformed LingPipe. Traditional off-the-shelf de-identification systems, such as the MITRE Identification Scrubber Toolkit (MIST) [31], also use Carafe. Finally, the CRF-based de-identification system Nottingham [32] was the best-performing method at the 2014 i2b2 de-identification challenge [33].
As deep learning architectures have gained popularity, convolutional neural networks (CNNs) [34] have also been extensively applied to healthcare tasks [35]. For instance, CNNs were used to process medical images and aid in diagnosing COVID-19 cases [36] and contact tracing [37] during the global outbreak of coronavirus. Chen et al. [38] proposed an efficient CNN architecture for NER, which yielded high performance with lower time costs than recurrent neural network models. Tomy et al. [39] proposed an architecture based on graph CNNs for estimating the spreading of epidemics, and Tan et al. [40] utilized graph neural networks for contact tracing. When it came to the de-identification task, Obeid et al. [41] tested machine and deep learning methods, in which a CNN-based architecture inspired by Kim [42] achieved the highest results.
More recently, deep learning approaches based on bidirectional long short-term memory (BiLSTM) networks [43] and transformer language models have attracted attention for EHR de-identification. Liu et al. [1] analyzed the results of CRF, BiLSTM, BiLSTM with handcrafted features, and a rule-based method for de-identification. Then, the authors proposed an ensemble of the first three methods and merged the ensemble results with those of the rule-based method, achieving high de-identification performance. Dernoncourt et al. [44] introduced the first deep learning-based de-identification model without handcrafted features or rules using BiLSTM, which outperformed a CRF baseline. Ahmed et al. [45] proposed a de-identification architecture based on bidirectional gated recurrent units (GRUs) [46], a stacked recurrent neural network structure with GRU and/or long short-term memory (LSTM) [47] components, and a self-attention mechanism. Their proposed methods performed faster and better than state-of-the-art baselines. Furthermore, the authors introduced utility metrics for the de-identified data. Trienes et al. [15] presented the first de-identification comparison across languages. The authors compared the performance of a rule-based de-identification system against that of a CRF model alone and a BiLSTM architecture combined with a CRF output layer, i.e., BiLSTM-CRF, for datasets in English and Dutch, showing that BiLSTM-CRF generalizes best, even under limited data scenarios. Recently, Liu et al. [5] explored the potential of LLMs for de-identification. The authors developed a GPT-4-based de-identification framework and benchmarked it against several LLM baselines using zero-shot learning and fine-tuning, showing high performance and reliability. Finally, a survey of neural network methods for de-identification was presented by Leevy et al. [2], describing recent baselines and challenges for this task.
The de-identification of German-language data is challenging due mainly to the lower public data availability compared to English. Richter-Pechanski et al. [48] performed an extensive evaluation for the de-identification of German medical reports using a CRF-based method and BiLSTM with pre-trained word embeddings. The latter method achieved remarkable recall scores in the evaluation. Kolditz et al. [16] proposed annotation guidelines for the de-identification of a corpus composed of German clinical documents. The authors trained a BiLSTM-based model, which yielded remarkable performance for many PHI categories on the annotated corpus. Other de-identification methods for German-language data were applied to clinical notes [49] and e-mails [50].
Approaches to de-identification have also been proposed for other languages worldwide, such as Arabic [51], Chinese [52], Dutch [53], French [54], Italian [55], Japanese [56], Korean [57], Norwegian [58], Portuguese [59], Spanish [60], and Swedish [61]. Our work extends those of Liu et al. [5], proposing an evaluation of LLMs for the de-identification of EHRs, and Trienes et al. [15], with experiments for two languages. Furthermore, we leveraged the potential of GenAI methods to overcome the scarcity of publicly available data in the German language for de-identification.

3. Materials and Methods

This section presents the methodology of this study. First, we describe the de-identification task (Section 3.1) and the PHI categories according to data protection regulations (Section 3.2). Second, we describe the datasets used in the experiments (Section 3.3). Third, we describe the in-context learning (Section 3.4) and full fine-tuning (Section 3.5) approaches for LLMs and list the LLMs used in the experiments (Section 3.6). Finally, we describe the evaluation procedure for the translation of a de-identification dataset from English to German (Section 3.7).

3.1. De-Identification

Given a set of EHRs X = { x 1 , x 2 , , x n } , de-identification aims to identify a set of PHI instances Y = { y 1 , y 2 , , y m } and remove them from X. Every PHI instance y i is composed of a word w i and a tag t g i , in which t g i is the PHI category assigned to w i in the BIO tagging scheme [15,62]. Similar to an NER task, the de-identification evaluation consists of verifying whether the predicted offsets for words and tags that indicate PHI instances match exactly [15]. De-identification performance is measured by the number of PHI instances from X, which the model identifies and removes from the EHRs afterward.

3.2. Protected Health Information Categories

Privacy regulations have been approved worldwide in order to protect patients’ privacy, but they differ in the number of PHI categories they consider. In the United States (US), the Health Insurance Portability and Accountability Act (HIPAA) [3] defines 18 categories of PHI [2], whereas the European Union (EU)’s General Data Protection Regulation (GDPR) [4,63] does not contain clear definitions of PHI categories [15] yet protects personal data to a broader extent. In this sense, any information that relates to an identified or identifiable individual is defined as personal data by the EU’s GDPR [63]. Table 1 presents the 18 PHI categories under HIPAA, such as names, dates, and telephone numbers, followed by the definition of personal data by the EU’s GDPR. This paper uses HIPAA PHI definitions since they have been more broadly employed in the de-identification literature [15,44].

3.3. De-Identification Datasets

In this work, we used two versions of the N2C2 (originally i2b2) 2014 dataset [64], which is a popular benchmark dataset for de-identification methods in English [1,15,44]. We refer to this dataset simply as N2C2 throughout this paper. First, we used the original N2C2 dataset, comprising 1304 files divided into two sets for training (790 files) and testing (514 files). The complete statistical overview of the dataset can be found in Stubbs et al. [64]. Finally, we used a translated version from English to German of the N2C2 dataset for part of the de-identification experiments in German. The translation process and validation are described in Section 3.7.
In addition to the two N2C2 dataset versions, we also used a real-world dataset for benchmarking EHR de-identification in German. The real-world dataset is part of the MUG-CTHEAD dataset of radiological reports [65] curated by a team of medical experts in radiology. In this dataset, original PHI instances were replaced with fictitious names and dates, but the radiology terminology and language remained unchanged. The dataset comprises 15 EHRs containing text from radiological evaluations, such as computer tomography scans, as Table 2 shows. The 15 real-world EHRs have 1454 tokens in total, 25 of which are PHI instances of two categories: names (10 instances) and dates (15 instances). German-language challenges like complexity, as shown in Table A3 in Appendix C, are also present in this dataset.

3.4. In-Context Learning and Prompt Engineering

In-context learning is the ability of LLMs to make predictions from contexts augmented by a few examples [66]. It enables LLMs to perform new tasks without updating model weights by following instructions from the input prompts [67]. For instance, the LLMs receive demonstration examples written in natural language for a task, learn the pattern from the demonstration, and make a prediction that completes the input text [66]. Even zero-shot examples were found to be successful for LLM generalization during in-context learning [67]. Moreover, this learning paradigm incorporates human knowledge into LLMs and reduces computational costs for model adaptation [66].
In this work, we used self-instruct prompts [67] for in-context learning following the Alpaca prompt format [68]. We developed both zero-shot and one-shot prompts for de-identification. Figure 2 and Figure 3 illustrate the prompt designed for the encoder–decoder and decoder-only LLMs for the English and German datasets, respectively. The prompts encompass a task statement, a set of rules, and a template for the output format. The task statement describes the task and its main goal. The set of rules specifies the requirements for the PHI categories [5], the strings to avoid, and how to proceed in case no PHI category is in the health record. Finally, the template for the output format aims to preserve the ordering of the PHI instances in the LLM output as they appear in the health record text. For the experiments on German data, we used a translated version of this prompt revised by native German speakers.

3.5. Full Fine-Tuning for LLMs

Full fine-tuning updates all LLM parameters for a new task [69]. BERT family models have long been subject to full fine-tuning for many NLP tasks [70]. This process is often time-consuming and computationally expensive yet leads to considerable gains in performance for the new task when compared to the original pre-trained models. In this work, we performed full fine-tuning for the encoder-only LLMs used for de-identification due to (1) their lower number of parameters, (2) their fixed output structure that prevents hallucinations, and (3) their demonstrated efficiency for NER tasks [71].

3.6. Large Language Models

Most of the recent advances in NLP have been powered by LLMs based on the transformer architecture [9]. These LLMs are in three categories based on their model architecture: encoder-only, decoder-only, and encoder–decoder models [70]. Encoder-only models consist uniquely of an encoder network [70], such as BERT [19], RoBERTa [22], and DistilBERT [72]. Decoder-only models are built solely with the transformer architecture’s decoder component, such as LLaMA [73] and GPT-3 [74]. Encoder–decoder models encompass both encoder and decoder components, such as FLAN-T5 [75].
Progress in LLMs is fast advancing [70]. In this study, we benchmarked LLMs from the three main model categories. Table 3 presents the statistics of the LLMs used in our evaluation. We used decoder-only and encoder–decoder models for in-context learning, whereas we used encoder-only models for full fine-tuning. The model sizes vary between 66 M and 1.76 T parameters, excluding GPT-3.5 Turbo and GPT-4o, whose numbers of parameters are not publicly available [76]. Both English and German languages were covered by the following LLMs. Additional information on the LLMs used in this study is given in Table A1 in Appendix A.
  • BERT [19] is an encoder-only LLM that comprises an embedding module, a stack of transformer encoders, and a fully connected layer [70]. This model relies on joint masked language modeling and next-sentence prediction objectives [19] and has recently been used for privacy-preserving NLP tasks [14]. In this work, we fine-tuned a 110 M parameter BERTbase uncased model for the English N2C2 dataset.
  • ClinicalBERT [77] is a version of BERT fine-tuned on medical-domain language. In this work, we fine-tuned a 110 M parameter ClinicalBERT model for the English N2C2 dataset.
  • DistilBERT [72] is a 40% smaller, 60% faster version of BERT with 97% the performance. This model is faster and less memory-consuming to fine-tune than the original BERT model. In this work, we fine-tuned a 66 M parameter DistilBERT model for the English N2C2 dataset.
  • RoBERTa [22] improves BERT by introducing hyperparameter modifications, removing the next-sentence training objective, and using larger mini-batches and increased learning rates [70]. In this work, we fine-tuned a 125 M parameter RoBERTabase uncased model for the English N2C2 dataset.
  • FLAN-T5 XXL [75] is an encoder–decoder LLM with 11 B parameters and multilingual capabilities and is fine-tuned on a wide range of NLP tasks. In this work, we performed in-context learning for the de-identification of the English N2C2 dataset for this model.
  • GPT-3.5 Turbo [76] is a decoder-only LLM, improving on the GPT-3 model [74] and part of the GPT model family developed by OpenAI (https://openai.com/, accessed on 27 January 2025). This model is closed-source and only accessible via an application programming interface (API) [70]. Additionally, the precise number of model parameters for GPT-3.5 Turbo is undisclosed [76]. In this work, we performed in-context learning for the de-identification of the English N2C2 dataset for the GPT-3.5 Turbo model version ‘gpt-3.5-turbo-0125’.
  • GPT-4 [20] is a decoder-only LLM in the GPT family, which far exceeds its predecessors across several benchmarks [70], with state-of-the-art performance, 1.76 T parameters, multi-modal outputs, and multilingual capabilities. Like GPT-3.5 Turbo, this model is closed-source and only accessible via APIs [20]. In this work, we performed in-context learning for the de-identification of both English and German N2C2 datasets for the GPT-4 model version ‘gpt-4-0613’.
  • GPT-4o [78] is a decoder-only LLM with cross-modal capabilities in video, audio, and text. It is a flagship, closed-source model in the GPT-4 model family with enhanced performance for languages other than English. Like its preceding models, GPT-4o is only accessible via APIs [78]. In this work, we performed in-context learning for the de-identification of the N2C2 and real-world datasets in German for the GPT-4o model version ‘gpt-4o-2024-08-06’.
  • LLaMA 3 [21] is a decoder-only, open-source LLaMA [73] family model, which promotes gains in performance over its predecessors due to improved data quality and a larger trained scale [79]. In this work, we performed in-context learning for the de-identification of the English N2C2 dataset for the 8 B parameter LLaMA-3 model.
  • Mistral-7B [80] is a decoder-only, open-source, 7 B parameter LLM that succeeds across reasoning, mathematics, and code generation benchmarks [70]. In this work, we performed in-context learning for the de-identification of the English N2C2 dataset for this model.

3.7. Translation of the N2C2 Dataset

To evaluate the models in a bilingual setting, the N2C2 dataset was translated into German. This translation process was conducted using the OpenAI REST API with the GPT-3.5 Turbo model. We designed both system and user prompts for this task, which Figure 4 and Figure 5 depict, respectively. The translated documents underwent a validation process performed by native German-speaking medical experts to ensure accuracy. First, three radiologists conducted an evaluation of the translation process. One radiologist had two years of work experience at the time of dataset translation, while the other two were in the final year of medical studies. Second, each radiologist evaluated every report in an independent manner. Finally, a curation session was conducted at the end, in which the experts reached a total agreement on the translated dataset’s final version. No discrepancies were identified during the individual evaluations during the validation process. In total, 1304 clinical reports were translated.

4. Experimental Setup

This section describes the experimental setup of this study. Section 4.1 describes the pre-processing procedures for both English and German datasets. Section 4.2 describes the LLM parameter values used in the experiments. Finally, Section 4.3 describes the experiments and performance metrics.

4.1. Data Pre-Processing

Before de-identification by LLMs, we executed common pre-processing procedures for the English and German N2C2 datasets, as well as the real-world German dataset. First, we performed an entity alignment on both the training and test sets of the N2C2 datasets and the full content of the real-world dataset to treat overlapping entities. Second, we used the SpaCy (https://spacy.io, accessed on 27 January 2025) tokenizers for tokenization and sentence segmentation [15]. Finally, we tagged all tokens following the BIO tagging scheme, as described by Trienes et al. [15]. The original N2C2 dataset’s division into training and test sets was maintained throughout all study phases.

4.2. Large Language Model Settings

In this study, we used pre-trained LLMs, which are publicly available in the Hugging Face Transformer Library (https://huggingface.co/models, accessed on 27 January 2025) or accessible via the OpenAI API (https://platform.openai.com/docs/overview, accessed on 27 January 2025). We include detailed LLM information in Appendix A. We recommend that the reader refer to Table A1 to check model versions and hyperparameter values for each LLM used in the experiments.
Regarding the LLMs with in-context learning, we aimed to reduce variability in the outputs by applying the following strategies. First, we set a fixed random seed for the experiments. Second, we limited the number of generated new tokens to 50 since the models were instructed to output a small number of tokens at a time. Third, we set the temperature parameter to a small positive value in order to avoid high variability in the LLMs’ outputs. Further, we set the number of highest-probability vocabulary tokens (‘top_k’) to 1 and set the top-p sampling value to 0.1. Finally, we enabled sampling for the output generation by the LLMs.
When it came to the LLMs with full fine-tuning, we used a token classification head atop BERT family models and applied the following strategies. First, we set the maximum length parameter to 512 tokens. Second, we set the initial learning rate to 2 × 10 5 . Third, we used batch sizes of 1 for both training and validation sets. Further, we varied the number of training epochs between 1 and 5. Finally, we used a weight decay technique to prevent overfitting of the fine-tuned models.

4.3. Experiments and Evaluation

We developed de-identification methods using LLMs and benchmarked them on English and German de-identification datasets. We used transformer-based language models and executed the experiments on English and German versions of the N2C2 de-identification dataset, as well as the real-world German data. We measured the de-identification performance for all categories of PHI combined for both English and German data. The experiments were conducted as follows. First, we performed in-context learning by zero-shot and one-shot prompts for the pre-trained encoder–decoder and decoder-only LLMs. Second, we performed full fine-tuning for the encoder-only LLMs. Third, we report micro-averaged precision, recall, and F1 scores [1,15] for all LLMs. Finally, we also report accuracy for the decoder-only LLMs [5].
De-identification performance is assessed by precision (P), recall (R), and F1 score ( F 1 ) given the number of true positives ( T P ), the number of false positives ( F P ), the number of true negatives ( T N ), and the number of false negatives ( F N ) [44]. Precision is defined as
P = T P T P + F P .
Recall is defined as
R = T P T P + F N .
F1 score is defined as
F 1 = 2 × P × R P + R .
Like Liu et al. [5], we also report accuracy (A), which is defined as
A = T P + T N T P + T N + F P + F N .

5. Results

This section describes the de-identification results for the LLMs used for in-context learning and full fine-tuning. Section 5.1 presents the in-context learning results for the encoder–decoder and decoder-only LLMs. Section 5.2 presents the full fine-tuning results for the encoder-only LLMs. Section 5.3 presents the results of the experiments on the German N2C2 dataset. Section 5.4 describes the results of a real-world evaluation of LLMs on the real-world German dataset. Finally, Section 5.5 analyzes the cost–performance trade-offs for LLMs in the healthcare domain.

5.1. In-Context Learning

In-context learning for encoder–decoder and decoder-only models presents solid performance for de-identification, especially for one-shot experiments. First, we used de-identification prompts composed of a task statement, a set of rules, and a template for the output format. Finally, we added an example EHR to the aforementioned prompt, as well as the expected output for the example EHR. The EHR used as the example for the one-shot experiments consisted of the EHR with the highest number of PHI categories from the N2C2 dataset’s training set. When adding an example EHR to the de-identification prompts, we notice increases in all performance metrics for de-identification.
Validating the de-identification results achieved by LLMs with in-context learning poses additional challenges to the post-processing of the model outputs. For example, the LLM outputs might be unstructured and contain tokens generated by the models that are not in the original EHR’s text. Furthermore, checking model outputs is a necessary practice to detect hallucinations [81]. Therefore, we have applied the following rules to reinforce the requirements from the prompt for model output validation during the post-processing step.
  • The removal of tokens that explicitly contain parts of the original prompt.
  • The removal of tokens containing medical terms, such as ‘pager’, ‘ultrasound’, or ‘nebulizer’.
  • The removal of tokens that are medication names, such as ‘atenolol’, ‘hydroxychloroquine’, or ‘prednisone’.
  • The removal of tokens that are medical conditions, such as ‘diabetes’.
  • The removal of tokens that are incomplete name parts, such as ‘Dr.’ or ‘M.D.’.
  • The removal of tokens that relate to gender.
  • The removal of tokens containing sets of stop words.
  • The removal of tokens that are floating-point numbers, percentages, fractions, and temperatures.
The empirical evaluation of the proposed post-processing rules can be found in Table A2 in Appendix B.
Table 4 presents precision, recall, and F1 scores for encoder–decoder and decoder-only LLMs with in-context learning in two settings, namely, zero-shot and one-shot de-identification. Overall, one-shot de-identification increased all performance metrics for all LLMs. These gains in performance were driven by a reduction in the false positives in the LLM outputs when an example EHR and the expected example outputs were included in the prompt. Decoder-only models also showed higher performance than the analyzed encoder-only model due to their autoregressive language modeling abilities.
Table 5 presents the performance metrics for the five LLMs used for in-context learning with one-shot settings in this study in comparison to two traditional de-identification systems (Nottingham [33] and MIST [44]) and state-of-the-art de-identification methods based on deep learning. The results in the table show that LLM-based de-identification is a promising approach, especially in terms of recall. To verify whether the performance differences between generative LLMs and the baselines in Table 5 were statistically significant, we performed the Friedman test [82]. First, we assumed the null hypothesis for the Friedman test that all methods performed equally. Second, we assumed the alternative hypothesis for the Friedman test that at least one method performed differently with statistical significance. Finally, we performed the Friedman test, which returned a test statistic equal to 32.82 and a p-value of 0.001. Therefore, the null hypothesis was rejected since the p-value was less than 0.05.
In terms of accuracy, one-shot de-identification promoted gains compared to zero-shot de-identification. Table 6 presents the accuracy scores for the encoder–decoder and decoder-only LLMs with in-context learning for both zero-shot and one-shot identification settings. Since Liu et al. [5] reported GPT-4 results on samples from the test set, we only compared our methods to the top-scoring models they tested on the full test set. As shown in the table, one-shot de-identification by GPT family models reduced the gap to top-scoring de-identification approaches based on LLMs. Overall, the design of prompts for de-identification is challenging, and including an example EHR from the dataset’s training set improved the final results for all models.

5.2. Fine-Tuning Large Language Models

BERT family models have been widely fine-tuned for NLP tasks. Furthermore, these models might be more effective for some information extraction tasks than in-context learning LLMs, which lack extensive training on structured output formats [83,84]. We fine-tuned the original BERTbase model and three BERT variants for the de-identification of the English N2C2 dataset and measured the performance for the predictions by the fine-tuned models. Overall, full fine-tuning resulted in competitive performance compared to state-of-the-art approaches.
Table 7 presents the precision, recall, and F1 scores and their standard deviations for de-identification by BERT family models on the English N2C2 dataset compared to state-of-the-art baselines. BERTbase, DistilBERT, and RoBERTabase were fine-tuned for five epochs, whereas ClinicalBERT was fine-tuned for three epochs. The models were benchmarked on performance metrics computed at the entity level, following the standard NER evaluation [15]. All experiments were repeated five times, and the average performance metrics and standard deviations were computed over the results of the five executions.
The results in Table 7 show that encoder-only models from the BERT family can be efficient for de-identification after full fine-tuning. Among the models, RoBERTabase presented the highest scores across all performance metrics. This leading performance might be attributed to RoBERTa’s larger number of parameters, dynamic word masking, and enhanced contextual embeddings. On the other hand, ClinicalBERT and DistilBERT presented lower scores, respectively, due to model limitations from the original pre-training and simplified architecture.
We evaluated BERTbase, ClinicalBERT, DistilBERT, and RoBERTabase with epochs in {1, 2, 3, 4, 5}. Since ClinicalBERT failed during training at four and five epochs, Figure 6 shows the F1 scores for the BERT family models fine-tuned over three epochs on the English dataset. DistilBERT and ClinicalBERT were the lowest-performing models at all epochs, even though improvements in F1 score were noticed at larger epochs. Overall, the increasing number of epochs for full fine-tuning resulted in improved performance for all models, especially RoBERTAbase and BERTbase.
In order to verify whether the performance differences between LLMs and deep learning baselines were statistically significant, we performed two statistical tests: the Friedman test [82] and the Nemenyi post hoc test [85]. First, we assumed the null hypothesis for the Friedman test that the performance outcomes for all methods were equal. Second, we assumed the alternative hypothesis for the Friedman test that at least one method’s performance was different. Finally, we conducted the Friedman test, which returned a test statistic equal to 21.97 and a p-value of 0.008. Since the p-value was less than 0.05, the null hypothesis was rejected, and we proceeded with the Nemenyi test to find which methods presented significant differences in performance. The Nemenyi test resulted in a critical difference (CD) of 7.82. Figure 7 shows the results of the Nemenyi test, with the CD represented as the bold lines along the x-axis. We assumed that models not connected by a CD line have average ranks that differ with statistical significance. Most of the average ranks of the encoder-only LLMs used for the de-identification task fell within the CD of the best-performing baseline. Overall, RoBERTabase was the best-performing LLM in our study.

5.3. De-Identification Results for the German N2C2 Dataset

To investigate the de-identification performance of LLMs in German, we also performed an experimental evaluation on the translated dataset. First, we sampled the 50 EHRs with the largest number of PHI categories from the German N2C2 dataset’s testing set. The file names of the EHRs sampled from the German N2C2 dataset can be found in Table A4 in Appendix D. Second, we selected the best-performing generative LLM (GPT-4) from the experiments on the English dataset. Third, we selected GPT-4o for comparison with GPT-4 due to its enhanced capabilities for languages other than English. Since the BERT family models used in this study are monolingual, we constrained our de-identification experiments on German-language data to GPT-4 and GPT-4o, which share the same model version for both languages. Finally, we applied the in-context learning prompts in German to these two LLMs with zero-shot and one-shot settings, similar to the experiments on English-language data.
GPT-4 successfully handled the de-identification of the German dataset, especially in zero-shot settings, achieving high recall scores. In contrast, GPT-4o outperformed GPT-4 in one-shot settings. Table 8 presents the precision, recall, and F1 scores for GPT-4 and GPT-4o with in-context learning in two settings on the German N2C2 dataset. GPT-4 achieved recall scores above 65% for both settings. Similar to the experiments on the English N2C2 dataset, adding an example EHR and example output aided in improving performance across all metrics for both LLMs. The F1 score for one-shot de-identification by GPT-4 on the German dataset sample approached that on the original English dataset. When it came to GPT-4o, one-shot settings also resulted in a large improvement in precision, leading its F1 score to exceed that of GPT-4. To verify whether the differences in performance were statistically significant, we conducted a paired-samples t-test [86]. First, we assumed the null hypothesis for the paired-samples t-test that GPT-4 and GPT4o performance outcomes were equal. Second, we assumed the alternative hypothesis for the paired-samples t-test that GPT-4 and GPT4o performance outcomes were unequal. Finally, we conducted the paired-samples t-test, which returned a p-value of 0.294. Therefore, the null hypothesis was confirmed since the test’s p-value was greater than 0.05.

5.4. Real-World Evaluation

We conducted a real-world evaluation of GPT-4 and GPT-4o for the de-identification of German EHRs. This evaluation used both models for in-context learning in zero-shot and one-shot settings for the real-world dataset. Since real-world data present additional challenges for LLMs, such as complexity, incomplete sentence structures, and language uses deviating from the standard language grammar, the de-identification task was performed under more challenging conditions. At the end of the evaluation, the three performance metrics for de-identification and a paired-samples t-test [86] were computed.
Table 9 presents the performance metrics for de-identification by GPT-4 and GPT-4o with in-context learning for the real-world German EHR dataset. The results show that GPT-4o outperformed GPT-4 in both zero-shot and one-shot settings, especially in terms of precision. GPT-4o provided shorter outputs that avoided false positives to a greater extent than GPT-4. In line with our experiments on the English and German N2C2 datasets, one-shot settings resulted in performance gains for the real-world German data. We also conducted a paired-samples t-test to verify whether the differences in performance were statistically significant for the real-world evaluation. First, we assumed the null hypothesis for the paired-samples t-test that GPT-4 and GPT4o performance outcomes on the real-world German dataset were equal. Second, we assumed the alternative hypothesis for the paired-samples t-test that GPT-4 and GPT4o performance outcomes on the real-world German dataset were unequal. Finally, we conducted the paired-samples t-test, which returned a p-value of 0.060. Therefore, the null hypothesis was also confirmed since the test’s p-value was greater than 0.05.

5.5. Cost–Performance Trade-Offs for LLMs

We extended the evaluation of LLMs for de-identification to analyze cost–performance trade-offs since both in-context learning and full model fine-tuning have practical implications in constrained settings in the healthcare domain. For instance, larger LLMs have a larger memory footprint, and cloud-based LLMs may suffer from high-latency problems. In order to evaluate the costs of each LLM in terms of inference time and output token pricing, we conducted a series of tests and analyses. First, we selected a chunk of 100 tokens from an EHR in the test set of the English N2C2 dataset. Second, we measured the time each LLM took to generate the output for the zero-shot and one-shot prompts with the chunk of 100 tokens included for English. Third, we performed the same measurement of the LLM output generation time for the LLMs used for de-identification in German. Fourth, we computed the prediction time for the fine-tuned BERT family models on the entire test set of the English N2C2 dataset and divided the total inference time by the number of EHRs in the test set (514). Finally, we repeated these evaluation steps five times for each LLM and computed the average times and standard deviations at the end. We conducted the cost evaluation on a cluster equipped with 4×NVIDIA A40 GPUs with 48 GB of GPU RAM and 16 × 64 GB of CPU RAM.
Table 10 presents the inference times of each LLM used for de-identification, as well as the output token pricing for the proprietary, closed-source models (https://openai.com/api/pricing/, accessed on 23 January 2025). The results in the table show that inference times for zero-shot de-identification are lower than one-shot de-identification for all LLMs, except for GPT-4 for the English dataset, which presented higher variance for the one-shot de-identification results. Overall, the fine-tuned BERT models presented the shortest average inference times. ClinicalBERT and DistilBERT were the fastest models, respectively, with 0.016 s and 0.018 s per full EHR in the N2C2 dataset’s test set. While the closed-source models were faster than the on-premises decoder-only LLMs in our evaluation, the usage of their APIs imposes financial costs for obtaining the predictions. For instance, GPT-4 output tokens had the highest price among the selected closed-source LLMs. The on-premises decoder-only LLMs evaluated in our study had no inference costs, regardless of the number of output tokens. LLaMA 3 was the fastest open-source decoder-only LLM, while FLAN-T5 XXL was the most time-consuming model used in our evaluation.
Healthcare is a domain in which computational resources for integrating transformer-based models are critical [10]. For this reason, we also measured the memory needed to load and fine-tune all open-source LLMs used in this study. Table 11 presents the memory needed to instantiate each pre-trained LLM, as well as to train each model on a batch size of 1 using the Adam optimizer. FLAN-T5 XXL is the most memory-demanding LLM in our evaluation, whereas the BERT family models have the lightest memory requirements due to their compact size and encoder-only architectures. In addition, the results in Table 11 highlight a limitation of closed-source models, whose memory footprint for fine-tuning is hard to estimate precisely, and model fine-tuning might not be an available option. Since the memory requirements for full fine-tuning of generative LLMs increase for larger LLMs, hybrid approaches for model fine-tuning can be feasible and promising options for improving de-identification performance. For instance, low-rank adaptation (LoRA) [87] reduces the number of trainable parameters for LLMs [70]. LoRA avoids re-training the original model weights by freezing them and introducing trainable lower-rank decomposition matrices into the transformer architecture layers [87]. Updating the lower-rank matrices is more computationally efficient than updating the full original model weight matrix [70]. The benefits of LoRA include reduced memory requirements, comparable or superior model performance for fine-tuning tasks, and avoiding increased inference latency [87].

6. Discussion

EHR de-identification is a key task in preserving the privacy of patients and healthcare providers. The use of neural network-based methods for this task has promoted progress in terms of performance, and so have LLMs recently [5]. However, LLM adoption for de-identification has benefits and challenges that need to be considered. In this section, we discuss the results of our study and place their implications into a broader context.
Although feature engineering for de-identification remains costly and time-consuming, encoder–decoder and decoder-only LLMs with in-context learning were able to alleviate the need for it. Larger decoder-only LLMs provided satisfactory recall scores, even for zero-shot settings requiring no examples in the prompts. Adding an example to the prompts in a one-shot setting elevated performance scores without requiring changes to the task statements or set of rules in the prompts. Overall, the model size and model training corpus size had an influence on the final performance. However, larger LLMs pose additional requirements for deployment, including computation, energy consumption, storage capacity, infrastructure costs, and communication costs [88]. Generative LLMs might be appropriate for individualized de-identification applications like chatbots.
When it came to the full fine-tuning of LLMs, an extensive data pre-processing routine was required, including named-entity alignment and changes to the model architecture. BERT family LLMs provided competitive de-identification performance after full fine-tuning. In contrast, data dependency is still an open challenge to overcome, especially for labeled training data [89]. Overall, encoder-only models are high-performing and unlikely to hallucinate since the model output is fixed by the inputs. Encoder-only LLMs might be appropriate for deployment on-premises for large-scale applications in healthcare organizations. Therefore, full fine-tuning can be advantageous for critical settings that require robustness to hallucinations and on-premises computing.

6.1. Limitations

LLM-based EHR de-identification faces numerous challenges from data, model, and evaluation perspectives. The reduced availability of de-identification datasets, especially real-world datasets, is a longstanding drawback for this task due to privacy risks for patients who generate the data and privacy regulations that prohibit data release [4,63]. Furthermore, the availability of de-identification datasets for languages other than English also contributes to difficulties in the development and evaluation of de-identification models [15,48]. Challenges from real-world EHRs in German, such as sentence fragments, unclear abbreviations, and complex sentences, might also complicate the de-identification evaluation by LLMs, both encoder-only and generative ones. We include examples of these challenges in Table A3 in Appendix C. Generative LLMs strongly rely on prompt engineering to provide high-quality outputs, whereas refining the input prompts might require many attempts [5]. In addition, closed-source LLMs might not be deployed on an organization’s premises, hence conflicting with data protection regulations. Even though closed-source model proprietary constraints contribute to limiting access to model specifications and hinder their customization, estimating their performance capabilities for de-identification is an essential step in defining baselines for open-source models. Finally, evaluating generative LLMs often requires the definition of rules in the post-processing step, contrary to encoder-only models.
Since generative LLMs might perpetuate biases toward social groups and produce deceiving outputs, debiasing and dehallucination approaches [90] can be integrated into model evaluation. We investigated whether the generative LLMs used in our study generated outputs containing biases or hallucinations. We considered the occurrence of gender bias and hallucinations as follows. First, we considered an LLM output to be biased toward gender if it contains female or male words [91] instead of avoiding them, as instructed by the prompts (see Figure 2 and Figure 3). We considered the following female and male words to quantify gender bias for the LLMs in English (‘female’, ‘male’, ‘Ms’, ‘Mrs’, ‘Mr’, ‘woman’, ‘man’) and German (‘weiblich’, ‘männlich’, ‘Frau’, ‘Herr’, ‘Frau’, ‘Mann’). Finally, we considered a hallucination to be any model output longer than five words since five is the length of the longest PHI instance in the N2C2 dataset’s test set. So, lengthy outputs without a PHI instance were considered hallucinations since they either contradict the rules in the prompts or deviate from the input [90].
Table 12 presents a quantitative assessment of gender bias and hallucination rates of the six generative LLMs for the EHRs in the test set of the N2C2 dataset in English and German. The rates in the table correspond to the percentage of EHRs for which the outputs included at least one occurrence of the binary-gendered words or a hallucination. The results in the table suggest that one-shot settings for in-context learning aided in reducing gender bias for five out of the six models. In contrast, a reduction in the number of hallucinations was only noticed for the GPT-4 and GPT-4o models in one-shot settings. The results also suggest that hallucinations were less frequent for the experiments using German data. It is important to notice that the post-processing rules proposed in Section 5.1 were able to detect such issues.

6.2. Integration of Privacy-Enhancing Technologies

Since numerous privacy and security risks threaten de-identification models, the integration of additional privacy-enhancing technologies (PETs), such as differential privacy (DP) [92] and federated learning (FL) [93], can enhance the privacy guarantees for the de-identified EHRs. Table 13 presents six prominent privacy and security risks and suitable PETs for mitigating each risk for de-identification tasks. These risks target models and datasets for de-identification. Membership inference and model inversion attacks directly target models [94,95,96]. Similarly, linking attacks, data leakages during transmission, unauthorized data access, and attacks on centralized cloud storage affect datasets [10,97,98,99]. Both DP and FL are efficient PETs. However, using DP often involves balancing privacy–fairness trade-offs [100], and FL faces vulnerabilities to backdoor attacks [101].
DP adds controlled noise levels to data or computation outputs and offers a formal quantification of privacy protection via the privacy budget ϵ [92]. This PET can be used to hinder membership inference attacks, i.e., inferring whether a data instance was used to train a model [94], by obfuscating the de-identification model gradients with noise from differentially private optimizers like DP-SGD [95]. DP-SGD can also harden the defenses against model inversion attacks, an attack category that aims at recovering original training data from the trained model parameters [96]. Finally, linking attacks in which de-identified data are combined with external data to infer private information can be prevented by data perturbation with DP [97].
FL enables models to be trained in a distributed manner, in which a global model is sent to client devices to compute local updates on their locally stored datasets and returned to a centralized server for updating the global model [93]. This PET prevents data leakages during transmission since the locally stored data remain on their owner’s premises, and model parameters are exchanged instead [98]. The distributed nature of FL, which dispenses data transmissions [93], can also mitigate the issue of unauthorized data access [10]. Finally, FL for distributed training of de-identification models does not rely on centralized cloud storage for the training datasets, which could be affected by attacks on centralized cloud storage otherwise [99].

6.3. Deployment Considerations

The deployment of LLMs for de-identification involves a series of legal and technical requirements to be observed. Figure 8 illustrates the deployment considerations for both cloud-based and on-premises LLMs for de-identification. First, latency for inferences is a key requirement to be managed for deploying LLM-based applications since inference times affect the practical applicability of the models [88]. Cloud-based LLMs may suffer from higher latency and dependency on network bandwidth than on-premises models, hence affecting the speed at which de-identification outputs are generated. Second, cloud-based LLMs often require data transmission from local devices to the cloud, compromising the privacy and security of the transmitted data. For instance, using LLMs in healthcare tasks might lead to privacy implications, such as data leakages and unauthorized access to data [10]. So, deploying de-identification models on-premises would be preferable for more substantial compliance with data protection regulations. Third, scalability should also be taken into account [88], especially for applications with high demand. Cloud-based LLMs may easily adapt to increasing users, while on-premises LLMs may require infrastructure improvements to catch up with increased demand for model usage. Fourth, cloud-based LLMs impose higher LLM costs for inferences, as demonstrated by output token pricing practices, and impede model adaptations like fine-tuning. In contrast, on-premises LLMs can be customized according to an organization’s needs and deployed for use without inference costs. Further, hosting LLMs on-premises demands infrastructure resources and might cause overheads due to model size and memory requirements. Finally, integrating LLMs into healthcare systems can be challenging due to the costs for development, software integration, and operation, amplifying inequalities in healthcare [10]. Therefore, the successful integration of LLMs with healthcare applications demands extensive engineering efforts for mitigating privacy risks [10], balancing model size and available resources [88], complying with data protection regulations [10], and mitigating biases and hallucinations in model outputs [90].

6.4. Future Work Directions

The rise of GenAI and its applications to healthcare inspire numerous future work directions. First, benchmarking further encoder-only, encoder–decoder, and decoder-only LLMs for de-identification might extend the knowledge about the capabilities of each model family and architecture. Second, increasing the number of LLMs for the German language can improve the understanding of challenges in bilingual settings. Third, additional languages can also be considered for de-identification evaluation, simulating international applications that run over organizations in different countries. A dataset for future experiments for German is GGPONC 2.0 [18], and potential datasets for other languages include NUT [15] for Dutch and the Stockholm EPR PHI Corpus [61] for Swedish, to name a few. Fourth, extending LLM-based de-identification to additional datasets, including real-world data, might aid in further understanding the generalization capabilities and limitations of each model and its suitability for real-world scenarios. The assessment of few-shot learning and fine-tuning procedures [70] for LLM optimization for de-identification can also be explored. Further, refining the post-processing evaluation steps and integrating debiasing and dehallucination approaches to mitigate potential issues in LLM outputs for de-identification is a promising research direction. Comparing such evaluations against de-identification tools, like Microsoft Presidio (https://microsoft.github.io/presidio/, accessed on 27 January 2025), might also be in the scope of future work. Finally, combining PETs like DP [92] and FL [93] and trustworthy artificial intelligence [102] requirements with the de-identification models can enrich both the scientific literature and applications.

7. Conclusions

This work compared ten recent LLMs for EHR de-identification in English and German, including encoder-only, encoder–decoder, and decoder-only models with in-context learning and full fine-tuning. Our experimental evaluation indicates that both in-context learning and full fine-tuning can yield competitive de-identification scores. While in-context learning eliminates the need for feature engineering for generative LLMs, full fine-tuning of BERT family models enables a streamlined model evaluation during the post-processing step. These findings reflect the application scenarios for such models. However, data and model availability issues require special attention and can be tackled by successive research. Overall, EHR de-identification can strongly benefit from LLMs that efficiently preserve privacy and yield high performance in a variety of applications and technical settings.

Author Contributions

Conceptualization, S.S., M.J. and R.K.; methodology, S.S.; validation, S.S., M.J. and R.K.; formal analysis, S.S.; investigation, S.S.; resources, M.K.; writing—original draft preparation, S.S. and M.J.; writing—review and editing, S.S., M.J., M.K. and R.K.; visualization, S.S.; supervision, R.K.; project administration, M.K.; funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research is part of the project “Simplification of Medical Reports using Artificial Intelligence” (SimplifAI), funded by the Austrian Ministry of Transport, Innovation and Technology (BMVIT) and the Austrian Research Promotion Agency (FFG) within the strategic objective ICT of the Future (https://www.ffg.at/iktderzukunft, accessed on 27 January 2025). Know Center Research GmbH is a COMET center within the COMET—Competence Centers for Excellent Technologies Programme—and funded by BMK and BMAW, as well as the co-finance provinces Styria, Vienna, and Tyrol. COMET is managed by FFG.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from the DBMI Data Portal and are available at https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ (accessed on 27 January 2025) with the permission of the DBMI Data Portal. The MUG-CTHEAD dataset used within this study is not publicly available due to legal restrictions but is available from the corresponding author upon reasonable request. Code can be shared upon reasonable request by contacting the corresponding author.

Conflicts of Interest

Mark Kröll was employed by the company Know Center Research GmbH. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
APIApplication programming interface
BiLSTMBidirectional long short-term memory
CDCritical difference
CNNsConvolutional neural networks
CPUCentral processing unit
CRFConditional random fields
DPDifferential privacy
EHRElectronic health record
EUEuropean Union
FLFederated learning
GDPRGeneral Data Protection Regulation
GenAI  Generative artificial intelligence
GPUGraphics processing unit
GRUGated recurrent units
HIPAAHealth Insurance Portability and Accountability Act
LLMsLarge language models
LoRALow-rank adaptation
LRLearning rate
LSTMLong short-term memory
MISTMITRE Identification Scrubber Toolkit
NERNamed-entity recognition
NLPNatural language processing
PETsPrivacy-enhancing technologies
PHIProtected health information
RAMRandom-access memory
SVMsSupport vector machines
UMLSUnified Medical Language System
USUnited States

Appendix A. Large Language Model Versions and Hyperparameters

This appendix provides detailed information on the LLMs used in this study. Table A1 presents the models, their versions, and the hyperparameter values set for the experimental evaluation. The table includes the following hyperparameters:
  • ‘temp.’ stands for temperature, which controls the probabilities for the next tokens.
  • ‘top_p’ sets the cumulative probability for nucleus sampling.
  • ‘top_k’ limits the number of highest-probability tokens.
  • ‘n’ limits the number of completions for GPT family models.
  • ‘seed’ is the random seed for sampling.
  • ‘LR’ stands for the learning rate for the BERT family models.
  • ‘batch_size’ defines the batch sizes for training and inference during full fine-tuning.
  • ‘epochs’ defines the number of passes through the training set during full fine-tuning.
The LLMs listed in Table A1 are either publicly available on the Hugging Face Hub or accessible via the OpenAI API. The following models are publicly available on the Hugging Face Hub:
  • BERT _ b a s e (‘google-bert/bert-base-uncased’).
  • ClinicalBERT (‘medicalai/ClinicalBERT’).
  • DistilBERT (‘distilbert/distilbert-base-uncased’).
  • FLAN-T5 XXL (‘google/flan-t5-xxl’).
  • LLaMA 3 (‘meta-llama/Meta-Llama-3-8B-Instruct’).
  • Mistral-7B (‘mistralai/Mistral-7B-Instruct-v0.3’).
  • RoBERTa _ b a s e (‘FacebookAI/roberta-base’).
Finally, the following models are accessible via the OpenAI API:
  • GPT-3.5 Turbo (‘gpt-3.5-turbo-0125’).
  • GPT-4 (‘gpt-4-0613’).
  • GPT-4o (‘gpt-4o-2024-08-06’).
Table A1. The statistics of the large language models used in this study.
Table A1. The statistics of the large language models used in this study.
ModelVersionHyperparameters
GPT-3.5 Turbo‘gpt-3.5-turbo-0125’{temp. = 0.1, top_p = 0.1, n = 1, seed = 1234}
GPT-4‘gpt-4-0613’{temp. = 0.1, top_p = 0.1, n = 1, seed = 1234}
GPT-4o‘gpt-4o-2024-08-06’{temp. = 0.1, top_p = 0.1, n = 1, seed = 1234}
FLAN-T5 XXL‘google/flan-t5-xxl’{temp. = 0.1, p = 0.1, top_k = 1, seed = 1234}
LLaMA 3‘meta-llama/Meta-Llama-3-8B-Instruct’{temp. = 0.1, top_p = 0.1, top_k = 1, seed = 1234}
Mistral-7B‘mistralai/Mistral-7B-Instruct-v0.3’{temp. = 0.1, top_p = 0.1, top_k = 1, seed = 1234}
BERT _ b a s e ‘google-bert/bert-base-uncased’{LR =  2 × 10 5 , batch_size = 1, epochs = {1, …, 5}}
ClinicalBERT‘medicalai/ClinicalBERT’{LR =  2 × 10 5 , batch_size = 1, epochs = {1, …, 3}}
DistilBERT‘distilbert/distilbert-base-uncased’{LR =  2 × 10 5 , batch_size = 1, epochs = {1, …, 5}}
RoBERTa _ b a s e ‘FacebookAI/roberta-base’{LR =  2 × 10 5 , batch_size = 1, epochs = {1, …, 5}}

Appendix B. Post-Processing Evaluation

This appendix empirically evaluates the post-processing rules developed to evaluate generative LLMs for de-identification. Table A2 presents the impact of the proposed rules on precision, recall, and F1 scores for the five generative LLMs used for the de-identification of the English N2C2 dataset with in-context learning. The column ‘Rules’ indicates whether the post-processing rules were utilized to compute the performance scores. Overall, utilizing the post-processing rules resulted in a reduction in false positives and improvements in terms of precision.
Table A2. The impact of the proposed post-processing rules on evaluating the de-identification performance of LLMs with in-context learning for the English N2C2 dataset. The column ‘Rules’ indicates whether the post-processing rules were utilized to compute the scores.
Table A2. The impact of the proposed post-processing rules on evaluating the de-identification performance of LLMs with in-context learning for the English N2C2 dataset. The column ‘Rules’ indicates whether the post-processing rules were utilized to compute the scores.
ModelRulesZero-ShotOne-Shot
Precision Recall F1 Precision Recall F1
FLAN-T5 XXLNo5.37%14.46%7.83%29.31%59.10%39.19%
GPT-3.5 TurboNo18.57%42.24%25.80%32.93%59.27%42.34%
GPT-4No25.04%74.62%37.49%32.09%74.62%37.49%
LLaMA 3No37.45%32.20%34.63%37.18%48.42%42.06%
Mistral-7BNo11.61%58.34%19.37%19.49%68.70%30.37%
FLAN-T5 XXLYes8.62%14.46%10.80%55.25%59.10%57.11%
GPT-3.5 TurboYes27.81%42.24%33.54%65.41%59.27%62.19%
GPT-4Yes47.73%74.62%58.23%70.17%87.14%77.74%
LLaMA 3Yes55.56%32.20%40.77%59.11%48.42%53.23%
Mistral-7BYes15.55%58.34%24.56%38.33%68.70%49.20%

Appendix C. Issues in Real-World EHRs in German

This appendix provides examples of issues in real-world EHRs in German. Table A3 shows three snippets derived from real-world EHRs in German after the anonymization of the original EHRs. All personal data and PHI instances shown in the examples are fictitious. Moreover, this content was created for illustrative purposes only. Common issues in real-world EHRs in German include inconsistent spacing and punctuation, unclear abbreviations, sentence fragments, complex terminology, and complex sentence structures.
Table A3. Examples of issues in real-world EHRs in German.
Table A3. Examples of issues in real-world EHRs in German.
Health Record SnippetsIssues
“vom Pony getreten, Pat. schwanger!!!!. Abklärung (SG+NNH)? Patientin derzeit in der 27. Schwangerschaftswoche. Nach einem Aufklärungsgespräch bezüglich der Strahlenbelastung einer Schädel-CT-Untersuchung für den Fötus, lehnt die Patientin die Untersuchung zum derzeitigen Zeitpunkt ab. Derzeit ist die Patientin subjektiv und objektiv in klinischer Beschwerdefreiheit. Bei einer etwaigen Verschlechterung, ist eine jederzeitige Wiedervorstellung zum Schädel CT möglich.Das Gespräche wird in Beisein des diensthabenden RT Ass geführt. Der zuweisende Dr Mustermann wird über das Gespräch telefonisch in Kenntnis gesetzt.”Inconsistent spacing and punctuation, excessive use of exclamation marks, sentence fragment, sentences with missing verbs or in indirect order, inconsistent article–noun agreement.
“St.p.Trauma. Kontrolle? Im Vergleich zur VU vom 15.10.2008 geringfügige Zunahme der epiduralen oder subduralen Blutansammlung, diese nunmehr in einer maximalen Längsausdehnung von maximal etwa 4 cm und einer Breite von etwa 0.5 cm hoch parietookzipital rechts. Zunahme des cortical/subcorticalen Kontusionsherdes, dieser vormals etwa 1.7 cm, nunmehr etwa 2.3 cm haltend temperoparietal rechts. Neu aufgetreten eine etwa 5 cm lange und 0.6 cm breite konvex bogig berandete Blutansammlung offenbar epidural okzipital links. Neu aufgetreten eine SAB temperookzipital links, die SAB frontotemporal geringfügig regredient. Geringfügige Regredienz des Kopfschwartenhämatoms okzipital rechts.”Unclear abbreviation, sentence fragment, word repetition, complex sentence.
“SHT, SAB. Kontrolle nativ? Verlaufskontrolle zu einer auswärtigen VU vom 14.10.2009 (Klinikum am Südpark). CT des Gehirnschädels: Im Bildvergleich Umverteilung des Hämatocephalus internus in die Hinterhörner der Seitenventrikel und am Boden des 3. Ventrikels. Regredienz der Hämorrhagien in den basalen und perimesencephalen Zisternen mit Umverteilung der SAB nach parieto-frontal beidseits. Neu eingebrachte Ventrikeldrainage über frontal rechts mit der Spitze im Seitenventrikel im Vorderhorn. Etwas zunehmende Weite der Seitenventrikel. Kein Mittellinienshift. Die Zeichen des Hirnödems abnehmend. Neu demarkiert eine umschriebene Hypodensität cerebellär rechts, fragl. subakut ischämisch. Bekannte ausgedehnte Gesichtsschädelfrakturen. Geringe Zunahme des Hämatosinus sphenoidales.”Unclear abbreviation, complex terminology, complex sentence, typo.

Appendix D. File Names of the EHRs Sampled from the German N2C2 Dataset for the De-Identification Experiments

This appendix provides the file names of the 50 EHRs sampled from the German N2C2 dataset for the de-identification experiments. Table A4 lists all file names of the 50 EHRs used to evaluate de-identification by LLMs in German.
Table A4. The file names of the 50 EHRs sampled from the German N2C2 dataset’s test set for the de-identification experiments.
Table A4. The file names of the 50 EHRs sampled from the German N2C2 dataset’s test set for the de-identification experiments.
EHR File Names
112-02.xml, 112-03.xml, 130-03.xml, 132-01.xml, 132-03.xml,
137-01.xml, 137-02.xml, 138-01.xml, 138-02.xml, 138-03.xml,
138-04.xml, 160-04.xml, 161-01.xml, 163-01.xml, 163-03.xml,
166-01.xml, 190-03.xml, 190-04.xml, 193-05.xml, 199-01.xml,
199-05.xml, 200-04.xml, 202-03.xml, 209-01.xml, 210-04.xml,
211-02.xml, 214-01.xml, 216-05.xml, 219-01.xml, 219-03.xml,
219-04.xml, 219-05.xml, 234-01.xml, 310-02.xml, 314-01.xml,
314-02.xml, 314-05.xml, 316-01.xml, 317-01.xml, 318-01.xml,
318-03.xml, 318-04.xml, 319-01.xml, 340-01.xml, 340-03.xml,
343-01.xml, 347-02.xml, 347-03.xml, 349-01.xml, 385-01.xml.

References

  1. Liu, Z.; Tang, B.; Wang, X.; Chen, Q. De-identification of clinical notes via recurrent neural network and conditional random field. J. Biomed. Inform. 2017, 75, S34–S42. [Google Scholar] [CrossRef]
  2. Leevy, J.L.; Khoshgoftaar, T.M.; Villanustre, F. Survey on RNN and CRF models for de-identification of medical free text. J. Big Data 2020, 7, 1–22. [Google Scholar] [CrossRef]
  3. Act. Health insurance portability and accountability act of 1996. Public Law 1996, 104, 191. [Google Scholar]
  4. European Commission. A New Era for Data Protection in the EU. 2018. Available online: https://commission.europa.eu/document/download/7fa5e36d-6412-4b44-9a2d-12d4838fd4c6_en?filename=data-protection-factsheet-changes_en.pdf (accessed on 30 December 2024).
  5. Liu, Z.; Huang, Y.; Yu, X.; Zhang, L.; Wu, Z.; Cao, C.; Dai, H.; Zhao, L.; Li, Y.; Shu, P.; et al. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv 2023, arXiv:2303.11032. [Google Scholar]
  6. Patil, H.K.; Seshadri, R. Big data security and privacy issues in healthcare. In Proceedings of the 2014 IEEE International Congress on Big Data, Anchorage, AK, USA, 27 June–2 July 2014; pp. 762–765. [Google Scholar]
  7. Henriksen-Bulmer, J.; Jeary, S. Re-identification attacks—A systematic literature review. Int. J. Inf. Manag. 2016, 36, 1184–1192. [Google Scholar] [CrossRef]
  8. Zhang, P.; Kamel Boulos, M.N. Generative AI in medicine and healthcare: Promises, opportunities and challenges. Future Internet 2023, 15, 286. [Google Scholar] [CrossRef]
  9. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 6000–6010. [Google Scholar] [CrossRef]
  10. Denecke, K.; May, R.; Rivera-Romero, O. Transformer Models in Healthcare: A Survey and Thematic Analysis of Potentials, Shortcomings and Risks. J. Med. Syst. 2024, 48, 23. [Google Scholar] [CrossRef]
  11. Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerova, A.; et al. Clinical text summarization: Adapting large language models can outperform human experts. Res. Sq. 2023. [Google Scholar] [CrossRef]
  12. Chintagunta, B.; Katariya, N.; Amatriain, X.; Kannan, A. Medically aware GPT-3 as a data generator for medical dialogue summarization. In Proceedings of the Machine Learning for Healthcare Conference, PMLR, Virtual Event, 6–7 August 2021; pp. 354–372. [Google Scholar]
  13. Xu, B.; Gil-Jardiné, C.; Thiessard, F.; Tellier, E.; Avalos, M.; Lagarde, E. Pre-training a neural language model improves the sample efficiency of an emergency room classification model. In Proceedings of the FLAIRS-33-Thirty-Third International Flairs Conference, North Miami Beach, FL, USA, 17–20 May 2020. [Google Scholar]
  14. Sousa, S.; Kern, R. How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processing. Artif. Intell. Rev. 2023, 56, 1427–1492. [Google Scholar] [CrossRef]
  15. Trienes, J.; Trieschnigg, D.; Seifert, C.; Hiemstra, D. Comparing rule-based, feature-based and deep neural methods for de-identification of dutch medical records. arXiv 2020, arXiv:2001.05714. [Google Scholar]
  16. Kolditz, T.; Lohr, C.; Hellrich, J.; Modersohn, L.; Betz, B.; Kiehntopf, M.; Hahn, U. Annotating German clinical documents for de-identification. In MEDINFO 2019: Health and Wellbeing e-Networks for All; IOS Press BV: Amsterdam, The Netherlands, 2019; pp. 203–207. [Google Scholar]
  17. Rehm, G.; Uszkoreit, H. The German Language in the European Information Society. In The German Language in the Digital Age; Springer: Berlin/Heidelberg, Germany, 2012; pp. 47–53. [Google Scholar]
  18. Borchert, F.; Lohr, C.; Modersohn, L.; Witt, J.; Langer, T.; Follmann, M.; Gietzelt, M.; Arnrich, B.; Hahn, U.; Schapranow, M.P. GGPONC 2.0—The German clinical guideline corpus for oncology: Curation workflow, annotation policy, baseline NER taggers. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 3650–3660. [Google Scholar]
  19. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  20. OpenAI. GPT-4 Technical Report. 2023. Available online: https://openai.com/research/gpt-4 (accessed on 12 November 2024).
  21. AI@Meta. Llama 3 Model Card. 2024. Available online: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md (accessed on 27 January 2025).
  22. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019. [Google Scholar] [CrossRef]
  23. Meystre, S.M.; Friedlin, F.J.; South, B.R.; Shen, S.; Samore, M.H. Automatic de-identification of textual documents in the electronic health record: A review of recent research. BMC Med. Res. Methodol. 2010, 10, 1–16. [Google Scholar] [CrossRef]
  24. Berman, J.J. Concept-match medical data scrubbing: How pathology text can be used in research. Arch. Pathol. Lab. Med. 2003, 127, 680–686. [Google Scholar] [CrossRef]
  25. Beckwith, B.A.; Mahaadevan, R.; Balis, U.J.; Kuo, F. Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med. Inform. Decis. Mak. 2006, 6, 1–9. [Google Scholar] [CrossRef] [PubMed]
  26. Friedlin, F.J.; McDonald, C.J. A software tool for removing patient identifying information from clinical documents. J. Am. Med. Inform. Assoc. 2008, 15, 601–610. [Google Scholar] [CrossRef]
  27. Uzuner, Ö.; Sibanda, T.C.; Luo, Y.; Szolovits, P. A de-identifier for medical discharge summaries. Artif. Intell. Med. 2008, 42, 13–35. [Google Scholar] [CrossRef]
  28. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  29. Wellner, B.; Huyck, M.; Mardis, S.; Aberdeen, J.; Morgan, A.; Peshkin, L.; Yeh, A.; Hitzeman, J.; Hirschman, L. Rapidly retargetable approaches to de-identification in medical records. J. Am. Med. Inform. Assoc. 2007, 14, 564–573. [Google Scholar] [CrossRef]
  30. Lafferty, J.; McCallum, A.; Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Icml, Williamstown, MA, USA, 28 June–1 July 2001; Volume 1, p. 3. [Google Scholar]
  31. Aberdeen, J.; Bayer, S.; Yeniterzi, R.; Wellner, B.; Clark, C.; Hanauer, D.; Malin, B.; Hirschman, L. The MITRE Identification Scrubber Toolkit: Design, training, and assessment. Int. J. Med. Inform. 2010, 79, 849–859. [Google Scholar] [CrossRef]
  32. Yang, H.; Garibaldi, J.M. Automatic detection of protected health information from clinic narratives. J. Biomed. Inform. 2015, 58, S30–S38. [Google Scholar] [CrossRef]
  33. Stubbs, A.; Kotfila, C.; Uzuner, Ö. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J. Biomed. Inform. 2015, 58, S11–S19. [Google Scholar] [CrossRef]
  34. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
  35. Haque, A.; Milstein, A.; Fei-Fei, L. Illuminating the dark spaces of healthcare with ambient intelligence. Nature 2020, 585, 193–202. [Google Scholar] [CrossRef]
  36. Fei, Z.; Ryeznik, Y.; Sverdlov, O.; Tan, C.W.; Wong, W.K. An overview of healthcare data analytics with applications to the COVID-19 pandemic. IEEE Trans. Big Data 2021, 8, 1463–1480. [Google Scholar] [CrossRef]
  37. Hang, C.N.; Tsai, Y.Z.; Yu, P.D.; Chen, J.; Tan, C.W. Privacy-enhancing digital contact tracing with machine learning for pandemic response: A comprehensive review. Big Data Cogn. Comput. 2023, 7, 108. [Google Scholar] [CrossRef]
  38. Chen, H.; Lin, Z.; Ding, G.; Lou, J.; Zhang, Y.; Karlsson, B. GRN: Gated relation network to enhance convolutional neural network for named entity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6236–6243. [Google Scholar]
  39. Tomy, A.; Razzanelli, M.; Di Lauro, F.; Rus, D.; Della Santina, C. Estimating the state of epidemics spreading with graph neural networks. Nonlinear Dyn. 2022, 109, 249–263. [Google Scholar] [CrossRef]
  40. Tan, C.W.; Yu, P.D.; Chen, S.; Poor, H.V. Deeptrace: Learning to optimize contact tracing in epidemic networks with graph neural networks. arXiv 2022, arXiv:2211.00880. [Google Scholar]
  41. Obeid, J.S.; Heider, P.M.; Weeda, E.R.; Matuskowitz, A.J.; Carr, C.M.; Gagnon, K.; Crawford, T.; Meystre, S.M. Impact of de-identification on clinical text classification using traditional and deep learning classifiers. Stud. Health Technol. Inform. 2019, 264, 283. [Google Scholar] [PubMed]
  42. Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar] [CrossRef]
  43. Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef]
  44. Dernoncourt, F.; Lee, J.Y.; Uzuner, O.; Szolovits, P. De-identification of patient notes with recurrent neural networks. J. Am. Med. Inform. Assoc. 2017, 24, 596–606. [Google Scholar] [CrossRef]
  45. Ahmed, T.; Aziz, M.M.A.; Mohammed, N. De-identification of electronic health record using neural network. Sci. Rep. 2020, 10, 18600. [Google Scholar] [CrossRef]
  46. Cho, K.; van Merriënboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
  47. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  48. Richter-Pechanski, P.; Amr, A.; Katus, H.A.; Dieterich, C. Deep Learning Approaches Outperform Conventional Strategies in De-Identification of German Medical Reports. In Proceedings of the GMDS, Dortmund, Germany, 8–11 September 2019; pp. 101–109. [Google Scholar]
  49. Baumgartner, M.; Schreier, G.; Hayn, D.; Kreiner, K.; Haider, L.; Wiesmüller, F.; Brunelli, L.; Pölzl, G. Impact analysis of De-identification in clinical notes classification. In dHealth 2022; IOS Press BV: Amsterdam, The Netherlands, 2022; pp. 189–196. [Google Scholar]
  50. Eder, E.; Krieg-Holz, U.; Hahn, U. CodE Alltag 2.0—A pseudonymized German-language email corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4466–4477. [Google Scholar]
  51. Kocaman, V.; Mellah, Y.; Haq, H.; Talby, D. Automated de-identification of arabic medical records. In Proceedings of the ArabicNLP 2023, Singapore, 7 December 2023; pp. 33–40. [Google Scholar]
  52. Zhao, Y.S.; Zhang, K.L.; Ma, H.C.; Li, K. Leveraging text skeleton for de-identification of electronic medical records. BMC Med. Inform. Decis. Mak. 2018, 18, 65–72. [Google Scholar] [CrossRef] [PubMed]
  53. Menger, V.; Scheepers, F.; van Wijk, L.M.; Spruit, M. DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text. Telemat. Inform. 2018, 35, 727–736. [Google Scholar] [CrossRef]
  54. Bourdois, L.; Avalos, M.; Chenais, G.; Thiessard, F.; Revel, P.; Gil-Jardiné, C.; Lagarde, E. De-identification of emergency medical records in French: Survey and comparison of state-of-the-art automated systems. Int. Flairs Conf. Proc. 2021, 34. [Google Scholar] [CrossRef]
  55. Catelli, R.; Gargiulo, F.; Casola, V.; De Pietro, G.; Fujita, H.; Esposito, M. A novel covid-19 data set and an effective deep learning approach for the de-identification of italian medical records. IEEE Access 2021, 9, 19097–19110. [Google Scholar] [CrossRef]
  56. Kajiyama, K.; Horiguchi, H.; Okumura, T.; Morita, M.; Kano, Y. De-identifying free text of Japanese electronic health records. J. Biomed. Semant. 2020, 11, 1–12. [Google Scholar] [CrossRef]
  57. Shin, S.Y.; Park, Y.R.; Shin, Y.; Choi, H.J.; Park, J.; Lyu, Y.; Lee, M.S.; Choi, C.M.; Kim, W.S.; Lee, J.H. A de-identification method for bilingual clinical texts of various note types. J. Korean Med. Sci. 2015, 30, 7–15. [Google Scholar] [CrossRef]
  58. Bråthen, S.; Wie, W.; Dalianis, H. Creating and evaluating a synthetic Norwegian clinical corpus for de-identification. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), Reykjavik, Iceland (Online), 31 May–2 June 2021; pp. 222–230. [Google Scholar]
  59. Prado, C.B.; Gumiel, Y.B.; Schneider, E.T.R.; Cintho, L.M.M.; de Souza, J.V.A.; Oliveira, L.E.S.e.; Paraiso, E.C.; Rebelo, M.S.; Gutierrez, M.A.; Pires, F.A.; et al. De-Identification Challenges in Real-World Portuguese Clinical Texts. In Proceedings of the Latin American Conference on Biomedical Engineering, Florianópolis, Brazil, 24–28 October 2022; pp. 584–590. [Google Scholar]
  60. Marimon, M.; Gonzalez-Agirre, A.; Intxaurrondo, A.; Rodriguez, H.; Martin, J.L.; Villegas, M.; Krallinger, M. Automatic De-identification of Medical Texts in Spanish: The MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results. In Proceedings of the IberLEF@ SEPLN, Bilbao, Spain, 24 September 2019; pp. 618–638. [Google Scholar]
  61. Berg, H.; Dalianis, H. A Semi-supervised Approach for De-identification of Swedish Clinical Text. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4444–4450. [Google Scholar]
  62. Ramshaw, L.A.; Marcus, M.P. Text chunking using transformation-based learning. In Natural Language Processing Using Very Large Corpora; Springer: Berlin/Heidelberg, Germany, 1999; pp. 157–176. [Google Scholar]
  63. European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council. 2016. Available online: https://data.europa.eu/eli/reg/2016/679/oj (accessed on 5 November 2024).
  64. Stubbs, A.; Uzuner, Ö. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J. Biomed. Inform. 2015, 58, S20–S29. [Google Scholar] [CrossRef]
  65. Jantscher, M.; Gunzer, F.; Kern, R.; Hassler, E.; Tschauner, S.; Reishofer, G. Information extraction from German radiological reports for general clinical text and language understanding. Sci. Rep. 2023, 13, 2353. [Google Scholar] [CrossRef]
  66. Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Liu, T.; et al. A Survey on In-context Learning. arXiv 2024. [Google Scholar] [CrossRef]
  67. Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N.A.; Khashabi, D.; Hajishirzi, H. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 13484–13508. [Google Scholar]
  68. Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Stanford Alpaca: An Instruction-Following Llama Model. 2023. Available online: https://crfm.stanford.edu/2023/03/13/alpaca.html (accessed on 27 January 2025).
  69. Lv, K.; Yang, Y.; Liu, T.; Guo, Q.; Qiu, X. Full Parameter Fine-tuning for Large Language Models with Limited Resources. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 8187–8198. [Google Scholar] [CrossRef]
  70. Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large language models: A survey. arXiv 2024, arXiv:2402.06196. [Google Scholar]
  71. Sun, C.; Yang, Z.; Wang, L.; Zhang, Y.; Lin, H.; Wang, J. Biomedical named entity recognition using BERT in the machine reading comprehension framework. J. Biomed. Inform. 2021, 118, 103799. [Google Scholar] [CrossRef] [PubMed]
  72. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020. [Google Scholar] [CrossRef]
  73. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
  74. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  75. Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 2024, 25, 1–53. [Google Scholar]
  76. OpenAI. GPT-3.5 Turbo. 2023. Available online: https://platform.openai.com/docs/models/gpt-3-5#gpt-3-5-turbo (accessed on 12 November 2024).
  77. Wang, G.; Liu, X.; Ying, Z.; Yang, G.; Chen, Z.; Liu, Z.; Zhang, M.; Yan, H.; Lu, Y.; Gao, Y.; et al. Optimized glycemic control of type 2 diabetes with reinforcement learning: A proof-of-concept trial. Nat. Med. 2023, 29, 2633–2642. [Google Scholar] [CrossRef]
  78. OpenAI. Hello GPT-4o. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 24 January 2025).
  79. Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
  80. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023. [Google Scholar] [CrossRef]
  81. Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 2023, 43, 1–55. [Google Scholar]
  82. Eisinga, R.; Heskes, T.; Pelzer, B.; Te Grotenhuis, M. Exact p-values for pairwise comparison of Friedman rank sums, with application to comparing classifiers. BMC Bioinform. 2017, 18, 1–18. [Google Scholar] [CrossRef] [PubMed]
  83. Kanjirangat, V.; Antonucci, A.; Zaalon, M. On the Limitations of Zero-Shot Classification of Causal Relations by LLMs (Work in Progress). Proc. ISSN 2024, 1613, 0073. [Google Scholar]
  84. Gao, J.; Lu, C.; Ding, X.; Li, Z.; Liu, T.; Qin, B. Enhancing Complex Causality Extraction via Improved Subtask Interaction and Knowledge Fusion. arXiv 2024, arXiv:2408.03079. [Google Scholar]
  85. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
  86. Ross, A.; Willson, V.L.; Ross, A.; Willson, V.L. Paired samples T-test. In Basic and Advanced Statistical Tests: Writing Results Sections and Creating Tables and Figures; Sense Publishers: Rotterdam, The Netherlands, 2017; pp. 17–19. [Google Scholar]
  87. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
  88. Bai, G.; Chai, Z.; Ling, C.; Wang, S.; Lu, J.; Zhang, N.; Shi, T.; Yu, Z.; Zhu, M.; Zhang, Y.; et al. Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv 2024, arXiv:2401.00625. [Google Scholar]
  89. Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, H.; Smith, N. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv 2020, arXiv:2002.06305. [Google Scholar]
  90. Lin, Z.; Guan, S.; Zhang, W.; Zhang, H.; Li, Y.; Zhang, H. Towards trustworthy LLMs: A review on debiasing and dehallucinating in large language models. Artif. Intell. Rev. 2024, 57, 243. [Google Scholar] [CrossRef]
  91. You, Z.; Lee, H.; Mishra, S.; Jeoung, S.; Mishra, A.; Kim, J.; Diesner, J. Beyond Binary Gender Labels: Revealing Gender Bias in LLMs through Gender-Neutral Name Predictions. In Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP), Bangkok, Thailand, 16 August 2024; pp. 255–268. [Google Scholar]
  92. Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating Noise to Sensitivity in Private Data Analysis. In Proceedings of the Theory of Cryptography; Halevi, S., Rabin, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
  93. Konečnỳ, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated learning: Strategies for improving communication efficiency. In Proceedings of the NIPS Workshop on Private Multi-Party Machine Learning, Barcelona, Spain, 6–8 December 2016. [Google Scholar]
  94. Hu, L.; Yan, A.; Yan, H.; Li, J.; Huang, T.; Zhang, Y.; Dong, C.; Yang, C. Defenses to membership inference attacks: A survey. ACM Comput. Surv. 2023, 56, 1–34. [Google Scholar] [CrossRef]
  95. Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar]
  96. He, Z.; Zhang, T.; Lee, R.B. Model inversion attacks against collaborative inference. In Proceedings of the 35th Annual Computer Security Applications Conference, San Juan, PR, USA, 9–13 December 2019; pp. 148–162. [Google Scholar]
  97. Hassan, M.U.; Rehmani, M.H.; Chen, J. Differential privacy techniques for cyber physical systems: A survey. IEEE Commun. Surv. Tutor. 2019, 22, 746–789. [Google Scholar] [CrossRef]
  98. Zhang, C.; Xie, Y.; Bai, H.; Yu, B.; Li, W.; Gao, Y. A survey on federated learning. Knowl.-Based Syst. 2021, 216, 106775. [Google Scholar] [CrossRef]
  99. Singh, A.; Chatterjee, K. Cloud security issues and challenges: A survey. J. Netw. Comput. Appl. 2017, 79, 88–115. [Google Scholar] [CrossRef]
  100. Bagdasaryan, E.; Poursaeed, O.; Shmatikov, V. Differential Privacy Has Disparate Impact on Model Accuracy. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
  101. Bagdasaryan, E.; Veit, A.; Hua, Y.; Estrin, D.; Shmatikov, V. How to backdoor federated learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Online, 26–28 August 2020; pp. 2938–2948. [Google Scholar]
  102. Floridi, L. Establishing the rules for building trustworthy AI. Nat. Mach. Intell. 2019, 1, 261–262. [Google Scholar] [CrossRef]
Figure 1. A diagram of the proposed EHR de-identification method based on encoder–decoder and decoder-only LLMs with in-context learning and zero-shot settings. First, the LLMs receive a prompt that defines the task to be performed. Second, the health records are input into the models. Finally, the models return the PHI instances from the health record in their output.
Figure 1. A diagram of the proposed EHR de-identification method based on encoder–decoder and decoder-only LLMs with in-context learning and zero-shot settings. First, the LLMs receive a prompt that defines the task to be performed. Second, the health records are input into the models. Finally, the models return the PHI instances from the health record in their output.
Information 16 00112 g001
Figure 2. The prompt designed for in-context learning for the encoder–decoder and decoder-only LLMs for the de-identification of the English dataset. First, the prompt defines a task, the objective of which is to identify and return all strings that are PHI instances from the health record text. Second, the set of rules specifies the requirements for each PHI category. Finally, the prompt provides a template for the output format, where n represents the number of PHI instances in a health record.
Figure 2. The prompt designed for in-context learning for the encoder–decoder and decoder-only LLMs for the de-identification of the English dataset. First, the prompt defines a task, the objective of which is to identify and return all strings that are PHI instances from the health record text. Second, the set of rules specifies the requirements for each PHI category. Finally, the prompt provides a template for the output format, where n represents the number of PHI instances in a health record.
Information 16 00112 g002
Figure 3. The prompt designed for in-context learning for the encoder–decoder and decoder-only LLMs for the de-identification of the German dataset. This prompt was translated and adapted from the prompt in English. The task statement, the set of rules, and the output format directly relate to those shown in Figure 2. Finally, n also represents the number of PHI instances in a health record in the output format (“AUSGABEFORMAT”, in German).
Figure 3. The prompt designed for in-context learning for the encoder–decoder and decoder-only LLMs for the de-identification of the German dataset. This prompt was translated and adapted from the prompt in English. The task statement, the set of rules, and the output format directly relate to those shown in Figure 2. Finally, n also represents the number of PHI instances in a health record in the output format (“AUSGABEFORMAT”, in German).
Information 16 00112 g003
Figure 4. The system prompt designed for the translation of the de-identification dataset from English into German.
Figure 4. The system prompt designed for the translation of the de-identification dataset from English into German.
Information 16 00112 g004
Figure 5. The user prompt designed for the translation of the de-identification dataset from English into German.
Figure 5. The user prompt designed for the translation of the de-identification dataset from English into German.
Information 16 00112 g005
Figure 6. F1 scores of BERT family models over 3 epochs on the English dataset during full fine-tuning. Increasing the number of epochs improved F1 scores for all models.
Figure 6. F1 scores of BERT family models over 3 epochs on the English dataset during full fine-tuning. Increasing the number of epochs improved F1 scores for all models.
Information 16 00112 g006
Figure 7. A Nemenyi test diagram with a critical difference (CD) of 7.820. The statistical test shows that most of the average ranks of the encoder-only LLMs used for de-identification fall within the CD of the best-performing baseline.
Figure 7. A Nemenyi test diagram with a critical difference (CD) of 7.820. The statistical test shows that most of the average ranks of the encoder-only LLMs used for de-identification fall within the CD of the best-performing baseline.
Information 16 00112 g007
Figure 8. Deployment considerations for cloud-based and on-premises LLMs for de-identification.
Figure 8. Deployment considerations for cloud-based and on-premises LLMs for de-identification.
Information 16 00112 g008
Table 1. PHI categories according to the HIPAA and the definition of personal data by the EU’s GDPR.
Table 1. PHI categories according to the HIPAA and the definition of personal data by the EU’s GDPR.
#HIPAA PHI CategoriesPersonal Data per the EU’s GDPR
1.NamesAny information relating to an identified or identifiable natural person.
2.Dates, except year
3.Telephone numbers
4.Geographic data
5.FAX numbers
6.Social security numbers
7.E-mail addresses
8.Medical record numbers
9.Account numbers
10.Health plan beneficiary numbers
11.Certificate/license numbers
12.Vehicle identifiers and serial numbers
13.Web URLs
14.Device identifiers and serial numbers
15.Internet protocol addresses
16.Full-face photos and comparable images
17.Biometric identifiers
18.Any unique identifying number or code
Table 2. The statistics of the real-world EHR dataset derived from radiological reports in German.
Table 2. The statistics of the real-world EHR dataset derived from radiological reports in German.
Number of EHRsNumber of TokensPHI Categories
Names Dates
1514541015
Table 3. The statistics of the LLMs used in this study. E stands for encoder-only models. D stands for decoder-only models. E-D stands for encoder–decoder models.
Table 3. The statistics of the LLMs used in this study. E stands for encoder-only models. D stands for decoder-only models. E-D stands for encoder–decoder models.
ApproachModelEDE-DSizeLanguageYear
In-context learningFLAN-T5 XXL 11 BMultilingual2022
GPT-3.5 Turbo N/AMultilingual2023
GPT-4 1.76 TMultilingual2023
GPT-4o N/AMultilingual2024
LLaMA 3 8 BMultilingual2024
Mistral-7B 7 BMultilingual2023
Full fine-tuningBERTbase 110 MEnglish2018
ClinicalBERT 110 MEnglish2023
DistilBERT 66 MEnglish2019
RoBERTabase 125 MEnglish2019
Table 4. The de-identification performance of LLMs with in-context learning for the English N2C2 dataset.
Table 4. The de-identification performance of LLMs with in-context learning for the English N2C2 dataset.
ModelZero-ShotOne-Shot
Precision Recall F1 Score Precision Recall F1 Score
FLAN-T5 XXL8.62%14.46%10.80%55.25%59.10%57.11%
GPT-3.5 Turbo27.81%42.24%33.54%65.41%59.27%62.19%
GPT-447.73%74.62%58.23%70.17%87.14%77.74%
LLaMA 355.56%32.20%40.77%59.11%48.42%53.23%
Mistral-7B15.55%58.34%24.56%38.33%68.70%49.20%
Table 5. The de-identification performance of LLMs with in-context learning (one-shot) compared to de-identification systems and deep learning baselines for the English N2C2 dataset.
Table 5. The de-identification performance of LLMs with in-context learning (one-shot) compared to de-identification systems and deep learning baselines for the English N2C2 dataset.
ModelPrecisionRecallF1 Score
Nottingham [33]0.9900.9640.976
MIST [44]0.9140.9270.920
BiLSTM-CRF [44]0.9790.9780.978
GRU [45]0.9870.9580.972
GRU-GRU [45]0.9900.9510.970
LSTM-GRU [45]0.9870.9520.969
Self-attention [45]0.9800.9840.982
BiLSTM-CRF [15]0.9590.8690.912
FLAN-T5 XXL (one-shot)0.5520.5910.571
GPT-3.5 Turbo (one-shot)0.6540.5920.621
GPT-4 (one-shot)0.7010.8710.777
LLaMA 3 (one-shot)0.5910.4840.532
Mistral-7B (one-shot)0.3830.6870.492
Table 6. The de-identification accuracy of LLMs with in-context learning for the English N2C2 dataset compared to LLM baselines from the literature.
Table 6. The de-identification accuracy of LLMs with in-context learning for the English N2C2 dataset compared to LLM baselines from the literature.
ModelZero-ShotOne-Shot
ChatGPT [5]0.929
LLaMa 2 [5]0.612
FLAN-T5 XXL0.0570.399
GPT-3.5 Turbo0.2010.451
GPT-40.4100.635
LLaMA 30.2560.362
Mistral-7B0.1400.326
Table 7. De-identification performance for fully fine-tuned LLMs compared to deep learning baselines for the English N2C2 dataset.
Table 7. De-identification performance for fully fine-tuned LLMs compared to deep learning baselines for the English N2C2 dataset.
ModelPrecisionRecallF1 Score
BiLSTM-CRF [44]0.9790.9780.978
GRU [45]0.9870.9580.972
GRU-GRU [45]0.9900.9510.970
LSTM-GRU [45]0.9870.9520.969
Self-attention [45]0.9800.9840.982
BiLSTM-CRF [15]0.9590.8690.912
BERTbase (5 epochs)0.929 ± 0.0020.948 ± 0.0010.938 ± 0.001
ClinicalBERT (3 epochs)0.842 ± 0.0090.849 ± 0.0050.845 ± 0.007
DistilBERT (5 epochs)0.904 ± 0.0050.922 ± 0.0050.913 ± 0.005
RoBERTabase (5 epochs)0.953 ± 0.0010.964 ± 0.0010.959 ± 0.001
Table 8. The de-identification performance of GPT-4 and GPT-4o with in-context learning for the German N2C2 dataset.
Table 8. The de-identification performance of GPT-4 and GPT-4o with in-context learning for the German N2C2 dataset.
ModelZero-ShotOne-Shot
Precision Recall F1 Score Precision Recall F1 Score
GPT-446.48%66.51%54.72%71.80%81.48%76.33%
GPT-4o39.10%56.88%46.34%79.57%77.33%78.43%
Table 9. The de-identification performance of GPT-4 and GPT-4o with in-context learning for the real-world German dataset.
Table 9. The de-identification performance of GPT-4 and GPT-4o with in-context learning for the real-world German dataset.
ModelZero-ShotOne-Shot
Precision Recall F1 Score Precision Recall F1 Score
GPT-441.66%60.00%49.18%78.94%60.00%68.18%
GPT-4o50.00%60.00%54.54%83.33%60.00%69.76%
Table 10. Average inference time, standard deviation, and output token pricing for each LLM. ZS stands for zero-shot. OS stands for one-shot. FT stands for fine-tuning. All inference times are stated in seconds (s), and all model prices are stated in US dollars at standard pricing as of 23 January 2025.
Table 10. Average inference time, standard deviation, and output token pricing for each LLM. ZS stands for zero-shot. OS stands for one-shot. FT stands for fine-tuning. All inference times are stated in seconds (s), and all model prices are stated in US dollars at standard pricing as of 23 January 2025.
ModelZSOSFTTimeOutput Token Pricing
FLAN-T5 XXL 5.144 s ± 0.598
FLAN-T5 XXL 29.576 s ± 0.173
GPT-3.5 Turbo 0.405 s ± 0.027USD 1.50/1M tokens
GPT-3.5 Turbo 0.481 s ± 0.052USD 1.50/1M tokens
GPT-4 (English) 1.905 s ± 0.269USD 60.00/1M tokens
GPT-4 (English) 1.424 s ± 0.431USD 60.00/1M tokens
LLaMA 3 3.035 s ± 0.057
LLaMA 3 3.266 s ± 0.049
Mistral-7B 8.507 s ± 0.185
Mistral-7B 9.086 s ± 0.086
BERTbase 0.026 s ± 0.000
ClinicalBERT 0.016 s ± 0.000
DistilBERT 0.018 s ± 0.002
RoBERTabase 0.029 s ± 0.000
GPT-4 (German) 2.408 s ± 0.404USD 60.00/1M tokens
GPT-4 (German) 2.644 s ± 0.548USD 60.00/1M tokens
GPT-4o (German) 1.446 s ± 0.298USD 10.00/1M tokens
GPT-4o (German) 1.634 s ± 0.530USD 10.00/1M tokens
Table 11. Memory usage for each LLM using dtype = ‘float32’ when training on a batch size of 1 and using Adam optimizer.
Table 11. Memory usage for each LLM using dtype = ‘float32’ when training on a batch size of 1 and using Adam optimizer.
ModelTotal SizeBackward PassOptimizer Step
FLAN-T5 XXL40.99 GB81.98 GB163.97 GB
GPT-3.5 TurboN/AN/AN/A
GPT-4N/AN/AN/A
GPT-4oN/AN/AN/A
LLaMA 328.21 GB56.42 GB112.83 GB
Mistral-7B27.5 GB55.0 GB110.0 GB
BERTbase417.65 MB835.3 MB1.63 GB
ClinicalBERT513.97 MB1.0 GB2.01 GB
DistilBERT253.16 MB506.32 MB1012.63 MB
RoBERTabase475.49 MB950.99 MB1.86 GB
Table 12. The assessment of gender bias and hallucination rates of each LLM for EHRs in the N2C2 dataset’s test set in English and German.
Table 12. The assessment of gender bias and hallucination rates of each LLM for EHRs in the N2C2 dataset’s test set in English and German.
ModelZero-ShotOne-Shot
Bias Hallucination Bias Hallucination
FLAN-T5 XXL16.92%85.21%56.22%100%
GPT-3.5 Turbo39.29%97.85%19.94%99.46%
GPT-4 (English)10.31%99.46%4.66%5.25%
LLaMA 350%100%41.16%100%
Mistral-7B55.44%74.70%34.43%83.26%
GPT-4 (German)44%64%26%0%
GPT-4o (German)34%52%22%2%
Table 13. Privacy and security risks for de-identification tasks and suitable PETs for risk mitigation.
Table 13. Privacy and security risks for de-identification tasks and suitable PETs for risk mitigation.
RiskSuitable PET
Linking attacksDP
Membership inference attacksDP
Model inversion attacksDP
Attacks on centralized cloud storageFL
Data leakage during transmissionFL
Unauthorized data accessFL
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sousa, S.; Jantscher, M.; Kröll, M.; Kern, R. Large Language Models for Electronic Health Record De-Identification in English and German. Information 2025, 16, 112. https://doi.org/10.3390/info16020112

AMA Style

Sousa S, Jantscher M, Kröll M, Kern R. Large Language Models for Electronic Health Record De-Identification in English and German. Information. 2025; 16(2):112. https://doi.org/10.3390/info16020112

Chicago/Turabian Style

Sousa, Samuel, Michael Jantscher, Mark Kröll, and Roman Kern. 2025. "Large Language Models for Electronic Health Record De-Identification in English and German" Information 16, no. 2: 112. https://doi.org/10.3390/info16020112

APA Style

Sousa, S., Jantscher, M., Kröll, M., & Kern, R. (2025). Large Language Models for Electronic Health Record De-Identification in English and German. Information, 16(2), 112. https://doi.org/10.3390/info16020112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop