Detecting Personally Identifiable Information Through Natural Language Processing: A Step Forward
Abstract
:1. Introduction
Research Context
2. Related Work and Reasons to Continue Research in PII Detection
2.1. Essential Literature Review
2.2. The Need to Continue Research in PII Detection
2.3. The Main Aim of the Research
3. Materials and Methods
3.1. Research Methodology
3.2. Configuration of the PII Detection Architecture
- The PiiDataLoader class handles the loading of data from CSV files, performs “tokenization,” and divides the data into batches.
- The DataHandler class is responsible for managing and preparing the data by splitting them into training (80%), test (10%), and validation (10%) sets, ensuring that the data are appropriately structured and formatted.
- The PiiDataset class associates the “tokenized” data with corresponding labels, indicating which portions correspond to personal information.
- The DetectionModelTrainer class is responsible for training, evaluating, testing, and managing the pre-trained model, which is provided by the Hugging Face Transformer library [31].
- Finally, the Main class orchestrates the other classes to control the execution of the PII detection flow.
3.3. Pre-Trained Model Selection
- MNLI (Multi-Genre Natural Language Inference), which evaluates the model’s ability to determine the relationships between a pair of sentences;
- QQP (Quora Question Pairs), which assesses whether two questions are semantically equivalent;
- QNLI (Question Natural Language Inference), which determines if a paragraph of text contains sufficient information to answer a specific question;
- SST-2 (Stanford Sentiment TreeBank), which classifies the sentiment of sentences as positive or negative;
- CoLA (Corpus of Linguistic Acceptability), which determines the grammatical acceptability of sentences;
- MRPC (Microsoft Research Paraphrase Corpus), which recognizes paraphrased sentences;
- RTE (Recognizing Textual Entailment), which determines the entailment relationship between sentences.
3.4. Pre-Trained Model Training
3.5. Model Testing
- “pii-masking-43k” [41]: This dataset, released by Ai4Privacy, is similar to the larger pii-masking-200k dataset. We chose it because it is known for its reliable data. It has been used to optimize the Distilled BERT model, achieving impressive results: precision (99.86%), recall (99.89%), and accuracy (99.45%). We included it as the core part of the testing dataset since it provides different types of data compared to the larger pii-masking-200k dataset. In past research, the pii-masking-43k dataset has been used to fine-tune various models, including RoBERTa and GPT-2. The dataset provides pre-trained models, which serve as a starting point to build effective PII detection tools.
- “Dialog Dataset” [42]: This dataset contains chatbot conversations without any PII. We used it to add variety to our test data, selecting 3726 records to make sure the model could handle different formats of conversational data.
- “Movie Review” [43]: Known as the “IMDB Dataset of 50K Movie Reviews,” this is a popular dataset for text classification. It is often used for sentiment analysis of movie reviews, with comments labeled as “positive” or “neutral.” We checked that no PII was included, and we selected 10,000 records to make sure the model handles sentiment detection correctly while avoiding privacy issues.
- “Chat-GPT 4 Sentences” [44]: We asked GPT-4 to generate 1000 sentences without PII. This portion of the dataset evaluates the model’s ability to handle artificially generated content, providing us insight into its performance with AI-generated data.
4. Results
4.1. Experimental Results Obtained from Model Training
- GPU: MSI GeForce RTX 4090 VENTUS 3X 24G OC;
- CPU: AMD Ryzen 9 7900X;
- RAM: Corsair DDR5 Vengeance 2 × 32 GB 5600;
- Disk: Crucial P3 Plus 4TB M.2 SSD;
- Motherboard: Asrock X670E Pro RS;
- Operating System: Ubuntu 24.
- Training Loss: This metric represents the model’s error as calculated on the training set. It is monitored throughout the training process and quantifies the incorrect predictions the model makes. A lower training loss indicates better performance in terms of fitting the training data.
- Validation Accuracy: This measures the percentage of correct predictions made by the model on the validation subset. The validation set is a separate subset of the training set (10%) used to evaluate how well the model can generalize during the training process. It helps to measure the model’s performance beyond just the training data.
- Test Accuracy: This measures the percentage of correct predictions made by the model when evaluated on the test subset. The test set is a separate subset of the training set (10%) that is used to assess the model’s overall performance and to provide a more detailed analysis of how well it generalizes to new, unseen data just after the training phase.
4.2. Experimental Results Obtained from Model Testing
- Accuracy: 99.558%;
- Precision: 99.564%;
- Recall: 99.558%;
- F1 score: 99.559%.
- Accuracy: 80.845%;
- Precision: 89.393%;
- Recall: 80.845%;
- F1 score: 82.219%.
- The model trained with the training-validation-test technique provides the most reliable results overall, with the stratified k-fold being a better alternative than the regular k-fold for ensuring balanced performance in classifying PII. The training-validation-test leads to more accurate and stable classification outcomes.
- The models trained with the stratified k-fold technique tend to have higher performance across all the metrics compared to the models trained with the standard k-fold method. This suggests that ensuring a balanced distribution of classes within the folds helps improve the model’s performance in detecting PII.
- The stratified k-fold technique makes sure that each fold has a similar proportion of each class, which is especially important when dealing with unbalanced data. In contrast, with regular k-fold cross-validation, the data might not be as well balanced across the different folds, which could lead to poorer generalization and performance.
5. Discussion
5.1. Evaluation
Comparison
- Device unique ID;
- Account or person identifier;
- Demographic data;
- Commercial or financial information;
- Biometric data;
- Employment information;
- Education level information;
- Job positions held.
- Accuracy = (True Positives + True Negatives)/(Total Samples);
- Precision = True Positives/(True Positives + False Positives);
- Recall = True Positives/(True Positives + False Negatives);
- F1-Score = 2 * (Precision * Recall)/(Precision + Recall).
5.2. Limitations
5.3. Future Work
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Cost of a Data Breach 2024|IBM. Available online: https://www.ibm.com/reports/data-breach (accessed on 9 January 2025).
- IT Governance USA. Data Breaches and Cyber Attacks–USA Report 2024. IT Governance USA Blog. Available online: https://www.itgovernanceusa.com/blog/data-breaches-and-cyber-attacks-in-2024-in-the-usa (accessed on 1 January 2025).
- General Data Protection Regulation (GDPR)—Legal Text. General Data Protection Regulation (GDPR). Available online: https://gdpr-info.eu/ (accessed on 1 January 2025).
- Jahan, M.S.; Oussalah, M. A Systematic Review of Hate Speech Automatic Detection Using Natural Language Processing. Neurocomputing 2023, 546, 126232. [Google Scholar] [CrossRef]
- Liu, Y.; Song, H.H.; Bermudez, I.; Mislove, A.; Baldi, M.; Tongaonkar, A. Identifying Personal Information in Internet Traffic. In Proceedings of the 2015 ACM on Conference on Online Social Networks, Palo Alto, CA, USA, 2–3 November 2015; ACM: Palo Alto, CA, USA, 2015; pp. 59–70. [Google Scholar] [CrossRef]
- Go, S.J.Y.; Guinto, R.; Festin, C.A.M.; Austria, I.; Ocampo, R.; Tan, W.M. An SDN/NFV-Enabled Architecture for Detecting Personally Identifiable Information Leaks on Network Traffic. In Proceedings of the 2019 Eleventh International Conference on Ubiquitous and Future Networks (ICUFN), Zagreb, Croatia, 2–5 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 306–311. [Google Scholar] [CrossRef]
- Ren, J.; Rao, A.; Lindorfer, M.; Legout, A.; Choffnes, D. ReCon: Revealing and Controlling PII Leaks in Mobile Network Traffic. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, Singapore, 26–30 June 2016; ACM: Singapore, 2016; pp. 361–374. [Google Scholar] [CrossRef]
- Noever, D. The Enron Corpus: Where the Email Bodies Are Buried? arXiv 2020. [Google Scholar] [CrossRef]
- Chan, K.-H.; Im, S.-K.; Ke, W. Multiple Classifier for Concatenate-Designed Neural Network. Neural Comput. Applic. 2022, 34, 1359–1372. [Google Scholar] [CrossRef]
- Bader, M.D.M.; Mooney, S.J.; Rundle, A.G. Protecting Personally Identifiable Information When Using Online Geographic Tools for Public Health Research. Am. J. Public Health 2016, 106, 206–208. [Google Scholar] [CrossRef]
- Alnemari, A.; Raj, R.K.; Romanowski, C.J.; Mishra, S. Protecting Personally Identifiable Information (PII) in Critical Infrastructure Data Using Differential Privacy. In Proceedings of the 2019 IEEE International Symposium on Technologies for Homeland Security (HST), Woburn, MA, USA, 5–6 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar] [CrossRef]
- AL Ghazo, A.T.; Abu Mallouh, M.; Alajlouni, S.; Almalkawi, I.T. Securing Cyber Physical Systems: Lightweight Industrial Internet of Things Authentication (LI2A) for Critical Infrastructure and Manufacturing. Appl. Syst. Innov. 2025, 8, 11. [Google Scholar] [CrossRef]
- Onik, M.M.H.; Kim, C.-S.; Lee, N.-Y.; Yang, J. Personal Information Classification on Aggregated Android Application’s Permissions. Appl. Sci. 2019, 9, 3997. [Google Scholar] [CrossRef]
- Majeed, A.; Ullah, F.; Lee, S. Vulnerability- and Diversity-Aware Anonymization of Personally Identifiable Information for Improving User Privacy and Utility of Publishing Data. Sensors 2017, 17, 1059. [Google Scholar] [CrossRef]
- Venkatanathan, J.; Kostakos, V.; Karapanos, E.; Goncalves, J. Online Disclosure of Personally Identifiable Information with Strangers: Effects of Public and Private Sharing. Interact. Comput. 2014, 26, 614–626. [Google Scholar] [CrossRef]
- Tesfay, W.B.; Serna, J.; Pape, S. Challenges in Detecting Privacy Revealing Information in Unstructured Text. In Proceedings of the 4th Workshop on Society, Privacy and the Semantic Web—Policy and Technology (PrivOn2016), Kobe, Japan, 18 October 2016; Brewster, C., Cheatham, M., d’Aquin, M., Decker, S., Kirrane, S., Eds.; CEUR Workshop Proceedings. CEUR: Kobe, Japan, 2016; Volume 1750. [Google Scholar]
- Liu, Y.; Song, T.; Liao, L. TPII: Tracking Personally Identifiable Information via User Behaviors in HTTP Traffic. Front. Comput. Sci. 2020, 14, 143801. [Google Scholar] [CrossRef]
- Vishwamitra, N.; Li, Y.; Wang, K.; Hu, H.; Caine, K.; Ahn, G.-J. Towards PII-Based Multiparty Access Control for Photo Sharing in Online Social Networks. In Proceedings of the 22nd ACM on Symposium on Access Control Models and Technologies, Indianapolis, IN, USA, 21–23 June 2017; ACM: Indianapolis, IN, USA, 2017; pp. 155–166. [Google Scholar] [CrossRef]
- Conti, M.; Li, Q.Q.; Maragno, A.; Spolaor, R. The Dark Side(-Channel) of Mobile Devices: A Survey on Network Traffic Analysis. IEEE Commun. Surv. Tutor. 2018, 20, 2658–2713. [Google Scholar] [CrossRef]
- Wongwiwatchai, N.; Pongkham, P.; Sripanidkulchai, K. Detecting Personally Identifiable Information Transmission in Android Applications Using Light-Weight Static Analysis. Comput. Secur. 2020, 99, 102011. [Google Scholar] [CrossRef]
- Lee, J.; De Guzman, M.C.; Wang, J.; Gupta, M.; Rao, H.R. Investigating Perceptions about Risk of Data Breaches in Financial Institutions: A Routine Activity-Approach. Comput. Secur. 2022, 121, 102832. [Google Scholar] [CrossRef]
- Tavana, M.; Khalili Nasr, A.; Ahmadabadi, A.B.; Amiri, A.S.; Mina, H. An Interval Multi-Criteria Decision-Making Model for Evaluating Blockchain-IoT Technology in Supply Chain Networks. Internet Things 2023, 22, 100786. [Google Scholar] [CrossRef]
- Tavana, M. Decision Analytics in the World of Big Data and Colorful Choices. Decis. Anal. J. 2021, 1, 100002. [Google Scholar] [CrossRef]
- Ayyagari, R. An Exploratory Analysis of Data Breaches from 2005-2011: Trends and Insights. J. Inf. Priv. Secur. 2012, 8, 33–56. [Google Scholar] [CrossRef]
- Stiennon, R. Breach Level Index. Categorising Data Breach Severity with a Breach Level Index. 2013. Available online: http://breachlevelindex.com/pdf/Breach-Level-Index-WP.pdf (accessed on 30 January 2025).
- Ayaburi, E.W. Understanding Online Information Disclosure: Examination of Data Breach Victimization Experience Effect. ITP 2023, 36, 95–114. [Google Scholar] [CrossRef]
- Zadeh, A.; Lavine, B.; Zolbanin, H.; Hopkins, D. A Cybersecurity Risk Quantification and Classification Framework for Informed Risk Mitigation Decisions. Decis. Anal. J. 2023, 9, 100328. [Google Scholar] [CrossRef]
- Pool, J.; Akhlaghpour, S.; Fatehi, F.; Burton-Jones, A. A Systematic Analysis of Failures in Protecting Personal Health Data: A Scoping Review. Int. J. Inf. Manag. 2024, 74, 102719. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
- Im, S.-K.; Chan, K.-H. Neural Machine Translation with CARU-Embedding Layer and CARU-Gated Attention Layer. Mathematics 2024, 12, 997. [Google Scholar] [CrossRef]
- Transformers. Available online: https://huggingface.co/docs/transformers/index (accessed on 20 January 2025).
- Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving Pre-Training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
- Clark, K.; Luong, M.-T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators. arXiv 2020. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019. [Google Scholar] [CrossRef]
- ai4privacy/pii-Masking-200k|Datasets at Hugging Face. Available online: https://huggingface.co/datasets/ai4privacy/pii-masking-200k (accessed on 22 January 2025).
- Holmes, L.; Crossley, S.; Wang, J.; Zhang, W. The Cleaned Repository of Annotated Personally Identifiable Information. 2024. Available online: https://zenodo.org/records/12729952 (accessed on 30 January 2025).
- Wen, Y.; Marchyok, L.; Hong, S.; Geiping, J.; Goldstein, T.; Carlini, N. Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-Trained Models. arXiv 2024. [Google Scholar] [CrossRef]
- Rashid, M.R.U.; Liu, J.; Koike-Akino, T.; Mehnaz, S.; Wang, Y. Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage. arXiv 2024. [Google Scholar] [CrossRef]
- Hastings, J.D.; Weitl-Harms, S.; Doty, J.; Myers, Z.J.; Thompson, W. Utilizing Large Language Models to Synthesize Product Desirability Datasets. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 5352–5360. [Google Scholar] [CrossRef]
- Generic Sentiment|Multidomain Sentiment Dataset. Available online: https://www.kaggle.com/datasets/akgeni/generic-sentiment-multidomain-sentiment-dataset (accessed on 22 January 2025).
- ai4privacy/pii-Masking-43k|Datasets at Hugging Face. Available online: https://huggingface.co/datasets/ai4privacy/pii-masking-43k (accessed on 24 January 2025).
- Dataset for Chatbot. Available online: https://www.kaggle.com/datasets/grafstor/simple-dialogs-for-chatbot (accessed on 24 January 2025).
- IMDB Dataset of 50K Movie Reviews. Available online: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews (accessed on 24 January 2025).
- Open AI|GPT-4. Available online: https://openai.com/index/gpt-4/ (accessed on 22 January 2025).
- Chen, Y.; Arunasalam, A.; Celik, Z.B. Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions. In Proceedings of the Annual Computer Security Applications Conference, Austin, TX, USA, 4–8 December 2023; ACM: Austin, TX, USA, 2023; pp. 366–378. [Google Scholar] [CrossRef]
- Singhal, S.; Zambrano, A.F.; Pankiewicz, M.; Liu, X.; Porter, C.; Baker, R.S. De-Identifying Student Personally Identifying Information with GPT-4. In Proceedings of the EDM2024, Atlanta, GA, USA, 14–17 July 2024. [Google Scholar] [CrossRef]
Corpus Model | MNLI-(m/mm) 1 392k 2 | QQP 363k | QNLI 108k | SST-2 67k | CoLA 8.5k | STS-B 5.7k | MRPC 3.5k | RTE 2.5k | Avg - |
---|---|---|---|---|---|---|---|---|---|
Pre-OpenAI SOTA | 80.6/80.1 | 66.1 | 82.3 | 93.2 | 35.0 | 81.0 | 86.0 | 61.7 | 74.0 |
BiLSTM + ELMo + Attn | 76.4/76.1 | 64.8 | 79.8 | 90.4 | 36.0 | 73.3 | 84.9 | 56.8 | 71.0 |
OpenAI GPT | 82.1/81.4 | 70.3 | 87.4 | 91.3 | 45.4 | 80.0 | 82.3 | 56.0 | 75.1 |
BERTBASE | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |
BERTLARGE | 86.7/85.9 | 72.1 | 92.7 | 94.9 | 60.5 | 86.5 | 89.3 | 70.1 | 82.1 |
|
|
Model | Epoch | Training Loss | Validation Accuracy | Training Time | Allocated Memory | Test Accuracy |
---|---|---|---|---|---|---|
Training-Validation-Test | 1 | 1.66% | 99.93% | 31.92 min | 1689.06 MB | 99.86% |
2 | 0.26% | 99.92% | ||||
3 | 0.20% | 99.85% |
Model | Epoch | Training Loss | Training Time | Allocated Memory | Test Accuracy |
---|---|---|---|---|---|
K-fold 1 | 1 | 1.44% | 31.39 min | 1688.71 MB | 99.92% |
2 | 0.28% | ||||
3 | 0.20% | ||||
K-fold 2 | 1 | 1.42% | 30.78 min | 1689.42 MB | 99.86% |
2 | 0.25% | ||||
3 | 0.19% | ||||
K-fold 3 | 1 | 1.45% | 30.55 min | 1689.58 MB | 98.91% |
2 | 0.27% | ||||
3 | 0.25% | ||||
K-fold 4 | 1 | 1.43% | 30.75 min | 1690.58 MB | 99.84% |
2 | 0.30% | ||||
3 | 0.31% | ||||
K-fold 5 | 1 | 1.33% | 30.70 min | 1690.91 MB | 99.87% |
2 | 0.28% | ||||
3 | 0.23% | ||||
Stratified k-fold 1 | 1 | 1.62% | 30.59 min | 1688.71 MB | 99.62% |
2 | 0.21% | ||||
3 | 0.22% | ||||
Stratified k-fold 2 | 1 | 1.29% | 30.63 min | 1689.42 MB | 99.93% |
2 | 0.32% | ||||
3 | 0.17% | ||||
Stratified k-fold 3 | 1 | 1.50% | 30.54 min | 1689.58 MB | 99.78% |
2 | 0.20% | ||||
3 | 0.29% | ||||
Stratified k-fold 4 | 1 | 1.42% | 30.60 min | 1690.58 MB | 99.84% |
2 | 0.28% | ||||
3 | 0.28% | ||||
Stratified k-fold 5 | 1 | 1.45% | 30.52 min | 1690.91 MB | 99.91% |
2 | 0.21% | ||||
3 | 0.25% |
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Training-Validation-Test | 99.558% | 99.564% | 99.558% | 99.559% |
K-fold 1 | 99.240% | 99.262% | 99.240% | 99.244% |
K-fold 2 | 98.904% | 98.945% | 98.904% | 98.912% |
K-fold 3 | 80.845% | 89.393% | 80.845% | 82.219% |
K-fold 4 | 97.968% | 98.106% | 97.968% | 97.993% |
K-fold 5 | 99.346% | 99.361% | 99.346% | 99.349% |
Stratified k-fold 1 | 98.480% | 98.496% | 98.480% | 98.466% |
Stratified k-fold 2 | 98.657% | 98.723% | 98.657% | 98.669% |
Stratified k-fold 3 | 95.953% | 96.542% | 95.953% | 96.057% |
Stratified k-fold 4 | 99.454% | 99.512% | 99.387% | 99.485% |
Stratified k-fold 5 | 99.417% | 99.431% | 99.417% | 99.419% |
TVT | K-Fold 1 | Strat. k-Fold 4 | ||
---|---|---|---|---|
PII Name | Total Classes | False Predictions | False Predictions | False Predictions |
Full name | 10,430 | 26 (0.249%) | 9 (0.086%) | 17 (0.163%) |
Surname | 6428 | 24 (0.373%) | 8 (0.124%) | 17 (0.264%) |
No PII | 14,404 | 10 (0.069%) | 7 (0.049%) | 10 (0.069%) |
Address | 1131 | 7 (0.619%) | 5 (0.442%) | 5 (0.442%) |
Job area | 236 | 2 (0.847%) | 4 (1.695%) | 4 (1.695%) |
First name | 1384 | 2 (0.145%) | 2 (0.145%) | 5 (0.361%) |
Job area + Full name | 1896 | 2 (0.105%) | 2 (0.105%) | 2 (0.105%) |
State | 464 | 1 (0.216%) | 3 (0.647%) | 2 (0.431%) |
Surname + Address | 985 | 3 (0.305%) | 0 (0.000%) | 1 (0.102%) |
Address + Full name | 1728 | 3 (0.174%) | 0 (0.000%) | 1 (0.058%) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mainetti, L.; Elia, A. Detecting Personally Identifiable Information Through Natural Language Processing: A Step Forward. Appl. Syst. Innov. 2025, 8, 55. https://doi.org/10.3390/asi8020055
Mainetti L, Elia A. Detecting Personally Identifiable Information Through Natural Language Processing: A Step Forward. Applied System Innovation. 2025; 8(2):55. https://doi.org/10.3390/asi8020055
Chicago/Turabian StyleMainetti, Luca, and Andrea Elia. 2025. "Detecting Personally Identifiable Information Through Natural Language Processing: A Step Forward" Applied System Innovation 8, no. 2: 55. https://doi.org/10.3390/asi8020055
APA StyleMainetti, L., & Elia, A. (2025). Detecting Personally Identifiable Information Through Natural Language Processing: A Step Forward. Applied System Innovation, 8(2), 55. https://doi.org/10.3390/asi8020055