Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients
Abstract
:1. Introduction
2. Materials and Methods
2.1. Dataset
2.2. NLP Toolkit
2.3. Performance Metrics
2.4. Identification of the Best Feature Set
2.5. Determination of the Minimum Size of the Training Set
- N = total number of reports of a type
- n(A) ≈ 0.75N = number of reports in the training pool
- n(B) ≈ 0.25N = number of reports in the test set
- n(Ak) = j = 25, 50, 100, 200, ……, n(A) = number of reports in each training dataset
- k = 1, 2, 3, ……, n(A)/j = number of iterations for each training dataset, Ak.
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Neamatullah, I.; Douglass, M.M.; Lehman, L.W.H.; Reisner, A.; Villarroel, M.; Long, W.J.; Szolovits, P.; Moody, G.B.; Mark, R.G.; Clifford, G.D. Automated De-Identification of Free-Text Medical Records. BMC Med. Inform. Decis. Mak. 2008, 8, 32. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Department of Health and Human Services Protecting Personal Health Information in Research: Understanding the HIPAA Privacy Rule; 2003; ISBN 2800228032. Available online: https://privacyruleandresearch.nih.gov/ (accessed on 7 April 2022).
- Xia, H.; Rao, R. The Method of Medical Named Entity Recognition Based on Semantic Model and Improved SVM-KNN Algorithm. In Proceedings of the 7th International Conference on Semantics, Knowledge, and Grids, SKG 2011, Beijing, China, 24–26 October 2011. [Google Scholar]
- Zhu, F.; Patumcharoenpol, P.; Zhang, C.; Yang, Y.; Chan, J.; Meechai, A.; Vongsangnak, W.; Shen, B. Biomedical Text Mining and Its Applications in Cancer Research. J. Biomed. Inform. 2013, 46, 200–211. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dehghan, A.; Keane, J.A.; Nenadic, G. Challenges in Clinical Named Entity Recognition for Decision Support. In Proceedings of the 2013 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2013, Manchester, UK, 13–16 October 2013. [Google Scholar]
- Saha, S.; Ekbal, A. Combining Multiple Classifiers Using Vote Based Classifier Ensemble Technique for Named Entity Recognition. Data Knowl. Eng. 2013, 85, 15–39. [Google Scholar] [CrossRef]
- Nadeau, D.; Sekine, S. A Survey of Named Entity Recognition and Classification. Lingvisticae InvestigationesLingvisticæ InvestigationesLingvisticæ Investigationes. Int. J. Linguist. Lang. Resour. 2007, 30, 3–26. [Google Scholar] [CrossRef]
- Goyal, A.; Gupta, V.; Kumar, M. Recent Named Entity Recognition and Classification Techniques: A Systematic Review. Comput. Sci. Rev. 2018, 29, 21–43. [Google Scholar] [CrossRef]
- Grouin, C.; Zweigenbaum, P. Automatic De-Identification of French Clinical Records: Comparison of Rule-Based and Machine-Learning Approaches. In Studies in Health Technology and Informatics; IOS Press: Amsterdam, The Netherlands, 2013. [Google Scholar]
- Jaćimović, J.; Krstev, C.; Jelovac, D. A Rule-Based System for Automatic de-Identification of Medical Narrative Texts. Informatica 2015, 39, 43–51. [Google Scholar] [CrossRef]
- Shaalan, K. Rule-Based Approach in Arabic Natural Language Processing. Int. J. Inf. Commun. Technol. 2010, 3, 11–19. [Google Scholar]
- Sil, A.; Yates, A. Re-Ranking for Joint Named-Entity Recognition and Linking. In Proceedings of the International Conference on Information and Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013. [Google Scholar]
- Yoshida, K.; Tsujii, J. Reranking for Biomedical Named-Entity Recognition. In Proceedings of the ACL 2007-Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, 2007, Prague, Czech Republic, 29 June 2007. [Google Scholar]
- Jiang, M.; Chen, Y.; Liu, M.; Rosenbloom, S.T.; Mani, S.; Denny, J.C.; Xu, H. A Study of Machine-Learning-Based Approaches to Extract Clinical Entities and Their Assertions from Discharge Summaries. J. Am. Med. Inform. Assoc. 2011, 18, 601–606. [Google Scholar] [CrossRef] [PubMed]
- Tang, B.; Cao, H.; Wang, X.; Chen, Q.; Xu, H. Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks. Biomed. Res. Int. 2014, 2014, 240403. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, M.; Carrell, D.; Aberdeen, J.; Hirschman, L.; Malin, B.A. De-Identification of Clinical Narratives through Writing Complexity Measures. Int. J. Med. Inform. 2014, 83, 750–767. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tsochantaridis, I.; Joachims, T.; Hofmann, T.; Altun, Y. Large Margin Methods for Structured and Interdependent Output Variables. J. Mach. Learn. Res. 2005, 6, 1453–1484. [Google Scholar]
- Lafferty, J.; Andrew, M.; Fernando, C.N.P. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the ICML ’01: Eighteenth International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001. [Google Scholar] [CrossRef]
- Wu, Y.; Jiang, M.; Xu, J.; Zhi, D.; Xu, H. Clinical Named Entity Recognition Using Deep Learning Models. AMIA Annu. Symp. Proc. 2017, 2017, 1812–1819. [Google Scholar] [PubMed]
- Aberdeen, J.; Bayer, S.; Yeniterzi, R.; Wellner, B.; Clark, C.; Hanauer, D.; Malin, B.; Hirschman, L. The MITRE Identification Scrubber Toolkit: Design, Training, and Assessment. Int. J. Med. Inform. 2010, 79, 849–859. [Google Scholar] [CrossRef] [PubMed]
- Soysal, E.; Wang, J.; Jiang, M.; Wu, Y.; Pakhomov, S.; Liu, H.; Xu, H. CLAMP—A Toolkit for Efficiently Building Customized Clinical Natural Language Processing Pipelines. J. Am. Med. Inform. Assoc. 2018, 25, 331–336. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Turian, J.; Ratinov, L.; Bengio, Y. Word Representations: A Simple and General Method for Semi-Supervised Learning. In Proceedings of the ACL 2010-48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010. [Google Scholar]
- Chaudhuri, B.B.; Bhattacharya, S. An Experiment on Automatic Detection of Named Entities in Bangla. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, Hyderabad, India, 12 January 2008; pp. 75–82. [Google Scholar]
- Sandin, F.; Emruli, B.; Sahlgren, M. Random Indexing of Multidimensional Data. Knowl. Inf. Syst. 2017, 52, 267–290. [Google Scholar] [CrossRef]
- Zhang, X.; Zhao, J.; Lecun, Y. Character-Level Convolutional Networks for Text Classification. Adv. Neural Inf. Processing Syst. 2015, 28, 649–657. [Google Scholar]
Report Type | Abbreviation | Number of Reports |
---|---|---|
Interventional Radiology | IR | 273 |
Mammography | MA | 167 |
Magnetic Resonance Imaging | MRI | 1010 |
Nuclear Medicine Technique | NM | 655 |
Ultrasound | US | 644 |
Computed Tomography | CT | 2741 |
X-ray | XR | 4749 |
Report Type | LOCATION | DATE | HOSPITAL | NAME | ID | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
M10 | M4 | M10 | M4 | M10 | M4 | M10 | M4 | M10 | M4 | ||
IR | P | 0.71 | 0.72 | 0.88 | 0.88 | 0.71 | 0.71 | 0.84 | 0.84 | 0.73 | 0.73 |
R | 0.68 | 0.69 | 0.87 | 0.86 | 0.70 | 0.71 | 0.78 | 0.78 | 0.60 | 0.61 | |
F1 | 0.70 | 0.70 | 0.87 | 0.87 | 0.71 | 0.71 | 0.80 | 0.80 | 0.66 | 0.66 | |
MA | P | 0.67 | 0.69 | 0.87 | 0.87 | 0.66 | 0.68 | 0.86 | 0.87 | 0.91 | 0.94 |
R | 0.60 | 0.64 | 0.82 | 0.82 | 0.63 | 0.66 | 0.81 | 0.81 | 0.91 | 0.91 | |
F1 | 0.63 | 0.67 | 0.84 | 0.85 | 0.64 | 0.67 | 0.83 | 0.84 | 0.91 | 0.92 | |
MRI | P | 0.82 | 0.83 | 0.88 | 0.89 | 0.83 | 0.83 | 0.91 | 0.91 | 0.86 | 0.86 |
R | 0.82 | 0.83 | 0.85 | 0.85 | 0.82 | 0.82 | 0.87 | 0.87 | 0.84 | 0.84 | |
F1 | 0.82 | 0.83 | 0,87 | 0.87 | 0.82 | 0.82 | 0.89 | 0.89 | 0.85 | 0.85 | |
NM | P | 0.82 | 0.81 | 0.90 | 0.90 | 0.83 | 0.82 | 0.91 | 0.92 | 0.88 | 0.89 |
R | 0.80 | 0.80 | 0.85 | 0.85 | 0.83 | 0.82 | 0.81 | 081 | 0.88 | 0.87 | |
F1 | 0.81 | 0.80 | 0.88 | 0.88 | 0.83 | 0.82 | 0.86 | 0.86 | 0.88 | 0.88 | |
US | P | 0.78 | 0.78 | 0.90 | 0.91 | 0.79 | 0.80 | 0.97 | 0.97 | 0.86 | 0.86 |
R | 0.77 | 0.77 | 0.87 | 0.87 | 0.78 | 0.78 | 0.95 | 0.95 | 0.86 | 0.85 | |
F1 | 0.77 | 0.78 | 0.89 | 0.89 | 0.78 | 0.79 | 0.96 | 0.96 | 0.86 | 0.86 | |
CT | P | 0.94 | 0.94 | 0.82 | 0.83 | 0.94 | 0.94 | 0.94 | 0.94 | 0.80 | 0.80 |
R | 0.93 | 0.93 | 0.75 | 0.75 | 0.94 | 0.94 | 0.93 | 0.93 | 0.71 | 0.71 | |
F1 | 0.94 | 0.94 | 0.79 | 0.79 | 0.94 | 0.94 | 0.94 | 0.94 | 0.75 | 0.75 | |
XR | P | 0.94 | 0.94 | 0.90 | 0.90 | 0.93 | 0.93 | 0.95 | 0.95 | 0.88 | 0.88 |
R | 0.93 | 0.93 | 0.88 | 0.88 | 0.93 | 0.93 | 0.95 | 0.95 | 0.80 | 0.80 | |
F1 | 0.93 | 0.93 | 0.89 | 0.89 | 0.93 | 0.93 | 0.95 | 0.95 | 0.84 | 0.84 | |
ALL | P | 0.90 | 0.90 | 0.88 | 0.88 | 0.90 | 0.90 | 0.93 | 0.93 | 0.86 | 0.87 |
R | 0.89 | 0.90 | 0.86 | 0.87 | 0.89 | 0.90 | 0.91 | 0.92 | 0.85 | 0.86 | |
F1 | 0.89 | 0.89 | 0.87 | 0.87 | 0.88 | 0.90 | 0.92 | 0.91 | 0.85 | 0.86 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Paul, T.; Islam, H.; Singh, N.; Jampani, Y.; Kotapati, T.V.P.; Tautam, P.A.; Rana, M.K.Z.; Mandhadi, V.; Sharma, V.; Barnes, M.; et al. Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients. Appl. Sci. 2022, 12, 9976. https://doi.org/10.3390/app12199976
Paul T, Islam H, Singh N, Jampani Y, Kotapati TVP, Tautam PA, Rana MKZ, Mandhadi V, Sharma V, Barnes M, et al. Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients. Applied Sciences. 2022; 12(19):9976. https://doi.org/10.3390/app12199976
Chicago/Turabian StylePaul, Tanmoy, Humayera Islam, Nitesh Singh, Yaswitha Jampani, Teja Venkat Pavan Kotapati, Preethi Aishwarya Tautam, Md Kamruz Zaman Rana, Vasanthi Mandhadi, Vishakha Sharma, Michael Barnes, and et al. 2022. "Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients" Applied Sciences 12, no. 19: 9976. https://doi.org/10.3390/app12199976