Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical Report
Abstract
:1. Introduction
2. Related Work
3. Methodology
3.1. Generative Model
3.2. Annotator
3.3. Evaluation
4. Experimental Result
4.1. Experimental Setup
4.2. Cerebrovascular Disease
- Stroke: this occurs when the blood flow is obstructed, causing brain tissue damage. It is further categorized into ischemic stroke (brain infarction) and hemorrhagic stroke (brain hemorrhage).
- Transient Ischemic Attack (TIA): similar to a stroke, but the blood flow blockage is temporary, and symptoms fully resolve.
- Cerebral Aneurysm: a condition where a part of a brain blood vessel weakens and bulges. If it ruptures, it can cause severe bleeding.
4.3. Medical Reports
4.4. Result
4.4.1. BioBERT
4.4.2. ClinicalBERT
4.4.3. BiomedBERT
4.4.4. Comparison
4.4.5. Comparison to Another Technique
5. Discussion
5.1. Discussion of Previous Research Results
5.2. Data Drift and Its Risks
- We filtered the generated synthetic text using a classification model. By setting the top and bottom 30% of the classification model’s prediction scores as thresholds, we reduced the differences between the original text and the synthetic text.
- We combined actual medical report data with synthetic data to ensure the model retained the critical patterns from the original data, thereby helping to reduce the impact of data drift.
5.3. Risk of Bias Introduction
5.4. Negative Impacts
5.5. Ethical Considerations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Poalelungi, D.G.; Musat, C.L.; Fulga, A.; Neagu, M.; Neagu, A.I.; Piraianu, A.I.; Fulga, I. Advancing Patient Care: How Artificial Intelligence Is Transforming Healthcare. J. Pers. Med. 2023, 13, 1214. [Google Scholar] [CrossRef] [PubMed]
- Lee, S.; Kim, H.-S. Prospect of Artificial Intelligence Based on Electronic Medical Record. J. Lipid Atheroscler. 2021, 10, 282. [Google Scholar] [PubMed]
- Jeun, Y.-J. EMR System and Patient Medical Information Protection. Korean J. Health Serv. Manag. 2013, 7, 213–224. [Google Scholar]
- Goncalves, A.; Ray, P.; Soper, B.; Stevens, J.; Coyle, L.; Sales, A.P. Generation and Evaluation of Synthetic Patient Data. BMC Med. Res. Methodol. 2020, 20, 108. [Google Scholar]
- Gonzales, A.; Guruswamy, G.; Smith, S.R. Synthetic data in health care: A narrative review. PLoS Digit. Health 2023, 2, e0000082. [Google Scholar]
- Clark, E.; August, T.; Serrano, S.; Haduong, N.; Gururangan, S.; Smith, N.A. All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2021; pp. 7282–7296. [Google Scholar]
- van der Lee, C.; Gatt, A.; van Miltenburg, E.; Wubben, S.; Krahmer, E. Best Practices for the Human Evaluation of Automatically Generated Text. In Proceedings of the 12th International Conference on Natural Language Generation, Tokyo, Japan, 29 October–1 November 2019; van Deemter, K., Lin, C., Takamura, H., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 355–368. [Google Scholar]
- Howcroft, D.M.; Belz, A.; Clinciu, M.-A.; Gkatzia, D.; Hasan, S.A.; Mahamood, S.; Mille, S.; van Miltenburg, E.; Santhanam, S.; Rieser, V. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions. In Proceedings of the 13th International Conference on Natural Language Generation, Dublin, Ireland, 8–13 December 2020; Davis, B., Graham, Y., Kelleher, J., Sripada, Y., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2020; pp. 169–182. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv 2020, arXiv:1910.01108. [Google Scholar]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinforma. Oxf. Engl. 2020, 36, 1234–1240. [Google Scholar]
- Wang, G.; Liu, X.; Ying, Z.; Yang, G.; Chen, Z.; Liu, Z.; Zhang, M.; Yan, H.; Lu, Y.; Gao, Y.; et al. Optimized Glycemic Control of Type 2 Diabetes with Reinforcement Learning: A Proof-of-Concept Trial. Nat. Med. 2023, 29, 2633–2642. [Google Scholar] [PubMed]
- Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans Comput Healthc. 2021, 3, 1–23. [Google Scholar]
- Wei, J.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 6382–6388. [Google Scholar]
- Huong, T.H.; Hoang, V.T. A Data Augmentation Technique Based on Text for Vietnamese Sentiment Analysis. In Proceedings of the 11th International Conference on Advances in Information Technology, New York, NY, USA, 1–3 July 2020; IAIT ’20; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–5. [Google Scholar]
- Qiu, S.; Xu, B.; Zhang, J.; Wang, Y.; Shen, X.; de Melo, G.; Long, C.; Li, X. EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks. In Companion Proceedings of the Web Conference 2020, New York, NY, USA, 20–24 April 2020; WWW ’20; Association for Computing Machinery: New York, NY, USA, 2020; pp. 249–252. [Google Scholar]
- Kumar, A.; Bhattamishra, S.; Bhandari, M.; Talukdar, P. Submodular Optimization-Based Diverse Paraphrasing and Its Effectiveness in Data Augmentation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 3609–3619. [Google Scholar]
- Kolomiyets, O.; Bethard, S.; Moens, M.-F. Model-Portability Experiments for Textual Temporal Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; Lin, D., Matsumoto, Y., Mihalcea, R., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2011; pp. 271–276. [Google Scholar]
- Xiang, R.; Chersoni, E.; Lu, Q.; Huang, C.-R.; Li, W.; Long, Y. Lexical Data Augmentation for Sentiment Analysis. J. Assoc. Inf. Sci. Technol. 2021, 72, 1432–1447. [Google Scholar]
- Coulombe, C. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs. arXiv 2018, arXiv:1812.04718. [Google Scholar]
- Belinkov, Y.; Bisk, Y. Synthetic and Natural Noise Both Break Neural Machine Translation. arXiv 2018, arXiv:1711.02173. [Google Scholar]
- Alzantot, M.; Sharma, Y.; Elgohary, A.; Ho, B.-J.; Srivastava, M.; Chang, K.-W. Generating Natural Language Adversarial Examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2018; pp. 2890–2896. [Google Scholar]
- Wang, W.Y.; Yang, D. That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors Using #petpeeve Tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; Màrquez, L., Callison-Burch, C., Su, J., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2015; pp. 2557–2563. [Google Scholar]
- Sugiyama, A.; Yoshinaga, N. Data Augmentation Using Back-Translation for Context-Aware Neural Machine Translation. In Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), Hong Kong, China, 3 November 2019; Popescu-Belis, A., Loáiciga, S., Hardmeier, C., Xiong, D., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 35–44. [Google Scholar]
- Guo, H.; Mao, Y.; Zhang, R. Augmenting Data with Mixup for Sentence Classification: An Empirical Study. arXiv 2019, arXiv:1905.08941. [Google Scholar]
- Chen, J.; Yang, Z.; Yang, D. MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2020; pp. 2147–2157. [Google Scholar]
- Anaby-Tavor, A.; Carmeli, B.; Goldbraich, E.; Kantor, A.; Kour, G.; Shlomov, S.; Tepper, N.; Zwerdling, N. Do Not Have Enough Data? Deep Learning to the Rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7383–7390. [Google Scholar]
- Bayer, M.; Kaufhold, M.-A.; Buchhold, B.; Keller, M.; Dallmeyer, J.; Reuter, C. Data Augmentation in Natural Language Processing: A Novel Text Generation Approach for Long and Short Text Classifiers. Int. J. Mach. Learn. Cybern. 2023, 14, 135–150. [Google Scholar] [PubMed]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Oh, B.-D.; Kim, G.-Y.; Kim, C.; Kim, Y.-S. How to Use Language Models for Synthetic Text Generation in Cerebrovascular Disease-Specific Medical Reports. In Proceedings of the 1st Workshop on Personalization of Generative AI Systems (PERSONALIZE 2024), St. Julians, Malta, 22 March 2024; Deshpande, A., Hwang, E., Murahari, V., Park, J.S., Yang, D., Sabharwal, A., Narasimhan, K., Kalyan, A., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2024; pp. 10–17. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 4171–4186. [Google Scholar]
- Cerebrovascular Disease: Types, Causes & Symptoms. Cleveland Clinic. Available online: https://my.clevelandclinic.org/health/diseases/24205-cerebrovascular-disease (accessed on 19 July 2024).
- AbuRahma, A.F. Overview of Cerebrovascular Disease. In Noninvasive Vascular Diagnosis: A Practical Textbook for Clinicians; AbuRahma, A.F., Perler, B.A., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 103–139. ISBN 978-3-030-60626-8. [Google Scholar]
- Sarker, S.; Qian, L.; Dong, X. Medical Data Augmentation via ChatGPT: A Case Study on Medication Identification and Medication Event Classification. arXiv 2023, arXiv:2306.07297. [Google Scholar]
- Lu, Q.; Dou, D.; Nguyen, T.H. Textual Data Augmentation for Patient Outcomes Prediction. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Online, 9–12 December 2021; pp. 2817–2821. [Google Scholar]
- Chen, H.; Dan, L.; Lu, Y.; Chen, M.; Zhang, J. An Improved Data Augmentation Approach and Its Application in Medical Named Entity Recognition. BMC Med. Inform. Decis. Mak. 2024, 24, 221. [Google Scholar]
Original Text | Augmented Text |
---|---|
Cystic encephalomalatic changes in left temporal, frontal lobes, and right inferior cerebellar hemisphere. | Cystic encephalomalatic ipsilateral ventricular effacement. Diffuse obliteration of fourth ventricle. R/O: diffuse hydrocephalus. |
Intracerebral Hemorrhage Patient | Normal Patient |
---|---|
downward and lateral dispalced right middle cerebral artery d/t right frontal hematoma. Otherwise, unremarkable. | Unremarkable finding of brain parenchymal and CSF space. |
Model | Data Size | Baseline | +20% | +40% | +60% | +80% | +100% | +120% | +140% | +160% | +180% | +200% |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BioBERT | 5000 | 77.52 | 72.82 | 69.04 | 75.28 | 79.88 | 76.66 | 80.78 | 80.36 | 81.02 | 80.78 | 81.7 |
10,000 | 80.46 | 77.52 | 80.58 | 81.71 | 81.94 | 82.03 | 83 | 82.83 | 83.53 | 83.25 | 83.23 | |
15,000 | 82.78 | 80.67 | 81.88 | 82.76 | 83.34 | 83.65 | 83.75 | 83.71 | 83.68 | 83.73 | 83.65 | |
20,000 | 83.7 | 82.3 | 81.24 | 83.8 | 83.86 | 84.06 | 83.9 | 84.03 | 84.21 | 84.28 | 84.59 | |
ClinicalBERT | 5000 | 80.12 | 80.16 | 79.42 | 81.16 | 80.86 | 81.68 | 81.46 | 81.82 | 82.56 | 82.48 | 82.8 |
10,000 | 82.77 | 80.69 | 81.75 | 82.7 | 83.32 | 83.65 | 83.09 | 84.62 | 84.73 | 84.53 | 84.74 | |
15,000 | 84.13 | 81.8 | 82.58 | 83.81 | 84.91 | 84.43 | 85 | 84.93 | 84.75 | 85.39 | 84.75 | |
20,000 | 84.88 | 83.14 | 83.89 | 84.27 | 85.03 | 85.48 | 84.74 | 85.62 | 85.56 | 85.32 | 85.65 | |
BiomedBERT | 5000 | 61.16 | 68.08 | 66.38 | 71 | 79.2 | 75.58 | 80.02 | 80.92 | 81.44 | 81.4 | 79.48 |
10,000 | 78.08 | 81.75 | 74.83 | 81.57 | 78.16 | 81.73 | 82.33 | 82.57 | 82.77 | 82.57 | 83.04 | |
15,000 | 82.09 | 75.25 | 80.72 | 82.35 | 82.47 | 83.03 | 83.13 | 83.06 | 83.39 | 82.81 | 83 | |
20,000 | 83.78 | 80.43 | 79.48 | 80.35 | 83.57 | 85.01 | 86 | 85.7 | 85.99 | 85.35 | 85 |
Model | Data Size | Baseline | +20% | +40% | +60% | +80% | +100% | +120% | +140% | +160% | +180% | +200% |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BioBERT | 5000 | 0.66 | 0.63 | 0.63 | 0.69 | 0.77 | 0.55 | 0.79 | 0.78 | 0.78 | 0.52 | 0.79 |
10,000 | 0.79 | 0.8 | 0.8 | 0.81 | 0.78 | 0.81 | 0.81 | 0.79 | 0.7 | 0.78 | 0.74 | |
15,000 | 0.81 | 0.81 | 0.79 | 0.83 | 0.82 | 0.78 | 0.83 | 0.8 | 0.84 | 0.82 | 0.82 | |
20,000 | 0.82 | 0.84 | 0.85 | 0.84 | 0.83 | 0.85 | 0.83 | 0.81 | 0.8 | 0.8 | 0.85 | |
ClinicalBERT | 5000 | 0.8 | 0.8 | 0.74 | 0.81 | 0.79 | 0.81 | 0.8 | 0.8 | 0.68 | 0.69 | 0.75 |
10,000 | 0.8 | 0.81 | 0.81 | 0.81 | 0.82 | 0.82 | 0.84 | 0.83 | 0.82 | 0.8 | 0.81 | |
15,000 | 0.82 | 0.82 | 0.84 | 0.84 | 0.82 | 0.82 | 0.83 | 0.85 | 0.82 | 0.84 | 0.81 | |
20,000 | 0.84 | 0.85 | 0.83 | 0.85 | 0.82 | 0.85 | 0.84 | 0.85 | 0.82 | 0.84 | 0.85 | |
BiomedBERT | 5000 | 0.64 | 0.61 | 0.67 | 0.75 | 0.76 | 0.76 | 0.78 | 0.78 | 0.76 | 0.65 | 0.71 |
10,000 | 0.8 | 0.81 | 0.67 | 0.73 | 0.65 | 0.82 | 0.81 | 0.82 | 0.71 | 0.82 | 0.82 | |
15,000 | 0.74 | 0.81 | 0.84 | 0.82 | 0.83 | 0.84 | 0.72 | 0.83 | 0.82 | 0.82 | 0.82 | |
20,000 | 0.84 | 0.82 | 0.83 | 0.82 | 0.82 | 0.83 | 0.84 | 0.84 | 0.84 | 0.84 | 0.84 |
Technique | Model | Metric | Baseline | +20% | +40% | +60% | +80% | +100% | +120% | +140% | +160% | +180% | +200% |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Ours | BioBERT | Acc F1-score | 77.52 0.66 | 72.82 0.63 | 69.04 0.63 | 75.28 0.69 | 79.88 0.77 | 76.66 0.55 | 80.78 0.79 | 80.36 0.78 | 81.02 0.78 | 80.78 0.52 | 81.7 0.79 |
Clinical BERT | Acc F1-score | 80.12 0.8 | 80.16 0.8 | 79.42 0.74 | 81.16 0.81 | 80.86 0.79 | 81.68 0.81 | 81.46 0.8 | 81.82 0.8 | 82.56 0.68 | 82.48 0.69 | 82.8 0.75 | |
BiomedBERT | Acc F1-score | 61.16 0.64 | 68.08 0.61 | 66.38 0.67 | 71 0.75 | 79.2 0.76 | 75.58 0.76 | 80.02 0.78 | 80.92 0.78 | 81.44 0.76 | 81.4 0.65 | 79.48 0.71 | |
EDA [13] | BioBERT | Acc F1-score | 79.18 0.82 | 79.56 0.82 | 79.28 0.82 | 79.82 0.82 | 78.78 0.81 | 75.56 0.8 | 79.5 0.82 | 79.9 0.82 | 77.64 0.81 | 78.24 0.81 | 79.32 0.82 |
Clinical BERT | Acc F1-score | 82.52 0.84 | 81.98 0.84 | 82.32 0.84 | 81.48 0.83 | 82.14 0.84 | 82.1 0.84 | 81.12 0.83 | 81.84 0.84 | 82.46 0.84 | 80.86 0.83 | 82.0 0.84 | |
BiomedBERT | Acc F1-score | 69.9 0.78 | 77.42 0.81 | 78.2 0.82 | 70.84 0.65 | 79.12 0.82 | 69.56 0.64 | 74.5 0.79 | 75.24 0.79 | 77.2 0.81 | 70.82 0.67 | 64.52 0.62 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, Y.-H.; Kim, C.; Kim, Y.-S. Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical Report. Appl. Sci. 2024, 14, 8652. https://doi.org/10.3390/app14198652
Kim Y-H, Kim C, Kim Y-S. Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical Report. Applied Sciences. 2024; 14(19):8652. https://doi.org/10.3390/app14198652
Chicago/Turabian StyleKim, Yu-Hyeon, Chulho Kim, and Yu-Seop Kim. 2024. "Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical Report" Applied Sciences 14, no. 19: 8652. https://doi.org/10.3390/app14198652
APA StyleKim, Y.-H., Kim, C., & Kim, Y.-S. (2024). Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical Report. Applied Sciences, 14(19), 8652. https://doi.org/10.3390/app14198652