Natural Language Processing to Extract Information from Portuguese-Language Medical Records
Abstract
:1. Introduction
2. Results
2.1. Model Results
2.2. Data Quality Analysis and Model Validation
2.3. Multivariate Statistical Methods—Cluster Analysis
3. Discussion
3.1. Extracted Entities
3.2. Data Quality Analysis and Model Validation
3.3. Multivariate Statistical Methods—Cluster Analysis
4. Materials and Methods
4.1. Related Works
4.2. Data Presentation
4.3. Methods
4.4. Named Entity Recognition Tools
4.5. Clinical Corpus
- medication: any medication in the medical record, whether prescribed in the current consultation or in the patient’s history
- condition: previous conditions of the patient, including physiological characteristics considered normal (not symptomatic of any disease)
- treatment: previously prescribed or to be performed treatment related to diagnosis
- symptom: subjective phenomenon or physiological characteristic reported by a patient and usually related to a disease
- exam: medical procedure that aims to help diagnosis
- diagnosis: identification of disease from the descriptions of symptoms and the tests performed
4.6. Model Description and Settings
4.7. Data Post-Processing
4.8. Multivariate Statistical Methods—Cluster Analysis
- age (entry date − birth date)/365.25)
- sex (0 = female and 1 = male)
- race/color (0 = unknown/did not declare; 1 = white; 2 = mixed race; 3 = Black; 4 = Asian; and 5 = Indigenous)
- education (0 = unknown/did not declare; 1 = illiterate; 2 = basic literacy; 3 = first–fourth grade elementary school complete or incomplete (early primary); 4 = fifth–eighth grade elementary school complete or incomplete (primary); 5 = complete or incomplete high school (secondary); and 6 = complete or incomplete higher education, master’s, or doctorate (higher education/postgraduate)
- marital status (0 = unknown/did not declare; 1 = single; 2 = married or common law; 3 = separated or divorced; and 4 = widowed)
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
- Example 1
- Portuguese version: Paciente encaminhada para avaliação de pseudoartrose DIAGNÓSTICO. tc EXAME e rx EXAME ok, com melhora de padrão prévio.
- Translated to English: Patient was referred for pseudarthrosis assessment DIAGNOSIS. ct EXAM and xr EXAM ok, with improvement of the previous pattern.
- Example 2
- Portuguese version: Paciente com lesão osteocondral DIAGNÓSTICO, lesão de mm DIAGNÓSTICO e lesão deg DIAGNÓSTICO, onde tem dor SINTOMA. CD paracetamol MEDICAÇÃO + tramadol MEDICAÇÃO, fisio TRATAMENTO.
- Translated to English: Patient with osteochondral lesion DIAGNOSIS, mm lesion DIAGNOSIS and deg lesion DIAGNOSIS, location of pain SYMPTOM. CD paracetamol MEDICATION + tramadol MEDICATION, physio TREATMENT.
- Example 3
- Portuguese version: Paciente consciente CONDIÇÃO, orientado CONDIÇÃO em tempo e espaço, descorada SINTOMA 1+/4+, hidratada CONDIÇÃO, acianótica CONDIÇÃO, anictérica CONDIÇÃO. Dispnéica SINTOMA, com cateter de O2 contínuo, mantendo saturação de O2 a 93.
- Translated to English: Patient conscious CONDITION, oriented CONDITION in time and space, pale SYMPTOM 1+/4+, hydrated CONDITION, acyanotic CONDITION, anicteric CONDITION. Dyspneic SYMPTOM, with continuous O2 catheter, maintaining O2 saturation at 93.
- Example 4
- Portuguese version: PS mãe refere que criança foi picada por inseto hoje, por volta das 12 h. Está com alergia local SINTOMA. EF beg CONDIÇÃO, corado CONDIÇÃO, hidratado CONDIÇÃO, afebril CONDIÇÃO, peso 16,1 kg, área central puntiforme com pequena vesícula, halo de cerca de 4 cm com hiperemia SINTOMA e calor SINTOMA em região do maleolo medial esquerdo. CD hidroxizine MEDICAÇÃO por 7 dias. compressa fria local TRATAMENTO.
- Translated to English: ER mother said that the child was bitten by an insect today, around 12 pm. Child has local allergy SYMPTOM. PE ggc CONDITION, flushed CONDITION, hydrated CONDITION, feverless CONDITION, weight 16.1 kg, central puncture area with small vesicle, halo of about 4 cm with hyperemia SYMPTOM and heat SYMPTOM in the region of the left medial malleolus. CD hydroxyzine MEDICATION for 7 days. Local cold compress TREATMENT.
Appendix B
References
- Pinto, V.B. Prontuário eletrônico do paciente: Documento técnico de informação e comunicação do domínio da saúde. Encontros Bibli Rev. Eletrônica De Bibliotecon. E Ciência Da Inf. 2006, 11, 34–48. [Google Scholar] [CrossRef]
- Zweigenbaum, P.; Demner-Fushman, D.; Yu, H.; Cohen, K.B. Frontiers of biomedical text mining: Current progress. In Briefings in Bioinformatics; Oxford University Press: Oxford, UK, 2007; Volume 8, pp. 358–375. [Google Scholar]
- Ananiadou, S.; Kell, D.B.; Tsujii, J.-I. Text mining and its potential applications in systems biology. In Trends in Biotechnology; Elsevier: Amsterdam, The Netherlands, 2006; Volume 24, pp. 571–579. [Google Scholar]
- Falcão, A.E.J.; Mancini, F.; da Costa, T.M.; Hummel, A.D.; Teixeira, F.O.; Sigulem, D.; Pisa, I.T. Indecs: Método automatizado de classificação de páginas web de saúde usando mineração de texto e descritores em ciências da saúde (DECS). J. Health Inform. 2009, 1, 1–6. [Google Scholar]
- Goth, G. Analyzing medical data. Commun. ACM 2012, 55, 13–15. [Google Scholar] [CrossRef] [Green Version]
- Kohane, I.S. Using electronic health records to drive discovery in disease genomics. Nat. Rev. Genet. 2011, 12, 417–428. [Google Scholar] [CrossRef] [PubMed]
- Song, M. Opinion: Text mining in the clinic. Scientist 2013, 1. Available online: https://www.the-scientist.com/opinion/opinion-text-mining-in-the-clinic-39531 (accessed on 3 October 2022).
- Spasic, I.; Ananiadou, S.; McNaught, J.; Kumar, A. Text mining and ontologies in biomedicine: Making sense of raw text. Brief. Bioinform. 2005, 6, 239–251. [Google Scholar] [CrossRef]
- Pakhomov, S.; A Weston, S.; Jacobsen, S.J.; Chute, C.G.; Meverden, R.; Roger, V.L. Electronic medical records for clinical research: Application to the identification of heart failure. Am. J. Manag. Care 2007, 13, 281–288. [Google Scholar]
- Peissig, P.L.; Rasmussen, L.; Berg, R.L.; Linneman, J.G.; McCarty, C.; Waudby, C.; Chen, L.; Denny, J.; A Wilke, R.; Pathak, J.; et al. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J. Am. Med. Inform. Assoc. 2012, 19, 225–234. [Google Scholar] [CrossRef] [Green Version]
- Roque, F.S.; Jensen, P.B.; Schmock, H.; Dalgaard, M.; Andreatta, M.; Hansen, T.F.; Søeby, K.; Bredkjær, S.; Juul, A.; Werge, T.; et al. Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts. PLoS Comput. Biol. 2011, 7, e1002141. [Google Scholar] [CrossRef] [Green Version]
- Lopes, F.; Teixeira, C.; Oliveira, H.G. Contributions to clinical named entity recognition in Portuguese. In Proceedings of the 18th BioNLP Workshop and Shared Task, Florence, Italy, 1 August 2019; pp. 223–233. Available online: https://www.aclweb.org/anthology/W19-5024 (accessed on 3 October 2022).
- de Souza, J.V.A.; Gumiel, Y.B.; Silva, L.E.; Moro, C.M.C. Named entity recognition for clinical Portuguese corpus with conditional random fields and semantic groups. In Proceedings of the Anais do XIX Simpósio Brasileiro de Computação Aplicada à Saúde, SBC, Niterói, Brazil, 11–14 June 2019; pp. 318–323. [Google Scholar]
- e Oliveira, L.E.S.; Peters, A.C.; da Silva, A.M.P.; Gebeluca, C.P.; Gumiel, Y.B.; Cintho, L.M.M.; Carvalho, D.R.; Al Hasan, S.; Moro, C.M.C. Semclinbr–a multi institutional and multi-specialty semantically annotated corpus for Portuguese clinical nlp tasks. arXiv 2020, arXiv:2001.10071. [Google Scholar] [CrossRef]
- Ferreira, L.; Teixeira, A.; Cunha, J.P. da S. Information extraction from Portuguese hospital discharge letters. Evolution 2010, 8, 506. [Google Scholar]
- Wang, X.; Song, X.; Li, B.; Guan, Y.; Han, J. Comprehensive named entity recognition on cord-19 with distant or weak supervision. arXiv 2020, arXiv:2003.12218. [Google Scholar]
- Andrade, V.D.; Ruas, P.; Couto, F.M. Named entity recognition and linking: A Portuguese and Spanish oncological parallel corpus. bioRxiv 2021. [Google Scholar] [CrossRef]
- Dias, M.; Boné, J.; Ferreira, J.C.; Ribeiro, R.; Maia, R. Named Entity Recognition for Sensitive Data Discovery in Portuguese. Appl. Sci. 2020, 10, 2303. [Google Scholar] [CrossRef] [Green Version]
- Ferreira, L.; Teixeira, A.; Cunha, J.P.S. Medical information extraction in European Portuguese. In Handbook of Research on ICTs for Human-Centered Healthcare and Social Care Services; IGI Global: Hershey, PA, USA, 2013; pp. 607–626. [Google Scholar]
- Leite-Moreira, A.; Mendes, A.; Pedrosa, A.; Rocha-Sousa, A.; Azevedo, A.; Amaral-Gomes, A.; Pinto, C.; Figueira, H.; Pereira, N.R.; Mendes, P.; et al. An NLP solution to foster the use of information in electronic health records for efficiency in decision-making in hospital care. arXiv 2022, arXiv:2202.12159. [Google Scholar]
- Lopes, F.; Teixeira, C.; Oliveira, H.G. Comparing Different Methods for Named Entity Recognition in Portuguese Neurology Text. J. Med. Syst. 2020, 44, 1–20. [Google Scholar] [CrossRef]
- Oleynik, M.; Nohama, P.; Cancian, P.S.; Schulz, S. Performance analysis of a pos tagger applied to discharge summaries in Portuguese. In MEDINFO; IOS Press: Amsterdam, The Netherlands, 2010; pp. 959–963. [Google Scholar]
- Peters, A.C.; Oleynik, M.; Pacheco, E.J.; Moro, C.M.C.; Schulz, S.; Nohama, P. Elaboração de um corpus médico baseado em narrativas clínicas contidas em sumários de alta hospitalar. In Proceedings of the Anais do XII Congresso Brasileiro de Informática em Saúde, Ipojuca, Brazil, 18–22 October 2010. [Google Scholar] [CrossRef]
- Schneider, E.T.R.; Gumiel, Y.B.; Luz, M.A.P.D.; Paraiso, E.C.; Moro, C. Experiments on Portuguese clinical question answering. In Proceedings of the Brazilian Conference on Intelligent Systems, Virtual Event, 29 November–3 December 2021; pp. 133–145. [Google Scholar]
- Terumi Rubel Schneider, E.; Andrioli de Souza, J.V.; Knafou, J.D.M.; Silva e Oliveira, L.E.; Copara Zea, J.L.; Bonescki Gumiel, Y.; Ferro Antunes de Oliveira, L.; Cabrera Paraiso, E.; Teodoro, D.; Cabral Moro Barra, C.M. BioBERTpt-a Portuguese neural language model for clinical named entity recognition. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, Online, 19 November 2020; pp. 65–72. Available online: https://www.aclweb.org/anthology/2020.clinicalnlp-1.7 (accessed on 3 October 2022).
- Souza, F.; Nogueira, R.; Lotufo, R. Portuguese named entity recognition using bert-crf. arXiv 2019, arXiv:1909.10649. [Google Scholar]
- de Souza, J.V.A.; Schneider, E.T.R.; Cezar, J.O.; Silva, L.E.; Gumiel, Y.B.; Paraiso, E.C.; Teodoro, D.; Barra, C.M.C.M. A multilabel approach to Portuguese clinical named entity recognition. J. Health Inform. 2020, 366–372. [Google Scholar]
- Arnaud, É.; Elbattah, M.; Gignon, M.; Dequen, G. Learning Embeddings from Free-text Triage Notes using Pretrained Transformer Models. In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies, Online, 9–11 February 2022; pp. 835–841, ISBN 978-989-758-552-4. [Google Scholar] [CrossRef]
- HCFMB. Hospital das Clinicas da Faculdade de Medicina de Botucatu. Available online: http://www.hcfmb.unesp.br/ (accessed on 3 October 2022).
- Murugavel, M. Spacy Annotation Tool. 2020. Available online: https://manivannanmurugavel.github.io/annotating-tool/spacy-ner-annotator/ (accessed on 3 October 2022).
- Zhang, Y.; Wallace, B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv 2015, arXiv:1510.03820. [Google Scholar]
- Ai Hub, T.M. Named Entity Recognition using Spacy and Tensorflow. 2020. Available online: https://aihub.cloud.google.com/p/products%2F2290fc65-0041-4c87-a898-0289f59aa8ba (accessed on 3 October 2022).
- Slatton, T.G. A Comparison of Dropout and Weight Decay for Regularizing Deep Neural Networks. 2014. Available online: https://scholarworks.uark.edu/cgi/viewcontent.cgi?article=1028&context=csceuht (accessed on 3 October 2022).
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Hinton, G.E.; Osindero, S.; Teh, Y.-W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
- SPACY. Language Processing Pipelines. 2021. Available online: https://spacy.io/usage/processing-pipelines (accessed on 3 October 2022).
Factors | Categories | n | % |
---|---|---|---|
Demographics | |||
Sex | Male | 13,823 | 46.1 |
Female | 16,177 | 53.9 | |
Age group | 0–12 years | 4494 | 15.0 |
13–19 years | 1431 | 4.8 | |
20–59 years | 13,730 | 45.8 | |
60 or older | 10,345 | 34.5 | |
Race | White | 26,348 | 87.8 |
Black | 1016 | 3.4 | |
Mixed race | 2311 | 7.7 | |
Asian | 39 | 0.1 | |
Indigenous | 2 | 0.0 | |
Unknown/did not declare | 284 | 0.9 | |
Socioeconomics | |||
Education | Illiterate | 3029 | 10.1 |
Basic literacy | 1993 | 6.6 | |
Early primary | 4598 | 15.3 | |
Primary | 10,225 | 34.1 | |
Secondary | 6946 | 23.2 | |
Higher education/undergraduate/postgraduate | 2586 | 8.6 | |
Unknown/did not declare | 623 | 2.1 | |
Marital status | Single | 11,360 | 37.9 |
Married/common law | 13,031 | 43.4 | |
Divorced | 2356 | 7.9 | |
Widower | 3017 | 10.1 | |
Unknown/did not declare | 236 | 0.8 |
Medical Specialty | n | Value (%) |
---|---|---|
Internal medicine | 4099 | 13.4 |
Pediatrics | 3886 | 13.0 |
Internist | 3684 | 12.3 |
General surgery | 1977 | 6.6 |
Orthopedics/traumatology | 1319 | 4.4 |
Nephrology | 1295 | 4.3 |
Obstetrics | 1263 | 4.2 |
Cardiology | 1165 | 3.9 |
Clinical neurology | 1159 | 3.9 |
Infectiology | 697 | 2.3 |
Ophthalmology | 632 | 2.1 |
Otolaryngology | 617 | 2.1 |
Others | 8297 | 27.7 |
ICD Description | n | Value (%) |
---|---|---|
General examination | 527 | 1.8 |
Acute pain | 513 | 1.7 |
Other specified septicemias | 509 | 1.7 |
Congestive heart failure | 486 | 1.6 |
Unspecified acute myocardial infarction | 409 | 1.4 |
End-stage kidney disease | 368 | 1.2 |
Stroke not specified as hemorrhagic or ischemic | 337 | 1.1 |
Status epilepticus unspecified | 325 | 1.1 |
Other cerebral infarctions | 296 | 1.0 |
Unspecified bacterial pneumonia | 294 | 1.0 |
Eye and vision exam | 290 | 1.0 |
Bronchopneumonia unspecified | 283 | 0.9 |
Others | 25,363 | 84.5 |
Entities | F-Score | Precision | Recall |
---|---|---|---|
Condition | 82.652 | 90.298 | 76.201 |
Diagnosis | 49.272 | 54.424 | 45.011 |
Exam | 54.664 | 72.609 | 43.832 |
Medication | 80.966 | 87.474 | 75.360 |
Symptom | 58.863 | 68.768 | 54.451 |
Treatment | 47.312 | 57.592 | 40.146 |
Model | 63.867 | 72.725 | 56.932 |
Number | Extracted | Correct Form |
---|---|---|
1 | Popranolol | Propranolol |
2 | Porpranolol | Propranolol |
3 | Prapranolol | Propranolol |
4 | Pronalol | Propranolol |
5 | Propanalol | Propranolol |
6 | Propanol | Propranolol |
7 | Propanolol | Propranolol |
8 | Proparanolol | Propranolol |
9 | Propalol | Propranolol |
10 | Propanol | Propranolol |
11 | Propanolo | Propranolol |
12 | Propranolol | Propranolol |
13 | Proranolol | Propranolol |
14 | Prorpanolol | Propranolol |
15 | Prpranolol | Propranolol |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
da Rocha, N.C.; Barbosa, A.M.P.; Schnr, Y.O.; Machado-Rugolo, J.; de Andrade, L.G.M.; Corrente, J.E.; de Arruda Silveira, L.V. Natural Language Processing to Extract Information from Portuguese-Language Medical Records. Data 2023, 8, 11. https://doi.org/10.3390/data8010011
da Rocha NC, Barbosa AMP, Schnr YO, Machado-Rugolo J, de Andrade LGM, Corrente JE, de Arruda Silveira LV. Natural Language Processing to Extract Information from Portuguese-Language Medical Records. Data. 2023; 8(1):11. https://doi.org/10.3390/data8010011
Chicago/Turabian Styleda Rocha, Naila Camila, Abner Macola Pacheco Barbosa, Yaron Oliveira Schnr, Juliana Machado-Rugolo, Luis Gustavo Modelli de Andrade, José Eduardo Corrente, and Liciana Vaz de Arruda Silveira. 2023. "Natural Language Processing to Extract Information from Portuguese-Language Medical Records" Data 8, no. 1: 11. https://doi.org/10.3390/data8010011
APA Styleda Rocha, N. C., Barbosa, A. M. P., Schnr, Y. O., Machado-Rugolo, J., de Andrade, L. G. M., Corrente, J. E., & de Arruda Silveira, L. V. (2023). Natural Language Processing to Extract Information from Portuguese-Language Medical Records. Data, 8(1), 11. https://doi.org/10.3390/data8010011