1. Introduction
An electronic medical record (EMR) is a textual record of medical activities [
1,
2]. The development of information technology has promoted the growth of electronic medical records. At the same time, the value of medical information is becoming more and more important. Due to their unstructured nature, extracting information from clinical notes is very difficult [
3]. Therefore, how to effectively extract patients’ disease information is the primary requirement for studying the causes, development, and evolution of patient morbidity. Entity extraction of EMR is the basis of studying patients’ disease. It has a wide range of application scenarios, such as medical information retrieval, question and answer system, clinical decision support, etc. Therefore, accurate extraction of medical entities is crucial to get and use medical information.
China, as a large population country, has produced more and more electronic medical records in recent years. Therefore, the effective use of Chinese EMR and the effective extraction of information from Chinese EMR are of great significance to public health. The resident admit note (RAN) is a type of electronic medical record, which contains a large number of descriptions of patients’ condition, and it is the first-hand information for studying the patient’s diseases. The information of original Chinese RANs is unstructured, so it is highly significant to accurately extract the entity information inside.
Figure 1 shows part of one annotated Chinese RAN collected from a famous hospital in Hunan, China. Its main contents include the chief complaint, present disease, history of past disease, personal history, family history, and physical examination. The words marked in different colors represent different medical entities. For example, 身体部位 (body part) is marked in blue, 医学发现 (medical discovery) is marked in green, 疾病 (disease) is marked in red, and 治疗 (treatment) is marked in yellow, etc. Our task is to extract these important medical entities from the original Chinese RANs. Specific annotation information can be seen in
Section 3.
Like biomedical named entity recognition (NER) in English, medical named entity recognition of Chinese text also has some challenges. Firstly, one entity can contain multiple words, which requires the NER system to identify the physical boundaries. For example, entity “各瓣膜听诊区 (valve auscultation area)” consists of words “各”, “瓣膜”, “听诊”, and “区”. Secondly, the same word can be an entity or part of an entity, such as the two entities “腹壁 (abdominal wall)” and “腹壁静脉 (abdominal wall vein)”, both containing “腹壁 (abdominal wall)”. Lastly, there are some abbreviations in medical texts, which will increase the difficulty of extracting entities. Besides, unlike English, every Chinese word has its own meaning, and every character in the word also has its own meaning. So, it is better to combine the information of characters and words. For example, “腹壁静脉 (abdominal wall vein)” can be divided into words “腹壁 (abdominal wall)” and “静脉 (vein)”, and the words “腹壁 (abdominal wall)” and “静脉 (vein)” can be divided into “腹”,“壁” and “静”, “脉”, respectively. We should consider the words “腹壁 (abdominal wall)”, “静脉 (vein)”, and characters “腹”, “壁”, “静”, “脉” for Chinese entity recognition at the same time.
In this study, for medical entity of Chinese RANs, we propose a medical entity recognition model based on character and word attention-enhanced (CWAE) neural network. Firstly, we obtain Chinese word embeddings and character-based embeddings through character-enhanced word embedding (CWE) and convolutional neural network (CNN). Then, we use the attention mechanism to weight the character-based embedding and word embedding together. Through this method, the new word embedding was obtained, which fully combined the information of characters and words. After that, the new word embedding is fed to the bidirectional long short-term memory (BI-LSTM). In this way, the context semantic information of the entities is obtained. Finally, BI-LSTM is combined with the conditional random field (CRF) to predict medical entities. We annotated nine types of medical entities on 355 RANs from a famous hospital in Hunan Province, China. They included 医学发现 (medical discovery), 时间词 (temporal word), 检查 (inspection), 检验 (laboratory test), 治疗 (treatment), 疾病 (disease), 药物 (medication), 身体部位 (body part), and 测量数据 (measurement) in these RANs. Additionally, we comparatively evaluated our model on these nine types of entities. The result shows that our model has a better performance, and the result of our model reached 94.44% in the F1-score.
There are three main contributions in this paper:
For the characteristics of Chinese RANs, we propose use of the attention mechanism to combine character and word information, which further improves the expression ability of words.
We annotated nine types of entities on Chinese RANs, including medical discovery, temporal word, inspection, laboratory test, treatment, disease, drug, body part, and measurement.
We achieved an F1-score of 95.93% for medical discovery; an F1-score of 86.83% for temporal word; an F1-score of 94.61% for inspection; an F1-score of 83.54% for laboratory test; an F1-score of 87.48% for treatment; an F1-score of 89.56% for disease; an F1-score of 78.82% for medication; an F1-score of 97.02% for body part; an F1-score of 94.73% for measurement; and an F1-score 94.44 for all the medical entities.
The remainder of the paper is organized as follows:
Section 2 introduces the related works of entity extraction.
Section 3 focuses on the detailed description of the experimental dataset, medical named entity recognition tasks, and the detailed introduction of our model. In
Section 4, the experimental results and analysis are provided. Finally,
Section 5 addresses our conclusion and future work.
2. Related Work
With the advent of the medical big data era, more and more attention has been paid to knowledge mining and the utilization of electronic medical records. The information extraction from clinical free text is the most fundamental task [
3]. The medical entity is the carrier of important information, and extracting medical entity is very important to public health. In recent years, with the rapid development of deep learning and natural language processing technology, information extraction technology is becoming more and more mature. More and more models are built to process biomedical texts.
There are many mature methods for biomedical text entities in English. The methods of named entity recognition mainly include traditional machine learning and deep learning. Traditional machine learning mainly includes logistic regression (LR), support vector machines (SVMs), hidden Markov model (HMM), and CRF, etc. Li et al. [
4] used the artificial feature-based CRF for gene entity recognition, and the F1 score was 87.28% on the Biocreative II GM corpus. The Biocreative II GM corpus is provided by the gene mention (GM) tagging task, and the task is concerned with the named entity extraction of gene and gene product mentions in text. Wang et al. [
5] used the SVM for biomedical named entity recognition on the JNLPBA corpus. The JNLPBA corpus is provided by the BioNLP/JNLPBA Shared Task 2004, and the task was organized by GENIA Project. Wang et al. used a lot of artificial features, including local features, full-text features, and external resource features. The F1 score was 71.7%. However, most of those methods rely on feature engineering, which is labor intensive. Deep learning does not require manual features, so it is becoming popular. Yao et al. [
6] proposed a biomedical named entity recognition (Bio-NER) method based on deep neural network architecture, which has multiple layers. In their model, CNN can extract the sentence-level features, but it cannot extract the dependency features between characters in sentences. In order to efficiently make use of the sequence features of sentences, Li et al. [
7] constructed BI-LSTM for entity recognition. They constructed the twin word embeddings and the sentence vector to enrich the input information. The F1 score was 88.6% on the Biocreative II GM corpus. Habibi et al. [
8] proposed BI-LSTM-CRF with word embedding to improve biomedical named entity recognition. They evaluated their model using 24 corpuses covering five entity types. The average value of the F1 scores was 81.11%. For the detection of word-level and character-level features, Chiu et al. [
9] constructed a BI-LSTM-CNNs model named the entity recognition model, which achieved a 91.62% F1 score on the CoNLL-2003 proprietary corpus and 86.28% F1 score on the OntoNotes corpus. Li et al. [
10] proposed a CNN-BILSTM-CRF neural network model. They used CNN to train the character-level representation of words, then combined them with the word vectors obtained from large-scale corpus training, and then sent the combined word vectors to the BLSTM-CRF neural network for training. The F1 score on the Biocreative II GM corpus was 89.09% but only 74.40% on the JNLPBA corpus. To use widely available unlabeled text data for improving the performance of NER models, Sachan et al. [
11] proposed the effective use of bidirectional language modeling for medical named entity recognition. They trained a bidirectional language model (Bi-LM) on unlabeled data and transferred its weights to an NER model with the same architecture as the Bi-LM. The best F1 score on corpus clinical notes is 86.11%. To pay attention to the significant areas when capturing features, Wei et al. [
12] proposed an attention-based BILSTM-CRF model, and their model obtained an F1-score of 73.50% on JNLPBA corpus.
For Chinese named entity recognition, Ouyang et al. [
13] proposed named entity recognition based on the BI-LSTM neural network with additional features. The experimental results showed that the BI-LSTM with word embedding trained on a large corpus achieved the highest F1 score of 92.47%. However, they did not combine CRF and consider the character information in words. Xiang et al. [
14] proposed a Chinese NER method based on character-word mixed embedding (CWME). They averaged the word embedding and character embedding when the word contains only one character. Yang et al. [
15] proposed deep neural networks for medical entity recognition in Chinese online health consultations. They utilized BI-LSTM-CRF as the basic architecture, and concatenated character embedding and context word embedding to learn effective features.
Different from the methods mentioned above, we proposed a medical entity recognition model based on character and word attention-enhanced neural network for Chinese RANs. We initialized the embedding layer through the CWE model to obtain the word embedding and the character embedding. We used the attention mechanism to combine the information of the character and word. Then, through training, the model combined the information of characters and words with the best weight. In our experiment, we used 355 Chinese RANs from a famous hospital in Hunan Province, China. We annotated 医学发现 (medical discovery), 时间词 (temporal word), 检查 (inspection), 检验 (laboratory test), 治疗 (treatment), 疾病 (disease), 药物 (medication), 身体部位 (body part), and 测量数据 (measurement) in these RANs. Additionally, we evaluated our model on these nine entities.
5. Conclusions and Future Work
Medical named entity recognition is mature in foreign countries, but Chinese medical entity recognition is relatively late. For entity extraction of Chinese RANs, we proposed a medical entity recognition model based on character and word attention-enhanced neural network. Firstly, we obtained Chinese word embeddings and character-based embeddings through the CWE and CNN model. Then, the attention mechanism weighted the character-based embeddings and word embeddings together to get new word embeddings. The new word embeddings were fed to BI-LSTM and CRF to calculate the error for training. Finally, the trained model was used to predict the medical named entity. During the experiment, we used 355 RANs from a famous hospital in Hunan Province to annotate nine types of medical entities. To illustrate the superiority of our model, we compared our method with traditional machine learning methods (SVM, CRF). The experiments showed that the recognition results of our model are better than the traditional method, and the F1 score is the best. Then, we also experimented with related deep learning models to compare with our model. The results showed that the model we built has the better performance.
Our model also has some limitations. Our model mainly considers the semantic features of Chinese. English is different from Chinese, for example, a word in English can contain multiple Chinese characters, so this model is not suitable for English. However, for other languages, our model can be applied if the language expression is similar to Chinese. Besides, in our study, the word embeddings in the experiment were trained on 500 RANs, People’s Daily, and Wikipedia. The number of the BANs is not large, so there may be a lack of expression ability on word embeddings. In the next stage, we will strive to acquire more medical data and train character and word embeddings to further express words. The extraction of medical entities is only the first step of intelligent medicine. In-depth study of medical entity modification and the entity relationship is the goal of our next work. Besides, the RANs are the original data of the patient’s condition. Based on our current work, we will further study similar medical records and predict the preliminary diagnostic results.