Uniting Multi-Scale Local Feature Awareness and the Self-Attention Mechanism for Named Entity Recognition

Shi, Lin; Zou, Xianming; Dai, Chenxu; Ji, Zhanlin

doi:10.3390/math11112412

Open AccessArticle

Uniting Multi-Scale Local Feature Awareness and the Self-Attention Mechanism for Named Entity Recognition

¹

Hebei Key Laboratory of Industrial Intelligent Perception, College of Artificial Intelligence, North China University of Science and Technology, Tangshan 063210, China

²

Telecommunications Research Centre (TRC), University of Limerick, V94 T9PX Limerick, Ireland

^*

Authors to whom correspondence should be addressed.

Mathematics 2023, 11(11), 2412; https://doi.org/10.3390/math11112412

Submission received: 13 April 2023 / Revised: 15 May 2023 / Accepted: 22 May 2023 / Published: 23 May 2023

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

In recent years, a huge amount of text information requires processing to support the diagnosis and treatment of diabetes in the medical field; therefore, the named entity recognition of diabetes (DNER) is giving rise to the popularity of this research topic within this particular field. Although the mainstream methods for Chinese medical named entity recognition can effectively capture global context information, they ignore the potential local information in sentences, and hence cannot extract the local context features through an efficient framework. To overcome these challenges, this paper constructs a diabetes corpus and proposes the RMBC (RoBERTa Multi-scale CNN BiGRU Self-attention CRF) model. This model is a named entity recognition model that unites multi-scale local feature awareness and the self-attention mechanism. This paper first utilizes RoBERTa-wwm to encode the characters; then, it designs a local context-wise module, which captures the context information containing locally important features by fusing multi-window attention with residual convolution at the multi-scale and adds a self-attention mechanism to address the restriction of the bidirectional gated recurrent unit (BiGRU) capturing long-distance dependencies and to obtain global semantic information. Finally, conditional random fields (CRF) are relied on to learn of the dependency between adjacent tags and to obtain the optimal tag sequence. The experimental results on our constructed private dataset, termed DNER, along with two benchmark datasets, demonstrate the effectiveness of the model in this paper.

Keywords:

named entity recognition; diabetes dataset; multi-scale local feature awareness; residual structure; self-attention mechanism

MSC:

68T50

1. Introduction

Named entity recognition (NER) is a fundamental technology in natural language processing (NLP), with the main functions of identifying and extracting various named entities from massive unstructured text (one of the subtasks of knowledge extraction) [1], such as person names, location names, organization names, and domain specific words. Nowadays, named entity recognition technology has been widely applied in other tasks such as machine translation [2], information extraction [3], intelligent Q and A [4], and knowledge graph construction [5]. In the natural language processing tasks, the accuracy of named entity recognition has a significant influence on the processing efficiency of subsequent tasks.

During the early medical NER tasks, the majority employed the rule-based and dictionary-based methods [6]. The experts in the medical field constructed the rule templates and the pattern matching and string matching were applied to identify the relevant entities, such as disease names, drug names, inspection indicators, and related symptoms. Many rules required development by professional medical experts, which are high in labor cost [7] and low in portability. At present, with the development of deep learning and increasing computing power, more and more scholars have made use of the neural network models to process medical NER tasks [8], treating them as sequence labeling problems. Nevertheless, in the medical NER, the most important problems currently are the lack of labeled data and the existence of polysemy and entity co-reference in a lot of specialized vocabularies, resulting in low accuracy in medical vocabulary recognition and low performance in network model recognition, which hinders the development of Chinese medical entity recognition [9].

In the medical field, taking diabetes as an example, China has become the country with the largest number of diabetes patients in the world, where the probability of adults suffering from diabetes is 11.7%, and this probability is on the rise [10,11]. For patients with diabetes, most of their conditions are recorded in the form of electronic documents [12]; nonetheless, there is a lack of efficient recognition methods to accurately identify the diabetes named entities. Accordingly, how to improve the accuracy of named entity recognition in the field of diabetes medicine has become a key issue in the field of diabetes medicine.

At present, there is less research on named entities in the field of diabetes medicine, and the labeled dataset is rare; thus, this paper crawls the information of domestic public medicine websites and related encyclopedia websites, such as “XYWY.COM”, “QIUYI.CN”, and Baidu Encyclopedia to obtain a large quantity of diabetes-related data and constructs a Chinese diabetes corpus. Further, in view of the problem of complex entity naming and entity nesting phenomena in the medical field, which leads to difficulty in entity recognition—and with most current models directly feeding the generated character vector into Bidirectional Long Short-Term Memory (BiLSTM) or Bidirectional Gate Recurrent Unit (BiGRU) to obtain the global features, which gives no consideration to local optimal features—this paper proposes a Chinese medical NER task model, uniting the local context-wise module and BiGRU with self-attention mechanism while taking into account the local and global features, and constructs a deep learning model RMBC (RoBERTa Multi-scale CNN BiGRU Self-attention CRF) for diabetes entity recognition to recognize diabetes named entities, providing effective extraction of six medical entities.

2. Related Work

There are four main methods for named entity recognition research: rule-based methods, statistics-based machine learning methods, deep learning methods, and NER methods using pre-trained models.

The origins of NER were mainly based on rule-based and dictionary-based methods [13,14], which relied on domain experts to manually construct corresponding rule templates and use matching methods to process text. This method requires a lot of human resources and cannot be easily transferred between different domains. After that, the methods based on statistical machine learning were introduced, including Hidden Markov Model (HMM) [15], Maximum Entropy Model (MEM) [16], Support Vector Machine (SVM) [17], and Conditional Random Fields (CRF) [18]. Among them, CRF considers the global distribution of data when normalizing, fully utilizing the feature information of internal and context, and solving the label bias problem [19]. However, the above-mentioned machine learning methods require a large quantity of manually annotated data for feature extraction, and the size of the corpus seriously affects the recognition performance [20]. In recent years, many deep learning methods have been applied to research on named entity recognition [21]. For example, Hammerton et al. [22] used a unidirectional long short-term memory (LSTM) for text recognition research and achieved very good results; therefore, LSTM-CRF became the basic structure of named entity recognition. Later, Guillaume Lample et al. [23] proposed a neural network model that combines bidirectional long short-term memory (BiLSTM) with CRF based on this model. This structure can effectively obtain the sequential information of context, and it achieved an F1 score of 90.94% on the CoNLL-2003 dataset. It has been widely used in tasks such as named entity recognition. Collobert et al. first proposed combining the convolutional neural network (CNN) with CRF [24]. This method assigns a fixed-size window to each word, and it extracts local information more effectively, but lacks consideration of long-distance word information. Chiu and Nichols [25] proposed a BiLSTM-CNN model that uses CNN to learn character features and completes sequence labeling tasks through BiLSTM. The F1 value of this model on the CoNLL2003 dataset reaches 91.62%. Ma et al. [26] proposed a model based on LSTM-CNNs-CRF to address sequence labeling problems. The model combines LSTM, CNN, and CRF models to establish an end-to-end model that does not require a large amount of data or specific task knowledge. Zhu et al. [27] proposed a named entity recognition model that combines CNN with BiGRU. In addition, the model also introduced a multi-task learning framework to simultaneously handle both entity type classification and entity boundary recognition. Experiments on multiple datasets showed that it performed better in improving accuracy than traditional named entity recognition models. Strubel et al. [28] proposed Iterated Dilated Convolutional Neural Networks (IDCNN), which improved traditional CNN networks by introducing regularization to address overfitting caused by an increase in the number of layers in traditional CNN networks. IDCNN also significantly improves speed and performs well in named entity recognition tasks. Zhang et al. [29] proposed a new type of Lattice LSTM that incorporates potential word information into the traditional word-based LSTM-CRF model, which avoids error propagation caused by word segmentation errors. This method achieved good results on multiple public datasets.

However, all of the aforementioned methods have a common problem, which is the inability to handle polysemous words. These methods only focus on feature extraction between words or characters, ignoring the semantic information between contexts. Therefore, the extracted static word vectors result in lower accuracy for named entity recognition. In order to address this issue, the Google team proposed a pre-processing language model called BERT [30] in 2018, which is used to handle word embeddings. The Bert model uses bidirectional transformers as encoders, which enhances the generalization ability of the word vector model, fully describes the relationships between characters, words, and sentences, and effectively represents the semantic information between contexts. The Bert model has become the mainstream model in the field of NLP [31,32].

Currently, named entity recognition technology is gradually being applied in the medical field. Compared with other fields, medical entities in Chinese named entity recognition tasks are more specialized, making their recognition more challenging. Chinese medical texts usually contain a large number of medical terms, with complex word formation and nested entities, as well as fuzzy boundaries between them. Among them, a large number of entities are mixed with numbers, symbols, Chinese and English, such as GLP-1 receptor agonists, cox-2 inhibitors, DDP-4 inhibitors, etc. In addition, in the medical field, there are some problems in addition to the difficulty of named entity recognition itself, such as the diversity of descriptions of medical entities, the lack of unified rules, and the continuous emergence of new entities with the continuous development of medical technology, which makes deep learning models have poor transferability. Due to the limited availability of publicly available annotated datasets for named entities in the medical field and the high cost of manual annotation, the training data for deep learning technology in medical entity recognition is insufficient. These issues undoubtedly increase the difficulty of named entity recognition in the medical field, leading to limited performance in entity recognition in the medical field.

To solve the current problems, some scholars have conducted research on named entity recognition in the medical field. For example, Chai et al. [33] proposed a novel biomedical named entity recognition method by combining XLNet and CRF for noise reduction and entity recognition. This method achieved an F1 score of 89.01% in the BioCreative V CDR task, outperforming other models. In the JNLPBA task, the method achieved an F1 score of 77.39%. Guo et al. [34] proposed a method for named entity recognition of Chinese electronic medical records using multi-task learning and transfer learning. This method used a shared deep neural network to learn multiple related tasks, including disease and drug entity recognition, and used pre-training and fine-tuning for transfer learning to improve the model’s generalization ability and robustness. The method achieved excellent performance on the CEMR-NER task of the public dataset CCKS2017, with an F1 score of 87.36%. Lee et al. [35] proposed a multi-graph neural network method with multi-embedding enhancement for Chinese medical named entity recognition. The method achieved excellent performance on the public dataset CEMR-NER, with an F1 score of 86.95%, outperforming multiple baseline models. Liang et al. [36] proposed a transfer learning-based method that transfers the pre-trained model in the textual entailment task to the biomedical named entity recognition task to improve the performance of biomedical NER. On the BioCreative IV dataset, the authors’ method achieved the best performance, with an F1 score of 70.63%. Chen et al. [37] proposed a knowledge-adaptive multi-path matching network based on machine reading comprehension for the biomedical named entity recognition task. The method combines medical knowledge with text features and achieves accurate entity recognition through a multi-path matching network. On the BC2GM dataset, the method achieved the best performance, with an F1 score of 87.02%. Liu et al. [38] designed a Med-BERT pre-training framework that combines medical corpora and specific tasks related to the field to improve the model’s performance in medical named entity recognition (NER). On the i2b2-2010 dataset, Med-BERT achieved the best F1 score of 87.02%.

3. Method

In this section, the structure of each part of the RMBC model proposed herein is introduced, including the embedding of the pre-trained model, local context-wise module, BiGRU combined self-attention layer, and CRF layer. The overall framework of the RMBC model is shown in Figure 1; this model first employs the RoBERTa-wwm pre-trained model to extract word vectors from text data, effectively extracting the local information from text data with the help of a local context-wise module, and then introduces the self-attention mechanism BiGRU to capture the global feature information of text; finally, it inputs the CRF layer for decoding to output the tag sequence with the highest probability in order to obtain the tag category of each character.

3.1. RoBERTa-wwm Pre-Trained Language Model

Comparing ELMo [39] and OpenAI-GPT [40] pre-trained models, BERT is an unsupervised deep bidirectional language representation model using stack Transformer as the main architecture, which is designed through two pre-trained tasks, namely, the Masked Language Model (MLM) and Next Sentence Prediction (NSP). BERT is widely used in named entity recognition tasks for the semantic representation of pretraining.

The Joint Laboratory of HIT and iFLYTEK Research has launched the Chinese RoBERTa-wwm pre-trained language model [41] that has been improved on the basis of RoBERTa and Chinese whole word mask technology [42], mainly including two improvements. RoBERTa-wwm uses the dynamic mask instead of the static mask of BERT. The dynamic mask will randomly select different words for [mask] each time, which increases the randomness of model input and allows the model to learn more diverse language representations. RoBERTa-wwm adopts the whole word mask instead of the individual character mask of BERT. The whole word mask will mask the entire word rather than individual characters, which helps improve the model’s understanding of the vocabulary. As shown in Figure 2, BERT is employed to randomly mask a single character in a sentence, while the whole word mask of RoBERTa-wwm is used to mask all characters belonging to a word. For the sentence “Patients with diabetes who experience high blood sugar levels”, the individual character mask scheme of BERT is to mask some individual characters in the word “high blood sugar levels”, such as the word “high”, and then learn the semantic representation of the character level by predicting the character to be masked. RoBERTa-wwm, on the other hand, first divides a sentence into words and then randomly masks a portion of words for prediction, such as “high blood sugar levels” and “clinical practice”; with this training method, RoBERTa-wwm can learn the semantic representation at the word level, thereby achieving the overall effect of improving the performance of the model.

3.2. Local Context-Wise Module

The convolutional neural network presents excellent performance in extracting local features; although it is primarily used for image feature extraction, more and more scholars have begun to employ CNN to solve the natural language processing problems, such as named entity recognition, in recent years.

This paper constructs a local context-wise module to extract multi-scale local features of diabetes texts. By using a multi-window attention mechanism to input character vectors, the important semantic components of local features can be effectively captured under different window sizes. The convolution layer improves the local feature awareness capability of CNN by setting multiple convolution kernels of different sizes to efficiently calculate the multiple convolution kernels in parallel, thereby fully extracting the local feature information at different scales. Next, a residual structure is leveraged to fuse the semantic information at different scales to avoid the problem of network degradation due to too deep network layers, thus improving the performance of entity recognition.

The local context-wise module encodes the character sequence output from the RoBERTa-wwm model, while implicitly grouping the related characters to capture the relevance in the local context.

w = w_{c h}

is utilized as the input representation for each character, and the embedded representation for the character is

w_{c h} \in R^{e_{c h}}

.

The convolution window size of CNN is set to

k

, and each character embedding includes a position embedding that is the same as the window size

k

. The embedded index range for this location is from 0 to

k - 1

, where the initial value is 1 if the current index corresponds to the location of the corresponding character in the window, otherwise it is 0; in this way, CNN can encode the location information of each character in the context into its embedding vector, thereby capturing the sequence dependency of the characters in the sequence. The embedding dimension is

e_{g} = e_{c h} + e_{p o s}

. In order to capture the semantic relationship between the central character and the surrounding characters, a method of combining CNN with the multi-window attention mechanism under different convolutional window sizes is applied. This method can effectively focus on the local context of each character and strengthen the semantic connection between the central character and its surrounding characters.

In the multi-window attention layer, the central character

j

with a window size of

k

is taken as the center, whose input with other surrounding characters is represented by

w_{j - (\frac{k - 1}{2})}, \dots, w_{j}, \dots, w_{j + (\frac{k - 1}{2})}

; such inputs ultimately generate

k

hidden vectors

h_{j - (\frac{k - 1}{2})}, \dots, h_{j}, \dots, h_{j + (\frac{k - 1}{2})}

with a length of

e_{c h}

, with the calculation method as follows:

h_{m} = a_{m} w_{m},

(1)

where

m \in {j - (\frac{k - 1}{2}), \dots, j + (\frac{k - 1}{2})}

and

a_{m}

represents the attention weight, and the calculation formula of

a_{m}

is as follows:

a_{m} = \frac{\exp s (w_{j}, w_{m})}{\sum_{n \in {j - (\frac{k - 1}{2}), \dots, j + (\frac{k - 1}{2})}} \exp s (w_{j}, w_{n})},

(2)

and the score function is defined as follows:

s (w_{j}, w_{k}) = v^{T} \tanh (W_{1} w_{j} + W_{2} w_{k}),

(3)

where

v \in R^{e_{c h}}

,

W_{1}, W_{2} \in R^{e_{c h}, e_{g}}

and

e_{g}

is the embedding dimension.

The resulting vector sequence

h_{j - (\frac{k - 1}{2}) : j + (\frac{k - 1}{2})}

undergoes convolution operations with different convolution kernel sizes, and the extracted local semantic features are represented by the following equation:

b_{i} = R e L U (w^{T} \times h_{j - (\frac{k - 1}{2}) : j + (\frac{k - 1}{2})}) .

(4)

Next, to better fuse the multi-scale local context information to obtain the more effective feature information and ensure that the network depth is increased without network degradation, this paper designs a multi-scale residual convolution network structure, as shown in Figure 3. Except for the first CNN layer, the input of each CNN layer is a fused feature vector obtained from the input and output of the previous CNN layer after residual concatenation, and finally, the multi-scale feature vectors from the output of each of its CNN layers are concatenated to obtain the output of the local context-wise module.

\oplus

represents concatenation operation.

c = c_{1} \oplus c_{2} \oplus \dots \oplus c_{i}

(5)

3.3. BiGRU Layer

The gate recurrent unit (GRU) and the long short-term memory network (LSTM) are both improvements of the recurrent neural network (RNN). They can effectively solve the gradient vanishing or exploding problems faced by traditional RNNs. In the design of LSTM, three gate units are used: input gate, forget gate, and output gate. In contrast, the structure of GRU is simpler, requires less computation, and has faster training speed. GRU includes two gate units: update gate and reset gate. The update gate replaces the input gate and forget gate in LSTM. The update gate

Z_{t}

controls its gate state to determine how much information

h_{t - 1}

from the previous time step will be passed to the current time step

h_{t}

, and selectively accepts information from candidate states

{\tilde{h}}_{t}

based on its gate state. Reset gate

r_{t}

is responsible for managing how to merge the candidate state

{\tilde{h}}_{t}

with the information from the previous time step

h_{t - 1}

. The formulae for calculating the various states of a GRU unit are as follows:

Z_{t} = σ (W_{Z} * [h_{t - 1}, x_{t}])

(6)

r_{t} = σ (W_{r} * [h_{t - 1}, x_{t}])

(7)

{\tilde{h}}_{t} = \tanh (W * [x_{t}, (r_{t} * h_{t - 1})])

(8)

h_{t} = Z_{t} * {\tilde{h}}_{t} + (1 - Z_{t}) * h_{t - 1},

(9)

and the structure of a GRU is shown in Figure 4.

Unidirectional GRU can only receive text sequences in the forward direction and obtain upstream information, while ignoring downstream information. However, in text, the front and back information are interrelated, so unidirectional processing easily loses important information. To solve this problem, this paper adopts the Bidirectional Gated Recurrent Unit (BiGRU) structure. BiGRU consists of a forward hidden layer and a backward hidden layer, which can simultaneously obtain two different vector representations of the current input information and combine them into the input information

d = [d_{1}, d_{2}, \dots, d_{i}]

at the current time step, thereby better performing deep feature extraction on text and understanding the dependency relationship between the contexts.

3.4. Self-Attention Layer

During the feature extraction process, BiGRU uses gating mechanisms to selectively retain or discard context information at the current time step, without specifically distinguishing the influence of different characters on the current position. For example, in the sentence “Even mild hypoglycemia can potentially cause physical injuries such as falls and fractures in patients, leading to hospitalization and increasing their psychological and economic burden”, the contribution of the character “hypoglycemia” to identifying diseases’ “falls” and “fractures” is obviously greater than the character “mild”. Therefore, we believe that the BiGRU model has a problem of attention dispersion in recognition, meaning that it does not assign different weights to different characters according to their importance, resulting in a lack of proper distinction between important and regular information during encoding.

To address this issue, this paper employs a self-attention mechanism to filter key information from the input text. Similar to the self-attention mechanism in Transformer, it only focuses on the relationships between characters within the input sequence in order to identify the connections between different characters and select the most representative and critical words and phrases. Its calculation formula is:

A t t e n t i o n (Q, K, V) = S o f t M a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(10)

where

Q

represents the query matrix,

K

represents the key matrix,

V

represents the value matrix, and

d_{k}

represents the dimension of

Q

and

K

. The attention mechanism calculates the similarity scores between the query matrix

Q

and all key matrices

K

. To prevent these scores from being too large in high dimensions, they are scaled by dividing them by a scaling factor

\sqrt{d_{k}}

. This approach can avoid numerical issues that may occur when calculating the softmax function. Next, the softmax function is used to transform these scores into normalized weights, which are then applied to the value matrix

V

to obtain a weighted vector representation. Finally, these weighted vector representations are used to express the importance of different parts of the input sequence. The calculation process of the self-attention mechanism is shown in Figure 5.

We take

d_{1}

as an example to explain the calculation process of self-attention mechanism. Here,

W^{Q}

,

W^{K}

, and

W^{V}

are weight matrices initialized for the self-attention mechanism, which, respectively, are dot multiplied with the input feature vector

d_{1}

to obtain the

Q_{1}

,

K_{1}

, and

V_{1}

matrices. The attention score

A_{1 n}

between feature

d_{1}

and feature

d_{n}

is calculated using Formula (11):

A_{1 n} = S o f t M a x (\frac{Q_{1} K_{n}^{T}}{\sqrt{d_{k}}}) .

(11)

Finally, the correlation

D_{1}

between

d_{1}

and

d_{n}

is obtained using Formula (12):

D_{1} = \sum_{i = 1}^{n} A_{1 i} V_{i} .

(12)

3.5. CRF Layer

CRF is a discriminative model based on conditional probability, which can model the conditional probability distribution of a set of random variables given the input variables. Because CRF can fully consider the relationship between contextual labels, it is widely used in sequence labeling tasks. In particular, in named entity recognition, there are strong constraint relationships between labels, and CRF can obtain the globally optimal label sequence. The probability score of output label sequence

y

given input sequence

x

is shown in Formula (13):

s (x, y) = \sum_{i = 1}^{n} L_{y_{(i - 1)}, y_{i}} + \sum_{i = 1}^{n} p_{i, y_{i}},

(13)

where

L

is the transition matrix,

L_{y_{(i - 1)}, y_{i}}

represents the probability of transition from label

y_{(i - 1)}

to label

y_{i}

, and

p_{i, y_{i}}

represents the probability of labeling the

i

character of input sequence

x

with label

y_{i}

. The conditional probability distribution of sequence y is shown in Formula (14):

p (y | s) = \frac{e^{s (x, y)}}{\sum_{\tilde{y} \in y_{x}} e^{s (x, \tilde{y})}} .

(14)

In the training process of CRF, the maximum likelihood method is used to maximize the probability of the correct label sequence

y^{*}

, as shown in Formula (15):

\log p (y^{*} | s) = s (x, y^{*}) - \log (\sum_{\tilde{y} \in y_{x}} e^{s (x, \tilde{y})}) .

(15)

The highest scoring label sequence obtained using the Viterbi algorithm is the globally optimal result output by the CRF, as shown in Formula (16):

y^{*} = \arg \max_{\tilde{y} \in y_{x}} s (x, \tilde{y})

(16)

4. Experiments

This section introduces the related experimental processes, including the construction of the dataset, annotation strategy, experimental environment, evaluation metrics, and finally, the analysis and comparison of the results and models.

4.1. Dataset Construction and Strategy

In the existing Chinese named entity recognition datasets, the specific medical field dataset for diabetes is relatively scarce. Therefore, we are committed to creating a named entity recognition dataset specifically for the diabetes field. The data for this experiment mainly come from domestic public medical and health websites and related encyclopedic websites. Websites such as “XYWY.COM” and “QIUYI.CN” were used to collect 9000 pieces of data information; we preprocessed the raw data to facilitate text recognition in subsequent work. In our experiments, text preprocessing mainly includes two stages: data cleaning and sentence segmentation. Data cleaning involves tasks such as removing meaningless characters, modifying text formats, handling missing data, and splitting long text sentences to obtain text that can be labeled with entities. The goal of data cleaning is to convert continuous raw text into a labeled sequence that only contains information such as words, punctuation marks, numbers, and spaces, while sentence segmentation identifies sentence boundaries using punctuation marks such as periods and divides the cleaned text into standardized sentences.

The data were annotated using the BIOES annotation mode, where B stands for the beginning of the entity, I stands for inside the entity, E stands for the end of the entity, S stands for a single character that is itself an entity, and O stands for outside the entity, indicating that it does not belong to any type. An example of the BIOES format is shown in Figure 6. Finally, the dataset was divided into training, validation, and test sets in a ratio of 8:1:1. The number distribution of all types of entities in the training, validation, and test sets is shown in Table 1. For the classification of medical entities, they were divided into examination indicators, drug names, adverse reactions, anatomy, operation, and diseases. Some detailed examples of the DNER dataset are shown in Figure 7.

In order to verify the generalization ability of the model, this article also conducted experiments using two public datasets, the diabetes dataset provided by the Ruijin Hospital and introduced by Alibaba Cloud Tianchi Laboratory, and CLUENER2020 [43].

The Ruijin Hospital diabetes dataset is a resource containing clinical data, sourced from a well-known Chinese diabetes research journal, and spanning over seven years. It includes the most extensive research and hot topics in the field of diabetes. This dataset includes 15 medical entities, such as examination methods, etiology, clinical manifestations, drug administration methods, and locations, among others. The release of this dataset can provide important reference information for researchers and medical workers in the field of diabetes and has a positive significance for promoting the prevention, treatment, and management of diabetes.

The CLUENER2020 dataset is a fine-grained Chinese named entity recognition (NER) dataset that includes ten entity types: address, book, company, game, government, movie, name, organization, position, and scene. This dataset covers scenarios in which the same entity may belong to different categories. Compared to other Chinese NER datasets such as People’s Daily and Weibo, the CLUENER2020 dataset is more challenging and better reflects real-world scenarios.

4.2. Evaluation Metrics

This experiment uses three evaluation metrics commonly used in named entity recognition—precision (

P

), recall (

R

), and F1 score (

F 1

)—to evaluate the performance of the model. The specific formulae are as follows:

P = \frac{T_{P}}{T_{P} + F_{P}}

(17)

R = \frac{T_{P}}{T_{P} + F_{n}}

(18)

F 1 = \frac{2 P R}{P + R}

(19)

where

T_{P}

represents true positives, which is the number of positive samples predicted as positive;

F_{P}

represents false positives, which is the number of negative samples predicted as positive; and

F_{n}

represents false negatives, which is the number of positive samples predicted as negative.

4.3. Experimental Environment

All experiments in this article were conducted on a Linux system, using the development environment of Python 3.6 and the development framework of PyTorch 1.6.0. The optimizer used was AdamW, and the training was conducted on a single RTX A4000 GPU.

4.4. Experimental Results and Analysis

In order to verify the performance of the proposed model for named entity recognition in the field of diabetes, a comparative experiment has been conducted with six mainstream models on the private diabetes dataset. The experimental results are shown in Table 2.

The BiLSTM-CRF model is our baseline method which is widely used in named entity recognition tasks. The BERT-CRF model utilizes the pre-trained BERT model to learn features of character sequences. The obtained sequence state scores are then fed into a CRF decoder to generate entity label sequences. The BERT-BiLSTM-CRF model is an extension of the BiLSTM-CRF model that incorporates the pre-trained BERT model. The recognition performance of this model is significantly improved compared to BiLSTM-CRF and BERT-CRF, mainly because BERT can consider contextual information, but it ignores the dependencies between words; therefore, BERT combined with BiLSTM learns the sequential relationship on observations, which performs better than BiLSTM-CRF and BERT-CRF in terms of performance. Compared to BiGRU, BiLSTM has a slightly larger parameter quantity and a relatively slow operating speed, which, however, has a smaller impact on accuracy. BERT-BiLSTM-IDCNN-CRF adds an IDCNN layer on the basis of BERT-BiLSTM-CRF to obtain the local information of sentences, alleviating the deficiency of BERT-BiLSTM-CRF considering only the global information. The RMBC model proposed herein takes full account the extraction of local and global features, which relies on the local feature awareness module to extract the local information of sentences, presenting better feature extraction than the IDCNN layer, and adds a self-attention mechanism on the basis of the BiGRU layer to solve the problem of unreasonable weight allocation of characters in BiGRU. Table 2 shows that the models proposed herein are superior to various contrast models in terms of F1 score. Compared to the baseline model BiLSTM-CRF, the F1 score grows by 14.79%; compared with other models, this model has been improved to different degrees, suggesting that this model demonstrates a good recognition effect in the field of diabetes medicine.

The recognition results on the Ruijin diabetes dataset are shown in Table 3. In the diabetes dataset of Ruijin Hospital, “hypoglycemia” in the “For patients who are older, have had diabetes for a long time, and have high-risk factors for hypoglycemia, the HbA1c target should be controlled to <7.5% or <8.0%.” sentence is recognized as a disease, which conforms to the entity category in the sentence. Yet, the entity “hypoglycemia” belongs to multiple categories; for example, in the “Insulin secretagogues also have potential adverse reactions of hypoglycemia and weight gain, so it is not recommended to use them in combination with insulin other than basal insulin.” sentence, the entity “hypoglycemia” belongs to the adverse reaction category. For difficult-to-recognize entities such as these, the RMBC model proposed in this paper can fully consider the semantic relationship between contexts by combining multi-scale local features and overcome the limitation of long-distance dependencies. Compared with other models, it has a significant improvement in performance in dealing with difficult-to-recognize entities. The experimental results also verify from one side that the model proposed in this paper has excellent recognition performance on the public diabetes dataset.

In order to verify the generalization of RMBC, compared on the Chinese fine-grained dataset CLUENER2020, which contains Chinese texts from ten different news domains, it has high quality and accuracy. Based on the recognition results in Table 4, the RMBC model proposed in this paper has shown significant improvement in performance compared to other mainstream models. Our model achieved an F1 score of 81.79% on the CLUENER2020 dataset, which is 2.97% higher than the public baseline model BERT-CRF, and 11.79% higher than the baseline model BiLSTM-CRF proposed in this paper. Therefore, it further validates that the proposed model in this paper can effectively capture both local and global information through the multi-scale local context-wise module and the combination of self-attention mechanism and BiGRU, which can better distinguish the categories of different entities in different scenarios and presents excellent recognition effects. Experimental results indicate that the proposed RMBC model in this paper not only presents excellent recognition effect in the field of diabetes medicine but also presents good generalization capability for datasets in different fields.

4.5. Ablation Study

In order to verify the effectiveness of the local context-wise module, this paper conducts an ablation study. Model1 represents RoBERTa-BiGRU-CRF model without adding a local context-wise module. Model2 represents the local context-wise module under multi-scale residual convolution without adding the multi-window attention. Model3 represents a local context-wise module in a single scale convolution.

The results of the ablation study are shown in Table 5; the F1 score of Model1, without adding a local context-wise module, decrease by 3.7%, 3.83%, and 0.95% on the three datasets, respectively; moreover, the F1 scores of Model2 and Model3 on the three datasets with partial local context-wise modules added are higher than Model1; nevertheless, the F1 scores are lower than RMBC, indicating that each part of the proposed local context-wise module can improve the recognition performance, the multi-window attention can effectively capture the important semantic components of local features, and the multi-scale residual convolution can fully integrate the context information at different scales, fully reflecting the importance of multi-window attention and multi-scale residual convolution in the local context-wise module. This proves that the local context-wise module can effectively capture the semantic information of the local context during the recognition process and plays a crucial role in the recognition process.

5. Conclusions

Under the background of fewer named entity datasets in Chinese diabetes medicine, we construct a named entity recognition and tagging corpus in the field of diabetes, thereby solving the problem of data shortage. With regard to the ambiguity of entities in diabetes medicine, this paper proposes the RMBC model that can, through the combination of a local context-wise module and a BiGRU with a self-attention mechanism, effectively take into account both local features and global features; the experimental results indicate that the model proposed herein outperforms the current mainstream NER model under the recognition effect of three datasets, showing excellent performance.

In future work, we will continue to improve and optimize the diabetes corpus, increase the number of various entities, and further expand the diabetes dataset. Next, we will study the application of advanced deep learning models in the field of diabetes medicine and make efforts to integrate the Chinese glyph and Pinyin character vectors into our model to further improve the performance of named entity recognition, thus providing high-quality data for the construction of diabetes knowledge graphs.

Author Contributions

Conceptualization, L.S. and X.Z.; methodology, C.D.; validation, L.S. and Z.J.; formal analysis, X.Z.; writing—original draft preparation, L.S.; writing—review and editing, L.S. and Z.J.; supervision, L.S.; project administration, Z.J. and C.D. All authors have read and agreed to the published version of the manuscript.

Funding

This publication emanated from research conducted with the financial support of the National Key Research and Development Program of China, under grant no. 2017YFE0135700.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations and notations are used in this manuscript:

Abbreviations	Description
DNER	Diabetes named entity recognition
RMBC	RoBERTa Multi-scale CNN BiGRU Self-attention CRF
RoBERTa-wwm	A robustly optimized BERT pretraining approach—whole word masking
BiGRU	Bidirectional gated recurrent unit
CRF	Conditional random field
CNN	Convolutional neural network
NER	Name entity recognition
NLP	Natural language processing
BiLSTM	Bidirectional long short-term memory
HMM	Hidden Markov model
MEM	Maximum entropy model
SVM	Support vector machine
LSTM	Long short-term memory
IDCNN	Iterated dilated convolutional neural network
BERT	Bidirectional encoder representations from transformers
MLM	Masked language model
NSP	Next sentence prediction
BC2GM	Biocreative II gene mention task
XLNet	eXtreme language understanding network
ELMo	Embeddings from language models
OpenAI-GPT	OpenAI generative pre-trained transformer
HIT	Harbin institute of technology
GRU	Gate recurrent unit
RNN	Recurrent neural network
CLUENER2020	Chinese language understanding evaluation 2020
CNER	Chinese named entity recognition
Notation	Description
$w_{c h}$	The input representation for each character
$k$	The convolution window size
$e_{g}$	The embedding dimension
$e_{c h}$	The representation for each vector in CNN
$e_{p o s}$	The position information vector
$j$	The central character
$h_{m}$	The hidden vectors in multi-window attention layer
$a_{m}$	The multi-window attention weight
$m, n$	The set of all characters in the window
$b_{i}$	The extracted local semantic features vector
$c$	The output of the local context-wise module
$Z_{t}$	The update gate of GRU
$h_{t - 1}$	Information from the previous moment in GRU
$h_{t}$	Information about the current moment in GRU
${\tilde{h}}_{t}$	Candidate information for the current moment in GRU
$r_{t}$	The reset gate of GRU
$d$	The input information of BiGRU
$Q$	The query matrix
$K$	The key matrix
$V$	The value matrix
$\sqrt{d_{k}}$	The scaling factor
$W^{Q}$ , $W^{K}$ , $W^{V}$	The weight matrix initialized by the self-attention mechanism.
$A$	The self-attention score
$D$	The correlation score between features
$L$	The transition matrix
$x$	Input sequence
$y$	The sequence of output labels
$p$	The tagging probability of the character
$y^{*}$	The correct label sequence
$s$	Output the probability score of the label sequence
$P$	Precision
$R$	Recall
$F 1$	F1 score
$T_{P}$	The number of positive samples predicted as positive
$F_{P}$	The number of negative samples predicted as positive
$F_{n}$	The number of positive samples predicted as negative

References

Li, D.; Yan, L.; Yang, J.; Ma, Z. Dependency syntax guided bert-bilstm-gam-crf for chinese ner. Expert Syst. Appl. 2022, 196, 116682. [Google Scholar] [CrossRef]
Xie, S.; Xia, Y.; Wu, L.; Huang, Y.; Fan, Y.; Qin, T. End-to-end entity-aware neural machine translation. Mach. Learn. 2022, 111, 1181–1203. [Google Scholar] [CrossRef]
Kambar, M.E.Z.N.; Esmaeilzadeh, A.; Heidari, M. A survey on deep learning techniques for joint named entities and relation extraction. In Proceedings of the 2022 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 6–9 June 2022; pp. 218–224. [Google Scholar]
Meng, F.; Yang, S.; Wang, J.; Xia, L.; Liu, H. Creating knowledge graph of electric power equipment faults based on BERT–BiLSTM–CRF model. J. Electr. Eng. Technol. 2022, 17, 2507–2516. [Google Scholar] [CrossRef]
Zhu, X.; Li, Z.; Wang, X.; Jiang, X.; Sun, P.; Wang, X.; Xiao, Y.; Yuan, N.J. Multi-modal knowledge graph construction and application: A survey. IEEE Trans. Knowl. Data Eng. 2022, 54, 1–20. [Google Scholar] [CrossRef]
Hettne, K.M.; Stierum, R.H.; Schuemie, M.J.; Hendriksen, P.J.; Schijvenaars, B.J.; Mulligen, E.M.v.; Kleinjans, J.; Kors, J.A. A dictionary to identify small molecules and drugs in free text. Bioinformatics 2009, 25, 2983–2991. [Google Scholar] [CrossRef]
Hu, Z.; Ma, X. A Novel Neural Network Model Fusion Approach for Improving Medical Named Entity Recognition in Online Health Expert Question-Answering Services. Expert Syst. Appl. 2023, 223, 119880. [Google Scholar] [CrossRef]
Gao, W.; Zheng, X.; Zhao, S. Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF. In Proceedings of the Journal of Physics: Conference Series, Sanya, China, 29–31 January 2021; p. 012083. [Google Scholar]
An, Y.; Xia, X.; Chen, X.; Wu, F.-X.; Wang, J. Chinese clinical named entity recognition via multi-head self-attention based BiLSTM-CRF. Artif. Intell. Med. 2022, 127, 102282. [Google Scholar] [CrossRef]
Li, Y.; Teng, D.; Shi, X.; Qin, G.; Qin, Y.; Quan, H.; Shi, B.; Sun, H.; Ba, J.; Chen, B. Prevalence of diabetes recorded in mainland China using 2018 diagnostic criteria from the American Diabetes Association: National cross sectional study. BMJ 2020, 369, m997. [Google Scholar] [CrossRef]
Luo, Z.; Fabre, G.; Rodwin, V.G. Meeting the challenge of diabetes in China. Int. J. Health Policy Manag. 2020, 9, 47. [Google Scholar] [CrossRef]
Zhou, B.; Cai, X.; Zhang, Y.; Yuan, X. An end-to-end progressive multi-task learning framework for medical named entity recognition and normalization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 6214–6224. [Google Scholar]
Landolsi, M.Y.; Hlaoua, L.; Ben Romdhane, L. Information extraction from electronic medical documents: State of the art and future research directions. Knowl. Inf. Syst. 2023, 65, 463–516. [Google Scholar] [CrossRef]
Cheng, J.; Liu, J.; Xu, X.; Xia, D.; Liu, L.; Sheng, V.S. A review of Chinese named entity recognition. KSII Trans. Internet Inf. Syst. 2021, 15. [Google Scholar] [CrossRef]
Chieu, H.L.; Ng, H.T. Named entity recognition with a maximum entropy approach. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, Canada, 31 May 2003; pp. 160–163. [Google Scholar]
Bikel, D.M.; Schwartz, R.; Weischedel, R.M. An algorithm that learns what’s in a name. Mach. Learn. 1999, 34, 211–231. [Google Scholar] [CrossRef]
Roberts, A.; Gaizauskas, R.; Hepple, M.; Guo, Y. Mining clinical relationships from patient narratives. BMC Bioinform. 2008, 9, S3. [Google Scholar] [CrossRef] [PubMed]
Marcińczuk, M. Automatic construction of complex features in conditional random fields for named entities recognition. In Proceedings of the International Conference Recent Advances in Natural Language Processing, Hissar, Bulgaria, 1–3 September 2015; pp. 413–419. [Google Scholar]
Chen, S.; Ouyang, X. Overview of Named Entity Recognition Technology [J/OL]; Radio Communications Technology: Canning Vale, Australia, 2020; pp. 1–11. [Google Scholar]
Saimaiti, A.; Wang, L.; Yibulayin, T. Learning subword embedding to improve uyghur named-entity recognition. Information 2019, 10, 139. [Google Scholar] [CrossRef]
Zhou, G.; Su, J. Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Pennsylvania, PA, USA, 7–12 July 2002; pp. 473–480. [Google Scholar]
Hammerton, J. Named entity recognition with long short-term memory. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, AB, Canada, 31 May 2003; pp. 172–175. [Google Scholar]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural architectures for named entity recognition. arXiv 2016, arXiv:1603.01360. [Google Scholar]
Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
Chiu, J.P.; Nichols, E. Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 2016, 4, 357–370. [Google Scholar] [CrossRef]
Ma, X.; Hovy, E. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv 2016, arXiv:1603.01354. [Google Scholar]
Zhu, Y.; Wang, G.; Karlsson, B.F. CAN-NER: Convolutional attention network for Chinese named entity recognition. arXiv 2019, arXiv:1904.02141. [Google Scholar]
Strubell, E.; Verga, P.; Belanger, D.; McCallum, A. Fast and accurate entity recognition with iterated dilated convolutions. arXiv 2017, arXiv:1702.02098. [Google Scholar]
Zhang, Y.; Yang, J. Chinese NER using lattice LSTM. arXiv 2018, arXiv:1805.02023. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, X.; Yang, N.; Jiang, Y.; Gu, L.; Shi, X. A parallel computing-based deep attention model for named entity recognition. J. Supercomput. 2020, 76, 814–830. [Google Scholar] [CrossRef]
Mutinda, F.W.; Yada, S.; Wakamiya, S.; Aramaki, E. Semantic textual similarity in Japanese clinical domain texts using BERT. Methods Inf. Med. 2021, 60, e56–e64. [Google Scholar] [CrossRef] [PubMed]
Chai, Z.; Jin, H.; Shi, S.; Zhan, S.; Zhuo, L.; Yang, Y.; Lian, Q. Noise reduction learning based on xlnet-crf for biomedical named entity recognition. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 20, 595–605. [Google Scholar] [CrossRef] [PubMed]
Guo, W.; Lu, J.; Han, F. Named Entity Recognition for Chinese Electronic Medical Records Based on Multitask and Transfer Learning. IEEE Access 2022, 10, 77375–77382. [Google Scholar] [CrossRef]
Lee, L.-H.; Lu, Y. Multiple embeddings enhanced multi-graph neural networks for Chinese healthcare named entity recognition. IEEE J. Biomed. Health Inform. 2021, 25, 2801–2810. [Google Scholar] [CrossRef]
Liang, T.; Xia, C.; Zhao, Z.; Jiang, Y.; Yin, Y.; Philip, S.Y. Transferring from Textual Entailment to Biomedical Named Entity Recognition. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 19, 1539–1549. [Google Scholar] [CrossRef]
Chen, P.; Wang, J.; Lin, H.; Zhang, Y.; Yang, Z. Knowledge Adaptive Multi-way Matching Network for Biomedical Named Entity Recognition via Machine Reading Comprehension. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 299–309. [Google Scholar] [CrossRef]
Liu, N.; Hu, Q.; Xu, H.; Xu, X.; Chen, M. Med-BERT: A pretraining framework for medical records named entity recognition. IEEE Trans. Ind. Inform. 2021, 18, 5600–5608. [Google Scholar] [CrossRef]
Sarzynska-Wawer, J.; Wawer, A.; Pawlak, A.; Szymanowska, J.; Stefaniak, I.; Jarkiewicz, M.; Okruszek, L. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res. 2021, 304, 114135. [Google Scholar] [CrossRef] [PubMed]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. Available online: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 3 April 2023).
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z. Pre-training with whole word masking for chinese bert. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3504–3514. [Google Scholar] [CrossRef]
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Wang, S.; Hu, G. Revisiting pre-trained models for Chinese natural language processing. arXiv 2020, arXiv:2004.13922. [Google Scholar]
Xu, L.; Dong, Q.; Liao, Y.; Yu, C.; Tian, Y.; Liu, W.; Li, L.; Liu, C.; Zhang, X. CLUENER2020: Fine-grained named entity recognition dataset and benchmark for chinese. arXiv 2020, arXiv:2001.04351. [Google Scholar]

Figure 1. The overall architecture of the RMBC model. From left to right are the embedding layer, the local context-wise module, the BiGRU layer, and the CRF layer. The embedding layer used Roberta-wwm to obtain the word vector and fed it into the local context-wise module. In the local context-wise module, the dashed frames of different colors indicate sliding windows of different sizes, and the multi-window attention is used to extract the important semantic components of local features. The output of the BiGRU layer is the output of the local context-aware module, and finally, the output is fed to the CRF layer for decoding.

Figure 2. Example of Whole Word Masking in RoBERta-wwm masking strategy.

Figure 3. Multi-scale residual convolution network structure. The input to each CNN layer except the first layer is a feature vector obtained by fusing the input and output of the previous layer through residual concatenation. Finally, the feature vectors at different scales are concatenated to output.

Figure 4. Structure of GRU.

Figure 5. Calculation process of self-attention mechanism.

Figure 6. BIOES tagging format.

Figure 7. Some examples of DNER datasets.

Table 1. Experimental data labeling statistics.

Entity Category	Train_Number	Test_Number	Validation_Number
Items	1265	163	154
Drug	2639	335	327
ADE	5183	651	628
ANAT	1205	131	157
Operation	1086	126	114
DIS	3509	372	367
All	14,887	1778	1747

Table 2. Comparison results on the DNER dataset.

Model	P (%)	R (%)	F1 (%)
BiLSTM-CRF	73.63	74.65	74.14
BERT-CRF	79.19	81.94	80.55
BERT-BiGRU-CRF	80.67	84.03	82.31
BERT-BiLSTM-CRF	81.33	84.72	82.99
BERT-BiGRU-IDCNN-CRF	88.43	82.29	85.25
BERT-BiLSTM-IDCNN-CRF	87.22	80.56	83.75
our	91.54	86.46	88.93

Table 3. Comparison results on the Ruijin diabetes dataset.

Model	P (%)	R (%)	F1 (%)
BiLSTM-CRF	68.81	62.04	65.25
BERT-CRF	73.43	77.16	75.25
BERT-BiGRU-CRF	76.21	79.70	77.92
BERT-BiLSTM-CRF	76.12	76.45	76.29
BERT-BiGRU-IDCNN-CRF	77.23	79.19	78.20
BERT-BiLSTM-IDCNN-CRF	77.37	77.25	77.31
our	80.10	81.73	80.90

Table 4. Comparison results on the CLUENER2020 dataset.

Model	P (%)	R (%)	F1 (%)
BiLSTM-CRF	71.06	68.97	70.00
BERT-CRF	77.24	80.46	78.82
BERT-BiGRU-CRF	81.25	78.64	79.93
BERT-BiLSTM-CRF	80.48	79.48	79.97
BERT-BiGRU-IDCNN-CRF	80.38	79.41	79.89
BERT-BiLSTM-IDCNN-CRF	80.97	78.17	79.54
our	81.95	81.62	81.79

Table 5. Results of ablation experiment the three datasets.

Model	DNER			Ruijin Diabetes			CLUENER2020
Model	P (%)	R (%)	F1 (%)	P (%)	R (%)	F1 (%)	P (%)	R (%)	F1 (%)
Model1	84.88	85.76	85.23	74.18	80.20	77.07	80.55	81.12	80.84
Model2	88.57	86.11	87.32	75.63	81.42	78.42	81.06	80.68	80.87
Model3	90.37	84.72	87.46	77.23	79.19	78.20	81.54	80.01	80.77
our	91.54	86.46	88.93	80.10	81.73	80.90	81.95	81.62	81.79

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, L.; Zou, X.; Dai, C.; Ji, Z. Uniting Multi-Scale Local Feature Awareness and the Self-Attention Mechanism for Named Entity Recognition. Mathematics 2023, 11, 2412. https://doi.org/10.3390/math11112412

AMA Style

Shi L, Zou X, Dai C, Ji Z. Uniting Multi-Scale Local Feature Awareness and the Self-Attention Mechanism for Named Entity Recognition. Mathematics. 2023; 11(11):2412. https://doi.org/10.3390/math11112412

Chicago/Turabian Style

Shi, Lin, Xianming Zou, Chenxu Dai, and Zhanlin Ji. 2023. "Uniting Multi-Scale Local Feature Awareness and the Self-Attention Mechanism for Named Entity Recognition" Mathematics 11, no. 11: 2412. https://doi.org/10.3390/math11112412

APA Style

Shi, L., Zou, X., Dai, C., & Ji, Z. (2023). Uniting Multi-Scale Local Feature Awareness and the Self-Attention Mechanism for Named Entity Recognition. Mathematics, 11(11), 2412. https://doi.org/10.3390/math11112412

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Uniting Multi-Scale Local Feature Awareness and the Self-Attention Mechanism for Named Entity Recognition

Abstract

1. Introduction

2. Related Work

3. Method

3.1. RoBERTa-wwm Pre-Trained Language Model

3.2. Local Context-Wise Module

3.3. BiGRU Layer

3.4. Self-Attention Layer

3.5. CRF Layer

4. Experiments

4.1. Dataset Construction and Strategy

4.2. Evaluation Metrics

4.3. Experimental Environment

4.4. Experimental Results and Analysis

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI