Chinese Named Entity Recognition Based on Multi-Level Representation Learning

Li, Weijun; Ding, Jianping; Liu, Shixia; Liu, Xueyang; Su, Yilei; Wang, Ziyi

doi:10.3390/app14199083

Open AccessArticle

Chinese Named Entity Recognition Based on Multi-Level Representation Learning

by

Weijun Li

^*

,

Jianping Ding

,

Shixia Liu

,

Xueyang Liu

,

Yilei Su

and

Ziyi Wang

School of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 9083; https://doi.org/10.3390/app14199083

Submission received: 9 September 2024 / Revised: 1 October 2024 / Accepted: 6 October 2024 / Published: 8 October 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Named Entity Recognition (NER) is a crucial component of Natural Language Processing (NLP). When dealing with the high diversity and complexity of the Chinese language, existing Chinese NER models face challenges in addressing word sense ambiguity, capturing long-range dependencies, and maintaining robustness, which hinders the accuracy of entity recognition. To this end, a Chinese NER model based on multi-level representation learning is proposed. The model leverages a pre-trained word-based embedding to capture contextual information. A linear layer adjusts dimensions to fit an Extended Long Short-Term Memory (XLSTM) network, enabling the capture of long-range dependencies and contextual information, and providing deeper representations. An adaptive multi-head attention mechanism is proposed to enhance the ability to capture global dependencies and comprehend deep semantic context. Additionally, GlobalPointer with rotational position encoding integrates global information for entity category prediction. Projected Gradient Descent (PGD) is incorporated, introducing perturbations in the embedding layer of the pre-trained model to enhance stability in noisy environments. The proposed model achieves F1-scores of 96.89%, 74.89%, 72.19%, and 80.96% on the Resume, Weibo, CMeEE, and CLUENER2020 datasets, respectively, demonstrating improvements over baseline and comparison models.

Keywords:

WoBERT; adversarial training; Extended Long Short-Term Memory network; adaptive multi-head attention; GlobalPointer

1. Introduction

The development of NER models has progressed from traditional rule-based and statistical approaches [1] to modern deep learning methods [2]. NER plays a vital role in various applications such as information extraction [3], question answering systems, machine translation [4], and sentiment analysis. In the embedding layer, early word vector [5] representation methods have been gradually replaced by contextual embedding methods [6], which better capture complex contextual semantic information. In terms of feature fusion, models have evolved from single-feature models to multi-feature fusion models, combining character features, word features, and syntactic features at multiple levels to enhance NER performance [7,8]. For decoding, traditional methods [9] have been gradually replaced by approaches that incorporate neural networks [10], improving the ability to capture dependencies between labels. In recent years, pointer-network-based decoders [11] have also been proposed and have achieved excellent results in NER.

However, while existing models perform well in certain areas, they still have some limitations: (1) challenges in recognizing polysemous words and ambiguities in complex Chinese texts; (2) difficulties in capturing long-range dependencies; and (3) poor robustness to noisy data, which limits their effectiveness in practical applications. For instance, Lample et al. [12] proposed a BiLSTM-CRF-based model that captures contextual information through BiLSTM and employs a CRF layer for sequence labeling, effectively identifying entity boundaries. This model, however, struggles when dealing with polysemous words and ambiguities in complex texts, and the CRF layer incurs high computational costs when handling a large number of labels, which restricts its scalability in large datasets. To reduce the computational cost of CRF, pointer networks were introduced to NER tasks. Li et al. [13] proposed a pointer-network-based model that encodes input sequences and locates entity boundaries via a pointer mechanism, simplifying the computation process. Nevertheless, pointer networks tend to perform inadequately when capturing long-range dependencies and handling complex sequences. Yan et al. [14] addressed this by incorporating multi-head attention (MHA) to improve global context understanding and long-range dependency capture. Yet, MHA’s fixed attention head weights limit its dynamic adaptability to different input features, and the inability to adjust attention head importance during task execution may waste computational resources, reducing overall model efficiency. Additionally, noise in data processing poses challenges for model robustness.

To address these issues, this paper proposes a Chinese NER model based on multi-level representation learning, named WP-XAG. The primary purpose of this research is to enhance the accuracy and robustness of Chinese NER models. The specific tasks to achieve this purpose include:

(1): Using a word-based pre-trained model, WoBERT, which effectively addresses the challenges of polysemy and word-sense disambiguation for initial contextual embedding. WoBERT is particularly advantageous due to its extensive pre-training on large corpora, allowing it to capture the nuanced meanings of words in context. This capability is crucial for recognizing entities that may have multiple interpretations, thereby directly contributing to improved NER performance in the complex landscape of the Chinese language. Additionally, we incorporate Projected Gradient Descent (PGD) from adversarial training to bolster the model’s robustness against data perturbations and noise.
(2): Employing an Extended Long Short-Term Memory (XLSTM) network to capture contextual dependencies in the sequence. The XLSTM is chosen for its enhanced ability to model long-range dependencies, essential for comprehending complex sentence structures in Chinese. Moreover, by introducing a normalization layer, we facilitate faster model convergence, thus enhancing training efficiency. The proposed adaptive multi-head attention mechanism further strengthens the model’s capacity to capture deep semantic context, allowing for more nuanced entity recognition.
(3): Utilizing the GlobalPointer decoding method to integrate global information, with rotational position encoding added to capture both local and global features of entities, aiding in the identification of entity boundaries.
(4): Conducting experiments to demonstrate that the proposed WP-XAG model achieves superior performance across several public NER datasets, showing improvements over baseline and comparison models.

The remainder of this paper is organized as follows: Section 2 reviews the related work, providing an overview of commonly used model architectures in Named Entity Recognition. Section 3 describes in detail the proposed WP-XAG model and its components. Section 4 outlines the experimental design and analysis. Section 5 presents the results, while Section 6 offers a discussion of the findings. Finally, Section 7 concludes the paper and suggests directions for future work.

2. Related Work

2.1. Hypergraph-Based Methods

In Natural Language Processing (NLP), hypergraphs are commonly used for tasks such as semantic analysis, dependency parsing, and semantic role labeling.

Lu and Roth [15] mapped text into a hypergraph to capture all possible nested structures in the dataset. Their results demonstrated that the model showed good scalability when dealing with large real-world datasets with fine-grained types, though the model remained fundamentally rule- and dictionary-based. Katiyar et al. [16] proposed a model that utilizes features extracted by Bi-directional Long Short-Term Memory (BiLSTM) to learn hypergraph representations, aimed at handling nested NER and nested entity detection. However, this model used a single structure to represent all named entities in an input sentence, making the construction process of the model more complex. To address this, Yan et al. [17] improved the approach by formalizing nested NER as the construction of multiple local hypergraphs using a sequence labeling method. The local hypergraph structure avoids the aforementioned complexity issues, and it enables the sequence labeling module to overcome the limitations of sequence-based methods that require the recognition of entities in a specific order.

Despite the advantages of hypergraph-based NER methods in handling nested entity types, the construction and processing of hypergraphs are generally more complex than sequence labeling methods. The relationships between edges and nodes in a hypergraph can be highly intricate, making the algorithms difficult to implement and understand, which also limits the interpretability of hypergraphs compared to sequence labeling methods. Additionally, the size of a hypergraph can grow rapidly with increasing text length, requiring more computational resources and time to construct and process.

2.2. Sequence Labeling-Based Methods

Sequence labeling-based methods are commonly used in NER to identify named entities in text and classify them into predefined categories, such as persons, locations, and organizations. This approach involves assigning a label to each word or character in the text to indicate whether it is part of a named entity.

Sequence labeling-based NER methods typically require a large amount of labeled data for training in order to learn the mapping between words and labels. Wang et al. [18] combined data-driven deep learning methods with a dictionary-based, character-level sequence labeling approach for Chinese clinical NER. They also extended two different BiLSTM-CRF (Conditional Random Field) architectures and five different feature representation schemes to handle the task. This approach addressed issues related to rare and unseen entities, though external domain-specific knowledge remains difficult to generalize and evaluate. In response, Kong et al. [19] proposed a new model for Chinese clinical NER that does not rely on external knowledge, using a data augmentation approach and a CNN-based model. They employed multimodal embeddings to capture richer semantic information. To address the CNN’s limitations in handling long-distance relationships between words, their model incorporated a multi-level CNN layer to capture both short- and long-term contextual information and used a residual structure to directly fuse information at different scales, effectively improving model performance. Richa et al. [20] combined MuRIL with a CRF layer and fine-tuned the system on the ICON 2013 Hindi NER dataset, developing a Hindi NER system. They explored different ways of leveraging the output representations from the MuRIL model’s layers, including summing the last four layers, to achieve excellent performance.

Despite the success of these methods in many tasks, they can face challenges when dealing with nested entities or multi-category entities. Consequently, researchers have been exploring alternative models and methods in recent years to further improve the performance and generalizability of NER.

2.3. Span-Based Methods

Span-based NER methods identify named entities by detecting the start and end positions of entities within the text, thereby avoiding the errors that may arise from word-by-word labeling, offering greater flexibility and accuracy.

Compared to sequence labeling methods, span-based approaches often perform better in handling nested entities, as they can directly capture the full span of an entity, making them more suitable for the complexities of nested entity recognition. TCSF [21] combines a token context network for token learning with a deep residual CNN and span relation network for span learning. The model employs a novel head-inside-tail operation to expand internal features within the span, allowing for fixed-length representations. LB-BMBC [22] injects lexical information into the Transformer layers of BERT, leveraging the model’s lexical learning capability to observe spatial relationships between adjacent spans. These relationships are then modeled using a dual affine decoder and CNN. GlobalPointer [23], a span-based decoding method, eliminates boundary recognition ambiguity through the use of an upper triangular matrix. It processes input feature vectors via complex matrix multiplication and adjusts dimensions to generate an entity boundary score matrix. The score matrix records the confidence levels of each category at various boundary positions, allowing GlobalPointer to accurately identify entities and their categories by analyzing these confidence scores. While GlobalPointer outperforms sequence labeling methods like CRF in NER tasks, it still faces challenges in handling contextual information, making it less ideal for long-distance dependencies and complex sequence data.

From the above studies, it is evident that span-based methods offer certain advantages in NER tasks. However, when dealing with Chinese texts containing complex contexts and polysemous words, models often exhibit limitations. To address these issues, this study proposes using GlobalPointer as a decoder to process entity representations. However, GlobalPointer lacks the ability to effectively utilize contextual semantic information. Therefore, this paper introduces a Chinese Named Entity Recognition model based on multi-level representation learning: WP-XAG. The model utilizes the word-based pre-trained model WoBERT, and during training, incorporates PGD to introduce perturbations, improving the model’s generalization ability. Considering that using only the XLSTM network in the feature fusion layer may lead to suboptimal performance in handling complex sequences and that the fixed weights of multi-head attention restrict the model’s adaptability to different contextual features across varying scenarios, this paper proposes adaptive multi-head attention (ADMHA) to enhance semantic connections. Finally, GlobalPointer is employed as the decoder to construct an entity matrix, capturing global entity information to achieve accurate entity recognition.

3. WP-XAG Model

The overall structure of the WP-XAG model can be divided into three layers: the embedding layer, the feature fusion layer, and the decoding layer. A schematic diagram of the model is shown in Figure 1.

3.1. Embedding Layer

The embedding layer is responsible for converting text data into dense vector representations. This section introduces the embedding methods used in the WP-XAG model: WoBERT, as well as the PGD method used to add perturbations in the embedding layer.

3.1.1. WoBERT

Most existing pre-trained models mainly enhance semantic modeling capabilities by learning the relationship between characters, but are still basically based on characters. In contrast, WoBERT is a word-based pre-trained model that generates shorter input sequences, improving training speed and alleviating issues of polysemy and ambiguity. Additionally, WoBERT’s vocabulary includes most characters, reducing out-of-vocabulary (OOV) occurrences. The contextual embeddings produced by WoBERT provide initial feature representations for the model.

WoBERT is trained on a word-level basis, building upon RoBERTa-wwm-ext [24]. This model enhances BERT’s static masking mechanism by introducing a dynamic masking approach and a whole-word masking strategy, improving semantic representation. Since BERT’s maximum matching algorithm is optimized for English tokenization, its accuracy in Chinese tokenization is limited. To address this, WoBERT incorporates a “pre-tokenization” step in BERT’s tokenizer for better handling of Chinese word segmentation. During initialization, WoBERT splits each word into characters, averages their embeddings, and uses this as the initial word embedding, enabling the model to capture richer lexical information.

3.1.2. PGD

Noisy data in datasets, such as incorrect, incomplete, or inconsistent annotations, can adversely affect model performance. The PGD method enhances the model’s robustness to input perturbations, thereby improving stability and generalization when dealing with noisy data.

PGD employs a “small-step, multi-iteration” training strategy to introduce perturbations. Through iterative steps and K iterations of gradient ascent, it identifies the optimal point within specified constraints. Adversarial training is defined by the min-max formulation shown in Equation (1) [25].

\min_{θ} E_{(x, y)} ~ D [\max_{Δ x \in Ω} L (x + Δ x, y; θ)]

(1)

where θ represents the model parameters, 𝔼 denotes the expected loss of adversarial examples, 𝓓 represents the training set, L is the loss of a single sample, x is the input,

Δ x

denotes the additional perturbation, and y represents the label. The perturbations introduced by PGD during training can be expressed by Equations (2) and (3).

x_{t + 1} = \prod_{x + s} (x_{t} + α g (x_{t}) / | | g (x_{t}) | |_{2})

(2)

g (x_{t}) = \nabla x L (x_{t}, y; θ)

(3)

where α represents the step size, t represents the iteration number, and the input at time t + 1 is derived from the input at time t and the gradient at time t; the meaning of

\prod_{x + s}

is that if the perturbation exceeds a certain range, it should be mapped back to the defined range S to ensure that the perturbation does not become too large.

S = {r \in ℝ^{d} : | | r | |_{2} \leq Є}

represents the constraint space for the perturbation, g(x_t) denotes the gradient, and ||g(x_t)||₂ represents a scalar obtained by calculating the L2 norm of the embedding matrix.

3.2. Feature Fusion Layer

The XLSTM network effectively captures long-range dependencies within sequences, enhancing the model’s ability to represent input sequences. The ADMHA mechanism can adaptively adjust attention distribution based on different contexts, allowing the model to more flexibly capture contextual dependencies. The feature fusion layer, composed of XLSTM and ADMHA, extracts and integrates multi-level feature information, capturing semantic information at different granularities, thereby enabling the model to recognize entities more accurately.

3.2.1. XLSTM

LSTMs often struggle to effectively manage memory when processing long dialogues or complex contexts due to their limited storage capacity, which compresses information into a scalar cell state. This limitation hinders their ability to handle rare tokens or long-range dependencies, resulting in poor performance on long texts or complex tasks.

While WoBERT provides accurate word-level representations, XLSTM [26] features an enhanced memory structure that better accommodates longer contexts and complex dependencies. XLSTM integrates scalar storage and updates from sLSTM with matrix storage and covariance updates from mLSTM, all enhanced by an exponential gating mechanism for improved performance. In XLSTM, mLSTM and sLSTM are combined into “residual blocks”, which are stacked to create the overall architecture. This gating mechanism allows more flexible control of information flow, enabling better adjustments to new contexts.

Furthermore, XLSTM’s matrix memory structure increases storage capacity, enhancing its efficiency in managing rare tokens and long-range dependencies. The residual network blocks help mitigate the vanishing gradient problem through skip connections, allowing XLSTM to effectively process complex sequence data. Due to hardware constraints and limited parallelization of sLSTM, the model primarily utilizes XLSTM [1:0] (indicating a 1:0 ratio of mLSTM-based to sLSTM-based blocks) to capture semantic representations from WoBERT. The block structure of mLSTM is illustrated in Figure 2, with forward propagation represented by Equations (4)–(14).

σ (x) = \frac{1}{1 + e^{- x}}

(4)

\tanh (x) = \frac{\sinh (x)}{\cosh (x)} = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}

(5)

C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ v_{t} k_{t}^{T}

(6)

n_{t} = f_{t} ⊙ n_{t - 1} + i_{t} ⊙ k_{t}

(7)

h_{t} = o_{t} ⊙ C_{t} q_{t} / \max {n_{t}^{T} q_{t}, 1}

(8)

q_{t} = W_{q} \cdot x_{t} + b_{q}

(9)

k_{t} = \frac{1}{\sqrt{d}} W_{k} \cdot x_{t} + b_{k}

(10)

v_{t} = W_{v} \cdot x_{t} + b_{v}

(11)

i_{t} = e^{W_{i}^{T} \cdot x_{t} + b_{i}}

(12)

f_{t} = σ (W_{f}^{T} \cdot x_{t} + b_{f}) OR e^{W_{f}^{T} \cdot x_{t} + b_{f}}

(13)

o_{t} = σ (W_{o} \cdot x_{t} + b_{o})

(14)

where σ denotes the sigmoid activation function, and tanh represents the hyperbolic tangent activation function. C_t and C_t₋₁ represent the cell states at the current and previous time steps, respectively. n_t and n_t₋₁ represent the normalized states at the current and previous time steps, respectively. h_t is the hidden state, x_t represents the input vector, and q_t, k_t, and v_t are the query, key, and value inputs, respectively. W_q, W_k, W_v, and W_i denote matrix parameters, while b_q, b_k, b_v, and b_i represent bias parameters. i_t, f_t, and o_t denote the input gate, forget gate, and output gate, respectively.

3.2.2. ADMHA

Multi-head attention is widely applied in NER tasks [27,28]. By incorporating multiple attention heads, MHA allows the model to simultaneously focus on different parts of the input sequence, effectively capturing rich semantic information and long-range dependencies. Each attention head operates in a different subspace, enhancing the model’s ability to comprehend complex contexts. However, the weights of attention heads in MHA are fixed and do not dynamically adjust based on the input. This limitation can lead to inefficient use of computational resources, as some heads may not be relevant to specific tasks.

To address this issue, we propose adaptive multi-head attention (ADMHA), which introduces adaptive weights for each attention head, allowing their weights to be dynamically adjusted according to the input. This enhancement enables the model to more effectively capture important information. The adaptive weight mechanism in ADMHA optimizes the information-capturing process, especially when dealing with complex sequences processed by XLSTM. This mechanism not only improves the modeling of complex contexts and long-range dependencies, but also allows the model to adjust dynamically based on task requirements, ultimately enhancing overall performance. The query vector Q, key vector K, and value vector V in ADMHA are consistent with those in MHA, as expressed in equation (15), and the attention score matrix is given by Equation (16).

(Q, K, V) = (W^{Q}, W^{K}, W^{V}) \cdot H

(15)

A t t e n t i o n (Q, K, V) = S o f t M a x (\frac{Q K^{T}}{\sqrt{d k}}) V

(16)

where W^Q, W^K_, and W^V are the weight matrices, and d_k represents the dimensionality of Q and K. The division by

\sqrt{d k}

acts as a scaling factor to prevent excessively large dot product values, which could lead to vanishing gradients. The final result is obtained by applying the SoftMax function and multiplying by V. In ADMHA, Equation (17) represents the computation of a single head in the multi-head attention mechanism. Equation (18) describes the calculation of adaptive weights. Equation (19) illustrates the application of the adaptive weights to the output of each head. Finally, Equation (20) shows the combination of the outputs from all heads. The structure of ADMHA is illustrated in Figure 3.

{h e a d}_{i} = A t t e n t i o n (Q W^{Q}, K W^{K}, V W^{V})

(17)

A d a p t i v e_w e i g h t s = S o f t M a x (h e a d_w e i g h t s)

(18)

A t t e n t i o n_o u t p u t_{i} = A d a p t i v e_w e i g h t s_{i} \cdot h e a d_{i}

(19)

A t t e n t i o n_o u t p u t = C o n c a t (A t t e n t i o n_o u t p u t_{1}, ..., A t t e n t i o n_o u t p u t_{i}) W^{o}

(20)

where Adaptive_weights represents the adaptive weights, head_weights denotes the head weights, and Attention_output refers to the attention output.

3.3. Decoding Layer

GlobalPointer [23] is a span-based decoding method. Compared to sequence labeling methods such as CRF, GlobalPointer treats the start and end of an entity as a unified whole, allowing it to more accurately identify the complete span of an entity.

For an input sequence of length u, the encoded vector sequence is H = [h₁, h₂, …, h_u]. Let i and j denote the start and end of an entity in the sequence, respectively. The coordinates (i, j) are marked as 1, while all other positions are marked as 0, indicating the position information of entities and non-entities. This process is illustrated in Figure 4 and can be represented by Equations (21)–(23).

q_{i, α} = w_{q, α} h_{i} + b_{q, α}

(21)

k_{j, α} = w_{k, α} h_{j} + b_{k, α}

(22)

s_{α} (i, j) = q_{i, α}^{T}, k_{j, α}

(23)

where q_i,α and k_j,α represent the vector sequences, while h_i and h_j denote the character vectors. w_q,a, w_k,a, b_q,α, and b_k,α are the matrix parameters and bias parameters. The function s_α scores the segment [i, j] as entity type α by taking the inner product of q_i,α and k_j,α.

In the absence of relative position information for entities, GlobalPointer often struggles with accurately recognizing entities due to its insensitivity to entity length and span. However, Rotary Position Embedding (RoPE) [29] enhances GlobalPointer’s ability to distinguish and capture entity information, thereby improving recognition performance. This is illustrated in Equation (24), where R_i and R_j represent orthogonal matrices, and R_i−j denotes the relative position vector.

{s'}_{α} (i, j) = {(R_{i} q_{i, α})}^{T} (R_{j} k_{j, α}) = q_{i, α}^{T} R_{i}^{T} R_{j} k_{j, α} = q_{i, α}^{T} R_{j - i} k_{j, α}

(24)

The loss function designed for GlobalPointer is suitable for multi-label classification tasks such as NER. It is an extension of the single-target, multi-label classification cross-entropy, as shown in Equation (25).

L o s s = \ln (1 + \sum_{(i, j) \in P_{α}} e^{- s_{α} (i, j)}) + \ln (1 + \sum_{(i, j) \in Q_{α}} e^{s_{α} (i, j)})

(25)

where P_α denotes the set of all start and end positions of entities with type α in the sample, while Q_α represents the set of start and end positions for all non-entity types or entities that are not of type α in the sample.

The complementary strengths of the various modules in WP-XAG significantly enhance the model’s overall performance. The inclusion of PGD effectively improves the model’s robustness. WoBERT alleviates issues with polysemy and ambiguity through word-level pre-training, while XLSTM processes these disambiguated semantic representations, ensuring the capture of long-range dependencies. ADMHA further optimizes information capture through adaptive adjustments, making the handling of polysemous words more precise and enhancing the model’s ability to capture long-range dependencies. Finally, GlobalPointer enables the model to identify and classify entities more accurately, further boosting the model’s performance.

4. Experimental Design and Analysis

4.1. Dataset Description

In this experiment, we used widely recognized publicly available Chinese NER datasets, including Resume, Weibo, CMeEE, and CLUENER2020. The entity types, along with the number of entities in the training, test, and validation sets for each dataset, are presented in Table 1.

The Resume dataset, collected by Sina Finance, consists of resumes of executives from Chinese publicly listed companies. The text in this dataset is relatively formal. The Weibo dataset is a corpus of Weibo posts annotated for NER, characterized by a more colloquial tone and a significant amount of slang. In this dataset, NOM refers to generic references, while NAM indicates specific references. The CMeEE dataset is derived from the Chinese Medical Information Processing Challenge, focusing on medical NER and featuring some nested entities. Finally, the CLUENER2020 dataset, based on Tsinghua University’s open-source THUCTC text classification dataset, includes a diverse range of entity types, making it well-suited for fine-grained NER tasks.

As shown in Table 2, examples of the data formats used in two of the datasets for this experiment are provided. In the Weibo example, “老师” is used as a generic reference to a person (PER.NOM), while “王晶” and “刘掌门” are specific references to individuals (PER.NAM). The CMeEE example includes nested entities, with the outer entity “明显的小管萎缩和间质炎症” classified as “sym (symptom)”, while the inner entities “小管” and “间质” are classified as “bod (body)”. Nested entities more accurately reflect complex scenarios in real-world applications, making the model more challenging to handle multi-level and overlapping information.

4.2. Evaluation Metrics and Experimental Setup

The evaluation metrics for NER models primarily include Precision, Recall, and F1-score, which are used to assess the model’s performance in entity recognition. These metrics are defined as follows:

Precision (P) is the ratio of True Positives (TP) to the total predicted positives (TP + FP).

Recall (R) is the ratio of True Positives (TP) to the total actual positives (TP + FN).

F1-score (F1) is the harmonic mean of Precision and Recall.

TP represents the number of entities correctly predicted by the model, FP represents the number of non-entities incorrectly predicted as entities, and FN represents the number of entities incorrectly predicted as non-entities. These metrics help assess the model’s performance and effectiveness.

The experiments were conducted on a single NVIDIA GeForce RTX 4090 GPU (Manufacturer: Colorful; City: Shenzhen; Country: China) with 24 GB of memory. The software environment consisted of PyTorch 1.12.0 + cu116, Python 3.8.18, and Transformers 4.3.2. The maximum input text length was set to 256 tokens.

To ensure that performance differences arise primarily from the characteristics of the datasets and models themselves, rather than from parameter tuning, we kept the parameters consistent across all experiments. The detailed parameter settings are provided in Table 3.

4.3. Comparative Experiments and Analysis

As shown in Table 4, we use bold values to represent the highest values for the corresponding metrics in the table. To verify the effectiveness of the proposed WP-XAG model in Chinese NER tasks, we conducted extensive comparative experiments using the aforementioned four public Chinese datasets. These datasets cover a wide range of domains and text types, providing a comprehensive evaluation of the model’s performance and generalization capabilities across different scenarios. For comparison, we selected several models commonly used in NER due to their established performance in Chinese NER tasks. This selection allows for a robust assessment of WP-XAG’s capabilities. The baseline and comparison models used are listed below:

BiLSTM-CRF: A popular algorithm in NER, BiLSTM is used to capture contextual information and outputs the probability of each word corresponding to each label. The CRF component ensures the legality of the predicted entity label sequence, thereby improving the accuracy of entity boundary detection.

BERT-CRF: Combines BERT’s deep context understanding with CRF’s sequence modeling for enhanced NER, offering improved accuracy and handling of complex entities.

RBC: Combines the RoBERTa model to extract deep semantic features, BiLSTM to learn sequence context information, and CRF for label decoding to achieve fine-grained entity recognition.

BIC: An integrated model combining BERT-wwm, IDCNN, and CRF in a composite neural network for Chinese NER.

RIC: Integrates a pre-trained language model, an improved deep convolutional network IDCNN, and CRF sequence tagging technology to achieve efficient NER.

RWGNN [30]: This model incorporates lexical information and uses a random graph algorithm to automatically generate multi-directional connection patterns, overcoming the limitations of manually designed graph structures. It also introduces enhanced word embeddings by combining reconstructed word information with the original word data to capture global dependencies.

LB-BMBC [22]: Leverages the word-learning capabilities of pre-trained models, observes spatial relationships between adjacent spans in span-based recognition, and uses CNN to model these relationships.

W²NER [31]: A state-of-the-art (SOTA) model that uses BERT and BiLSTM as input encoders and employs multi-granularity 2D convolution to refine word-pair representations. This method treats NER as a word-to-word relationship classification problem, replacing both sequence labeling and span-based methods.

Among these models, BiLSTM-CRF, BERT-CRF, RBC, BIC, and RIC are based on sequence labeling methods. LB-BMBC is a span-based method, while W²NER is a SOTA model proposing a new word-to-word relationship classification scheme. Additionally, RWGNN introduces an innovative approach by designing a unique random graph structure and enhanced word embeddings, offering new insights into NER architecture.

From Table 4, it is evident that for the Resume dataset, which features relatively uniform text formats and structured content, models tend to perform well. This is because the standardized nature of the data minimizes the complexity and noise introduced by unstructured text, thereby reducing the chances of error. The WP-XAG model proposed in this paper leverages ADMHA, which allows dynamic adjustment of attention weights based on contextual information. This flexibility enables the model to capture information at different levels and granularities more effectively, leading to superior performance in comparison to other models.

The BIOES tagging scheme, which clearly distinguishes between the beginning, inside, end, and singleton entities, provides finer-grained labeling information, allowing models to capture entity boundaries more accurately. This is especially relevant for datasets like Weibo, which feature colloquial language, ambiguous words, and a higher level of noise. In such cases, fine-grained labeling is crucial to reducing decoding errors. Even when compared to models like BiLSTM-CRF, BERT-CRF, RBC, BIC, and RIC, which use the BIOES scheme for more detailed labeling, WP-XAG demonstrates a distinct performance advantage. WP-XAG employs WoBERT to mitigate the effects of polysemy, providing richer semantic representations. The incorporation of adversarial training improves the model’s robustness to noisy data, while the feature fusion layer further refines disambiguated semantic information. Additionally, GlobalPointer improves the model’s ability to capture global information, resulting in superior performance on the Weibo dataset.

In the nested entity dataset CMeEE and the fine-grained dataset CLUENER2020, capturing contextual dependencies and appropriately allocating attention becomes even more critical. For nested entities, entity boundaries are not solely determined by local information but also depend on the context of the entire sentence. In WP-XAG, the XLSTM effectively captures longer contextual dependencies, while the adaptive weights in ADMHA dynamically adjust the attention distribution across different heads, improving the model’s ability to focus on various types of entities. Additionally, GlobalPointer captures the positional relationships among entities throughout the text, allowing the model to identify longer-span or more complex entity structures efficiently. The complementary strengths of these modules enable WP-XAG to achieve strong performance across datasets, further demonstrating its effectiveness in NER tasks.

When compared to the SOTA model W2NER, WP-XAG also shows performance advantages across various datasets. F1-scores improve by 0.24% on Resume, 2.66% on Weibo, 0.58% on CMeEE, and 0.45% on CLUENER2020. Furthermore, compared to other span-based models like LB-BMBC, WP-XAG demonstrates significant improvements. Compared to LB-BMBC, WP-XAG outperforms it on Resume and Weibo, with F1-score improvements of 0.60% and 6.25%, respectively.

In summary, the proposed WP-XAG model for Chinese NER, based on multi-level representation learning, effectively addresses the issue of polysemy, enhances model robustness, and improves its ability to capture long-range dependencies. Its performance on Resume, Weibo, and the nested entity dataset CMeEE, as well as the fine-grained dataset CLUENER2020, surpasses that of the baseline model BiLSTM-CRF and eight other SOTA models.

4.4. Ablation Study

As shown in Table 5, an ablation study was conducted to further verify the significance of each module in the WP-XAG model and its contribution to performance improvement. The study used a stepwise removal strategy across the four publicly available datasets used in the previous experiments. This approach helps to explore how each module affects the model’s performance across different types of datasets. We use bold values to represent the highest values for the corresponding metrics in the table.

The details of the ablation study are as follows:

-: PGD: The PGD applied to the embedding layer is removed.
-: XLSTM: The XLSTM network in the WP-XAG model is removed.
-: ADMHA: The ADMHA mechanism is removed.
-: RoPE: The RoPE in GlobalPointer is removed.

From the results of the ablation experiments shown in Table 5, we can observe the following:

After removing the PGD module, the F1-score on the Resume dataset decreased by 1.42%, indicating that PGD adversarial training enhances model robustness even in the more structured text of the Resume dataset. On the Weibo dataset, the F1-score dropped by 3.66%, which can be attributed to the more colloquial and polysemous nature of Weibo texts. PGD helps the model adapt to these polysemous words and irregular language structures, and the significant drop in F1 after its removal demonstrates PGD’s effectiveness in handling noise and complex semantics. For the CMeEE dataset, the F1-score decreased by 1.37%, suggesting that while nested entities in medical texts are complex, PGD still improves the model’s handling of noise and challenging annotations. On the CLUENER2020 dataset, the F1-score dropped by 3.98%, illustrating that PGD contributes significantly to the recognition of complex entities in fine-grained NER tasks.

When the XLSTM module was removed, the F1-score on the Resume dataset dropped by 1.62%, showing that XLSTM contributes to capturing global semantics and long-range dependencies. On the Weibo dataset, the F1-score fell by 3.82%. Weibo contains not only complex contexts, but also many polysemous words, and XLSTM’s enhanced memory structure helps the model better distinguish between different meanings of the same word in such intricate contexts. The removal of XLSTM weakened the model’s ability to handle these complex dependencies and polysemy. On the CMeEE dataset, the F1-score dropped by 1.88%, indicating that XLSTM aids in the modeling of nested entities. The F1-score on CLUENER2020 decreased by 4.62%, demonstrating that XLSTM’s multi-layer sequence modeling capability plays an important role in capturing subtle variations in fine-grained datasets.

After removing the ADMHA module, the F1-score on the Resume dataset dropped by 1.67%. ADMHA helps capture semantic information in more structured texts. On the Weibo dataset, the F1-score decreased by 3.39%, with ADMHA playing a relatively important role in processing polysemous words and complex semantics in Weibo texts. By employing adaptive weighting mechanisms, the model can better understand the different meanings of polysemous words in varying contexts. Removing this module diminished the model’s ability to differentiate between such words. On the CMeEE dataset, the F1-score dropped by 1.65%, indicating that ADMHA helps in modeling the complex semantic relationships in medical texts. The F1-score on CLUENER2020 decreased by 2.97%, indicating that ADMHA contributes to handling the complex relationships between fine-grained entities.

After the removal of the RoPE module, the F1-score on the Resume dataset decreased by 2.01%. RoPE helps the model capture positional relationships between entities more effectively, still proving useful in structured texts. On the Weibo dataset, the F1-score dropped by 6.19%, as polysemous words appear frequently in Weibo, and RoPE’s rotational position encoding enhances the model’s ability to recognize these words in different positions and contexts. Removing RoPE significantly weakened the model’s handling of polysemy. On the CMeEE dataset, the F1-score dropped by 2.36%, showing that RoPE contributes to handling the positional relationships of nested entities. On CLUENER2020, the F1-score decreased by 4.5%, illustrating that RoPE contributes significantly to recognizing complex positional relationships in fine-grained NER tasks.

In summary, the WP-XAG model leverages multi-level representation learning, incorporating the advantages of each module, thus enhancing the model’s understanding of complex semantics, contexts, and positional relationships, ultimately improving its entity recognition performance. The performance degradation after the removal of any module highlights the effectiveness of each component.

4.5. Iterative Comparison Experiments

In this section, we conduct experiments using the Weibo and CMeEE datasets. The Weibo dataset is characterized by its frequent use of popular terms, polysemous words, and colloquial language, making it challenging for many models to effectively learn features in the early stages of training. By utilizing the Weibo dataset, we can better assess the capability of different models in handling word ambiguity and capturing complex information. The CMeEE dataset is derived from the Chinese Biomedical Language Understanding Evaluation (CBLUE) and contains a large number of nested entities, adding to its complexity. By experimenting with the CMeEE dataset, we aim to evaluate the models’ performance in processing technical terms and recognizing complex entity relationships.

As shown in Figure 5a, the non-standard sentence structures and irregular formatting in the Weibo dataset make it difficult for models to capture sentence structure and dependencies, leading to consistently lower performance for the BiLSTM-CRF model. While the RBC model achieves better results, it still falls short compared to WP-XAG, possibly due to insufficient robustness in handling the noise and complexity present in the Weibo dataset. The BIC model’s F1-score exhibits instability during training, with significant fluctuations, and at later stages, the F1-score even drops to zero within certain epochs. This suggests that BIC struggles with capturing and processing complex features and is particularly susceptible to the dataset’s noise. In contrast, WP-XAG achieves over 60% performance by the second epoch and stabilizes around 70%, indicating that WP-XAG quickly learns effective features early in training and is more adept at handling the complex information within the Weibo dataset.

From the results displayed in Figure 5b, WP-XAG also performs exceptionally well on the CMeEE dataset compared to other models. It reaches a high F1-score early in training and maintains stability throughout, outperforming other models. The BiLSTM-CRF model struggles with complex medical terms and nested entities, resulting in a lower F1-score. Although the RBC model can achieve a high F1-score, it performs poorly in the later stages of training, suggesting that it may have learned noise and details from the training data, leading to suboptimal performance on unseen data. The BIC model also shows instability when handling complex features, indicating that it has certain limitations in modeling complex medical texts. Overall, WP-XAG demonstrates superior performance in capturing and processing complex medical information, outperforming other models across the board.

As shown in Figure 5, although our model has achieved better results compared to other models, further efforts are needed to reach a higher F1-score (above 90%). This may require more comprehensive solutions to address the misclassification caused by complex sequences and nested entities, as well as difficulties in recognizing polysemous words in context. Additionally, when dealing with complex medical terminology, the model sometimes struggles to accurately identify proper nouns, leading to inaccuracies. By systematically analyzing these errors, we can gain insights for future improvements, such as integrating external dictionaries to enhance the recognition of proper nouns and improving the model’s handling of polysemy.

4.6. Performance Analysis of ADMHA

In this section, we designed comparative experiments to evaluate the performance of ADMHA compared to MHA and to explore the optimization effect of ADMHA with varying numbers of attention heads. In the experiment comparing ADMHA and MHA, all other modules and parameters of the WP-XAG model were kept constant, with only ADMHA replaced by MHA, as shown in Table 6. Bold data indicates the highest value of the corresponding indicator in the table.

Furthermore, we conducted experiments to assess the model’s performance with different numbers of attention heads (1, 2, 4, 8, and 16) across various datasets. The results of these experiments are detailed in Table 7. Similarly, we use bold values to represent the highest values for the corresponding metrics in the table.

As shown in Table 6, by comparing the performance of ADMHA and MHA across the four datasets, it is evident that ADMHA demonstrates a clear advantage on all datasets. First, in the more structured Resume dataset, ADMHA improves the F1-score by 1.86% compared to MHA. This shows that ADMHA, through its adaptive weighting mechanism, dynamically adjusts the importance of different attention heads, enhancing the model’s ability to capture fine-grained entities. Second, in the Weibo dataset, which contains a large number of polysemous words and noise, the F1-score increases by 3.49%. This indicates that ADMHA’s dynamic weighting mechanism allows it to better handle complex contexts and the semantic variations of polysemous words. In the CMeEE dataset, the F1-score improves by 1.26%, as ADMHA can more flexibly model the complex dependencies between nested entities. Finally, in the CLUENER2020 dataset, which contains a rich set of fine-grained entities, ADMHA improves the F1-score by 2.05%, demonstrating its superior flexibility in modeling complex entity relationships compared to MHA. ADMHA’s adaptive attention mechanism makes the model more adaptable and flexible when addressing complex dependencies, fine-grained entities, and polysemy issues.

As shown in Table 7, the F1-score increases across all datasets as the number of attention heads rises from 1 to 4. This is likely because ADMHA dynamically adjusts the importance of each attention head based on the input data. With more heads, the model can learn diverse semantic features from various subspaces. The adaptive weight mechanism ensures that the model effectively utilizes these diverse features, avoiding the limitations of fixed weight allocation. At four attention heads, ADMHA strikes a balance between capturing both local and global information while avoiding issues like information redundancy or unnecessary computational resource consumption.

However, as the number of attention heads increases beyond four, the F1-score begins to decrease. This suggests that despite ADMHA’s ability to adaptively adjust the weights of each head, having too many heads can lead to information redundancy. Even with the adaptive mechanism, some heads may fail to capture useful new information, potentially introducing noise instead. Moreover, an excessive number of attention heads may result in an over-parameterized model, which could negatively impact training performance. In future work, dynamically identifying attention heads that contribute little or contain redundant information during training—and reducing their weights, or even pruning them during inference—could improve the model’s efficiency.

5. Results

This section summarizes the results of the experimental design and analysis in Section 4. In Section 4, the performance of the WP-XAG model was evaluated across various datasets, with improvements compared to SOTA models. As shown in Table 4 of Section 4.3 Comparative Experiments and Analysis, WP-XAG outperformed other models in the Resume dataset, achieving superior results due to its effective handling of structured text. The model’s use of the ADMHA mechanism allowed for dynamic attention weight adjustment, which was critical for capturing information at multiple levels.

In comparative analysis against the SOTA model W²NER, WP-XAG demonstrated improvements of 0.24% on Resume, 2.66% on Weibo, 0.58% on CMeEE, and 0.45% on CLUENER2020. This highlights WP-XAG’s effectiveness in mitigating the effects of polysemy and enhancing robustness in challenging datasets.

The ablation study confirmed the significance of each module. The removal of PGD, XLSTM, ADMHA, and RoPE consistently resulted in performance declines across all datasets, emphasizing the contributions of these components to the model’s overall effectiveness.

Iterative comparison experiments with the Weibo and CMeEE datasets further demonstrated WP-XAG’s superior capability in handling complex and ambiguous data. WP-XAG quickly stabilized performance in Weibo, surpassing 70% F1-scores, while maintaining high scores on CMeEE throughout training. This reinforces WP-XAG’s advantages in effectively capturing complex medical information and colloquial language nuances.

The performance analysis of ADMHA highlighted its clear advantages over MHA, with significant improvements in F1-scores across datasets. The optimal number of attention heads was identified as four, maximizing feature capture while minimizing redundancy.

6. Discussion

The enhanced performance of the WP-XAG model, particularly on the Weibo and CMeEE datasets, underscores the significance of adaptive mechanisms in modern NER systems. The improvements observed with the ADMHA module align with previous research emphasizing the need for flexible attention strategies to handle the complexities of natural language, particularly in contexts rich in polysemy and ambiguity.

Our findings regarding the optimal number of attention heads corroborate existing studies, suggesting a trade-off between model complexity and performance. The drop in F1-scores with excessive attention heads emphasizes the importance of balancing model complexity and performance, highlighting the need for efficient architectures that optimize both training and inference times.

Moreover, the ablation study reinforces the contributions of individual components within the WP-XAG framework, indicating that multi-level representation learning is crucial for addressing diverse challenges in NER. This aligns with recent advancements in the field that advocate for integrated approaches that leverage multiple modeling techniques to improve robustness and accuracy.

In conclusion, our research contributes to the ongoing discourse in NER by presenting a novel model that effectively addresses the challenges posed by complex data. The insights gained from this study not only enhance the understanding of model architectures, but also provide a foundation for future explorations into adaptive mechanisms in NER tasks.

7. Conclusions and Future Work

This research successfully develops the WP-XAG model for Chinese NER, aimed at addressing the challenges posed by diverse and complex text environments. The model employs multi-level representation learning, utilizing the word-based pre-trained model WoBERT to effectively manage polysemy and enhance feature extraction. Key innovations include the integration of PGD adversarial training, which significantly improves model robustness by introducing perturbations that enhance noise handling.

The feature fusion layer incorporates XLSTM and ADMHA, enabling the model to adeptly capture intricate contextual relationships and long-range dependencies. The use of the GlobalPointer layer further facilitates the extraction of rich semantic information, allowing the model to excel across various datasets, including structured, colloquial, and nested entity formats. Experimental results demonstrate WP-XAG’s superior performance compared to SOTA models, underscoring its adaptability and generalization capabilities in NER tasks.

The scientific novelty of this work lies in its holistic approach to NER, combining advanced techniques to improve entity recognition in challenging linguistic contexts. Practically, the findings indicate significant advancements in model performance, making WP-XAG a valuable tool for applications in information extraction across diverse domains.

The “small-step, multi-iteration” strategy of PGD, while effective, is not very efficient, which may limit the model’s application on larger datasets. At the same time, due to the presence of a large number of proper nouns in certain datasets, relying solely on pre-trained models may not fully capture entity information in specific domains.

Therefore, future work may consider optimizing the model’s efficiency and scalability, as well as incorporating external dictionaries or vocabulary enhancement methods to improve the recognition of proper nouns and domain-specific terminology, thereby enhancing its performance in specialized fields. Additionally, existing models still face challenges when dealing with entity recognition tasks in low-resource languages.

Author Contributions

Conceptualization, J.D. and W.L.; Methodology, J.D. and W.L.; Software, J.D.; Validation, J.D., S.L. and X.L.; Formal analysis, S.L., X.L. and Y.S.; Resources, Z.W.; Data curation, J.D.; Writing—original draft, J.D.; Writing—review & editing, W.L. and S.L.; Visualization, X.L.; Supervision, Y.S. and Z.W.; Project administration, W.L.; Funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ningxia Higher Education Scientific Research Project (NYG2024086), the Fundamental Research Funds for the Central Universities (2022PT_S04), and the National Natural Science Foundation of China (62066038, 61962001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

We gratefully acknowledge the support of the School of Computer Science and Engineering, North Minzu University.

Conflicts of Interest

The authors declare no conflict of interest.

References

Han, M.; Chen, Z.; Li, M.; Wu, H.; Zhang, X. A survey of active and passive concept drift handling methods. Comput. Intell. 2022, 38, 1492–1535. [Google Scholar] [CrossRef]
Farmakiotou, D.; Karkaletsis, V.; Koutsias, J.; Sigletos, G.; Spyropoulos, C.D.; Stamatopoulos, P. Rule-based named entity recognition for Greek financial texts. In Proceedings of the Workshop on Computational lexicography and Multimedia Dictionaries (COMLEX 2000), Patras, Greece, 22–23 September 2000; pp. 75–78. [Google Scholar]
Weston, L.; Tshitoyan, V.; Dagdelen, J.; Kononova, O.; Trewartha, A.; Persson, K.A.; Ceder, G.; Jain, A. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. J. Chem. Inf. Model. 2019, 59, 3692–3702. [Google Scholar] [CrossRef] [PubMed]
Xie, S.; Xia, Y.; Wu, L.; Huang, Y.; Fan, Y.; Qin, T. End-to-end entity-aware neural machine translation. Mach. Learn. 2022, 111, 1181–1203. [Google Scholar] [CrossRef]
Soriano, I.M.; Peña, J.L.C.; Breis, J.T.F.; San Román, I.; Barriuso, A.A.; Baraza, D.G. Snomed2Vec: Representation of SNOMED CT terms with Word2Vec. In Proceedings of the 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS), IEEE, Cordoba, Spain, 5–7 June 2019; pp. 678–683. [Google Scholar]
Ulčar, M.; Robnik-Šikonja, M. Cross-lingual alignments of ELMo contextual embeddings. Neural Comput. Appl. 2022, 34, 13043–13061. [Google Scholar] [CrossRef]
Na, S.H.; Kim, H.; Min, J.; Kim, K. Improving LSTM CRFs using character-based compositions for Korean named entity recognition. Comput. Speech Lang. 2019, 54, 106–121. [Google Scholar] [CrossRef]
Zheng, X.; Du, H.; Luo, X.; Tong, F.; Song, W.; Zhao, D. BioByGANS: Biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework. BMC Bioinform. 2022, 23, 501. [Google Scholar] [CrossRef]
Liu, J.; Huang, M.; Zhu, X. Recognizing biomedical named entities using skip-chain conditional random fields. In Proceedings of the 2010 Workshop on Bio-medical Natural Language Processing, Uppsala, Sweden, 15 July 2010; pp. 10–18. [Google Scholar]
Li, D.; Yan, L.; Yang, J.; Ma, Z. Dependency syntax guided BERT-BiLSTM-GAM-CRF for Chinese NER. Expert Syst. Appl. 2022, 196, 116682. [Google Scholar] [CrossRef]
Guo, Q.; Guo, Y. Lexicon enhanced Chinese named entity recognition with pointer network. Neural Comput. Appl. 2022, 34, 14535–14555. [Google Scholar] [CrossRef]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. arXiv 2016, arXiv:1603.01360. [Google Scholar]
Li, C.; Zhang, Y.; Wei, Z. A Unified MRC Framework for Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online, 5–10 July 2020. [Google Scholar]
Yan, S.; Chai, J.; Wu, L. Bidirectional GRU with multi-head attention for Chinese NER. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), IEEE, Chongqing, China, 12–14 June 2020; pp. 1160–1164. [Google Scholar]
Lu, W.; Roth, D. Joint mention extraction and classification with mention hypergraphs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 857–867. [Google Scholar]
Katiyar, A.; Cardie, C. Nested Named Entity Recognition Revisited. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018; 1, pp. 861–871. Available online: https://par.nsf.gov/biblio/10075233 (accessed on 5 October 2024).
Yan, Y.; Cai, B.; Song, S. Nested named entity recognition as building local hypergraphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington DC, USA, 7–14 February 2023; Volume 37, pp. 13878–13886. [Google Scholar]
Wang, Q.; Zhou, Y.; Ruan, T.; Gao, D.; Xia, Y.; He, P. Incorporating dictionaries into deep neural networks for the Chinese clinical named entity recognition. J. Biomed. Inform. 2019, 92, 103133. [Google Scholar] [CrossRef]
Kong, J.; Zhang, L.; Jiang, M.; Liu, T. Incorporating multilevel CNN and attention mechanism for Chinese clinical named entity recognition. J. Biomed. Inform. 2021, 116, 103737. [Google Scholar] [CrossRef] [PubMed]
Sharma, R.; Morwal, S.; Agarwal, B. Named entity recognition using neural language model and CRF for Hindi language. Comput. Speech Lang. 2022, 74, 101356. [Google Scholar] [CrossRef]
Sun, L.; Sun, Y.; Ji, F.; Wang, C. Joint learning of token context and span feature for span-based nested NER. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2720–2730. [Google Scholar] [CrossRef]
Guo, T.; Zhang, Z. LB-BMBC: MHBiaffine-CNN to Capture Span Scores with BERT Injected with Lexical Information for Chinese NER. Int. J. Comput. Intell. Syst. 2024, 17, 144. [Google Scholar] [CrossRef]
Su, J.; Murtadha, A.; Pan, S.; Hou, J.; Sun, J.; Huang, W.; Wen, B.; Liu, Y. Global pointer: Novel efficient span-based approach for named entity recognition. arXiv 2022, arXiv:2208.03054. [Google Scholar]
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z. Pre-training with whole word masking for Chinese Bert. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3504–3514. [Google Scholar] [CrossRef]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
Beck, M.; Pöppel, K.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. xLSTM: Extended Long Short-Term Memory. arXiv 2024, arXiv:2405.04517. [Google Scholar]
An, Y.; Xia, X.; Chen, X.; Wu, F.X.; Wang, J. Chinese clinical named entity recognition via multi-head self-attention based BiLSTM-CRF. Artif. Intell. Med. 2022, 127, 102282. [Google Scholar] [CrossRef]
He, S.; Sun, D.; Wang, Z. Named entity recognition for Chinese marine text with knowledge-based self-attention. Multimed. Tools Appl. 2022, 1–15. [Google Scholar] [CrossRef]
Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neuro-computing 2024, 568, 127063. [Google Scholar] [CrossRef]
Chen, J.; Xi, X.; Sheng, V.S.; Cui, Z. Randomly wired graph neural network for Chinese NER. Expert Syst. Appl. 2023, 227, 120245. [Google Scholar] [CrossRef]
Li, J.; Fei, H.; Liu, J.; Wu, S.; Zhang, M.; Teng, C.; Ji, D.; Li, F. Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 10965–10973. [Google Scholar]

Figure 1. WP-XAG Model.

Figure 2. mLSTM block structure diagram.

Figure 3. Adaptive multi-head attention structure diagram.

Figure 4. GlobalPointer for entity tagging.

Figure 5. Model iteration comparison.

Table 1. Information about the datasets (unit: count).

Datasets	Entity Type	Training Set	Testing Set	Validation Set
Resume	CONT, NAME, ORG, PRO, RACE, TITLE, EDU, LOC	13,438	1497	1630
Weibo	GPE.NOM, GPE.NAM, LOC.NAM, LOC.NOM, PER.NAM, PER.NOM, ORG.NAM, ORG.NOM	1894	389	418
CMeEE	mic, pro, bod, dru, equ, ite, sym, dis, dep	82,649	-	26,081
CLUENER2020	book, organization, movie, address, name, scene, game, company, position, government	23,338	-	2982

Table 2. Data format examples.

Datasets

Example Text

Weibo

{“text”: “有梦才有未来王晶老师：年轻的生命总是那样执着的光彩绽放湘潭大学生联盟校傲江湖网刘掌门”, “entities”: [{“start_idx”: 8, “end_idx”: 10, “type”: “PER.NOM”, “entity”: “老师”}, {“start_idx”: 6, “end_idx”: 8, “type”: “PER.NAM”, “entity”: “王晶”}, {“start_idx”: 39, “end_idx”: 42, “type”: “PER.NAM”, “entity”: “刘掌门”}, {“start_idx”: 27, “end_idx”: 34, “type”: “ORG.NAM”, “entity”: “湘潭大学生联盟”}]}

CMeEE

{“text”: “有明显的小管萎缩和间质炎症。”, “entities”: [{“start_idx”: 1, “end_idx”: 12, “type”: “sym”, “entity”: “明显的小管萎缩和间质炎症”}, {“start_idx”: 4, “end_idx”: 5, “type”: “bod”, “entity”: “小管”}, {“start_idx”: 9, “end_idx”: 10, “type”: “bod”, “entity”: “间质”}]}

Table 3. Parameter settings.

Parameters	Values
learning_rate	2 × 10⁻⁵
batch_size	8
optimizer	Adam
seed	50

Table 4. The comparative experiment (%).

Models	Resume			Weibo			CMeEE			CLUENER2020
Models	P	R	F1	P	R	F1	P	R	F1	P	R	F1
WP-XAG (ours)	96.94	96.84	96.89	72.95	76.94	74.89	72.87	71.53	72.19	80.21	81.73	80.96
BiLSTM-CRF	91.67	91.85	91.76	63.82	49.87	55.99	66.31	56.19	60.83	69.62	67.06	68.31
BERT-CRF	94.67	96.06	95.36	62.83	69.61	66.05	70.33	65.28	67.71	72.34	80.27	76.10
RBC	95.65	96.09	95.87	70.84	72.58	71.69	70.29	66.14	68.15	79.14	80.76	79.94
BIC	95.97	96.06	96.01	66.23	71.15	68.60	71.98	66.03	68.88	78.58	81.09	79.81
RIC	95.72	96.29	96.00	68.97	71.98	70.44	71.75	65.84	68.66	78.21	80.73	79.45
RWGNN	96.14	96.52	96.33	58.69	66.42	62.32	-	-	-	-	-	-
LB-BMBC	96.15	94.44	96.29	65.01	72.71	68.64	-	-	-	-	-	-
W²NER	96.96	96.35	96.65	70.84	73.87	72.32	72.18	71.04	71.61	82.16	78.92	80.51

Table 5. Results of ablation experiments (%).

Models	Resume			Weibo			CMeEE			CLUENER2020
Models	P	R	F1	P	R	F1	P	R	F1	P	R	F1
WP-XAG(ours)	96.94	96.84	96.89	72.95	76.94	74.89	72.87	71.53	72.19	80.21	81.73	80.96
-PGD	95.96	95.56	95.47	70.31	76.20	71.23	72.81	70.12	70.82	76.51	78.91	76.98
-XLSTM	95.63	95.53	95.27	72.03	74.09	71.07	70.40	71.72	70.31	82.42	72.72	76.34
-ADMHA	94.93	96.23	95.22	71.51	74.08	71.50	70.43	71.82	70.54	78.43	78.97	77.99
-RoPE	94.08	96.31	94.88	71.41	68.65	68.70	70.94	69.84	69.83	75.85	78.94	76.46

Table 6. Performance comparison experiments (%).

Datasets	WP-XAG (ADMHA)			WPMG (MHA)
Datasets	P	R	F1	P	R	F1
Resume	96.94	96.84	96.89	96.22	94.74	95.03
Weibo	72.95	76.94	74.89	73.61	72.31	71.40
CMeEE	72.87	71.53	72.19	71.39	71.82	70.93
CLUENER2020	80.21	81.73	80.96	75.40	83.15	78.91

Table 7. Optimization effect for different number of attention heads (%).

num_head	1	2	4	8	16
Datasets	F1	F1	F1	F1	F1
Resume	96.55	96.66	96.89	96.79	96.33
Weibo	73.39	74.40	74.89	74.54	73.05
CMeEE	71.07	71.14	72.19	71.45	71.35
CLUENER2020	80.12	80.60	80.96	80.64	80.52

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, W.; Ding, J.; Liu, S.; Liu, X.; Su, Y.; Wang, Z. Chinese Named Entity Recognition Based on Multi-Level Representation Learning. Appl. Sci. 2024, 14, 9083. https://doi.org/10.3390/app14199083

AMA Style

Li W, Ding J, Liu S, Liu X, Su Y, Wang Z. Chinese Named Entity Recognition Based on Multi-Level Representation Learning. Applied Sciences. 2024; 14(19):9083. https://doi.org/10.3390/app14199083

Chicago/Turabian Style

Li, Weijun, Jianping Ding, Shixia Liu, Xueyang Liu, Yilei Su, and Ziyi Wang. 2024. "Chinese Named Entity Recognition Based on Multi-Level Representation Learning" Applied Sciences 14, no. 19: 9083. https://doi.org/10.3390/app14199083

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Chinese Named Entity Recognition Based on Multi-Level Representation Learning

Abstract

1. Introduction

2. Related Work

2.1. Hypergraph-Based Methods

2.2. Sequence Labeling-Based Methods

2.3. Span-Based Methods

3. WP-XAG Model

3.1. Embedding Layer

3.1.1. WoBERT

3.1.2. PGD

3.2. Feature Fusion Layer

3.2.1. XLSTM

3.2.2. ADMHA

3.3. Decoding Layer

4. Experimental Design and Analysis

4.1. Dataset Description

4.2. Evaluation Metrics and Experimental Setup

4.3. Comparative Experiments and Analysis

4.4. Ablation Study

4.5. Iterative Comparison Experiments

4.6. Performance Analysis of ADMHA

5. Results

6. Discussion

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI