4.1. Dataset Description
In this experiment, we used widely recognized publicly available Chinese NER datasets, including Resume, Weibo, CMeEE, and CLUENER2020. The entity types, along with the number of entities in the training, test, and validation sets for each dataset, are presented in
Table 1.
The Resume dataset, collected by Sina Finance, consists of resumes of executives from Chinese publicly listed companies. The text in this dataset is relatively formal. The Weibo dataset is a corpus of Weibo posts annotated for NER, characterized by a more colloquial tone and a significant amount of slang. In this dataset, NOM refers to generic references, while NAM indicates specific references. The CMeEE dataset is derived from the Chinese Medical Information Processing Challenge, focusing on medical NER and featuring some nested entities. Finally, the CLUENER2020 dataset, based on Tsinghua University’s open-source THUCTC text classification dataset, includes a diverse range of entity types, making it well-suited for fine-grained NER tasks.
As shown in
Table 2, examples of the data formats used in two of the datasets for this experiment are provided. In the Weibo example, “老师” is used as a generic reference to a person (PER.NOM), while “王晶” and “刘掌门” are specific references to individuals (PER.NAM). The CMeEE example includes nested entities, with the outer entity “明显的小管萎缩和间质炎症” classified as “sym (symptom)”, while the inner entities “小管” and “间质” are classified as “bod (body)”. Nested entities more accurately reflect complex scenarios in real-world applications, making the model more challenging to handle multi-level and overlapping information.
4.2. Evaluation Metrics and Experimental Setup
The evaluation metrics for NER models primarily include Precision, Recall, and F1-score, which are used to assess the model’s performance in entity recognition. These metrics are defined as follows:
Precision (P) is the ratio of True Positives (TP) to the total predicted positives (TP + FP).
Recall (R) is the ratio of True Positives (TP) to the total actual positives (TP + FN).
F1-score (F1) is the harmonic mean of Precision and Recall.
TP represents the number of entities correctly predicted by the model, FP represents the number of non-entities incorrectly predicted as entities, and FN represents the number of entities incorrectly predicted as non-entities. These metrics help assess the model’s performance and effectiveness.
The experiments were conducted on a single NVIDIA GeForce RTX 4090 GPU (Manufacturer: Colorful; City: Shenzhen; Country: China) with 24 GB of memory. The software environment consisted of PyTorch 1.12.0 + cu116, Python 3.8.18, and Transformers 4.3.2. The maximum input text length was set to 256 tokens.
To ensure that performance differences arise primarily from the characteristics of the datasets and models themselves, rather than from parameter tuning, we kept the parameters consistent across all experiments. The detailed parameter settings are provided in
Table 3.
4.3. Comparative Experiments and Analysis
As shown in
Table 4, we use bold values to represent the highest values for the corresponding metrics in the table. To verify the effectiveness of the proposed WP-XAG model in Chinese NER tasks, we conducted extensive comparative experiments using the aforementioned four public Chinese datasets. These datasets cover a wide range of domains and text types, providing a comprehensive evaluation of the model’s performance and generalization capabilities across different scenarios. For comparison, we selected several models commonly used in NER due to their established performance in Chinese NER tasks. This selection allows for a robust assessment of WP-XAG’s capabilities. The baseline and comparison models used are listed below:
BiLSTM-CRF: A popular algorithm in NER, BiLSTM is used to capture contextual information and outputs the probability of each word corresponding to each label. The CRF component ensures the legality of the predicted entity label sequence, thereby improving the accuracy of entity boundary detection.
BERT-CRF: Combines BERT’s deep context understanding with CRF’s sequence modeling for enhanced NER, offering improved accuracy and handling of complex entities.
RBC: Combines the RoBERTa model to extract deep semantic features, BiLSTM to learn sequence context information, and CRF for label decoding to achieve fine-grained entity recognition.
BIC: An integrated model combining BERT-wwm, IDCNN, and CRF in a composite neural network for Chinese NER.
RIC: Integrates a pre-trained language model, an improved deep convolutional network IDCNN, and CRF sequence tagging technology to achieve efficient NER.
RWGNN [
30]: This model incorporates lexical information and uses a random graph algorithm to automatically generate multi-directional connection patterns, overcoming the limitations of manually designed graph structures. It also introduces enhanced word embeddings by combining reconstructed word information with the original word data to capture global dependencies.
LB-BMBC [
22]: Leverages the word-learning capabilities of pre-trained models, observes spatial relationships between adjacent spans in span-based recognition, and uses CNN to model these relationships.
W
2NER [
31]: A state-of-the-art (SOTA) model that uses BERT and BiLSTM as input encoders and employs multi-granularity 2D convolution to refine word-pair representations. This method treats NER as a word-to-word relationship classification problem, replacing both sequence labeling and span-based methods.
Among these models, BiLSTM-CRF, BERT-CRF, RBC, BIC, and RIC are based on sequence labeling methods. LB-BMBC is a span-based method, while W2NER is a SOTA model proposing a new word-to-word relationship classification scheme. Additionally, RWGNN introduces an innovative approach by designing a unique random graph structure and enhanced word embeddings, offering new insights into NER architecture.
From
Table 4, it is evident that for the Resume dataset, which features relatively uniform text formats and structured content, models tend to perform well. This is because the standardized nature of the data minimizes the complexity and noise introduced by unstructured text, thereby reducing the chances of error. The WP-XAG model proposed in this paper leverages ADMHA, which allows dynamic adjustment of attention weights based on contextual information. This flexibility enables the model to capture information at different levels and granularities more effectively, leading to superior performance in comparison to other models.
The BIOES tagging scheme, which clearly distinguishes between the beginning, inside, end, and singleton entities, provides finer-grained labeling information, allowing models to capture entity boundaries more accurately. This is especially relevant for datasets like Weibo, which feature colloquial language, ambiguous words, and a higher level of noise. In such cases, fine-grained labeling is crucial to reducing decoding errors. Even when compared to models like BiLSTM-CRF, BERT-CRF, RBC, BIC, and RIC, which use the BIOES scheme for more detailed labeling, WP-XAG demonstrates a distinct performance advantage. WP-XAG employs WoBERT to mitigate the effects of polysemy, providing richer semantic representations. The incorporation of adversarial training improves the model’s robustness to noisy data, while the feature fusion layer further refines disambiguated semantic information. Additionally, GlobalPointer improves the model’s ability to capture global information, resulting in superior performance on the Weibo dataset.
In the nested entity dataset CMeEE and the fine-grained dataset CLUENER2020, capturing contextual dependencies and appropriately allocating attention becomes even more critical. For nested entities, entity boundaries are not solely determined by local information but also depend on the context of the entire sentence. In WP-XAG, the XLSTM effectively captures longer contextual dependencies, while the adaptive weights in ADMHA dynamically adjust the attention distribution across different heads, improving the model’s ability to focus on various types of entities. Additionally, GlobalPointer captures the positional relationships among entities throughout the text, allowing the model to identify longer-span or more complex entity structures efficiently. The complementary strengths of these modules enable WP-XAG to achieve strong performance across datasets, further demonstrating its effectiveness in NER tasks.
When compared to the SOTA model W2NER, WP-XAG also shows performance advantages across various datasets. F1-scores improve by 0.24% on Resume, 2.66% on Weibo, 0.58% on CMeEE, and 0.45% on CLUENER2020. Furthermore, compared to other span-based models like LB-BMBC, WP-XAG demonstrates significant improvements. Compared to LB-BMBC, WP-XAG outperforms it on Resume and Weibo, with F1-score improvements of 0.60% and 6.25%, respectively.
In summary, the proposed WP-XAG model for Chinese NER, based on multi-level representation learning, effectively addresses the issue of polysemy, enhances model robustness, and improves its ability to capture long-range dependencies. Its performance on Resume, Weibo, and the nested entity dataset CMeEE, as well as the fine-grained dataset CLUENER2020, surpasses that of the baseline model BiLSTM-CRF and eight other SOTA models.
4.4. Ablation Study
As shown in
Table 5, an ablation study was conducted to further verify the significance of each module in the WP-XAG model and its contribution to performance improvement. The study used a stepwise removal strategy across the four publicly available datasets used in the previous experiments. This approach helps to explore how each module affects the model’s performance across different types of datasets. We use bold values to represent the highest values for the corresponding metrics in the table.
The details of the ablation study are as follows:
- -
PGD: The PGD applied to the embedding layer is removed.
- -
XLSTM: The XLSTM network in the WP-XAG model is removed.
- -
ADMHA: The ADMHA mechanism is removed.
- -
RoPE: The RoPE in GlobalPointer is removed.
From the results of the ablation experiments shown in
Table 5, we can observe the following:
After removing the PGD module, the F1-score on the Resume dataset decreased by 1.42%, indicating that PGD adversarial training enhances model robustness even in the more structured text of the Resume dataset. On the Weibo dataset, the F1-score dropped by 3.66%, which can be attributed to the more colloquial and polysemous nature of Weibo texts. PGD helps the model adapt to these polysemous words and irregular language structures, and the significant drop in F1 after its removal demonstrates PGD’s effectiveness in handling noise and complex semantics. For the CMeEE dataset, the F1-score decreased by 1.37%, suggesting that while nested entities in medical texts are complex, PGD still improves the model’s handling of noise and challenging annotations. On the CLUENER2020 dataset, the F1-score dropped by 3.98%, illustrating that PGD contributes significantly to the recognition of complex entities in fine-grained NER tasks.
When the XLSTM module was removed, the F1-score on the Resume dataset dropped by 1.62%, showing that XLSTM contributes to capturing global semantics and long-range dependencies. On the Weibo dataset, the F1-score fell by 3.82%. Weibo contains not only complex contexts, but also many polysemous words, and XLSTM’s enhanced memory structure helps the model better distinguish between different meanings of the same word in such intricate contexts. The removal of XLSTM weakened the model’s ability to handle these complex dependencies and polysemy. On the CMeEE dataset, the F1-score dropped by 1.88%, indicating that XLSTM aids in the modeling of nested entities. The F1-score on CLUENER2020 decreased by 4.62%, demonstrating that XLSTM’s multi-layer sequence modeling capability plays an important role in capturing subtle variations in fine-grained datasets.
After removing the ADMHA module, the F1-score on the Resume dataset dropped by 1.67%. ADMHA helps capture semantic information in more structured texts. On the Weibo dataset, the F1-score decreased by 3.39%, with ADMHA playing a relatively important role in processing polysemous words and complex semantics in Weibo texts. By employing adaptive weighting mechanisms, the model can better understand the different meanings of polysemous words in varying contexts. Removing this module diminished the model’s ability to differentiate between such words. On the CMeEE dataset, the F1-score dropped by 1.65%, indicating that ADMHA helps in modeling the complex semantic relationships in medical texts. The F1-score on CLUENER2020 decreased by 2.97%, indicating that ADMHA contributes to handling the complex relationships between fine-grained entities.
After the removal of the RoPE module, the F1-score on the Resume dataset decreased by 2.01%. RoPE helps the model capture positional relationships between entities more effectively, still proving useful in structured texts. On the Weibo dataset, the F1-score dropped by 6.19%, as polysemous words appear frequently in Weibo, and RoPE’s rotational position encoding enhances the model’s ability to recognize these words in different positions and contexts. Removing RoPE significantly weakened the model’s handling of polysemy. On the CMeEE dataset, the F1-score dropped by 2.36%, showing that RoPE contributes to handling the positional relationships of nested entities. On CLUENER2020, the F1-score decreased by 4.5%, illustrating that RoPE contributes significantly to recognizing complex positional relationships in fine-grained NER tasks.
In summary, the WP-XAG model leverages multi-level representation learning, incorporating the advantages of each module, thus enhancing the model’s understanding of complex semantics, contexts, and positional relationships, ultimately improving its entity recognition performance. The performance degradation after the removal of any module highlights the effectiveness of each component.
4.5. Iterative Comparison Experiments
In this section, we conduct experiments using the Weibo and CMeEE datasets. The Weibo dataset is characterized by its frequent use of popular terms, polysemous words, and colloquial language, making it challenging for many models to effectively learn features in the early stages of training. By utilizing the Weibo dataset, we can better assess the capability of different models in handling word ambiguity and capturing complex information. The CMeEE dataset is derived from the Chinese Biomedical Language Understanding Evaluation (CBLUE) and contains a large number of nested entities, adding to its complexity. By experimenting with the CMeEE dataset, we aim to evaluate the models’ performance in processing technical terms and recognizing complex entity relationships.
As shown in
Figure 5a, the non-standard sentence structures and irregular formatting in the Weibo dataset make it difficult for models to capture sentence structure and dependencies, leading to consistently lower performance for the BiLSTM-CRF model. While the RBC model achieves better results, it still falls short compared to WP-XAG, possibly due to insufficient robustness in handling the noise and complexity present in the Weibo dataset. The BIC model’s F1-score exhibits instability during training, with significant fluctuations, and at later stages, the F1-score even drops to zero within certain epochs. This suggests that BIC struggles with capturing and processing complex features and is particularly susceptible to the dataset’s noise. In contrast, WP-XAG achieves over 60% performance by the second epoch and stabilizes around 70%, indicating that WP-XAG quickly learns effective features early in training and is more adept at handling the complex information within the Weibo dataset.
From the results displayed in
Figure 5b, WP-XAG also performs exceptionally well on the CMeEE dataset compared to other models. It reaches a high F1-score early in training and maintains stability throughout, outperforming other models. The BiLSTM-CRF model struggles with complex medical terms and nested entities, resulting in a lower F1-score. Although the RBC model can achieve a high F1-score, it performs poorly in the later stages of training, suggesting that it may have learned noise and details from the training data, leading to suboptimal performance on unseen data. The BIC model also shows instability when handling complex features, indicating that it has certain limitations in modeling complex medical texts. Overall, WP-XAG demonstrates superior performance in capturing and processing complex medical information, outperforming other models across the board.
As shown in
Figure 5, although our model has achieved better results compared to other models, further efforts are needed to reach a higher F1-score (above 90%). This may require more comprehensive solutions to address the misclassification caused by complex sequences and nested entities, as well as difficulties in recognizing polysemous words in context. Additionally, when dealing with complex medical terminology, the model sometimes struggles to accurately identify proper nouns, leading to inaccuracies. By systematically analyzing these errors, we can gain insights for future improvements, such as integrating external dictionaries to enhance the recognition of proper nouns and improving the model’s handling of polysemy.
4.6. Performance Analysis of ADMHA
In this section, we designed comparative experiments to evaluate the performance of ADMHA compared to MHA and to explore the optimization effect of ADMHA with varying numbers of attention heads. In the experiment comparing ADMHA and MHA, all other modules and parameters of the WP-XAG model were kept constant, with only ADMHA replaced by MHA, as shown in
Table 6. Bold data indicates the highest value of the corresponding indicator in the table.
Furthermore, we conducted experiments to assess the model’s performance with different numbers of attention heads (1, 2, 4, 8, and 16) across various datasets. The results of these experiments are detailed in
Table 7. Similarly, we use bold values to represent the highest values for the corresponding metrics in the table.
As shown in
Table 6, by comparing the performance of ADMHA and MHA across the four datasets, it is evident that ADMHA demonstrates a clear advantage on all datasets. First, in the more structured Resume dataset, ADMHA improves the F1-score by 1.86% compared to MHA. This shows that ADMHA, through its adaptive weighting mechanism, dynamically adjusts the importance of different attention heads, enhancing the model’s ability to capture fine-grained entities. Second, in the Weibo dataset, which contains a large number of polysemous words and noise, the F1-score increases by 3.49%. This indicates that ADMHA’s dynamic weighting mechanism allows it to better handle complex contexts and the semantic variations of polysemous words. In the CMeEE dataset, the F1-score improves by 1.26%, as ADMHA can more flexibly model the complex dependencies between nested entities. Finally, in the CLUENER2020 dataset, which contains a rich set of fine-grained entities, ADMHA improves the F1-score by 2.05%, demonstrating its superior flexibility in modeling complex entity relationships compared to MHA. ADMHA’s adaptive attention mechanism makes the model more adaptable and flexible when addressing complex dependencies, fine-grained entities, and polysemy issues.
As shown in
Table 7, the F1-score increases across all datasets as the number of attention heads rises from 1 to 4. This is likely because ADMHA dynamically adjusts the importance of each attention head based on the input data. With more heads, the model can learn diverse semantic features from various subspaces. The adaptive weight mechanism ensures that the model effectively utilizes these diverse features, avoiding the limitations of fixed weight allocation. At four attention heads, ADMHA strikes a balance between capturing both local and global information while avoiding issues like information redundancy or unnecessary computational resource consumption.
However, as the number of attention heads increases beyond four, the F1-score begins to decrease. This suggests that despite ADMHA’s ability to adaptively adjust the weights of each head, having too many heads can lead to information redundancy. Even with the adaptive mechanism, some heads may fail to capture useful new information, potentially introducing noise instead. Moreover, an excessive number of attention heads may result in an over-parameterized model, which could negatively impact training performance. In future work, dynamically identifying attention heads that contribute little or contain redundant information during training—and reducing their weights, or even pruning them during inference—could improve the model’s efficiency.