Next Article in Journal
Blockchain-Assisted Secure and Lightweight Authentication Scheme for Multi-Server Internet of Drones Environments
Next Article in Special Issue
Arouse-Net: Enhancing Glioblastoma Segmentation in Multi-Parametric MRI with a Custom 3D Convolutional Neural Network and Attention Mechanism
Previous Article in Journal
Reliability Analysis of Deep Foundation Pit Using the Gaussian Copula-Based Bayesian Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reduced Forgetfulness in Continual Learning for Named Entity Recognition Through Confident Soft-Label Imitation

1
School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
2
Institute of Machine Intelligence, University of Shanghai for Science and Technology, Shanghai 200093, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(24), 3964; https://doi.org/10.3390/math12243964
Submission received: 13 November 2024 / Revised: 12 December 2024 / Accepted: 13 December 2024 / Published: 17 December 2024
(This article belongs to the Special Issue Robust Perception and Control in Prognostic Systems)

Abstract

:
Continual Learning for Named Entity Recognition (CL-NER) is a crucial task in recognizing emerging concepts when constructing real-world natural language processing applications. It involves sequentially updating an existing NER model with new entity types while retaining previously learned information. However, current CL methods are struggling with a major challenge called catastrophic forgetting. Owing to the semantic shift of the non-entity type, the issue is further intensified in NER. Most existing CL-NER methods rely on knowledge distillation through the output probabilities of previously learned entities, resulting in excessive stability (recognition of old entities) at the expense of plasticity (recognition of new entities). Some recent works further extend these methods by improving the distinction between old entities and non-entity types. Although these methods result in overall performance improvements, the preserved knowledge does not necessarily ensure the retention of task-related information for the oldest entities, which can lead to significant performance drops. To address this issue while maintaining overall performance, we propose a method called Confident Soft-Label Imitation (ConSOLI) for continual learning in NER. Inspired by methods that balance stability and plasticity, ConSOLI incorporates a soft-label distillation process and confident soft-label imitation learning. The former helps to gather the task-related knowledge in the old model and the latter further preserves the knowledge from diluting in the step-wise continual learning process. Moreover, ConSOLI demonstrates significant improvements in recognizing the oldest entity types, achieving Micro-F1 and Macro-F1 scores of up to 8.72 and 9.72, respectively, thus addressing the challenge of catastrophic forgetting in CL-NER.

1. Introduction

As an essential task in the context of understanding natural language, Named Entity Recognition (NER) aims to extract character sequences from unstructured text, and assign them to predefined entity categories (name, address, organization, etc.). For instance, given the input sentence “He was seen at Tri-City Hospital and was told he had a fracture.”, a well-trained NER model should tag “Tri-City Hospital” as a hospital entity. With the development of information technology, especially the rise of big data and artificial intelligence, NER has gradually become a fundamental element in constructing various Natural Language Processing (NLP) applications [1,2], such as event detection [3], task-oriented dialogue robot [4], recommendation system [5], and so on.
Generally, the NER task is formulated as a sequence tagging problem, in which the model produces the tagging for each character or token. Early works in NER follow classic supervised training schemas, in which the model is trained over annotated samples before deploying for inference. In the meantime, all the entity categories are fixed and available during training. Therefore, traditional methods focus on how to improve the overall recognition performance using machine learning and deep learning techniques [6,7]. Later, with the advancements of Pre-trained Language Models (PLMs), encoder–decoder architecture has become the mainstream in the field [8], and overall performance has been largely improved over the past few years. Despite performance improvements, these methods fail to adapt to real application scenarios in which new entities and data appear in a constant manner. Moreover, the sparse annotated data is another obstacle for traditional NER methods. Therefore, it is necessary to provide a new paradigm allowing NER models to continually recognize emerging entity types.
Continual Learning (CL, also known as incremental learning or lifelong learning) is a technique that allows the model to train and update in a continuous and efficient manner. Inspired by precedent works in computer vision, researchers have successfully combined CL with NER models to recognize entities in a streaming manner [9,10]. Due to its potential in practical applications, this new learning paradigm, also known as Continual Named Entity Recognition (CNER) or Continual Learning for NER (CL-NER), has attracted increasing attention. However, existing CL methods [11,12,13] are still struggling with a main challenge called catastrophic forgetting. The idea is that simply fine-tuning the model based on new annotated data often yields a performance drop in the old data. Meanwhile, CL-NER differs from conventional CL in that new datasets only annotate new entities [10,14], leaving the majority of the tokens tagged as non-entities. For instance, about 89% tokens in OntoNotes5 belong to the non-entity class [15]. Therefore, unlike other accuracy-oriented tasks, such as image/text classification [16,17,18], the characteristics inevitably introduce a strong bias in the non-entity type. Moreover, since only the new entity type is annotated in the current CL step, non-entity tokens contain both historical entities and a true non-entity class. The changing meaning of the non-entity label further poses a dynamic change in the label’s semantic meaning, which amplifies forgetting-related problems.
To mitigate the aforementioned issues, most recent researchers [10] have proposed a method called CPFD, which combines confidence pseudo-labeling and pooled feature distillation. Despite the overall performance improvements, CPFD suffers from severe performance drops in recognizing the oldest entities, indicating the forgetting issue still remains. To further tackle the issue while maintaining the superiority of CPFD, we propose a new NER method called continual learning for continuous learning through Confident Soft-Label Imitation (ConSOLI), which uses the historical models in two different ways. Firstly, CPFD only uses the hard pseudo-labels, i.e., the confidence label, to guide the supervised training. It neglects the hidden states that the new model should memorize. We incorporate imitation learning on the latent space to restrain the network’s behavior by forcing the new model to learn the hidden states of the old, thereby emphasizing the ability of historical tasks. Secondly, to imitate the fine-grained behavior of the old model, we extend the original confidence-based pseudo-labeling strategy into a mixture labeling process to leverage the distinction between the hard labels (in terms of entity types) and soft-labels (in terms of pooled latent features). To sum up, our contributions can be summarized as follows:
  • We propose a continual learning method for NER through feature imitation. The approach includes a knowledge distillation loss, which improves the original CPFD by strengthening the knowledge learned from the old task.
  • We propose a confidence-based soft pseudo-labeling method to preserve the knowledge learned from old tasks. To cope with original pseudo-labeling in CPFD, we incorporate a balancing factor to leverage the distinction between hard labels and the proposed soft-labels.
  • Extensive experiments on 4 benchmark datasets are conducted, and the results show that our method improves the Micro-F1 and Macro-F1 with an average of 1.28% and 1.18%, respectively.
The manuscript is organized as follows. Section 2 summarizes the concept of NER and CL. Section 3 provides the problem formulation and details the method. Experiments and an analysis of the results are given in Section 4, followed by a conclusion in Section 6.

2. Related Work

In this section, we first summarize related works in NER in Section 2.1. Subsequently, we introduce recent works regarding to using CL in NER in Section 2.2.

2.1. Named Entity Recognition

As previously introduced in Section 1, Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP). Traditional feature-based methods rely on manual feature engineering and domain knowledge. With the development of deep neural networks, researchers first combined CNN, LSTM, or BiLSTM to provide word- or character-level embeddings for NER [19,20]. Later, with the developments of pre-trained language models (PLMs), the representations of words were enhanced. Existing PLM-based NER models [21,22,23] typically adopt an “encoder–decoder” architecture, in which the encoder of PLM transforms tokenized sentences into token or word-level representations. Recent works have incorporated sentence- or document-level representations to further enhance semantics. For instance, Luo et al. [24] utilize BiLSTM with label embedding to learn the sentence representation, which is then used to calculate document-level representation for the NER task. Schweter et al. [25] used contextual information to help enhance the information in sentence representation, thereby improving the performance of NER.
Another trend in NER is to incorporate syntactic information, such as a dependency tree, combinatorial category grammar, etc. To name a few, Luo et al. [26] proposed a bipartite flat-graph network to model the interaction and dependency between entities; Xu et al. [27] constructed asyntactic adjacency matrices to develop syntactic representations for entities; Singh et al. [5] adopted a graph-attention network to construct a multi-task learning framework to help determine the entities in complaints. These methods often adopt graph neural networks and the improvements they make largely rely on the construction of valid graphs.

2.2. Continual Learning in NER

Continual Learning (CL) aims to learn continuous tasks without reducing the performance on historical tasks. Despite the development in computer vision [28,29,30,31], current existing methods focus on improving accuracy and neglect the preservation of previously learned knowledge. In real-world application scenarios, new entity types are rapidly emerging. Dealing with such characteristics requires updating the NER model constantly, i.e., Continual Entity Recognition (CNER). However, due to the characteristics of NER and the catastrophic forgetting problem, it is not straightforward to combine the conventional NER method with existing CL methods. To tackle this problem, prior researchers have relied on knowledge distillation to extract valuable information from the current model, which is then transferred to update the new model. These methods choose either output logits or latent representations as the substantial source of knowledge preservation. To name a few, ExtendNER [32] uses the logits from a previous model to force the output of the new model to yield to similar logits. Xia et al. [33] further adopt a two-stage framework, in which synthesized samples containing the old entity types are included to help update the new model. To enhance the correlation between the old entity types and current non-entity tokens, CFNER [14] constructs a causal graph to help distinguish old entity types from true non-entity types. Although these methods have achieved significant advancements, they strictly require distilling the knowledge from the output of the previous model, resulting in over-confidence in the old task but limited adaptability for new tasks. Later, Zhang et al. [10] extended the method by introducing a pooled feature distillation and confidence-based type-balanced learning, yielding a state-of-the-art method to deal with catastrophic forgetting. However, the distilled feature from CPFD only ensures the preservation of the linguistic knowledge in the old model, neglecting the fact that the distilled knowledge may not be relevant to the NER tasks. A phenomenon of such characteristics is that the new model tends to have large performance degradation in most previous tasks. Inspired by using soft-labels [34,35] to distill knowledge, our method in this paper extends CPFD by incorporating soft-label distillation and confidence-based soft-label imitation to help ensure the preservation of task-relevant knowledge.

3. Continual Learning for Named Entity Recognition Through Confident Soft-Label Imitation

In this section, we first describe the task formulation of CL-NER in Section 3.1, which is followed by an overview of ConSOLI in Section 3.2. Then, implementation details, including soft-label distillation and confident imitation, are provided in Section 3.3 and Section 3.4, respectively.

3.1. Task Formulation

In this paper, we improve CNER by mitigating the catastrophic forgetting problem. The objective is to train a model M sequentially through steps t = 1 , , T to maintain high recognition performance for an expanding set of entity types. Specifically, at each step t, a unique training set D t is provided, which includes multiple entity pairs ( X t , Y t ) . Here, X t is an input token sequence with length | X t | , and  Y t refers to the corresponding ground truth label sequence. Note that Y t only comprises current entity types ε t , which means the entity categories in the previous CL step ( ε 1 to ε t 1 ) are annotated as the non-entity type e o . Therefore, at time step i ( i > 1 ) , let the entity types that can be recognized by the previous old model M t 1 be i = 1 t 1 . Our objective is to update the new model M t , which can recognize all the entity types (denoted as i = 1 t ε i ) seen thus far. Initially, M 0 is initialized by an initial dataset D 0 using conventional supervised training paradigm. To avoid catastrophic forgetting, what is essentially required is that the current model M t has a high recognition accuracy regarding previous tasks D i ( i t ) with respect to ε 1 to ε t 1 , especially on the oldest entities ε 1 .

3.2. Framework Overview

As discussed in CPFD [10], one major challenge is catastrophic forgetting, which refers to a performance drop for older tasks as new tasks are properly learned. Moreover, another challenge is the preservation of task-specific knowledge in the CL process. To address these issues, the authors in CPFD solve the catastrophic forgetting by leveraging distilled features in high-confidence tokens. Such an operation improves the average performance on two challenging benchmarks, that is, Macro-F1 and Micro-F1. Despite improvements, we observe that the issue of catastrophic forgetting persists in previous tasks. More specifically, although pooled feature distillation helps preserve linguistic knowledge across step-wise tasks, whether it is contributive to the relevant NER task remains questionable. Especially when M t is updated, the recognition of the oldest tasks suffers from performance degradation. For instance, in I2B2, the scores for the entity “AGE” dropped significantly from 87.90% to 63.50%, indicating evident catastrophic forgetting.
To mitigate the problem while preserving knowledge, we propose a variant based on CPFD called Confident Soft-Label Imitation (ConSOLI). The overview of each CL step is shown in Figure 1. Similar to CPFD, ConSOLI uses a pooled feature distillation loss L P F D and a balanced pseudo loss L b a l a n c e p s e u d o to constrain the updating of M t .
As previously introduced in CPFD, a recent study [36] suggests that attention weights of the PLM-based model contain substantial linguistic knowledge. This discovery drove Zhang et al. to propose a feature distillation loss L F D to encourage the transfer of such knowledge through maintaining a similar attention weight. As shown in Equation (1), A , k , i , j t and A , k , i , j t 1 refer to the attention weights of layer for M t and M t 1 , respectively. K is the number of attention heads. | X t | refers to the sequence length.
L F D = k = 1 K i = 1 | X t | j = 1 | X t | A , k , i , j t A , k , i , j t 1 2
L F D restrictively compels M t to align with M t 1 over each value of attention weights, which yields over-stability and limits the potential to learn new entity types. To leverage the stability and learning capacity, CPFD extends L F D by pooling along different dimensions. Therefore, variants like L P D F l a x and L P F D can be obtained (Equation (2)).
L P D F l a x = k = 1 K i , j = 1 | X t | A , k , i , j t i , j = 1 | X t | A , k , i , j t 1 2 L P F D = i = 1 | X t | j = 1 | X t | k = 1 K A , k , i , j t k = 1 K A , k , i , j t 1 2 + k = 1 K j = 1 | X t | i = 1 | X t | A , k , i , j i = 1 | X t | A , k , i , j t 1 2 + k = 1 K i = 1 | X t | j = 1 | X t | A , k , i , j j = 1 | X t | A , k , i , j t 1 2
Instead of constraining each value in attention weights, L P D F l a x pools the feature along the sequence dimension. Compared with L F D , it is a more permissive loss, since only the head dimension is preserved. To leverage the difference between plasticity and stability, L P F D pools the attention weight along each dimension. Compared with L P D F l a x , it sacrifices plasticity by pooling only one dimension, while integrating each pooling result to enhance the stability. The empirical study of CPFD also confirms that L P F D outperforms two other losses. Therefore, in ConSOLI, we use L P F D by default to maintain the superiority of CPFD.
Despite the improved performance of CPFD, the contributions do not guarantee the removal of catastrophic forgetting in recognizing the oldest entities. As shown in Section 4.2, CPFD results in a significant performance drop in the oldest tasks on the I2B2 dataset. The reason for such a phenomenon is that the pooled feature distillation along the different dimensionalities of the attention weights is a balanced strategy for preserving linguistic knowledge. Such an operation does not explicitly extract the task-specific information in M t 1 , resulting in a constant performance degradation of the old tasks.
To preserve the relevant task-specific knowledge, ConSOLI extends CPFD by incorporating a soft-label distillation process, in which the output of the hidden layers is used to construct soft-labels. To constrain the soft-label similarity of the new model M t and the old model M t 1 , ConSOLI adopts a confidence-based soft-label imitation loss L s o f t . Details of the modifications are introduced in the following subsections.

3.3. Soft-Label Distillation

As mentioned in Section 3.2, CPFD focuses on regularizing the preservation of linguistic knowledge in terms of pooled attention weights. Our experiments in Section 4.2 show that such knowledge does not necessarily contain task-related information, thereby degrading the performance in early tasks. To compensate for such information loss, we aim to construct a semantic representation that contains knowledge of previous NER tasks.
Consider a typical encoder–decoder architecture, defined as (3); the outputs of encoder f e n c are referred to as features, while the decoder f d e c is constructed through a classifier. Given the input sentence x, the output y is the corresponding NER tag sequence.
y = f d e c ( f e n c ( x ) )
In a supervised training process, we conventionally use cross-entropy to measure the difference between the prediction and the ground truth labels. Specifically, we compel the model to learn the conditional probability of P ( y | x ) by back-propagating the gradient of cross-entropy. In CPFD, the model refines the process through confidence-based pseudo-labels, encouraging it to replicate the predictions of the old model M t 1 . Transforming latent features into predictions can blur distinctions between varied features of the same label. For instance, given two similar features x i and x j with minimal variation ϵ , it logically follows that Equation (4), where ϵ 0 , holds true. Here, y i and y j are the predictions of x i and x j , respectively. To retain the knowledge acquired by M t 1 , an intuition is to harness valuable information from latent features as soft-labels.
y i = f d e c ( f e n c ( x i ) + ϵ ) f d e c ( f e n c ( x j ) ) = y j
To compensate for the information loss, we first propose a soft-label distillation strategy to preserve the knowledge in M t 1 . Specifically, we use M t 1 to obtain pseudo-labels and the corresponding token-level representations e i . As shown in Figure 1, CPFD applies a confidence-based selection over the pseudo-labels. Following the process, we select the token-level representations in the same way. For a pre-trained language model such as BERT, early study [37] shows that the hidden states close to the output layer contain task-specific information, while those of shallow layers contain more linguistic knowledge. Motivated by the pooled feature distillation in CPFD, we propose a pooled soft-label denoted by Equation (5). Here, e i , j o is the i-th hidden state of the j-th non-entity token, and  | L e | refers to the number of layers used to construct the soft-label.
e i , j , p o = 1 | L e | i = 1 | L e | e i , j o
Soft-labeling is a flourishing strategy widely used in recent studies. Intuitively, it relaxes the hard-label constraints by forcing the samples with similar features to cluster in the same class. In this paper, we start with the last hidden states to examine the effects of L e . Initially, when only the last hidden states are included, the soft-label for j-th token is e j , p o = e j o .

3.4. Confident Soft-Label Imitation

Soft-label learning comprises two similarity computations. One is to force the predicted label of a given sample to be like nearby samples, while another is to force the sample to be similar to the given samples [38,39]. Therefore, the typical soft-label aims to solve the following problem:
min i , j n v i v j 2 2 + γ Y ^ l Y l F 2
where γ is a trade-off parameter, and  s i , j is the similarity of the i-th and j-th samples. Y l is the ground truth label. Y ^ l is the predicted label. v i and v j are the i-th and j-th sample features.
We follow a similar idea, using soft-labels to force the new model M t to relax between learning the new task and consolidating the old model M t 1 . Similar to CPFD, we apply a confidence threshold τ to filter confident tokens, which will be used to calculate the token-wise similarity loss L s o f t .
L s o f t = min i | X c o n f | S ( e i o , e i n )
Here, S ( e i o , e i n ) refers to a measure of similarity for e i o and e i n . In CL, the prediction of the new model M t contains both non-entity labels and old entity labels. CPFD splits the non-entity label of the current new task into two classes, i.e., current non-entity class and old entity class (Equation (8)). Here, Y ^ i t refers to the one-hot pseudo-label. If a token is not marked as a non-entity type ( Y i t = 1 ), we replicate the ground truth. Otherwise, if the prediction of M t 1 marks the token as an old entity type ( e = arg max e ϵ t Y i t ), we replicate the prediction.
Y ^ i t = 1 i f Y i t = 0 & e = arg max e ϵ t Y i t 1 i f Y i t = 1 & e = arg max e e 0 ϵ 1 : t 1 Y i t 1 & u < τ 0 o t h e r w i s e
Let y i t 1 and y i t be the predictions of the old model M t 1 and M t , respectively. Inspired by contrastive learning, we propose a similarity measurement as Equation (9).
S ( e i o , e i n ) = 1 c o s ( e i o , e i n ) i f y i o y i n max ( 0 , 1 c o s ( e i o , e i n ) + m a r g i n ) i f y i o = y i n
When the prediction y i o of the old model M t 1 equals the one ( y i n ) of the new model M t , we use a margin to ignore the differences of similar features. In this manner, we force M t + 1 to focus on those features with large differences. When the predictions of the labels are different, C ( e i o , e i n ) will compel the model to minimize the differences.
Given the imbalanced distribution of labels on the current task, we follow CPFD to incorporate a weighting strategy for pseudo-label learning. Therefore, Y ^ l Y l F 2 in Equation (6) can be replaced by a balanced pseudo-labeling cross-entropy loss, summarized as Equation (10).
L b a l a n c e p s e u d o = 1 | X t | i | X t | η i Y ^ i t log Y ^ i t
where X t and Y ^ i t refer to the sentence and the i-th prediction in the current task, and  | X t | is the number of tokens. η i denotes the balance factor, which can be calculated using Equation (11). Here, N o l d and N n e w are the number of predicted labels of M t 1 and M t , respectively.
η i = 0.5 + σ ( N o l d N n e w ) i f Y ^ i t ϵ 1 : t 1 1 o t h e r w i s e
Finally, the total loss in our proposed method is shown as Equation (12), in which λ is a trade-off parameter, and  θ t is the learnable parameter set in M t . Here, we use L P F D in Equation (12) for its best performance in general. For different choices of the pooled features, we provide a comparison in Section 4.3 by replacing L P F D with L F D and L P F D l a x , respectively.
L ( θ t ) = L b a l a n c e p s e u d o + λ L P F D + ( 1 λ ) L s o f t
To provide a comprehensive description of the training process, we present the pseudocode for each CL step in Algorithm 1. Initially, when no old model M t 1 is given, ConSOLI trains M t via conventional supervised training by minimizing the classic cross-entropy loss over D t (line 1–3). The corresponding inference performance on the evaluation set D d e v is recorded as the current best score F b . If M t 1 exists, then ConSOLI performs CL training. More specifically, the ConSOLI firstly use M t 1 to make inferences on D t . In the process, the algorithm sequentially calculates L P F D , L b a l a n c e , and L s o f t (line 6–15). Finally, M t is updated by minimizing the combination of the three losses. Regarding the preservation of M t , we use Micro-F1 as the measurement, which is identical to the implementation of CPFD. If the performance F t of M t over the evaluate set D d e v is better than current best score F b , M t will be saved. Otherwise, the model is discarded.    
Algorithm 1: Pseudocode for CL-NER through confident soft-label imitation.
Mathematics 12 03964 i001

4. Experiments

In this section, we first introduce the datasets and experimental settings relevant for this paper in Section 4.1. Comparisons with other CL methods are summarized as the main results in Section 4.2, together with the effects of different soft-label combinations. Section 4.3 shows the ablation study and Section 4.4 gives a case study.

4.1. Datasets and Settings

To evaluate the effectiveness of ConSOLI, we conducted comprehensive experiments on two challenging and widely used datasets: I2B2 [40] and OntoNotes5 [15]. Meanwhile, we additionally selected two datasets (CoNLL2003 [41] and BioNLP11ID [42]) with relatively smaller sizes for comparison. The statistics of the datasets are summarized in Table 1.
We follow the same split strategy as in CPFD and CFNER. Specifically, the training set is divided evenly into several disjoint subsets, each of which corresponds to a continual learning process. In each set, we keep the label of the entities for the current CL process, while altering others as non-entity types (“O” tag in our implementation). Detailed descriptions of the sampling algorithm used in the dividing of datasets can be found in Appendix B [14]. The entity labels are arranged in an alphabetical manner for I2B2, OntoNotes5, and CoNLL2003. Unlike the other datasets, entities in BioNLP11ID are extremely imbalanced. For instance, “Regulon” has only 50 samples in the test set. To model real-world scenarios, we keep the sequence as the entity first appears in the dataset for the experiments.
Experimental Settings. We follow similar settings as those introduced in [10]. Specifically, we use “FG” to denote the entities used to initialize the base model. For the CL process, we use “PG” to indicate the entity types per task. Therefore, the settings for experiments can be denoted by “FG-a-PG-b”, in which a refers to the number of entities for the base model and b is the number of entities used for each CL task. Considering the different entity types in the benchmark datasets, with a view to fair comparison with other baselines, we follow the same CL settings with CPFD and apply two CNER settings for CoNLL2003 and BioNLP11ID (i.e., FG-1-PG-1 and FG-2-PG-1), and four settings for I2B2 and OntoNotes5 (i.e., FG-1-PG-1, FG-2-PG-2, FG-8-PG-1, and FG-8-PG-2). The settings “FG-1-PG-1”, “FG-2-PG-1”, and “FG-2-PG-2” evaluate the impacts of small entity numbers in the initial task, where they also have fewer data compared to the settings “FG-8-PG-1” and “FG-8-PG-2”. On the contrary, “FG-8-PG-1” and “FG-8-PG-2” analyze the effects of a larger number of entity types in the initial data. For each task, we kept the current entity types in the validation set and masked others as non-entity types. In each CL step, we kept the labels for all previous entities while masking the rest as non-entity types within the test set. For a fair comparison, we used the same BERT-base-cased model as the encoder, which had 12 layers and an attention head count K of 12. The classifier was constructed through a fully connected layer. For each experiment, we allowed the model to train for 20 epochs in all CL learning processes. The training batch size, learning rate, and balancing weight λ were set to 8, 4 e 2 , and 0.6 , respectively. All experiments were conducted on a NVIDIA A6000 GPU with 48 GB memory. The results in the section were run 5 times for statistical purposes.
Baseline Methods. We evaluate ConSOLI against recent works using CL for NER. The baselines include ExtendNER [32], CFNER [14], and CPFD [10]. Similar to the experiments in [14] and [10], we also compare our method with implementations of CL in other fields, i.e., Self-Training (ST) [28,29], LUCIR [30], and PODNet [31]. In the meantime, fine-tuning (FT) without any anti-forgetting strategies is also included as a lower bound for comparison. A detailed introduction to the baselines can be found in Appendix A.
Evaluation metrics. For a consistent comparison with existing baselines, especially CPFD, we select the same metrics, i.e., Macro-F1 (Ma-F1) and Micro-F1 (Mi-F1), to evaluate the overall performance. In addition, we evaluate the performance of each entity using F1 scores. We gather the mean results across all steps to form our main findings, similar to the approach used in CPFD. To provide a more detailed analysis of catastrophic forgetting, we offer step-wise performance comparison in line plots.

4.2. Main Results

The average performances on I2B2 and OntoNotes5 are summarized in Table 2. To offer a fair comparison, we chose the same CL settings as CPFD for the results in Table 2 and Table 3 and Figure 2 and Figure 3. For the reproduced results, we use the codes and model settings reported in the respective paper. We also ran the process 5 times for each CL setting in Table 2 to maintain consistency with the results in our method. To provide a detailed analysis of the improvements on historical tasks, we also visualize the step-wise performance for CPFD and our model in Figure 2. For more results regarding the baselines, we visualize their step-wise performances in Appendix B.
In general, our method surpasses other baselines, yielding improvements ranging from 38.07 ( 35.39 ) to 61.14 ( 51.39 ) in Micro-F1 (Macro-F1). Moreover, when compared with CPFD, ConSOLI maintains enhancements ranging from 0.75 ( 0.85 ) to 1.04 ( 3.42 ) in Micro-F1 (Macro-F1) for I2B2. For OntoNotes5, similar observations can also be found, except for the results of “FG-8-PG-2”. The improvements of “FG-8-PG-2” are trivial, which is also verified by a significance test. Compared with the results in “FG-8-PG-1”, the improvements drop with the growth of annotated data in each CL step. Considering the data as the only factor, the reason for such a phenomenon is the distribution of different entity types in each CL step. In the meantime, the overall performance of ConSOLI increases as the volume of the initial dataset increases. Details can be seen when comparing “FG-1-PG-1” (“FG-2-PG-2”) and “FG-8-PG-1” (“FG-8-PG-2”). The phenomenon also applies to CPFD in Table 2. These results validate the superiority of our method in learning a robust CL-NER model. In the meantime, as shown in Figure 2, our method has smaller performance degradation in most previous tasks. For instance, when identifying entity “AGE” of I2B2 in Figure 2a, the performance decreases from 87.90 to 71.96 , while that of CPFD reduces to 63.5 . The results indicate that our method preserves more valid knowledge from the old tasks than CPFD, preventing the model from losing the same recognition ability. For new tasks with limited old entity tokens in the sentences, such as “COUNTRY” in Figure 2a–d, our method drops, as does CPFD. The reason for this phenomenon is that ConSOLI uses the supervised signal from pseudo-labels to update M t 1 . When D t contains limited old entity tokens, the corresponding values in L s o f t and L b a l a n c e contribute less than those of the current entity types. Therefore, the minimization of L ( θ t ) may shift to the learning of the current entity type. An example can be found in Figure 2b as “CITY” has limited tokens in future tasks. As a result, the performance of ConSOLI decreases drastically from over 40 to 0. However, when there are fewer data in each CL (e.g., Figure 2c,d), our method still outperforms CPFD. These results further indicate that ConSOLI preserves more valid knowledge for identifying old entities by mitigating the performance drop in each CL step. Note that the performance improvements in Table 2 are smaller than those of the old entities in of Figure 2. Although we did not show all the step-wise performances for each entity, CPFD performed better in recognizing new entities, thereby narrowing the overall performance gap when compared with our method. Considering that we focus on the anti-forgetting mechanism, our method still results in a superior CL-NER approach, as the overall performance of ConSOLI (as shown in Table 2) was generally better than that of CPFD.
We also summarize the results over CoNLL2003 and BioNLP11ID in Table 3, and the corresponding figures are visualized in Figure 3. With smaller datasets, the performance differences become smaller. However, ConSOLI still outperforms CPFD, especially for BioNLP11ID. Owing to the small volume of CoNLL2003 and BioNLP11ID, the changes in each CL step are relatively small. Except for the first entity “REGULON” with a limited number of samples, ConSOLI leads CPFD by up to 8.72 ( 9.72 ) in Micro-F1 (Macro-F1). For “REGULON”, the validation set achieved only 27 out of 624 and the test set 50 out of 624. It is typical for few-shot learning scenarios that both CPFD and ConSOLI fail. This explains why the Micro-F1 scores in Figure 3c,d are zero. At the same time, ConSOLI maintains a superiority for the oldest entities. For the entity “ORGANISM”, our method drops from 76.84 to 66.31 , while CPFD falls to 33.96 . These findings verify that our method has a better capability to resist catastrophic forgetting, yielding a better CL method than CPFD. To compare with other baselines, we also visualize the results in Figure A1, Figure A2 and Figure A3. Although different baselines have diverse results, the same conclusion can be drawn since our method still maintains better performance for the old entities. A detailed analysis and comparisons can be found in Appendix B.
To evaluate the effects of | L e | in Equation (5), we further conducted experiments by using the last three hidden states. The results are summarized in Table 4. “ 1 ” indicates that a soft-label is constructed using only the last hidden states. “−2:−1” and −3:−1 refer to the combination of the last two and three hidden states, respectively. As shown in Table 4, ConSOLI performs well in the 1 setting in general. For I2B2 and OntoNotes5, 1 outperforms the other two settings by less than 1%. For smaller dataset BioNLP11ID, the setting of −2:−1 outperforms others by over 0.93 in Mi-F1 and 0.9 in Ma-F1, respectively. For CoNLL2003, although the settings of 1 outperform others, the lead is only 0.43 when compared with −2:−1.

4.3. Ablation Study

To examine the effectiveness of each component in our method, we conducted an ablation study and summarized the results in Table 5. For consistency with Table 2 and Figure 3, we chose the implementation setting of ConSOLI presented in Table 2. Therefore, the soft-label imitation only uses the last hidden states. There are four components in ConSOLI, i.e., pooled feature distillation ( L P F D ), confident pseudo-labeling (CPL), adaptive re-weighting type-balanced learning (ART), and soft-label imitation ( S L ). Therefore, we separately remove each component to analyze the impact on performance.
As shown in Table 5, we first evaluate the effects of different feature distillation strategies by replacing L P F D with its alternatives L F D (“w/ L F D ”) and L P F D l a x (“w/ L P F D l a x ”), respectively. In general, replacing L P F D with the other two alternatives causes a performance drop in all benchmarks. The maximum drop reaches 4.95 ( 6.77 ) in Micro-F1 and 3.11 ( 4.84 ) in Macro-F1 for “w/ L F D ” (“w/ L P F D l a x ”). Similar to the results in CPFD, the phenomenon indicates that either without pooling (Equation (1)) or with excessive pooling ( L P F D l a x ), the plasticity or stability is diminished. To further analyze the impact of the component in ConSOLI, we removed each of them and analyzed the performance changes. By removing CPL, we remove L b a l a n c e p s e u d o from Equation (12). Compared with “Ours”, there is a significant performance drop. For example, the Micro-Fi of ConSOLI decreases by 18.55 on I2B2, and Macro-F1 is reduced by 13.71 . Similar results can also be observed in CoNLL2003 and BioNLP11ID. The results indicate that the original CPL plays a vital component in the method. Without CPL, the missing confident pseudo-label predicted by the old model causes the new model to lose track of its knowledge of the historical entity label. In the meantime, by removing ART, we remove the balancing factor η i in Equation (10). The performance drop is relatively smaller when compared with CPL, especially for small datasets such as CoNLL2003 and BioNLP11ID. When removing SL from ConSOLI, we remove L s o f t from Equation (12). This makes the method degrade as concerns our re-implementation of CPFD in Table 2 and Table 3. The performance drops, although not as much as it does for CPL, suggesting that confident soft-label imitation does contribute to overall improvements. The visualization in Section 4.4 also shows that ConSOLI helps to increase the model’s recognition ability of old entities. These results further indicate that by leveraging the preservation of latent feature similarity, the new model can be more resilient as concerns catastrophic forgetting.

4.4. Case Study

Figure 4 shows a case study on OntoNotes5. Note that PODNet and LUCIR in the baselines are originally designed for image classification. ST and FT have relatively low performance in Table 2. Therefore, we only compare CPFD, CFNER, ExtendNER, and ConSOLI in this subsection. For all the methods, we used the models that were trained over all the CL steps in OntoNotes5 and selected the two sentences that contained the oldest entities (i.e., DATE and CARDINAL).
For the first sentence, all four methods successfully recognized DATE type tokens. However, CFNER and ExtendNER had trouble identifying the token of CARDINAL type. They both misclassified the token into DATE or QUANTITY, which indicates that these models have forgetting issues for older entity types. In the meantime, CPFD only recognized one token that belonged to CARDINAL, and misclassified the other token as QUANTITY. Comparatively, our method successfully recognized both CARDINAL entity tokens. Similar observations can also be found in the second case, in which ConSOLI was the only one that predicted CARDINAL entity tokens. Moreover, in recognizing DATE entity tokens, our method outperformed CPFD by three successive tokens (“Friday the 13th”). Although CPFD and other methods correctly tagged the first DATE entity token “the”, the results are controversial since they split the ideologically complete phrase “the only other Friday” into two or more entities. These results demonstrate the advantage of our method in recognizing older entity types.

5. Discussion

As has been validated in our experiments, ConSOLI alleviates the forgetting issue in CPFD in recognizing older entity categories. However, the overall performance obtains equivalent scores or small improvements when compared with CPFD. This phenomenon indicates that ConSOLI is not as satisfactory as CPFD when learning new entity types. Part of the reason may be that the pooled feature and the soft-labeling process refer to different levels of linguistic knowledge. Both our method and CPFD have not been fully explored regarding the quantification of such linguistic knowledge in different layers of the NER model. Moreover, the preservation of knowledge in the old model relies fully on the inference process over the current data for both ConSOLI and CPFD. When the new data contain little ground truth for the old entities, knowledge gradually becomes missing in the update process. Future directions include the exploration of valid knowledge for entities with limited samples, integrating large language models to provide valuable prior knowledge, and scaling the method to other tasks.

6. Conclusions

In this paper, we studied CL-NER tasks by alleviating the catastrophic forgetting problem. Inspired by the previous method CPFD, we propose a confidence-based soft-label imitation approach, namely ConSOLI. To reduce the forgetting issue in recognizing the old entities while maintaining equivalent or better performance, we incorporated soft-label distillation to summarize the knowledge learned by old, non-updated models. In addition, we constructed a confidence-based imitation task to help preserve knowledge, thereby decreasing the effects of catastrophic forgetting. To validate the effectiveness of ConSOLI, comprehensive experiments were conducted on two large benchmarks and two small datasets.
The results of different CL settings show that ConSOLI outperforms other baselines, which verify the improvements of ConSOLI. Specifically, when comparing with CPFD, ConSOLI achieves the highest improvements—namely, 8.72 and 9.72 in Micro-F1 and Macro-F1, respectively.

Author Contributions

H.Z. was involved in the conceptualization, methodology, formal analysis, original draft, review and editing, and supervision. L.Z. contributed to the methodology, validation, original draft and editing. M.G. contributed to the implementation, data curation, and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to thank the anonymous reviewers for their insightful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Baseline Introductions

A comparison of the baseline methods in Section 4 is given as follows. We follow the same experimental settings as CPFD. For a fair comparison, there are no reserved old samples for LUCIR and PODNet. For fair comparison, we keep the same hyper-parameter settings for LUCIR and ExtendNER as those used in [14].
  • Fine-Tuning (FT): FT refers to the fine-tuning process of a pre-trained model. In CL-NER, we adopt FT over D t for updating M t 1 . Therefore, it has no anti-forgetting measurement. In our implementation, we use a pre-trained BERT model as the base model and update it in each CL step to evaluate the performance.
  • Self-Training (ST): ST uses the old model M t 1 to annotate the non-entity tokens in the current dataset D t with the old entity types. The predicted labels of the old entities are treated as the ground truth during the update of the new model M t . In the CL process, the new model M t is updated by minimizing the cross-entropy loss over the annotated data and the ground truth in D t .
  • ExtendNER: Similar to ST, ExtendNER uses the old model M t 1 to infer each non-entity token. Instead of directly using the predicted entity type (the hard label), the probability distribution of the old entity categories is used. Then, the KL divergence loss is calculated for the non-entity class, which is further combined with cross-entropy loss over current entity types’ tokens. By minimizing the losses, the new model M t is updated. Although both our method and ExtendNER use soft-labels, the soft-label in ExtendNER is constructed from the classifier, while our method extracts the linguistic knowledge in the hidden states for soft-labels.
  • LUCIR: LUCIR was originally designed for image classification. Except for the cross-entropy loss on the new categories, it uses a distillation loss to constrain the features from M t 1 and M t . Moreover, to maintain the knowledge from M t 1 , LUCIR reserves some samples for the old classes, which are then used to compute a margin-ranking loss. In our implementations, no reserved samples is given. Therefore, the margin-ranking loss is calculated over the non-entity tokens in D t instead of the reserved samples. Our method uses the same idea as LUCIR to preserve knowledge using features. The difference is that our method explores the different combinations of hidden states and the soft-label is only calculated over confident non-entity tokens.
  • PODNet: PODNet is another CL method in image classification. Similar to LUCIR, PODNet also relies on knowledge distillation of M t 1 . The difference is that PODNet uses distillation loss to constrain the output of each convolutional layer, while LUCIR only considers the output of the last one. Moreover, PODNet replaces the cross-entropy loss with NCA loss for classification. In the re-implementation of PODNet in CL-NER, the distillation uses the intermediate output of BERT.
  • CFNER: Similar to ExtendNER, CFNER calculates the probability distribution of the old entity types from M t 1 . The difference is that CFNER focuses on distilling the causal effects from the non-entity tokens as an anti-forgetting measurement. Similar to CPFD, it also has a dynamic balancing strategy to distinguish between new entity and non-entity types.
  • CPFD: CPFD leverages the stability and plasticity by using a pooled feature distillation loss to constrain the updating of M t 1 . In the meantime, CPFD also focuses on the label imbalance problem in D t and proposes a balanced pseudo loss. Unlike CFNER, only the confident pseudo-labels in the non-entity class are used to compute the loss, thereby limiting the noise caused by M t 1 and the label shift. Our method differs from CPFD in that we further introduced soft feature distillation and confidence-based loss to help preserve the knowledge from M t 1 , thereby increasing the performance in most previous entity types.

Appendix B. Visualizations of Step-Wise Performance for Other Baseline Methods

To complement Figure 2 and Figure 3, we visualize the results of other baselines to evaluate the step-wise performance over the oldest tasks. Figure A1, Figure A2 and Figure A3 are the visualizations on I2B2, OntoNotes5, CoNLL2003, and BioNLP11ID. For each dataset, we select the first entity type to analyze the forgetting issue in CL-NER, except Figure 2a. In Figure 2a, the performance of the first entity drops drastically to 0 after the initial step for all baselines. Therefore, we select the second entity type “DATE” for the purpose of better demonstration.
As has been reported in CPFD, the old entity types have fewer samples when compared to the new entity types. Fewer samples lead to a weak supervised signal in the learning process, thereby resulting in performance drop. Such a phenomenon can be observed in Figure 2, Figure 3 and Figure A1, Figure A2 and Figure A3. In the meantime, the forgetting problems can be alleviated with the number of data available at the initial CL step and each CL step thereafter. For instance, the performance drop in setting “FG-8-PG-1” (Figure 1c) is mitigated for each step when compared with the setting “FG-1-PG-1” (Figure 1a). Similar results can also be found in Figure A2 and Figure A3. Compared with the results in Figure 2 and Figure 3, ConSOLI demonstrates a relatively better performance across all CL steps. Take the setting “FG-1-PG-1” on I2B2, for example (Figure 2a): ConSOLI keeps the performance of “AGE” above 70, while the baselines in Figure 1a reduce to 0 after the 5-th step. The performance gap is over 70. Combined with the main results in Table 2, we can conclude that, compared with other baselines, ConSOLI mitigates catastrophic forgetting especially in early tasks, which holds for other results in Figure A2 and Figure A3.
Figure A1. The step-wise performance on I2B2. To monitor the step-wise performance changes, we visualize the entity type “AGE” for all settings.
Figure A1. The step-wise performance on I2B2. To monitor the step-wise performance changes, we visualize the entity type “AGE” for all settings.
Mathematics 12 03964 g0a1
In the meantime, when comparing the results in small datasets, i.e., CoNLL2003 and BioNLP11ID, ST, LUCIR, ExtendNER, and CFNER are relatively better than FT and PODNet. For instance, CFNER and LUCIR have 77.68 and 76.87 in CoNLL2003 under the setting “FG-1-PG-1”, respectively. However, compared with the results of ConSOLI in Figure 3a, our method still leads by over 9. Note that we select the second entity “ORGANISM” instead of “REGULON”. Owing to the limited sentence that contains “REGULON”, all the baselines and our method have F1 scores of 0 across all CL steps. Therefore, we visualize the step-wise performance of the second entity type for the purposes of better demonstration. As shown in Figure 3c,d, when the future tasks contain limited old entity tokens, the preservation of the knowledge in M t 1 becomes challenging. Especially under setting “FG-2-PG-1”, all baselines in Figure 3d fail to maintain the capability of recognizing “ORGANISM” after the second task. Compared with the results in Figure 3d, CPFD also suffers from forgetting issues as the score is reduced from 76.84 to 33.96 , while ConSOLI maintains a score of over 60 for all CL steps. These results further validate the improvements in our method.
Figure A2. The step-wise performance on OntoNotes5. To monitor the step-wise performance changes, we visualize the entity type “CARDINAL” for all settings except for the first one (“FG-1-PG-1”). In “FG-1-PG-1”, all baselines reach 0 right after the second task. Therefore, we visualize the second entity type “DATE” instead.
Figure A2. The step-wise performance on OntoNotes5. To monitor the step-wise performance changes, we visualize the entity type “CARDINAL” for all settings except for the first one (“FG-1-PG-1”). In “FG-1-PG-1”, all baselines reach 0 right after the second task. Therefore, we visualize the second entity type “DATE” instead.
Mathematics 12 03964 g0a2
Figure A3. The step-wise performance on CoNLL2003 and BioNLP11ID. To monitor the step-wise performance changes, we visualize the entity type “LOCATION” for CoNLL2003 and “ORGANISM” for BioNLP11ID.
Figure A3. The step-wise performance on CoNLL2003 and BioNLP11ID. To monitor the step-wise performance changes, we visualize the entity type “LOCATION” for CoNLL2003 and “ORGANISM” for BioNLP11ID.
Mathematics 12 03964 g0a3

References

  1. Gligic, L.; Kormilitzin, A.; Goldberg, P.; Nevado-Holgado, A. Named entity recognition in electronic health records using transfer learning bootstrapped Neural Networks. Neural Netw. 2020, 121, 132–139. [Google Scholar] [CrossRef] [PubMed]
  2. Wang, D.; Feng, X.; Liu, Z.; Wang, C. 2M-NER: Contrastive learning for multilingual and multimodal NER with language and modal fusion. Appl. Intell. 2024, 54, 6252–6268. [Google Scholar] [CrossRef]
  3. Liu, X.; Huang, H.; Zhang, Y. Open Domain Event Extraction Using Neural Latent Variable Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics; Korhonen, A., Traum, D., Màrquez, L., Eds.; Association for Computational Linguistics: Florence, Italy, 2019; pp. 2860–2871. [Google Scholar] [CrossRef]
  4. Wang, Y.; Han, X.; Zhou, F.; Wang, Y.; Deng, C.; Feng, J. Distill-AER: Fine-Grained Address Entity Recognition from Spoken Dialogue via Knowledge Distillation. In Proceedings of the Natural Language Processing and Chinese Computing; Lu, W., Huang, S., Hong, Y., Zhou, X., Eds.; Springer: Cham, Switzerland, 2022; pp. 643–655. [Google Scholar]
  5. Singh, A.; Saha, S. GraphIC: A graph-based approach for identifying complaints from code-mixed product reviews. Expert Syst. Appl. 2023, 216, 119444. [Google Scholar] [CrossRef]
  6. Zhang, Y.; Yang, J. Chinese NER Using Lattice LSTM. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Gurevych, I., Miyao, Y., Eds.; Association for Computational Linguistics: Melbourne, Australia, 2018; pp. 1554–1564. [Google Scholar] [CrossRef]
  7. Žukov-Gregorič, A.; Bachrach, Y.; Coope, S. Named Entity Recognition With Parallel Recurrent Neural Networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Gurevych, I., Miyao, Y., Eds.; Association for Computational Linguistics: Melbourne, Australia, 2018; pp. 69–74. [Google Scholar] [CrossRef]
  8. Zhou, R.; Xie, Z.; Wan, J.; Zhang, J.; Liao, Y.; Liu, Q. Attention and Edge-Label Guided Graph Convolutional Networks for Named Entity Recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 6499–6510. [Google Scholar] [CrossRef]
  9. Chen, Y.; He, L. SKD-NER: Continual Named Entity Recognition via Span-based Knowledge Distillation with Reinforcement Learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 6689–6700. [Google Scholar] [CrossRef]
  10. Zhang, D.; Cong, W.; Dong, J.; Yu, Y.; Chen, X.; Zhang, Y.; Fang, Z. Continual Named Entity Recognition without Catastrophic Forgetting. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 8186–8197. [Google Scholar] [CrossRef]
  11. Wang, Z.; Wang, X.; Hu, W. Continual Event Extraction with Semantic Confusion Rectification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 11945–11955. [Google Scholar] [CrossRef]
  12. Xiong, W.; Song, Y.; Wang, P.; Li, S. Rationale-Enhanced Language Models are Better Continual Relation Learners. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 15489–15497. [Google Scholar] [CrossRef]
  13. Song, Y.; Wang, P.; Xiong, W.; Zhu, D.; Liu, T.; Sui, Z.; Li, S. InfoCL: Alleviating Catastrophic Forgetting in Continual Text Classification from An Information Theoretic Perspective. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 14557–14570. [Google Scholar] [CrossRef]
  14. Zheng, J.; Liang, Z.; Chen, H.; Ma, Q. Distilling Causal Effect from Miscellaneous Other-Class for Continual Named Entity Recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 3602–3615. [Google Scholar] [CrossRef]
  15. Pradhan, S.S.; Xue, N. OntoNotes: The 90% Solution. In Proceedings of the Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts; Chelba, C., Kantor, P., Roark, B., Eds.; Association for Computational Linguistics: Boulder, CO, USA, 2009; pp. 11–12. [Google Scholar]
  16. Liu, Y.; Schiele, B.; Vedaldi, A.; Rupprecht, C. Continual Detection Transformer for Incremental Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23799–23808. [Google Scholar] [CrossRef]
  17. Chaudhary, Y.; Rai, P.; Schubert, M.; Schütze, H.; Gupta, P. Federated Continual Learning for Text Classification via Selective Inter-client Transfer. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 4789–4799. [Google Scholar] [CrossRef]
  18. Zhang, Z.; Yu, T.; Zhao, H.; Xie, K.; Yao, L.; Li, S. Exploring Soft Prompt Initialization Strategy for Few-Shot Continual Text Classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processin, Seoul, Republic of Korea, 14–19 April 2024; pp. 12106–12110. [Google Scholar] [CrossRef]
  19. Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
  20. Zhao, Z.; Yang, Z.; Luo, L.; Zhang, Y.; Wang, L.; Lin, H.; Wang, J. ML-CNN: A novel deep learning based disease named entity recognition architecture. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, Shenzhen, China, 15–18 December 2016; p. 794. [Google Scholar] [CrossRef]
  21. He, Y.; Tang, B. SetGNER: General Named Entity Recognition as Entity Set Generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 3074–3085. [Google Scholar] [CrossRef]
  22. Jeong, M.; Kang, J. Consistency enhancement of model prediction on document-level named entity recognition. Bioinformatics 2023, 39, btad361. [Google Scholar] [CrossRef] [PubMed]
  23. Yan, Y.; Zhu, P.; Cheng, D.; Yang, F.; Luo, Y. Adversarial Multi-task Learning for Efficient Chinese Named Entity Recognition. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 193. [Google Scholar] [CrossRef]
  24. Luo, Y.; Xiao, F.; Zhao, H. Hierarchical Contextualized Representation for Named Entity Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 8441–8448. [Google Scholar] [CrossRef]
  25. Schweter, S.; Akbik, A. FLERT: Document-Level Features for Named Entity Recognition. arXiv 2020, arXiv:2011.06993. [Google Scholar] [CrossRef]
  26. Luo, Y.; Zhao, H. Bipartite Flat-Graph Network for Nested Named Entity Recognition. In Proceedings of the Annual Meeting of the Association for Computational Linguistics; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6408–6418. [Google Scholar] [CrossRef]
  27. Xu, L.; Jie, Z.; Lu, W.; Bing, L. Better Feature Integration for Named Entity Recognition. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3457–3469. [Google Scholar] [CrossRef]
  28. Rosenberg, C.; Hebert, M.; Schneiderman, H. Semi-Supervised Self-Training of Object Detection Models. In Proceedings of the IEEE Workshops on Applications of Computer Vision, Breckenridge, CO, USA, 5–7 January 2005; pp. 29–36. [Google Scholar] [CrossRef]
  29. De Lange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; Tuytelaars, T. A Continual Learning Survey: Defying Forgetting in Classification Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3366–3385. [Google Scholar] [CrossRef] [PubMed]
  30. Hou, S.; Pan, X.; Loy, C.C.; Wang, Z.; Lin, D. Learning a Unified Classifier Incrementally via Rebalancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 831–839. [Google Scholar] [CrossRef]
  31. Douillard, A.; Cord, M.; Ollion, C.; Robert, T.; Valle, E. PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning. In Proceedings of the European Conference on Computer Vision; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 86–102. [Google Scholar]
  32. Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. Comput. Sci. 2015, 14, 38–39. [Google Scholar]
  33. Xia, Y.; Wang, Q.; Lyu, Y.; Zhu, Y.; Wu, W.; Li, S.; Dai, D. Learn and Review: Enhancing Continual Named Entity Recognition via Reviewing Synthetic Samples. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 2291–2300. [Google Scholar] [CrossRef]
  34. Wang, S.; Shuai, H.; Liu, C.; Liu, Q. Bias-Based Soft Label Learning for Facial Expression Recognition. IEEE Trans. Affect. Comput. 2023, 14, 3257–3268. [Google Scholar] [CrossRef]
  35. Wu, B.; Li, Y.; Mu, Y.; Scarton, C.; Bontcheva, K.; Song, X. Don’t Waste a Single Annotation: Improving Single-Label Classifiers Through Soft Label. arXiv 2023, arXiv:2311.05265. [Google Scholar] [CrossRef]
  36. Clark, K.; Khandelwal, U.; Levy, O.; Manning, C.D. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D., Eds.; Association for Computational Linguistics: Florence, Italy, 2019; pp. 276–286. [Google Scholar] [CrossRef]
  37. Jawahar, G.; Sagot, B.; Seddah, D. What Does BERT Learn about the Structure of Language? In Proceedings of the Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2019; pp. 3651–3657. [Google Scholar] [CrossRef]
  38. Liang, N.; Yang, Z.; Chen, J.; Li, Z.; Xie, S. Label-Weighted Graph-Based Learning for Semi-Supervised Classification Under Label Noise. IEEE Trans. Big Data 2024, 10, 55–65. [Google Scholar] [CrossRef]
  39. Lou, Q.; Deng, Z.; Sang, Q.; Xiao, Z.; Choi, K.S.; Wang, S. A Robust Multilabel Method Integrating Rule-Based Transparent Model, Soft Label Correlation Learning and Label Noise Resistance. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 454–473. [Google Scholar] [CrossRef]
  40. Murphy, S.N.; Griffin, W.; Michael, M.; Vivian, G.; Chueh, H.C.; Susanne, C.; Isaac, K. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med Inform. Assoc. 2010, 17, 124–130. [Google Scholar] [CrossRef] [PubMed]
  41. Tjong Kim Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Conference on Natural Language Learning at HLT-NAACL, Edmonton, AB, Canada, 31 May–1 June 2003; pp. 142–147. [Google Scholar] [CrossRef]
  42. ShafieiBavani, E.; Jimeno Yepes, A.; Zhong, X.; Martinez Iraola, D. Global Locality in Biomedical Relation and Event Extraction. In Proceedings of the SIGBioMed Workshop on Biomedical Language Processing; Demner-Fushman, D., Cohen, K.B., Ananiadou, S., Tsujii, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 195–204. [Google Scholar] [CrossRef]
Figure 1. Overview of continual learning for continuous learning through Confident Soft-Label Imitation (ConSOLI) for each continual learning step. The current label for “Bin was in Japan yesterday” is [PER], which refers to the entity of persons. [GPE] is the old entity label, while [DATE] is an entity relevant at some future step. The non-entity token is tagged as [O] while [IGN] is the ignored label. [X] means not applicable.
Figure 1. Overview of continual learning for continuous learning through Confident Soft-Label Imitation (ConSOLI) for each continual learning step. The current label for “Bin was in Japan yesterday” is [PER], which refers to the entity of persons. [GPE] is the old entity label, while [DATE] is an entity relevant at some future step. The non-entity token is tagged as [O] while [IGN] is the ignored label. [X] means not applicable.
Mathematics 12 03964 g001
Figure 2. The step-wise Mi-F1 on I2B2 and OntoNotes5. The captions decorated by * denote the reproduced results of CPFD. To monitor the step-wise performance changes, we visualize the first 2 entity types in all 4 settings. For OntoNotes5, the entity types are CARDINAL and DATE. For I2B2, the entity types are AGE and CITY.
Figure 2. The step-wise Mi-F1 on I2B2 and OntoNotes5. The captions decorated by * denote the reproduced results of CPFD. To monitor the step-wise performance changes, we visualize the first 2 entity types in all 4 settings. For OntoNotes5, the entity types are CARDINAL and DATE. For I2B2, the entity types are AGE and CITY.
Mathematics 12 03964 g002
Figure 3. The step-wise Mi-F1 on CoNLL2003 and BioNLP11ID. We visualize the first two entity types to demonstrate the effects of catastrophic forgetting. * indicates the reproduction of CPFD results.
Figure 3. The step-wise Mi-F1 on CoNLL2003 and BioNLP11ID. We visualize the first two entity types to demonstrate the effects of catastrophic forgetting. * indicates the reproduction of CPFD results.
Mathematics 12 03964 g003
Figure 4. Pseudo-labels of two cases on OntoNotes5 under FG-1-PG-1. Inferences are made when all entities have been learned. “DATE” and “CARDINAL” are first two entity types. “PERSON” is another entity type in first sentence. Wrong predictions are marked in red.
Figure 4. Pseudo-labels of two cases on OntoNotes5 under FG-1-PG-1. Inferences are made when all entities have been learned. “DATE” and “CARDINAL” are first two entity types. “PERSON” is another entity type in first sentence. Wrong predictions are marked in red.
Mathematics 12 03964 g004
Table 1. Statistics of datasets. “#” refers to the number of each value.
Table 1. Statistics of datasets. “#” refers to the number of each value.
# Entity Types# SamplesEntity Sequence in Recognition
I2B216141kAGE, CITY, COUNTRY, DATE, DOCTOR, HOSPITAL, IDNUM, MEDICALRECORD, ORGANI-ZATION, PATIENT, PHONE, PR-OFESSION, STATE, STREET, US-ERNAME, ZIP
OntoNotes51877kCARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
CoNLL2003421kLOCATION, MISC, ORGANISATION, PERSON
BioNLP11ID425kREGULON, ORGANISM, PROTEIN, CHEMICAL
Table 2. Overall results on I2B2 and OntoNotes5. “ExtendNER” and “CFNER” refer to original scores reported in [14,32], respectively. “ExtendNER†” and “CFNER†” represent results implemented in [10]. “CPFD*” refers to reproduced results using code published in [10]. Other baselines are cited from CPFD [10]. ∘ indicates statistically significant p-value < 0.05, compared with CPFD. “Imp.” denotes improvements of our method over CPFD.
Table 2. Overall results on I2B2 and OntoNotes5. “ExtendNER” and “CFNER” refer to original scores reported in [14,32], respectively. “ExtendNER†” and “CFNER†” represent results implemented in [10]. “CPFD*” refers to reproduced results using code published in [10]. Other baselines are cited from CPFD [10]. ∘ indicates statistically significant p-value < 0.05, compared with CPFD. “Imp.” denotes improvements of our method over CPFD.
DataBaselineFG-1-PG-1FG-2-PG-2FG-8-PG-1FG-8-PG-2
Mi-F1Ma-F1Mi-F1Ma-F1Mi-F1Ma-F1Mi-F1Ma-F1
I2B2FT [10] 17.43 ± 0.54 13.81 ± 1.14 28.57 ± 0.26 21.43 ± 0.41 20.83 ± 1.78 18.11 ± 1.16 23.60 ± 0.15 23.54 ± 0.38
PODNet [31] 12.31 ± 0.35 17.14 ± 1.03 34.67 ± 2.65 24.62 ± 1.76 39.26 ± 1.38 27.23 ± 0.93 36.22 ± 12.9 26.08 ± 7.42
LUCIR [30] 43.86 ± 2.43 31.31 ± 1.62 64.32 ± 0.76 43.53 ± 0.59 57.86 ± 0.87 33.04 ± 0.39 68.54 ± 0.27 46.94 ± 0.63
ST [28,29] 31.98 ± 2.12 14.76 ± 1.31 55.44 ± 4.78 33.38 ± 3.13 49.51 ± 1.35 23.77 ± 1.01 48.94 ± 6.78 29.00 ± 3.04
ExtendNER† [10] 41.65 ± 10.11 23.11 ± 2.70 67.60 ± 1.15 42.58 ± 1.59 45.14 ± 2.91 27.41 ± 0.88 56.48 ± 2.41 38.88 ± 1.38
ExtendNER [32] 42.85 ± 2.86 24.05 ± 1.35 57.01 ± 4.14 35.29 ± 3.38 43.95 ± 2.01 23.12 ± 1.79 52.25 ± 5.36 30.93 ± 2.77
CFNER† [10] 64.79 ± 0.26 37.79 ± 0.65 72.58 ± 0.59 51.71 ± 0.84 56.66 ± 3.22 36.84 ± 1.35 69.12 ± 0.94 51.61 ± 0.87
CFNER [14] 62.73 ± 3.62 36.26 ± 2.24 71.98 ± 0.50 49.09 ± 1.38 59.79 ± 1.70 37.30 ± 1.15 69.07 ± 0.89 51.09 ± 1.05
CPFD* 72.70 ± 0.97 48.35 ± 1.44 78.45 ± 0.55 56.54 ± 1.02 76.16 ± 0.85 56.19 ± 2.17 80.99 ± 0.93 63.69 ± 1.59
Ours 73.45 ± 1.02 49.20 ± 1.13 79.18 ± 0.77 59.96 ± 1.30 77.20 ± 1.03 57.92 ± 1.92 81.67 ± 1.07 65.80 ± 1.83
Imp. 0.75 0.85 0.73 3.42 1.04 1.73 0.68 2.11
Onto-Notes5FT [10] 15.27 ± 0.26 10.85 ± 1.11 25.85 ± 0.11 20.55 ± 0.24 17.63 ± 0.57 12.23 ± 1.08 29.81 ± 0.12 20.05 ± 0.16
PODNet [31] 9.06 ± 0.56 8.36 ± 0.57 19.04 ± 1.08 16.93 ± 0.85 29.00 ± 0.86 20.54 ± 0.91 37.78 ± 0.26 25.85 ± 0.29
LUCIR [30] 28.18 ± 1.15 21.11 ± 0.84 56.40 ± 1.79 40.58 ± 1.11 66.46 ± 0.46 46.29 ± 0.38 76.17 ± 0.09 55.58 ± 0.55
ST [28,29] 50.71 ± 0.79 33.24 ± 1.06 68.93 ± 1.67 50.63 ± 1.66 73.59 ± 0.66 49.41 ± 0.77 77.07 ± 0.62 53.32 ± 0.63
ExtendNER† [10] 51.35 ± 0.77 33.38 ± 0.98 63.03 ± 9.39 47.64 ± 5.15 73.65 ± 0.19 50.55 ± 0.56 77.86 ± 0.10 55.21 ± 0.51
ExtendNER [32] 50.53 ± 0.86 32.84 ± 0.84 67.61 ± 1.53 49.26 ± 1.49 73.12 ± 0.93 49.55 ± 0.90 76.85 ± 0.77 54.37 ± 0.57
CFNER† [10] 58.94 ± 0.57 42.22 ± 1.10 72.59 ± 0.48 55.96 ± 0.69 78.92 ± 0.58 57.51 ± 1.32 80.68 ± 0.25 60.52 ± 0.84
CFNER [14] 58.94 ± 0.57 42.22 ± 1.10 72.59 ± 0.48 55.96 ± 0.69 78.92 ± 0.58 57.51 ± 1.32 80.68 ± 0.25 60.52 ± 0.84
CPFD* 66.73 ± 0.70 54.12 ± 0.30 73.74 ± 0.31 57.78 ± 0.30 81.54 ± 0.55 59.20 ± 1.25 83.41 ± 0.18 65.81 ± 0.75
Ours 69.65 ± 1.20 54.41 ± 1.35 75.33 ± 0.51 57.73 ± 0.45 82.09 ± 0.52 63.62 ± 1.31 83.49 ± 0.28 65.86 ± 0.89
Imp. 2.92 0.29 1.59 0.05 0.55 4.42 0.08 0.05
Table 3. The main results on CoNLL2003 and BioNLP11ID. ExtendeNER† and CFNER† refer to scores cited from original papers. “CPFD*” refers to reproduced results using code published in [10]. ExtendedNER and CFNER refer to reproduced results. ∘ indicates a statistically significant p-value, p-value < 0.05 , compared with CPFD. “Imp.” denotes improvements of our method over CPFD.
Table 3. The main results on CoNLL2003 and BioNLP11ID. ExtendeNER† and CFNER† refer to scores cited from original papers. “CPFD*” refers to reproduced results using code published in [10]. ExtendedNER and CFNER refer to reproduced results. ∘ indicates a statistically significant p-value, p-value < 0.05 , compared with CPFD. “Imp.” denotes improvements of our method over CPFD.
DataBaselineFG-1-PG-1FG-2-PG-1
Mi-F1Ma-F1Mi-F1Ma-F1
CoNLL-2003FT 50.84 ± 0.10 40.64 ± 0.16 57.45 ± 0.05 43.58 ± 0.18
PODNET 36.74 ± 0.52 29.43 ± 0.28 59.12 ± 0.54 58.39 ± 0.99
LUCIR 74.15 ± 0.43 70.48 ± 0.66 80.53 ± 0.31 77.33 ± 0.31
ST 76.17 ± 0.91 72.88 ± 1.12 76.65 ± 0.24 66.72 ± 0.11
ExtendNER† 76.07 ± 0.35 73.06 ± 0.29 77.89 ± 0.42 69.92 ± 1.12
ExtendNER 76.36 ± 0.98 73.04 ± 1.80 76.66 ± 0.66 66.36 ± 0.64
CFNER† 80.29 ± 0.21 78.44 ± 0.24 81.52 ± 0.43 77.20 ± 0.82
CFNER* 80.91 ± 0.29 79.11 ± 0.50 80.83 ± 0.36 75.20 ± 0.32
CPFD 82.24 ± 0.63 79.94 ± 0.66 85.56 ± 0.94 83.34 ± 0.51
Ours 83.47 ± 0.82 81.44 ± 0.91 85.63 ± 1.21 83.02 ± 1.30
Imp. 1.23 1.5 0.07 0.32
BioNL-P11IDFT 33.37 ± 0.11 16.82 ± 0.09 7.91 ± 0.10 9.48 ± 0.08
PODNET 6.35 ± 0.49 4.21 ± 0.31 39.79 ± 0.44 19.48 ± 0.37
LUCIR 53.97 ± 0.41 28.27 ± 0.33 57.71 ± 0.35 28.20 ± 0.34
ST 47.30 ± 0.96 28.27 ± 1.03 60.86 ± 0.45 28.53 ± 0.59
ExtendNER†----
ExtendNER 46.37 ± 0.11 27.75 ± 0.97 60.69 ± 0.55 29.36 ± 0.61
CFNER†----
CFNER 57.44 ± 0.51 35.04 ± 0.70 60.37 ± 0.47 29.84 ± 0.41
CPFD* 55.53 ± 1.05 33.96 ± 1.19 63.13 ± 0.94 33.77 ± 0.51
Ours 57.12 ± 1.23 35.06 ± 1.43 71.85 ± 1.21 43.48 ± 1.30
Imp. 1.59 1.1 8.72 9.72
Table 4. Performances of using different hidden states to calculate soft-label. Results are summarized on I2B2, OntoNotes, CoNLL2003, and BioNLP11ID under setting FG-1-PG-1.
Table 4. Performances of using different hidden states to calculate soft-label. Results are summarized on I2B2, OntoNotes, CoNLL2003, and BioNLP11ID under setting FG-1-PG-1.
# of LayersI2B2OntoNotes5
Mi-F1Ma-F1Mi-F1Ma-F1
1 73.45 ± 1.02 49.20 ± 1.13 69.65 ± 1.20 54.41 ± 1.35
−2:−1 72.59 ± 1.15 48.75 ± 1.57 68.73 ± 0.97 53.99 ± 0.73
−3:−1 72.61 ± 0.97 48.02 ± 1.34 67.55 ± 1.03 54.02 ± 1.21
# of LayersCoNLL2003BioNLP11ID
Mi-F1Ma-F1Mi-F1Ma-F1
1 83.47 ± 0.63 81.44 ± 0.91 57.12 ± 1.23 35.06 ± 1.43
−2:−1 83.04 ± 0.91 80.99 ± 0.79 58.05 ± 1.21 35.96 ± 1.33
−3:−1 82.15 ± 0.77 79.52 ± 0.81 56.90 ± 1.92 34.01 ± 1.55
Table 5. Ablation study on I2B2, OntoNotes, CoNLL2003, and BioNLP11ID under setting FG-1-PG-1. “w/” refers to replacement of L P F D by L F D or L P F D l a x . “w/o” refers to removal of corresponding component in ConSOLI.
Table 5. Ablation study on I2B2, OntoNotes, CoNLL2003, and BioNLP11ID under setting FG-1-PG-1. “w/” refers to replacement of L P F D by L F D or L P F D l a x . “w/o” refers to removal of corresponding component in ConSOLI.
MethodsI2B2OntoNotes5
Mi-F1Ma-F1Mi-F1Ma-F1
Ours 73.45 ± 1.02 49.20 ± 1.13 69.65 ± 1.20 54.41 ± 1.35
w/ L F D 70.65 ± 1.10 46.09 ± 1.32 64.70 ± 0.51 52.31 ± 1.14
w/ L P F D l a x 67.85 ± 0.82 44.36 ± 1.21 62.88 ± 0.93 51.21 ± 1.03
w/o L P F D 69.54 ± 0.91 42.28 ± 0.73 64.80 ± 1.29 50.09 ± 1.15
w/o CPL 54.90 ± 4.56 35.49 ± 2.15 59.97 ± 0.71 49.48 ± 0.54
w/o ART 72.40 ± 1.26 46.21 ± 1.57 64.19 ± 1.01 51.53 ± 0.51
w/o SL 72.70 ± 0.97 48.35 ± 1.44 66.73 ± 0.70 54.12 ± 0.30
MethodsCoNLL2003BioNLP11ID
Mi-F1Ma-F1Mi-F1Ma-F1
Ours 83.47 ± 0.82 81.44 ± 0.91 57.12 ± 1.23 35.06 ± 1.43
w/ L F D 81.33 ± 0.75 78.91 ± 1.07 55.09 ± 1.12 32.57 ± 1.76
w/ L P F D l a x 79.90 ± 0.91 79.55 ± 1.32 54.88 ± 0.77 31.17 ± 1.08
w/o L P F D 79.81 ± 0.75 78.92 ± 1.07 53.97 ± 0.52 30.88 ± 1.15
w/o CPL 76.91 ± 0.94 75.35 ± 1.01 49.52 ± 1.31 27.44 ± 2.31
w/o ART 79.13 ± 0.89 79.05 ± 1.47 54.95 ± 1.20 30.99 ± 2.05
w/o SL 82.24 ± 0.63 79.94 ± 0.66 55.53 ± 1.05 33.96 ± 1.19
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, H.; Zhou, L.; Gu, M. Reduced Forgetfulness in Continual Learning for Named Entity Recognition Through Confident Soft-Label Imitation. Mathematics 2024, 12, 3964. https://doi.org/10.3390/math12243964

AMA Style

Zhang H, Zhou L, Gu M. Reduced Forgetfulness in Continual Learning for Named Entity Recognition Through Confident Soft-Label Imitation. Mathematics. 2024; 12(24):3964. https://doi.org/10.3390/math12243964

Chicago/Turabian Style

Zhang, Huan, Long Zhou, and Miaomiao Gu. 2024. "Reduced Forgetfulness in Continual Learning for Named Entity Recognition Through Confident Soft-Label Imitation" Mathematics 12, no. 24: 3964. https://doi.org/10.3390/math12243964

APA Style

Zhang, H., Zhou, L., & Gu, M. (2024). Reduced Forgetfulness in Continual Learning for Named Entity Recognition Through Confident Soft-Label Imitation. Mathematics, 12(24), 3964. https://doi.org/10.3390/math12243964

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop