Reduced Forgetfulness in Continual Learning for Named Entity Recognition Through Confident Soft-Label Imitation
Abstract
:1. Introduction
- We propose a continual learning method for NER through feature imitation. The approach includes a knowledge distillation loss, which improves the original CPFD by strengthening the knowledge learned from the old task.
- We propose a confidence-based soft pseudo-labeling method to preserve the knowledge learned from old tasks. To cope with original pseudo-labeling in CPFD, we incorporate a balancing factor to leverage the distinction between hard labels and the proposed soft-labels.
- Extensive experiments on 4 benchmark datasets are conducted, and the results show that our method improves the Micro-F1 and Macro-F1 with an average of 1.28% and 1.18%, respectively.
2. Related Work
2.1. Named Entity Recognition
2.2. Continual Learning in NER
3. Continual Learning for Named Entity Recognition Through Confident Soft-Label Imitation
3.1. Task Formulation
3.2. Framework Overview
3.3. Soft-Label Distillation
3.4. Confident Soft-Label Imitation
Algorithm 1: Pseudocode for CL-NER through confident soft-label imitation. |
4. Experiments
4.1. Datasets and Settings
4.2. Main Results
4.3. Ablation Study
4.4. Case Study
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Baseline Introductions
- Fine-Tuning (FT): FT refers to the fine-tuning process of a pre-trained model. In CL-NER, we adopt FT over for updating . Therefore, it has no anti-forgetting measurement. In our implementation, we use a pre-trained BERT model as the base model and update it in each CL step to evaluate the performance.
- Self-Training (ST): ST uses the old model to annotate the non-entity tokens in the current dataset with the old entity types. The predicted labels of the old entities are treated as the ground truth during the update of the new model . In the CL process, the new model is updated by minimizing the cross-entropy loss over the annotated data and the ground truth in .
- ExtendNER: Similar to ST, ExtendNER uses the old model to infer each non-entity token. Instead of directly using the predicted entity type (the hard label), the probability distribution of the old entity categories is used. Then, the KL divergence loss is calculated for the non-entity class, which is further combined with cross-entropy loss over current entity types’ tokens. By minimizing the losses, the new model is updated. Although both our method and ExtendNER use soft-labels, the soft-label in ExtendNER is constructed from the classifier, while our method extracts the linguistic knowledge in the hidden states for soft-labels.
- LUCIR: LUCIR was originally designed for image classification. Except for the cross-entropy loss on the new categories, it uses a distillation loss to constrain the features from and . Moreover, to maintain the knowledge from , LUCIR reserves some samples for the old classes, which are then used to compute a margin-ranking loss. In our implementations, no reserved samples is given. Therefore, the margin-ranking loss is calculated over the non-entity tokens in instead of the reserved samples. Our method uses the same idea as LUCIR to preserve knowledge using features. The difference is that our method explores the different combinations of hidden states and the soft-label is only calculated over confident non-entity tokens.
- PODNet: PODNet is another CL method in image classification. Similar to LUCIR, PODNet also relies on knowledge distillation of . The difference is that PODNet uses distillation loss to constrain the output of each convolutional layer, while LUCIR only considers the output of the last one. Moreover, PODNet replaces the cross-entropy loss with NCA loss for classification. In the re-implementation of PODNet in CL-NER, the distillation uses the intermediate output of BERT.
- CFNER: Similar to ExtendNER, CFNER calculates the probability distribution of the old entity types from . The difference is that CFNER focuses on distilling the causal effects from the non-entity tokens as an anti-forgetting measurement. Similar to CPFD, it also has a dynamic balancing strategy to distinguish between new entity and non-entity types.
- CPFD: CPFD leverages the stability and plasticity by using a pooled feature distillation loss to constrain the updating of . In the meantime, CPFD also focuses on the label imbalance problem in and proposes a balanced pseudo loss. Unlike CFNER, only the confident pseudo-labels in the non-entity class are used to compute the loss, thereby limiting the noise caused by and the label shift. Our method differs from CPFD in that we further introduced soft feature distillation and confidence-based loss to help preserve the knowledge from , thereby increasing the performance in most previous entity types.
Appendix B. Visualizations of Step-Wise Performance for Other Baseline Methods
References
- Gligic, L.; Kormilitzin, A.; Goldberg, P.; Nevado-Holgado, A. Named entity recognition in electronic health records using transfer learning bootstrapped Neural Networks. Neural Netw. 2020, 121, 132–139. [Google Scholar] [CrossRef] [PubMed]
- Wang, D.; Feng, X.; Liu, Z.; Wang, C. 2M-NER: Contrastive learning for multilingual and multimodal NER with language and modal fusion. Appl. Intell. 2024, 54, 6252–6268. [Google Scholar] [CrossRef]
- Liu, X.; Huang, H.; Zhang, Y. Open Domain Event Extraction Using Neural Latent Variable Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics; Korhonen, A., Traum, D., Màrquez, L., Eds.; Association for Computational Linguistics: Florence, Italy, 2019; pp. 2860–2871. [Google Scholar] [CrossRef]
- Wang, Y.; Han, X.; Zhou, F.; Wang, Y.; Deng, C.; Feng, J. Distill-AER: Fine-Grained Address Entity Recognition from Spoken Dialogue via Knowledge Distillation. In Proceedings of the Natural Language Processing and Chinese Computing; Lu, W., Huang, S., Hong, Y., Zhou, X., Eds.; Springer: Cham, Switzerland, 2022; pp. 643–655. [Google Scholar]
- Singh, A.; Saha, S. GraphIC: A graph-based approach for identifying complaints from code-mixed product reviews. Expert Syst. Appl. 2023, 216, 119444. [Google Scholar] [CrossRef]
- Zhang, Y.; Yang, J. Chinese NER Using Lattice LSTM. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Gurevych, I., Miyao, Y., Eds.; Association for Computational Linguistics: Melbourne, Australia, 2018; pp. 1554–1564. [Google Scholar] [CrossRef]
- Žukov-Gregorič, A.; Bachrach, Y.; Coope, S. Named Entity Recognition With Parallel Recurrent Neural Networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Gurevych, I., Miyao, Y., Eds.; Association for Computational Linguistics: Melbourne, Australia, 2018; pp. 69–74. [Google Scholar] [CrossRef]
- Zhou, R.; Xie, Z.; Wan, J.; Zhang, J.; Liao, Y.; Liu, Q. Attention and Edge-Label Guided Graph Convolutional Networks for Named Entity Recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 6499–6510. [Google Scholar] [CrossRef]
- Chen, Y.; He, L. SKD-NER: Continual Named Entity Recognition via Span-based Knowledge Distillation with Reinforcement Learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 6689–6700. [Google Scholar] [CrossRef]
- Zhang, D.; Cong, W.; Dong, J.; Yu, Y.; Chen, X.; Zhang, Y.; Fang, Z. Continual Named Entity Recognition without Catastrophic Forgetting. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 8186–8197. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, X.; Hu, W. Continual Event Extraction with Semantic Confusion Rectification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 11945–11955. [Google Scholar] [CrossRef]
- Xiong, W.; Song, Y.; Wang, P.; Li, S. Rationale-Enhanced Language Models are Better Continual Relation Learners. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 15489–15497. [Google Scholar] [CrossRef]
- Song, Y.; Wang, P.; Xiong, W.; Zhu, D.; Liu, T.; Sui, Z.; Li, S. InfoCL: Alleviating Catastrophic Forgetting in Continual Text Classification from An Information Theoretic Perspective. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 14557–14570. [Google Scholar] [CrossRef]
- Zheng, J.; Liang, Z.; Chen, H.; Ma, Q. Distilling Causal Effect from Miscellaneous Other-Class for Continual Named Entity Recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 3602–3615. [Google Scholar] [CrossRef]
- Pradhan, S.S.; Xue, N. OntoNotes: The 90% Solution. In Proceedings of the Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts; Chelba, C., Kantor, P., Roark, B., Eds.; Association for Computational Linguistics: Boulder, CO, USA, 2009; pp. 11–12. [Google Scholar]
- Liu, Y.; Schiele, B.; Vedaldi, A.; Rupprecht, C. Continual Detection Transformer for Incremental Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23799–23808. [Google Scholar] [CrossRef]
- Chaudhary, Y.; Rai, P.; Schubert, M.; Schütze, H.; Gupta, P. Federated Continual Learning for Text Classification via Selective Inter-client Transfer. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 4789–4799. [Google Scholar] [CrossRef]
- Zhang, Z.; Yu, T.; Zhao, H.; Xie, K.; Yao, L.; Li, S. Exploring Soft Prompt Initialization Strategy for Few-Shot Continual Text Classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processin, Seoul, Republic of Korea, 14–19 April 2024; pp. 12106–12110. [Google Scholar] [CrossRef]
- Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
- Zhao, Z.; Yang, Z.; Luo, L.; Zhang, Y.; Wang, L.; Lin, H.; Wang, J. ML-CNN: A novel deep learning based disease named entity recognition architecture. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, Shenzhen, China, 15–18 December 2016; p. 794. [Google Scholar] [CrossRef]
- He, Y.; Tang, B. SetGNER: General Named Entity Recognition as Entity Set Generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 3074–3085. [Google Scholar] [CrossRef]
- Jeong, M.; Kang, J. Consistency enhancement of model prediction on document-level named entity recognition. Bioinformatics 2023, 39, btad361. [Google Scholar] [CrossRef] [PubMed]
- Yan, Y.; Zhu, P.; Cheng, D.; Yang, F.; Luo, Y. Adversarial Multi-task Learning for Efficient Chinese Named Entity Recognition. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 193. [Google Scholar] [CrossRef]
- Luo, Y.; Xiao, F.; Zhao, H. Hierarchical Contextualized Representation for Named Entity Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 8441–8448. [Google Scholar] [CrossRef]
- Schweter, S.; Akbik, A. FLERT: Document-Level Features for Named Entity Recognition. arXiv 2020, arXiv:2011.06993. [Google Scholar] [CrossRef]
- Luo, Y.; Zhao, H. Bipartite Flat-Graph Network for Nested Named Entity Recognition. In Proceedings of the Annual Meeting of the Association for Computational Linguistics; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6408–6418. [Google Scholar] [CrossRef]
- Xu, L.; Jie, Z.; Lu, W.; Bing, L. Better Feature Integration for Named Entity Recognition. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3457–3469. [Google Scholar] [CrossRef]
- Rosenberg, C.; Hebert, M.; Schneiderman, H. Semi-Supervised Self-Training of Object Detection Models. In Proceedings of the IEEE Workshops on Applications of Computer Vision, Breckenridge, CO, USA, 5–7 January 2005; pp. 29–36. [Google Scholar] [CrossRef]
- De Lange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; Tuytelaars, T. A Continual Learning Survey: Defying Forgetting in Classification Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3366–3385. [Google Scholar] [CrossRef] [PubMed]
- Hou, S.; Pan, X.; Loy, C.C.; Wang, Z.; Lin, D. Learning a Unified Classifier Incrementally via Rebalancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 831–839. [Google Scholar] [CrossRef]
- Douillard, A.; Cord, M.; Ollion, C.; Robert, T.; Valle, E. PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning. In Proceedings of the European Conference on Computer Vision; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 86–102. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. Comput. Sci. 2015, 14, 38–39. [Google Scholar]
- Xia, Y.; Wang, Q.; Lyu, Y.; Zhu, Y.; Wu, W.; Li, S.; Dai, D. Learn and Review: Enhancing Continual Named Entity Recognition via Reviewing Synthetic Samples. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 2291–2300. [Google Scholar] [CrossRef]
- Wang, S.; Shuai, H.; Liu, C.; Liu, Q. Bias-Based Soft Label Learning for Facial Expression Recognition. IEEE Trans. Affect. Comput. 2023, 14, 3257–3268. [Google Scholar] [CrossRef]
- Wu, B.; Li, Y.; Mu, Y.; Scarton, C.; Bontcheva, K.; Song, X. Don’t Waste a Single Annotation: Improving Single-Label Classifiers Through Soft Label. arXiv 2023, arXiv:2311.05265. [Google Scholar] [CrossRef]
- Clark, K.; Khandelwal, U.; Levy, O.; Manning, C.D. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D., Eds.; Association for Computational Linguistics: Florence, Italy, 2019; pp. 276–286. [Google Scholar] [CrossRef]
- Jawahar, G.; Sagot, B.; Seddah, D. What Does BERT Learn about the Structure of Language? In Proceedings of the Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2019; pp. 3651–3657. [Google Scholar] [CrossRef]
- Liang, N.; Yang, Z.; Chen, J.; Li, Z.; Xie, S. Label-Weighted Graph-Based Learning for Semi-Supervised Classification Under Label Noise. IEEE Trans. Big Data 2024, 10, 55–65. [Google Scholar] [CrossRef]
- Lou, Q.; Deng, Z.; Sang, Q.; Xiao, Z.; Choi, K.S.; Wang, S. A Robust Multilabel Method Integrating Rule-Based Transparent Model, Soft Label Correlation Learning and Label Noise Resistance. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 454–473. [Google Scholar] [CrossRef]
- Murphy, S.N.; Griffin, W.; Michael, M.; Vivian, G.; Chueh, H.C.; Susanne, C.; Isaac, K. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med Inform. Assoc. 2010, 17, 124–130. [Google Scholar] [CrossRef] [PubMed]
- Tjong Kim Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Conference on Natural Language Learning at HLT-NAACL, Edmonton, AB, Canada, 31 May–1 June 2003; pp. 142–147. [Google Scholar] [CrossRef]
- ShafieiBavani, E.; Jimeno Yepes, A.; Zhong, X.; Martinez Iraola, D. Global Locality in Biomedical Relation and Event Extraction. In Proceedings of the SIGBioMed Workshop on Biomedical Language Processing; Demner-Fushman, D., Cohen, K.B., Ananiadou, S., Tsujii, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 195–204. [Google Scholar] [CrossRef]
# Entity Types | # Samples | Entity Sequence in Recognition | |
---|---|---|---|
I2B2 | 16 | 141k | AGE, CITY, COUNTRY, DATE, DOCTOR, HOSPITAL, IDNUM, MEDICALRECORD, ORGANI-ZATION, PATIENT, PHONE, PR-OFESSION, STATE, STREET, US-ERNAME, ZIP |
OntoNotes5 | 18 | 77k | CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART |
CoNLL2003 | 4 | 21k | LOCATION, MISC, ORGANISATION, PERSON |
BioNLP11ID | 4 | 25k | REGULON, ORGANISM, PROTEIN, CHEMICAL |
Data | Baseline | FG-1-PG-1 | FG-2-PG-2 | FG-8-PG-1 | FG-8-PG-2 | ||||
---|---|---|---|---|---|---|---|---|---|
Mi-F1 | Ma-F1 | Mi-F1 | Ma-F1 | Mi-F1 | Ma-F1 | Mi-F1 | Ma-F1 | ||
I2B2 | FT [10] | ||||||||
PODNet [31] | |||||||||
LUCIR [30] | |||||||||
ST [28,29] | |||||||||
ExtendNER† [10] | |||||||||
ExtendNER [32] | |||||||||
CFNER† [10] | |||||||||
CFNER [14] | |||||||||
CPFD* | |||||||||
Ours | |||||||||
Imp. | |||||||||
Onto-Notes5 | FT [10] | ||||||||
PODNet [31] | |||||||||
LUCIR [30] | |||||||||
ST [28,29] | |||||||||
ExtendNER† [10] | |||||||||
ExtendNER [32] | |||||||||
CFNER† [10] | |||||||||
CFNER [14] | |||||||||
CPFD* | |||||||||
Ours | |||||||||
Imp. |
Data | Baseline | FG-1-PG-1 | FG-2-PG-1 | ||
---|---|---|---|---|---|
Mi-F1 | Ma-F1 | Mi-F1 | Ma-F1 | ||
CoNLL-2003 | FT | ||||
PODNET | |||||
LUCIR | |||||
ST | |||||
ExtendNER† | |||||
ExtendNER | |||||
CFNER† | |||||
CFNER* | |||||
CPFD | |||||
Ours | |||||
Imp. | |||||
BioNL-P11ID | FT | ||||
PODNET | |||||
LUCIR | |||||
ST | |||||
ExtendNER† | - | - | - | - | |
ExtendNER | |||||
CFNER† | - | - | - | - | |
CFNER | |||||
CPFD* | |||||
Ours | |||||
Imp. |
# of Layers | I2B2 | OntoNotes5 | ||
---|---|---|---|---|
Mi-F1 | Ma-F1 | Mi-F1 | Ma-F1 | |
−2:−1 | ||||
−3:−1 | ||||
# of Layers | CoNLL2003 | BioNLP11ID | ||
Mi-F1 | Ma-F1 | Mi-F1 | Ma-F1 | |
−2:−1 | ||||
−3:−1 |
Methods | I2B2 | OntoNotes5 | ||
---|---|---|---|---|
Mi-F1 | Ma-F1 | Mi-F1 | Ma-F1 | |
Ours | ||||
w/ | ||||
w/ | ||||
w/o | ||||
w/o CPL | ||||
w/o ART | ||||
w/o SL | ||||
Methods | CoNLL2003 | BioNLP11ID | ||
Mi-F1 | Ma-F1 | Mi-F1 | Ma-F1 | |
Ours | ||||
w/ | ||||
w/ | ||||
w/o | ||||
w/o CPL | ||||
w/o ART | ||||
w/o SL |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, H.; Zhou, L.; Gu, M. Reduced Forgetfulness in Continual Learning for Named Entity Recognition Through Confident Soft-Label Imitation. Mathematics 2024, 12, 3964. https://doi.org/10.3390/math12243964
Zhang H, Zhou L, Gu M. Reduced Forgetfulness in Continual Learning for Named Entity Recognition Through Confident Soft-Label Imitation. Mathematics. 2024; 12(24):3964. https://doi.org/10.3390/math12243964
Chicago/Turabian StyleZhang, Huan, Long Zhou, and Miaomiao Gu. 2024. "Reduced Forgetfulness in Continual Learning for Named Entity Recognition Through Confident Soft-Label Imitation" Mathematics 12, no. 24: 3964. https://doi.org/10.3390/math12243964
APA StyleZhang, H., Zhou, L., & Gu, M. (2024). Reduced Forgetfulness in Continual Learning for Named Entity Recognition Through Confident Soft-Label Imitation. Mathematics, 12(24), 3964. https://doi.org/10.3390/math12243964