PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition

Yang, Hongjian; Zhang, Qinghao; Kwon, Hyuk-Chul

doi:10.3390/app14051717

Open AccessArticle

PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition

by

Hongjian Yang

,

Qinghao Zhang

and

Hyuk-Chul Kwon

^*

Center for Artificial Intelligence Research, Pusan National University, Busan 46241, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(5), 1717; https://doi.org/10.3390/app14051717

Submission received: 20 January 2024 / Revised: 14 February 2024 / Accepted: 19 February 2024 / Published: 20 February 2024

Download

Browse Figures

Versions Notes

Abstract

:

Named entity recognition (NER) in natural language processing encompasses three primary types: flat, nested, and discontinuous. While the flat type often garners attention from researchers, nested NER poses a significant challenge. Current approaches to addressing nested NER involve sequence labeling methods with merged label layers, cascaded models, and those rooted in reading comprehension. Among these, sequence labeling with merged label layers stands out for its simplicity and ease of implementation. Yet, highlighted issues persist within this method, prompting our aim to enhance its efficacy. In this study, we propose augmentations to the sequence labeling approach by employing a pipeline model bifurcated into sequence labeling and text classification tasks. Departing from annotating specific entity categories, we amalgamated types into main and sub-categories for a unified treatment. These categories were subsequently embedded as identifiers in the recognition text for the text categorization task. Our choice of resolution involved BERT+BiLSTM+CRF for sequence labeling and the BERT model for text classification. Experiments were conducted across three nested NER datasets: GENIA, CMeEE, and GermEval 2014, featuring annotations varying from four to two levels. Before model training, we conducted separate statistical analyses on nested entities within the medical dataset CMeEE and the everyday life dataset GermEval 2014. Our research unveiled a consistent dominance of a particular entity category within nested entities across both datasets. This observation suggests the potential utility of labeling primary and subsidiary entities for effective category recognition. Model performance was evaluated based on F1 scores, considering correct recognition only when both the complete entity name and category were identified. Results showcased substantial performance enhancement after our proposed modifications compared to the original method. Additionally, our improved model exhibited strong competitiveness against existing models. F1 scores on the GENIA, CMeEE, and GermEval 2014 datasets reached 79.21, 66.71, and 87.81, respectively. Our research highlights that, while preserving the original method’s simplicity and implementation ease, our enhanced model achieves heightened performance and competitive prowess compared to other methodologies.

Keywords:

nested entity; named entity recognition; NER; sequence labeling; text classification; merged label; BERT model

1. Introduction

In natural language processing (NLP), the identification of specific words or phrases, such as individual names, locations, and organizations, is essential for tasks including machine translation [1], question answering [2,3], entity linking [4], and relation extraction [5]. These terms, known as named entities (NEs), are identified through named entity recognition (NER), a foundational task in NLP.

However, certain texts contain named entities that are nested or overlapping, giving rise to what is known as nested entities or entity nesting [6]. These complexities are prevalent in particular domain-specific datasets. Most researchers concentrate on resolving flat entities [7]; the existence of nested entities in text often goes unaddressed. Refer to Figure 1 for an illustrative sentence extracted from the GENIA dataset [8]: “Mouse interleukin-2 receptor alpha gene expression”. In this sentence, “interleukin-2” is categorized as a “protein” entity type. Notably, “interleukin-2” also exists within the “DNA” entity type “Mouse interleukin-2 receptor alpha gene”. This scenario exemplifies a nested entity, where “interleukin-2” holds multiple entity-type associations within the same text span. Sequence labeling methods are commonly employed to tackle NER tasks. However, these methods confine each token to representing only one entity category, disregarding the possibility of multiple entity categories associated with a single token within nested entities. Addressing this issue, some researchers proposed solutions such as the “merged label layer” [9], a method involving sequence labeling models annotating multiple entity categories on tokens with nesting challenges. Although straightforward, this method leads to label expansion, resulting in model sparsity.

In our study, we address the significant challenges inherent in processing nested entities with sequence labeling methods, a task complicated by the intricate hierarchies and overlapping entities in text data. Traditional sequence labeling approaches primarily focus on identifying single-layer entities, often inadequately handling or entirely overlooking nested structures. Although some researchers have explored solutions to this problem, their methods either deviate from the simplicity of the sequence labeling framework or require the annotation of an extensive number of labels, leading to model sparsity. Our work seeks to navigate these issues, proposing an innovative approach that retains the simplicity of sequence labeling models while effectively managing the complexity of nested entities.

Our research endeavors to optimize this approach by mitigating label proliferation. We devised a pipeline approach to tackle nested entity complexities. This method consists of two stages. Initially, we utilized a sequence labeling model incorporating a merged label layer; however, we streamlined entity categories by introducing a hierarchical entity level. This hierarchical level, labeled as ent, sub1, sub2, and so forth, is determined based on primary and secondary relationships within nested entities, with the outermost entity assigned the highest importance (ent). This hierarchical structure significantly reduces label volume. Subsequently, we employed a classification model to ascertain the entity category.

In a prior state-of-the-art model for entity relationship extraction proposed by Chen et al. [10], a pipeline approach was utilized. They proposed an enhancement involving entity boundaries and types as identifiers around entity spans within the classification model. Building upon this notion, our research incorporates entity levels as identifiers around entity spans within the classification model. We hypothesize that entity levels may exert influence on the entity category.

We conducted experiments on three datasets encompassing nested entity challenges across diverse languages: the Chinese medical dataset CMeEE [11], the English medical dataset GENIA [8], and the German dataset GermEval2014 [12]. For the sequence labeling task, we employed the BERT-Bi-LSTM-CRF model, a state-of-the-art approach. Leveraging BERT [13] for pre-training, this model adeptly constructs word embeddings. It integrates bi-directional long short-term memory (Bi-LSTM) for comprehensive feature extraction, accommodating both long- and short-term word embeddings. Finally, conditional random field (CRF) was utilized for NER labeling, a conventional method for NER tasks. For the classification task, we utilized transfer learning, specifically fine-tuning a pre-trained BERT model, to address this challenge. Our main contributions are listed as follows:

We refined the sequence labeling to effectively tackle nested entity challenges. Our proposed approach involves initially utilizing the merged label layer method for identifying entity boundaries, followed by the classification of entity categories. This strategy preserves the benefits of the merged label layer method while mitigating the issue of exponentially increasing label counts.
Introducing an entity hierarchical system allowed us to discern between different entities nested within each other. This hierarchical grading of entities, from outermost to innermost, ensures the precise identification of entity boundaries in instances with nesting complexities.
In our approach to entity classification, we were inspired by the pipeline method employed in entity relationship extraction. Our innovation involved introducing entity hierarchies as text identifiers. This decision was rooted in our belief that modifications in entity hierarchies would significantly influence the identification of entity classes.
Following experimentation across three diverse datasets, our findings underscore that our methodology not only refines the merged label layer sequence labeling method but also maintains robust competitiveness when compared to existing models.

The structure of this paper, following the already completed Introduction section (Section 1), is organized as follows: Section 2 provides a review of related work. Section 3 describes the three datasets used in our experiments, along with a statistical analysis of these datasets. Section 4 is dedicated to the proposed method, with Section 4.1 detailing the labeling method for the fusion labeling layer, Section 4.2 discussing the labeling method with a focus on the uniformity of entity categories, and Section 4.3 explaining the process of inserting identifiers for entity category recognition. Subsequently, Section 5 presents the results of our experiments, including comparisons and analyses relative to both the model before improvement and other benchmark models. We conclude our findings in Section 6.

2. Related Work

Named entities, a concept first introduced by Grishman and Sundheim in 1996 [14], have played a pivotal role in the field of natural language processing. The development and refinement of named entity datasets since then have been instrumental in advancing the domain of flat named entity recognition (NER) [15]. The most widely adopted model in NER has been the sequence labeling model [7], with the LSTM-CRF model standing out as a classic example of this approach [16,17], where long short-term memory (LSTM) is a specialized form of recurrent neural networks (RNNs) introduced by Goller and Kuchler in 1996 [18]. It features a unique memory cell designed to retain past information over extended periods. In sequence tagging tasks, to ensure that both past and future input features are recognized, a bidirectional LSTM network is utilized for named entity recognition tasks [19]. Alex (2007) presented a variety of strategies for integrating multiple conditional random fields (CRF) [20] to address the tasks [21]. Through fine-tuning, the pre-trained model BERT can be adapted for various tasks, including named entity recognition. The subsequent introduction of the pre-trained model BERT marked a significant leap forward in NER efficiency and accuracy. In the evaluation task of the CMeEE dataset, various versions of BERT were employed as benchmarks for assessment [22]. Furthermore, various labeling schemes like BIO, BIOES, and BILOU have emerged, evolving from these sequence labeling models and further enriching the tools available for NER tasks.

In the realm of biology, researchers have identified a unique challenge with certain entities that demonstrate nesting issues, where one entity encompasses other entities. Tackling these nested entity problems initially posed significant challenges, but over time, various innovative methods have been proposed to address them. A notable solution is the cascading model, pioneered by Ju et al. [16]. This model adopts a hierarchical strategy, building multiple layers of flat NER recognition that systematically identify entities from the innermost layer outward until no further entities are detectable. This approach ensures that information about inner entities is fully utilized by outer entities. However, it has a limitation in the one-way flow of information, as it does not effectively harness information from outer entities. Consequently, when misidentifications occur at the innermost levels, they can result in cascading errors throughout the system.

To overcome this limitation, Luo (2020) implemented neural network techniques to facilitate a mutual interaction between inner and outer entities [23]. This innovation marked a significant advancement in handling nested entities. Additionally, Wang (2020) introduced a pyramid structure to refine and optimize the concept of the hierarchical model [24], further enhancing the effectiveness of nested entity recognition.

Several researchers have innovated named entity recognition (NER) by proposing the multi-head approach. In this method, token pairs, comprising head and tail tokens, define a span. A multi-head matrix is then constructed using various techniques like dot-product scoring [25], additive methods [26], and a blend of multiplicative and additive approaches [27]. This approach also integrates the concept of Biaffine [28] into the field.

Another significant advancement is the span-based method. This technique identifies all entities within a sequence by considering every possible span of characters in the text, employing a neural network-based classifier for this purpose [29,30]. An alternative perspective treats NER as a reading comprehension task [31]. This involves incorporating entity category information into the input, presenting a unified framework suitable for a range of information extraction tasks [32].

Finally, traditional sequence labeling methods struggle with nested entities. To address this, the “merged label layer” approach has been proposed, which consolidates multiple labels into a single layer. This enables nested entities to be treated as flat entities, making them recognizable using sequence labeling models.

The pipeline method, used in information extraction tasks, has received mixed reviews from researchers. Some, like Chen (2020), have successfully applied this method to entity relationship extraction, achieving notable results [10]. A key improvement in this approach is the incorporation of entity boundaries and types as identifiers before and after entity spans, proving that, when applied appropriately, the pipeline model can outperform joint models.

In summary, various methods have been proposed to tackle the nesting problem in NER, each achieving significant results. However, these methods often face challenges related to complexity or implementation issues. In response, we have improved the simple and easily implementable sequence labeling method with a merged label layer, transforming it into a pipeline approach.

3. Datasets

In the task of named entity recognition, open-source datasets containing nested entities are very scarce. Therefore, in this experiment, we utilized datasets in three different languages to demonstrate the feasibility of the model in nested named entity recognition tasks across various languages. The open-source datasets for these three languages are: the Chinese medical dataset CMeEE, the English medical dataset GENIA, and the German daily news dataset GermEval 2014. All three datasets were divided into training sets (Train), validation sets (Dev), and test sets (Test), and each dataset included nested entities. Detailed descriptions of the datasets are provided below.

3.1. CMeEE Dataset

The CMeEE dataset, tailored specifically for the Chinese medical domain, is a rich and comprehensive resource for researchers and practitioners. This dataset, encompassing a training set, validation set, and test set, was meticulously sourced from the AliTianchi platform and is presented in a JSON format for ease of use. The training and validation sets are particularly robust, featuring not only sentences but also detailed annotations. These include entities, the start index and end index of each entity, and entity categories.

The test set, while providing only sentences, plays a crucial role in evaluating the effectiveness of models trained on this dataset. Once a model is trained, its performance can be rigorously tested using this set, and the results can be submitted to the Ali platform for a comprehensive evaluation. This process ensures the robustness and applicability of models in real-world medical scenarios.

In our thorough analysis, we delved into the intricate details of the main entities and nested entities (sub-entities) present in the dataset. The findings of this detailed statistical examination are depicted in the table below.

Table 1 reveals insightful data about the CMeEE dataset. It includes the counts of each category label as main entities and sub-entities. The items in the table are described in Table 2. It shows that the dataset is composed of a total of 79,111 entities. A closer look at the distribution of these entities across different categories reveals a significant imbalance. When considering the entities as main categories, the categories “bod” (body parts) and “dis” (diseases) dominate in terms of occurrences, with the “sym” (symptoms) category following closely. Interestingly, the scenario shifts when we analyze the entities as sub-entities. Here, the “bod” category overwhelmingly constitutes about 84.7% of the total entities in both the training and test sets, underscoring its predominant role in the dataset’s structure. The “ite” category, although smaller in quantity, stands out due to its proportional significance. Despite its lower overall count, it ranks second in frequency of occurrence as a sub-entity, highlighting its importance in the context of nested entities. This pattern suggests a characteristic feature of the Chinese medical dataset: if an entity contains nested entities, there is a high probability that these nested entities belong to the “bod” category.

3.2. GermEval 2014

The GermEval 2014 dataset stands out as a comprehensive German named entity recognition resource, notable for its inclusion of nested entities. This dataset, meticulously compiled, derives its content from German news corpora and Wikipedia, ensuring a rich and varied linguistic representation. The dataset’s annotation rigorously adheres to the labeling scheme developed by D. Benikova et al. [12], employing the “BIO” (beginning, inside, outside) tagging method for precision and clarity in entity identification.

Mirroring the structure commonly found in domain-specific datasets, the GermEval 2014 dataset is methodically segmented into training, validation, and test sets. This segmentation facilitates a systematic approach to model training and evaluation, ensuring that the models developed are robust and effective across different data splits.

A distinct feature of the GermEval 2014 dataset is its composition of predominantly everyday language corpora, as opposed to the specialized language found in medical datasets. This characteristic necessitated an in-depth statistical analysis of the main entities and sub-entities present within the dataset, with the findings detailed below.

Table 3 provides the statistics of the GermEval 2014 dataset. It includes the counts of each category label as main entities and sub-entities. The items that appear in the table are described in detail in Table 4. It highlights that the dataset encompasses a total of 39,558 entities. An analysis of the distribution of these entities across various categories reveals an imbalance. However, an interesting observation emerges when the “deriv” (derivatives) and “part” (parts) categories are amalgamated with broader categories. When examining the dataset in terms of sub-entities, a significant pattern is observed. Entities categorized as “LOC” (locations) represent approximately 70.7% of the total entities across the training, validation, and test sets. This prevalence suggests a notable tendency within the German language corpus dataset: entities that act as nested entities are highly likely to fall under the “LOC” category. Such insights not only enhance our understanding of the dataset’s structure but also inform strategies for effective entity recognition and extraction in the German language context.

3.3. GENIA

The GENIA dataset, a specialized resource tailored for the English medical domain, represents a pivotal tool in advancing natural language processing within this field. Our pre-processing strategy for this dataset is grounded in the methodologies proposed by Finkel et al. [33] and Lu et al. [34]. By closely adhering to these established approaches, we ensure that our data handling and preparation processes align with the best practices in the field. This dataset is meticulously partitioned into training, validation, and test sets, adhering to a carefully considered ratio of 8.1:0.9:1. This allocation ensures a comprehensive and balanced distribution of data, facilitating effective training and rigorous evaluation of models.

Detailed information about the training, validation, and test sets for the three datasets is presented in Table 5. It includes the number of sentences in three types of datasets: the training set (Train), the development set (Dev), and the test set (Test), as well as the number of entity categories in each dataset.

4. Method

In our paper, we introduce a novel pipeline approach specifically designed to tackle the complex challenge of nested structures in named entity recognition. The flowchart of our model is depicted in Figure 2, where the sequence labeling model can be chosen as BERT-CRF or BERT-Bi-LSTM-CRF, with the model frameworks shown in Figure 3. In our subsequent experiments, we constructed the PNER I and PNER II models for them, respectively. For the flowchart, initially, the model adopts a sequential labeling technique to accurately identify entity boundaries. To address the complexities of nested entities, we integrate a merging label layer into the sequence labeling framework. To simplify the labeling process, which often involves a daunting variety of categories, we adopt a unified approach to entity types. Furthermore, we implement hierarchical annotations while maintaining a manageable number of labels, thereby reducing the complexity of entity categorization.

In the subsequent phase of our model development, we introduce a text classification system designed to categorize entity types using an advanced multi-class classification strategy. Our method involves the strategic use of hierarchical entity categories as markers within the text, along with precise annotations of the locations of these entities. Importantly, our approach pays special attention to nested entities, marking not only the primary entities but also the nested ones. This detailed and inclusive annotation strategy greatly improves the model’s capacity to recognize and understand the relationships between various entities.

The subsequent sections of the paper provide an in-depth and detailed discussion of each component of our methodology. This comprehensive explanation is intended to offer a clear and nuanced insight into the sophisticated processes and techniques we have employed in developing our approach.

4.1. Fusion Labeling Layer

In our study, we adopted a methodology inspired by prior research, focusing on organizing nested entity structures in the “BIO” format. To illustrate, take the example of the sentence “Mouse interleukin-2 receptor alpha gene expression”. In this case, the “BIO” labeling system provides a structured approach: “B-” indicates the beginning of an entity, “I-” marks the intermediate part of an entity, and “O” signifies segments without any entity association.

Our technique for integrating label layers follows a specific hierarchy, prioritizing longer entities over shorter ones. This hierarchy ranges from the primary entity to nested entities, arranged from left to right. We use a “|” separator to clearly differentiate the labels at each level, thus enabling precise delineation of entity hierarchies. Table 6 in our paper provides detailed insights into this process, with the sample sentence from the GENIA dataset along with nested-level annotation (label1 and label2) and fusion labeling layers. This method was first proposed by Straková et al. [9]. The pseudo-code for this method is shown in Algorithm 1.

A key feature of our labeling approach is how we handle sub-level label segments that align with non-entity sections. In the final annotation phase, these segments are hidden instead of being marked as “O”. This approach results in a more streamlined and accurate labeling scheme. Our systematic method simplifies the complexity involved in understanding and recognizing nested entities within texts, presenting an organized framework for effective entity recognition and hierarchy delineation.

Algorithm 1 Fusion labeling layer

Input: label1, label2

Output: result

1:: if label2 is empty then
2:: result ← label1
3:: else
4:: result ← label1 + “|” + label2
5:: end if
6:: return result

4.2. Unified Entity Categories

The implementation of the merging label layer technique in the sequence labeling model resulted in a significant increase in the number of labels. To address this issue, we further refined our method by consolidating entity categories. This improved labeling scheme is illustrated in Table 7 below. While we continue to use the unchanged “BIO” labeling rule, our modified merging label layer method offers a more ambiguous categorization of entity types. Regardless of the longer entity’s specific category, it is uniformly labeled as “-ent”. In the same vein, shorter entities (such as nested entities) are uniformly labeled as “-sub”. This system leads to a streamlined set of six possible labels in cases with only a single nesting layer: “B-ent”, “I-ent”, “B-ent|B-sub”, “I-ent|B-sub”, “I-ent|I-sub”, and “O”. It is important to note that, due to our labeling conventions, the label “B-ent|I-sub” is not used in our system. After the pre-processing of the unified label layer, the statistical results of the dataset size are shown in Table 8. The pseudocode for the unified entity categories is shown in Algorithm 2.

During the decoding phase, we use a systematic method to unravel arrays marked by “|” as the delimiter. We then meticulously merge these arrays in sequential order, starting with “B-” and ending when an “O” is encountered. This precise process allows for the complete construction of an entity. These refinements make the labeling process more streamlined, enabling a more effective and organized approach to entity recognition. They also maintain clarity and accuracy in decoding complex nested structures within text data.

Algorithm 2 Unified entity categories

Input: label1, label2

Output: result

1:: Begin
2:: if label1 starts with “B-” then
3:: result ← “B-ent”
4:: else if label1 starts with “I-” then
5:: result ← “I-ent”
6:: else
7:: result ← label1
8:: end if
9:: if label2 is not empty then
10:: result ← result + “|”
11:: if label2 starts with “B-” then
12:: result ← result + “B-sub”
13:: else if label2 starts with “I-” then
14:: result ← result + “I-sub”
15:: else
16:: result ← result + label2
17:: end if
18:: end if
19:: Return result
20:: End

4.3. Entity Category Classification

Given our approach of unifying entity categories in the initial sequence annotation process, we have encountered limitations in achieving more precise entity classifications. As a result, incorporating an additional text classification model is crucial for effectively distinguishing between different entity categories. Figure 4 showcases the use of identifiers for text annotation within the entity category classification phase, employing example sentences from the GENIA dataset for illustration. The text includes category tags representing distinct entity types, along with position markers that indicate the boundaries of the entities, all integrated within sentences. Our category tags are divided into two main classifications: “main entity” (ent) and “nested entity” (sub). The position markers, like <ent> and </ent> for starting and ending recognized main entities and <sub> and </sub> for marking the beginning and end of recognized nested entities, act as our guide for annotations. For the annotation protocol of the classification model, the following applies:

We begin by injecting the category tag (“ent” or “sub”) at the start of each sentence.
When the main entity category is identified within a sentence, we strategically place markers to denote the beginning and end of this main entity.
To identify nested entity categories, our initial step involves inserting markers that indicate the start and finish of the main entity enclosing the nested one. Following this, we insert additional markers to highlight the start and end positions of the nested entity itself. It is important to note that the markers for the main entity take priority in this process.

Let Xmain denote this modified main entity sequence with text markers inserted, and Xsub denote this modified sub-entity sequence with text markers inserted:

\begin{matrix} X_{main} = main entity, \dots, 〈 ent 〉, x_{START (main)}, \dots, x_{END (main)}, 〈 / ent 〉, \dots \end{matrix}

(1)

\begin{matrix} X_{sub} = & sub - entity, \dots, 〈 ent 〉, x_{START (main)}, \dots, 〈 sub 〉, x_{START (sub)}, \dots, x_{END (sub)}, \\ 〈 / sub 〉, \dots, x_{END (main)}, 〈 / ent 〉, \dots \end{matrix}

(2)

Each entity requires careful text annotation, not just during the training phase but also in subsequent prediction tasks. This detailed annotation is vital, as it forms the cornerstone of improving our model’s ability to understand and categorize entities within text data. After the pre-processing of inserting identifiers, the statistical results of the dataset size are as shown in Table 9.

5. Results and Discussion

Our experimental setup was conducted on a platform equipped with Intel i7-12700 and NVIDIA GeForce RTX 3050 Ti. For the development and execution of our models, we employed the PaddlePaddle 2.1 framework, which provides a comprehensive and efficient environment for deep learning applications. Meanwhile, in our experiment, we adopted a highly rigorous evaluation approach. An entity match was considered successful only if it accurately matched the start and end indices of the entity and correctly classified its category. We used the micro-F1 score to evaluate our model’s performance.

Initially, on the CMeEE dataset, we assessed our model’s performance using various pre-trained BERT models and compared its effectiveness in recognizing entities against traditional models. We then explored the impact of incorporating identifiers in the classification of entity categories and examined how this influenced the outcomes.

Furthermore, we conducted a comparison involving two approaches against other baseline models. The first method used a pre-trained BERT model as the encoding layer, complemented by an additional conditional random field (CRF) layer. The second approach integrated contextual word embeddings from the pre-trained BERT model, coupled with a bi-directional LSTM for encoding and a CRF for decoding.

After completing our experiments on the CMeEE dataset, we extended our analysis to compare the most effective methods with other baseline models on datasets in two additional languages: GermEval 2014 and GENIA.

5.1. Discussion of Results on the CMeEE Dataset

Firstly, we referred to the performance of three pre-trained models on the CMeEE dataset provided by Zhang et al. in CBLUE. These three pre-trained models are as follows:

BERT-wwm-ext-base [35]. This pre-trained model adopts the whole-word masking approach, masking the entire word as the smallest unit, and is trained on a Chinese language corpus.
BERT-base [13]. This is the basic BERT model, comprising 110 million parameters, and is trained on a Chinese language corpus.
MacBERT-large [36]. This is an enhanced BERT model that employs the masked language model (MLM) as a pre-training task.

In our PNER model, specifically for the entity boundary recognition task, we utilized three different pre-trained models. We then benchmarked these against the standards set by Zhang et al. [22]. The outcomes of this comparison are detailed in Table 10 below, where the same pre-trained models and parameters were used for the comparison, and the baseline model data were obtained from CBLUE [22]. Our experiments revealed that, across the board, all the enhanced versions of our model outperformed their respective baseline counterparts. Notably, when using the MacBERT pre-trained model, the improvement in performance was even more significant compared to the other two enhanced versions. In fact, it achieved the highest score. This finding indicates that while there is a boost in performance for entity boundary recognition tasks, the enhancement is even more pronounced in tasks related to recognizing entity categories. This leads to an overall more substantial improvement in the performance of our model.

During the classification task of our study, we introduced identifiers in the pre-processing stage of the dataset to indicate whether the entity being classified was a main entity or a sub-entity. Additionally, we annotated the dataset with position index identifiers for both the main and sub-entities. For training, we employed the BERT-base pre-trained model on the pre-processed CMeEE training set, and we carried out testing on the validation set. Alongside this, we monitored the classification performance across various entity categories. The model’s performance in different categories is presented in Table 11. It is important to note that we are not certain whether the inclusion of identifiers will affect the performance of the classification model. Therefore, we need to conduct an experiment to determine this.

Our primary objective in the classification task was to achieve higher accuracy. To this end, we conducted experiments to compare scenarios with and without the inclusion of identifiers. We evaluated the accuracy of both approaches, and the comparative findings are depicted in Table 12. The results clearly demonstrate that including identifiers significantly enhances the model’s classification precision. This underscores the vital role these identifiers play in effectively classifying different entity categories.

We proceeded to conduct entity boundary recognition tasks using two distinct models: BERT-CRF and BERT-Bi-LSTM-CRF. Upon completing these tasks, we introduced identifiers for entity category classification and proceeded to classify the entities accordingly. We designated these two models as PNER I and PNER II, respectively. To evaluate their effectiveness, we compared their performance on the CMeEE dataset against three benchmark models.

The traditional BERT-CRF model employs the BERT pre-trained model [13] for encoding and a CRF [20] layer for decoding. The RICON model [37] is a span-based named entity recognition model that enhances span representation by integrating regularized information about entities. The Global Pointer model [25] approaches entity recognition by considering the starting and ending positions of entities from a global perspective.

The results of these comparisons are detailed in Table 13, where “*” denotes our implementation via their code. From the findings, we observed that our enhanced PNER I model, based on BERT-CRF, shows some improvement. Moreover, PNER II, which incorporates LSTM in the encoding layer, performs even better, slightly surpassing the other models in effectiveness.

5.2. Discussion of Results on Other Datasets

When we tested our model on the GENIA dataset, we applied the PNER II framework and initially compared it with the model introduced by Straková et al. [9]. In their research, they presented two named entity recognition methods. The first method employs an LSTM-CRF neural model framework, which was designed to tackle nested entities through the integration of label layers. Our model builds upon this approach, capitalizing on its simplicity and efficiency for improvement. The second method they proposed uses a sequence-to-sequence (seq2seq) approach, and both methods have shown commendable performance. We compared our model against both of these methods, as detailed in Table 14.

The comparison results revealed that, by using the same LSTM-CRF framework, our model achieved significant performance improvements. These improvements are attributed to optimizations in our annotation method during sequence labeling and enhanced accuracy in identifying entity categories. This underscores that effectively addressing the complexity of increasing named entity categories in nested entities can notably boost model performance. Additionally, our model retains the simplicity and efficiency of the original method. Furthermore, our model shows notable improvements compared to the second seq2seq-based model framework proposed by Straková et al. [9]

The comparison results in Table 14 indicate that our enhanced model achieved significant improvements while maintaining the original framework, where “#” denotes the baseline model before improvement. We then compared our model with various benchmark models on the GENIA dataset. These benchmarks include the sequence labeling-based method by Ju et al. [16] used for nested entity recognition using stacked flat entity layers, hypergraph-based approaches by Wang and Lu [38], seq2seq framework-based methods by Straková et al. [9] and Yan et al. [39], span-based methods by Wang et al. [24], and span-level graphs introduced by Wan et al. [40]. The results of these comparisons are presented in Table 15. From this analysis, we observe that, although our model does not outperform all benchmark models on the GENIA dataset, it is notable for its simplicity and ease of implementation while still being competitive in performance.

In Table 16, we compare our model against benchmarks on the GermEval 2014 dataset, a German language dataset. For this experiment, we utilized the multilingual version of BERT-base to obtain contextual word vectors. The results indicate that our model performs better on the GermEval 2014 dataset compared to other models.

Table 17 displays the execution times of the model on the GENIA dataset, calculated per single epoch round. It is observable that, due to the pipeline method requiring the use of two models concurrently, the model takes longer to execute compared to traditional sequence labeling models. However, we believe that, due to the inherently simple framework of the model, this method does not take significantly longer compared to other, more complex models.

It is important to highlight that, while addressing nested named entities often leads to more complex models, the sequence labeling method remains the most traditional and straightforward approach in named entity recognition. This method requires only simple annotations for identifying entities. Our proposed method for fusing label layers represents a straightforward enhancement to this labeling technique, facilitating the recognition of nested named entities. This enhancement optimizes the model through the unification of label layers, maintaining its simplicity and effectiveness. By incorporating identifiers, we ensure the precision of entity category recognition. Our model stands out for its strong competitiveness compared to other more complex models.

6. Conclusions

Our model builds upon the fused labeling approach pioneered by Straková et al. [9]. To tackle the challenge of a rapidly increasing number of entity category labels, we have implemented a pipeline approach. This method first recognizes entity boundaries and then classifies entity categories. It effectively bypasses the error propagation problem that exists in the pipeline method. This feasibility stems from our rigorous evaluation criteria, which require the exact identification of both entity boundaries and categories for a match to be deemed successful. In the entity boundary recognition task, we utilize a unified entity category approach, categorizing entities as either main entities or sub-entities. This unified method substantially decreases the number of required labels, thereby improving the model’s accuracy in identifying entity boundaries. This strategy of having both main and sub-entity categories is crucial because if there were only a single category labeled as “entity”, label fusion would make it challenging to distinguish the boundaries of individual entities. For the entity category identification task, we adopted a methodology inspired by Chen et al.’s work in relationship extraction. We introduced the identifier (including the start and end index) in the text to distinguish between main entities and sub-entities. To validate this method, we initially analyzed the distribution of main and sub-entities across datasets from two different domains. We noticed a consistent trend across various named entity datasets in different domains, where invariably one category tends to dominate the majority of sub-entities. Further, we conducted tests both before and after adding these identifiers. The results confirm that the inclusion of identifiers impacts the recognition of entity categories. Our model, tested on named entity recognition datasets in three languages, demonstrates its exceptional performance. Additionally, it retains the simplicity and efficiency of the traditional sequence annotation method while being highly competitive with other benchmark models in terms of performance. However, from the execution times of the model, we can observe that the PNER model, due to the use of a pipeline approach, inevitably requires the utilization of two models. This results in a significantly longer processing time compared to traditional sequence labeling models. Furthermore, this method demands more time for data annotation. Meanwhile, our current method does not address discontinuous entities, a notable aspect of named entity recognition. In the future, we aim to expand our method to encompass discontinuous entities, thereby providing a comprehensive solution to all types of named entity recognition challenges in a straightforward manner.

Author Contributions

Conceptualization, H.Y.; methodology, H.Y.; software, Q.Z.; validation, H.Y. and Q.Z.; investigation, H.Y.; resources, H.Y.; data curation, H.Y. and Q.Z.; writing—original draft preparation, H.Y.; writing—review and editing, H.Y. and H.-C.K.; visualization, H.Y.; supervision, H.-C.K.; funding acquisition, H.-C.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2024-RS-2023-00254177) grant funded by the Korea government(MSIT).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that are used in this research work are openly available at the following links: 1. CMeEE dataset: https://tianchi.aliyun.com/dataset/95414 (accessed on 29 June 2023); 2. GermEval 2014 dataset: https://sites.google.com/site/germeval2014ner/data (accessed on 17 July 2023); 3. GENIA dataset: http://www.geniaproject.org/genia-corpus (accessed on 17 July 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Babych, B.; Hartley, A. Improving Machine Translation Quality with Automatic Named Entity Recognition. In Proceedings of the 7th International EAMT Workshop on MT and Other Language Technology Tools, Improving MT through Other Language Technology Tools, Resource and Tools for Building MT, Budapest, Hungary, 13 April 2003. [Google Scholar]
Mollá, D.; van Zaanen, M.; Smith, D. Named Entity Recognition for Question Answering. In Proceedings of the Australasian Language Technology Workshop, Sydney, Australia, 30 November–1 December 2006; Cavedon, L., Zukerman, I., Eds.; pp. 51–58. [Google Scholar]
Alzubi, J.A.; Jain, R.; Singh, A.; Parwekar, P.; Gupta, M. COBERT: COVID-19 Question Answering System Using BERT. Arab. J. Sci. Eng. 2023, 48, 11003–11013. [Google Scholar] [CrossRef]
Le, P.; Titov, I. Improving Entity Linking by Modeling Latent Relations between Mentions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Melbourne, Australia, 2018; pp. 1595–1604. [Google Scholar]
Wei, Z.; Su, J.; Wang, Y.; Tian, Y.; Chang, Y. A Novel Cascade Binary Tagging Framework for Relational Triple Extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1476–1488. [Google Scholar]
Shen, Y.; Ma, X.; Tan, Z.; Zhang, S.; Wang, W.; Lu, W. Locate and Label: A Two-Stage Identifier for Nested Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2782–2794. [Google Scholar]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics: San Diego, CA, USA, 2016; pp. 260–270. [Google Scholar]
Kim, J.-D.; Ohta, T.; Tateisi, Y.; Tsujii, J. GENIA Corpus—A Semantically Annotated Corpus for Bio-Textmining. Bioinformatics 2003, 19 (Suppl. 1), i180–i182. [Google Scholar] [CrossRef]
Straková, J.; Straka, M.; Hajic, J. Neural Architectures for Nested NER through Linearization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Florence, Italy, 2019; pp. 5326–5331. [Google Scholar]
Zhong, Z.; Chen, D. A Frustratingly Easy Approach for Entity and Relation Extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 50–61. [Google Scholar]
Hongying, Z.; Wenxin, L.; Kunli, Z.; Yajuan, Y.; Baobao, C.; Zhifang, S. Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation. In Proceedings of the Chinese Lexical Semantics, Nanjing, China, 15 May 2021; Liu, M., Kit, C., Su, Q., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 652–664. [Google Scholar]
Benikova, D.; Biemann, C.; Reznicek, M. NoSta-D Named Entity Annotation for German: Guidelines and Dataset. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 2524–2531. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Grishman, R.; Sundheim, B. Message Understanding Conference-6: A Brief History. In Proceedings of the COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, Copenhagen, Denmark, 5–9 August 1996. [Google Scholar]
Sang, E.F.T.K.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. arXiv 2003, arXiv:cs/0306050. [Google Scholar]
Ju, M.; Miwa, M.; Ananiadou, S. A Neural Layered Model for Nested Named Entity Recognition. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; Walker, M., Ji, H., Stent, A., Eds.; Association for Computational Linguistics: New Orleans, LA, USA, 2018; pp. 1446–1459. [Google Scholar]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
Goller, C.; Kuchler, A. Learning Task-Dependent Distributed Representations by Backpropagation through Structure. In Proceedings of the International Conference on Neural Networks (ICNN’96), Washington, DC, USA, 3–6 June 1996; Volume 1, pp. 347–352. [Google Scholar]
Graves, A.; Mohamed, A.; Hinton, G. Speech Recognition with Deep Recurrent Neural Networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
Lafferty, J.; McCallum, A.; Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning, San Francisco, CA, USA, 28 June–1 July 2001. [Google Scholar]
Alex, B.; Haddow, B.; Grover, C. Recognising Nested Named Entities in Biomedical Text. In Proceedings of the Biological, Translational, and Clinical Language Processing, Prague, Czech Republic, 29 June 2007; pp. 65–72. [Google Scholar]
Zhang, N.; Chen, M.; Bi, Z.; Liang, X.; Li, L.; Shang, X.; Yin, K.; Tan, C.; Xu, J.; Huang, F.; et al. CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark. arXiv 2022, arXiv:2106.08087. [Google Scholar]
Luo, Y.; Zhao, H. Bipartite Flat-Graph Network for Nested Named Entity Recognition. arXiv 2020, arXiv:2005.00436. [Google Scholar]
Wang, J.; Shou, L.; Chen, K.; Chen, G. Pyramid: A Layered Model for Nested Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 5918–5928. [Google Scholar]
Su, J.; Murtadha, A.; Pan, S.; Hou, J.; Sun, J.; Huang, W.; Wen, B.; Liu, Y. Global Pointer: Novel Efficient Span-Based Approach for Named Entity Recognition. arXiv 2022, arXiv:2208.03054. [Google Scholar]
Bekoulis, G.; Deleu, J.; Demeester, T.; Develder, C. Joint Entity Recognition and Relation Extraction as a Multi-Head Selection Problem. Expert Syst. Appl. 2018, 114, 34–45. [Google Scholar] [CrossRef]
Yu, J.; Bohnet, B.; Poesio, M. Named Entity Recognition as Dependency Parsing. arXiv 2020, arXiv:2005.07150. [Google Scholar]
Dozat, T.; Manning, C.D. Deep Biaffine Attention for Neural Dependency Parsing. arXiv 2017, arXiv:1611.01734. [Google Scholar]
Sohrab, M.G.; Miwa, M. Deep Exhaustive Model for Nested Named Entity Recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 2843–2849. [Google Scholar]
Fu, J.; Huang, X.; Liu, P. SpanNER: Named Entity Re-/Recognition as Span Prediction. arXiv 2021, arXiv:2106.00641. [Google Scholar]
Li, X.; Yin, F.; Sun, Z.; Li, X.; Yuan, A.; Chai, D.; Zhou, M.; Li, J. Entity-Relation Extraction as Multi-Turn Question Answering. arXiv 2019, arXiv:1905.05529. [Google Scholar]
Li, X.; Feng, J.; Meng, Y.; Han, Q.; Wu, F.; Li, J. A Unified MRC Framework for Named Entity Recognition. arXiv 2022, arXiv:1910.11476. [Google Scholar]
Finkel, J.R.; Manning, C.D. Nested Named Entity Recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–7 August 2009; Koehn, P., Mihalcea, R., Eds.; Association for Computational Linguistics: Singapore, 2009; pp. 141–150. [Google Scholar]
Lu, W.; Roth, D. Joint Mention Extraction and Classification with Mention Hypergraphs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; Màrquez, L., Callison-Burch, C., Su, J., Eds.; Association for Computational Linguistics: Lisbon, Portugal, 2015; pp. 857–867. [Google Scholar]
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z. Pre-Training with Whole Word Masking for Chinese BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3504–3514. [Google Scholar] [CrossRef]
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Wang, S.; Hu, G. Revisiting Pre-Trained Models for Chinese Natural Language Processing. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020; pp. 657–668. [Google Scholar]
Gu, Y.; Qu, X.; Wang, Z.; Zheng, Y.; Huai, B.; Yuan, N.J. Delving Deep into Regularity: A Simple but Effective Method for Chinese Named Entity Recognition. arXiv 2022, arXiv:2204.05544. [Google Scholar]
Wang, B.; Lu, W.; Wang, Y.; Jin, H. A Neural Transition-Based Model for Nested Mention Recognition. arXiv 2018, arXiv:1810.01808. [Google Scholar]
Yan, H.; Gui, T.; Dai, J.; Guo, Q.; Zhang, Z.; Qiu, X. A Unified Generative Framework for Various NER Subtasks. arXiv 2021, arXiv:2106.01223. [Google Scholar]
Wan, J.; Ru, D.; Zhang, W.; Yu, Y. Nested Named Entity Recognition with Span-Level Graphs. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 892–903. [Google Scholar]
Zheng, C.; Cai, Y.; Xu, J.; Leung, H.; Xu, G. A Boundary-Aware Neural Model for Nested Named Entity Recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3 November 2019; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 357–366. [Google Scholar]
Wang, Y.; Li, Y.; Tong, H.; Zhu, Z. HIT: Nested Named Entity Recognition via Head-Tail Pair and Token Interaction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online Event, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6027–6036. [Google Scholar]
Pikuliak, M.; Simko, M.; Bielikova, M. Towards Combining Multitask and Multilingual Learning. In SOFSEM 2019: Theory and Practice of Computer Science; Catania, B., Královič, R., Nawrocki, J., Pighizzini, G., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; Volume 11376, pp. 435–446. ISBN 978-3-030-10800-7. [Google Scholar]
Agrawal, A.; Tripathi, S.; Vardhan, M.; Sihag, V.; Choudhary, G.; Dragoni, N. BERT-Based Transfer-Learning Approach for Nested Named-Entity Recognition Using Joint Labeling. Appl. Sci. 2022, 12, 976. [Google Scholar] [CrossRef]
Marcińczuk, M.; Radom, J. A Single-Run Recognition of Nested Named Entities with Transformers. Procedia Comput. Sci. 2021, 192, 291–297. [Google Scholar] [CrossRef]

Figure 1. An example of nested entities from GENIA dataset; protein and DNA are categories of entities.

Figure 2. The flowchart of the method.

Figure 3. Two kinds of sequence labeling model frameworks.

Figure 4. Data annotation method in text classification model.

Table 1. Statistics on the number of various category labels in the CMeEE dataset.

Item	Train		Dev		Overall
Item	Main	Sub	Main	Sub	Overall
dis	15,664	155	4880	47	20,746
sym	10,115	2	3388	0	13,505
pro	6291	40	2039	16	8386
equ	878	9	235	3	1125
dru	3900	29	1433	7	5369
ite	2265	301	813	104	3483
bod	14,609	3063	4857	1019	23,548
dep	348	0	108	1	457
mic	1889	19	580	4	2492
Overall	55,959	3618	18,333	1201	79,111

Table 2. Details of labels written in the CMeEE dataset.

Label	Named Entity (NE) Type	Subclasses
dis	Disease	Diseases or syndromes; poisoning or injury; organ or cell damage
sym	Symptom	Symptoms; signs
pro	Medical procedure	Diagnostic procedure; treatment or prevention procedure
equ	Medical equipment	Diagnostic equipment; treatment equipment
dru	Drug	Drugs
ite	Medical test item	Medical test items
bod	Body	Bodily substances; body parts
dep	Department	Departments
mic	Microorganism	Microorganisms

Table 3. Statistics on the number of various category labels in the GermEval 2014 dataset.

Item	Train		Dev		Test		Overall
Item	Main	Sub	Main	Sub	Main	Sub	Overall
PER	7679	173	711	18	1639	47	10,267
PERderiv	62	5	2	0	11	1	81
PERpart	184	12	18	3	44	3	264
LOC	8281	907	763	78	1706	166	11,901
LOCderiv	2808	76	235	6	561	26	3712
LOCpart	513	21	52	1	109	2	698
ORG	5255	41	496	3	1150	8	6953
ORGderiv	41	0	3	0	8	1	53
ORGpart	805	2	91	1	172	0	1071
OTH	3024	20	269	0	697	4	4014
OTHderiv	236	2	16	0	39	0	293
OTHpart	190	0	18	0	42	1	251
Overall	29,078	1259	2674	110	6178	259	39,558

Table 4. Details of labels written in the GermEval 2014 dataset.

Label	NE Type	Example
Label	NE Type	Sentence	Entity
PER	Person	Für Erika Ziltener besteht angesichts dieser Entwicklung die Gefahr eines Zweiklassen-Gesundheitssystems.	Erika Ziltener
PERderiv	The derivation of the person	In der ersten Hälfte des 9. Jahrhunderts erstarkten die Sanjaya wieder.	Sanjaya
PERpart	The part of the person	Die folgenden vier Spielzeiten war er für Motor IFA Karl-Marx-Stadt aktiv.	Karl-Marx-Stadt
LOC	Location	In Deutschland ist nach StGB eine Anwerbung für die Fremdenlegion strafba.	Deutschland
LOCderiv	The derivation of the location	Die 75 Kampfjets warden im englischen Hatfield hergestellt und 1949 in die Schweiz geflogen.	englischen
LOCpart	The part of the location	Ancient City, die Altstadt, ist das alte Chinesenviertel.	Chinesenviertel
ORG	Organization	Das war ein tolles Ergebnis für Porsche.	Porsche
ORGderiv	The derivation of the organization	Die Republikaner starten bei Portillo de Suano einen Gegenangriff.	Republikaner
ORGpart	The part of the organization	Heute soll Asahari ohne die JI-Struktur auskommen.	JI-Struktur
OTH	Others	Der Haushalt belief sich im vergangenen Jahr auf rund 40.2 Millionen Euro.	Euro
OTHderiv	The derivation of the others	Görts wahre Leidenschaft gilt der klassischen und romantischen Musik.	klassischen; romantischen
OTHpart	The part of the others	1938 wurde der Rajon aufgelöst und die deutschsprachige Zeitung Rote Fahne verboten.	deutschsprachige

Table 5. Statistics on the number of training, validation, and test sets in three datasets.

Dataset	Train	Dev	Test	Entity Type
CMeEE	15,000	5000	3000	9
GermEval 2014	24,000	2200	5100	12
GENIA	15,023	1669	1854	5

Table 6. Sample fusion annotation from GENIA dataset.

Sentence	Label1	Label2	Fusion Labeling Layers
Mouse	B-DNA	O	B-DNA
interleukin	I-DNA	B-protein	I-DNA\|B-protein
-	I-DNA	I-protein	I-DNA\|I-protein
2	I-DNA	I-protein	I-DNA\|I-protein
receptor	I-DNA	O	I-DNA
alpha	I-DNA	O	I-DNA
gene	I-DNA	O	I-DNA
expression	I-DNA	O	I-DNA

Table 7. Uniform label layer example from GENIA dataset.

Sentence	Label1	Label2	Fusion Labeling Layers
Mouse	B-ent	O	B-ent
interleukin	I-ent	B-sub	I-ent\|B-sub
-	I-ent	I-sub	I-ent\|I-sub
2	I-ent	I-sub	I-ent\|I-sub
receptor	I-ent	O	I-ent
alpha	I-ent	O	I-ent
gene	I-ent	O	I-ent
expression	I-ent	O	I-ent

Table 8. Dataset size statistics after the unified label layer.

Dataset	Train	Dev	Test	Labeling Type
CMeEE	15,000	5000	-	6
GermEval 2014	24,000	2200	5100	6
GENIA	15,023	1669	1854	6

Table 9. Dataset size statistics after inserting identifiers.

Dataset	Train	Dev	Test	Labeling Type
CMeEE	59,577	19,534	-	9
GermEval 2014	30,337	2784	6437	12
GENIA	45,929	4337	5474	5

Table 10. Comparison results on the CMeEE dataset.

CMeEE	Model	F1 Score
Baseline	BERT-wwm-ext-base [35]	61.7
	BERT-base [13]	62.1
	MacBERT-large [36]	62.4
Our model	PNER(BERT-wwm-ext-base)	62.2
	PNER(BERT-base)	62.9
	PNER(MacBERT)	63.4

Table 11. The performance of the model in classifying each entity category on the validation set.

Category	P	R	F1 Score
dis	83.95	93.44	88.44
sym	91.28	82.64	86.74
pro	90.52	95.07	92.74
equ	93.37	65.13	76.73
dru	92.42	95.69	94.03
ite	85.96	65.94	74.63
bod	95.95	94.63	95.28
dep	98.00	92.45	95.15
mic	86.81	94.69	90.58

Table 12. Comparing model accuracy in entity classification with and without identifiers.

Model	Accuracy
With identifier	90.36%
Without identifier	88.53%

Table 13. Comparison of results with existing approaches for CMeEE dataset.

Model	P	R	F1 Score
BERT-CRF *	-	-	64.41
RICON [37]	66.25	64.89	65.57
Global Pointer [25]	-	-	66.54
PNER I (ours)	66.97	63.94	65.42
PNER II (ours)	67.53	65.91	66.71

Table 14. Comparison results with the baseline model on GENIA dataset.

Model	F1 Score
LSTM-CRF+BERT [9] #	77.80
seq2seq+BERT+Flair [9]	78.31
PNER (ours)	79.21

Table 15. Comparison of results with existing approaches for GENIA dataset.

Model	P	R	F1 Score
Sequence labeling-based [16]	78.50	71.30	74.70
Hypergraph-based [38]	77.00	73.30	75.10
Seq2seq framework-based [9]	-	-	78.31
Span-based [24]	79.45	78.94	79.19
Seq2seq unified generative framework [39]	78.87	79.60	79.23
Span-level graphs [40]	77.92	80.74	79.30
PNER (ours)	79.34	79.08	79.21

Table 16. Comparison of results with existing approaches for GermEval 2014 dataset.

Model	P	R	F1 Score
Boundary aware-based [41]	74.50	69.10	71.70
HIT [42]	74.80	70.50	72.60
Bi-LSTM-CRF [43]	-	-	75.30
Transfer-learning [44]	-	-	85.29
PolDeepNer2 [45]	87.86	87.53	87.69
PNER (ours)	86.97	88.67	87.81

Table 17. The execution times of our model on the GENIA dataset.

Model	Time/Epoch
Sequence labeling model	92 s
Text classification model	402 s
Overall	494 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, H.; Zhang, Q.; Kwon, H.-C. PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition. Appl. Sci. 2024, 14, 1717. https://doi.org/10.3390/app14051717

AMA Style

Yang H, Zhang Q, Kwon H-C. PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition. Applied Sciences. 2024; 14(5):1717. https://doi.org/10.3390/app14051717

Chicago/Turabian Style

Yang, Hongjian, Qinghao Zhang, and Hyuk-Chul Kwon. 2024. "PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition" Applied Sciences 14, no. 5: 1717. https://doi.org/10.3390/app14051717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition

Abstract

1. Introduction

2. Related Work

3. Datasets

3.1. CMeEE Dataset

3.2. GermEval 2014

3.3. GENIA

4. Method

4.1. Fusion Labeling Layer

4.2. Unified Entity Categories

4.3. Entity Category Classification

5. Results and Discussion

5.1. Discussion of Results on the CMeEE Dataset

5.2. Discussion of Results on Other Datasets

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI