Next Article in Journal
A Review of Deep Learning Applications in Tunneling and Underground Engineering in China
Previous Article in Journal
Study on the Causes and Control Measures of Mg–Al Spinel Inclusions in U75V Heavy Rail Steel
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition

Center for Artificial Intelligence Research, Pusan National University, Busan 46241, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(5), 1717; https://doi.org/10.3390/app14051717
Submission received: 20 January 2024 / Revised: 14 February 2024 / Accepted: 19 February 2024 / Published: 20 February 2024

Abstract

:
Named entity recognition (NER) in natural language processing encompasses three primary types: flat, nested, and discontinuous. While the flat type often garners attention from researchers, nested NER poses a significant challenge. Current approaches to addressing nested NER involve sequence labeling methods with merged label layers, cascaded models, and those rooted in reading comprehension. Among these, sequence labeling with merged label layers stands out for its simplicity and ease of implementation. Yet, highlighted issues persist within this method, prompting our aim to enhance its efficacy. In this study, we propose augmentations to the sequence labeling approach by employing a pipeline model bifurcated into sequence labeling and text classification tasks. Departing from annotating specific entity categories, we amalgamated types into main and sub-categories for a unified treatment. These categories were subsequently embedded as identifiers in the recognition text for the text categorization task. Our choice of resolution involved BERT+BiLSTM+CRF for sequence labeling and the BERT model for text classification. Experiments were conducted across three nested NER datasets: GENIA, CMeEE, and GermEval 2014, featuring annotations varying from four to two levels. Before model training, we conducted separate statistical analyses on nested entities within the medical dataset CMeEE and the everyday life dataset GermEval 2014. Our research unveiled a consistent dominance of a particular entity category within nested entities across both datasets. This observation suggests the potential utility of labeling primary and subsidiary entities for effective category recognition. Model performance was evaluated based on F1 scores, considering correct recognition only when both the complete entity name and category were identified. Results showcased substantial performance enhancement after our proposed modifications compared to the original method. Additionally, our improved model exhibited strong competitiveness against existing models. F1 scores on the GENIA, CMeEE, and GermEval 2014 datasets reached 79.21, 66.71, and 87.81, respectively. Our research highlights that, while preserving the original method’s simplicity and implementation ease, our enhanced model achieves heightened performance and competitive prowess compared to other methodologies.

1. Introduction

In natural language processing (NLP), the identification of specific words or phrases, such as individual names, locations, and organizations, is essential for tasks including machine translation [1], question answering [2,3], entity linking [4], and relation extraction [5]. These terms, known as named entities (NEs), are identified through named entity recognition (NER), a foundational task in NLP.
However, certain texts contain named entities that are nested or overlapping, giving rise to what is known as nested entities or entity nesting [6]. These complexities are prevalent in particular domain-specific datasets. Most researchers concentrate on resolving flat entities [7]; the existence of nested entities in text often goes unaddressed. Refer to Figure 1 for an illustrative sentence extracted from the GENIA dataset [8]: “Mouse interleukin-2 receptor alpha gene expression”. In this sentence, “interleukin-2” is categorized as a “protein” entity type. Notably, “interleukin-2” also exists within the “DNA” entity type “Mouse interleukin-2 receptor alpha gene”. This scenario exemplifies a nested entity, where “interleukin-2” holds multiple entity-type associations within the same text span. Sequence labeling methods are commonly employed to tackle NER tasks. However, these methods confine each token to representing only one entity category, disregarding the possibility of multiple entity categories associated with a single token within nested entities. Addressing this issue, some researchers proposed solutions such as the “merged label layer” [9], a method involving sequence labeling models annotating multiple entity categories on tokens with nesting challenges. Although straightforward, this method leads to label expansion, resulting in model sparsity.
In our study, we address the significant challenges inherent in processing nested entities with sequence labeling methods, a task complicated by the intricate hierarchies and overlapping entities in text data. Traditional sequence labeling approaches primarily focus on identifying single-layer entities, often inadequately handling or entirely overlooking nested structures. Although some researchers have explored solutions to this problem, their methods either deviate from the simplicity of the sequence labeling framework or require the annotation of an extensive number of labels, leading to model sparsity. Our work seeks to navigate these issues, proposing an innovative approach that retains the simplicity of sequence labeling models while effectively managing the complexity of nested entities.
Our research endeavors to optimize this approach by mitigating label proliferation. We devised a pipeline approach to tackle nested entity complexities. This method consists of two stages. Initially, we utilized a sequence labeling model incorporating a merged label layer; however, we streamlined entity categories by introducing a hierarchical entity level. This hierarchical level, labeled as ent, sub1, sub2, and so forth, is determined based on primary and secondary relationships within nested entities, with the outermost entity assigned the highest importance (ent). This hierarchical structure significantly reduces label volume. Subsequently, we employed a classification model to ascertain the entity category.
In a prior state-of-the-art model for entity relationship extraction proposed by Chen et al. [10], a pipeline approach was utilized. They proposed an enhancement involving entity boundaries and types as identifiers around entity spans within the classification model. Building upon this notion, our research incorporates entity levels as identifiers around entity spans within the classification model. We hypothesize that entity levels may exert influence on the entity category.
We conducted experiments on three datasets encompassing nested entity challenges across diverse languages: the Chinese medical dataset CMeEE [11], the English medical dataset GENIA [8], and the German dataset GermEval2014 [12]. For the sequence labeling task, we employed the BERT-Bi-LSTM-CRF model, a state-of-the-art approach. Leveraging BERT [13] for pre-training, this model adeptly constructs word embeddings. It integrates bi-directional long short-term memory (Bi-LSTM) for comprehensive feature extraction, accommodating both long- and short-term word embeddings. Finally, conditional random field (CRF) was utilized for NER labeling, a conventional method for NER tasks. For the classification task, we utilized transfer learning, specifically fine-tuning a pre-trained BERT model, to address this challenge. Our main contributions are listed as follows:
  • We refined the sequence labeling to effectively tackle nested entity challenges. Our proposed approach involves initially utilizing the merged label layer method for identifying entity boundaries, followed by the classification of entity categories. This strategy preserves the benefits of the merged label layer method while mitigating the issue of exponentially increasing label counts.
  • Introducing an entity hierarchical system allowed us to discern between different entities nested within each other. This hierarchical grading of entities, from outermost to innermost, ensures the precise identification of entity boundaries in instances with nesting complexities.
  • In our approach to entity classification, we were inspired by the pipeline method employed in entity relationship extraction. Our innovation involved introducing entity hierarchies as text identifiers. This decision was rooted in our belief that modifications in entity hierarchies would significantly influence the identification of entity classes.
  • Following experimentation across three diverse datasets, our findings underscore that our methodology not only refines the merged label layer sequence labeling method but also maintains robust competitiveness when compared to existing models.
The structure of this paper, following the already completed Introduction section (Section 1), is organized as follows: Section 2 provides a review of related work. Section 3 describes the three datasets used in our experiments, along with a statistical analysis of these datasets. Section 4 is dedicated to the proposed method, with Section 4.1 detailing the labeling method for the fusion labeling layer, Section 4.2 discussing the labeling method with a focus on the uniformity of entity categories, and Section 4.3 explaining the process of inserting identifiers for entity category recognition. Subsequently, Section 5 presents the results of our experiments, including comparisons and analyses relative to both the model before improvement and other benchmark models. We conclude our findings in Section 6.

2. Related Work

Named entities, a concept first introduced by Grishman and Sundheim in 1996 [14], have played a pivotal role in the field of natural language processing. The development and refinement of named entity datasets since then have been instrumental in advancing the domain of flat named entity recognition (NER) [15]. The most widely adopted model in NER has been the sequence labeling model [7], with the LSTM-CRF model standing out as a classic example of this approach [16,17], where long short-term memory (LSTM) is a specialized form of recurrent neural networks (RNNs) introduced by Goller and Kuchler in 1996 [18]. It features a unique memory cell designed to retain past information over extended periods. In sequence tagging tasks, to ensure that both past and future input features are recognized, a bidirectional LSTM network is utilized for named entity recognition tasks [19]. Alex (2007) presented a variety of strategies for integrating multiple conditional random fields (CRF) [20] to address the tasks [21]. Through fine-tuning, the pre-trained model BERT can be adapted for various tasks, including named entity recognition. The subsequent introduction of the pre-trained model BERT marked a significant leap forward in NER efficiency and accuracy. In the evaluation task of the CMeEE dataset, various versions of BERT were employed as benchmarks for assessment [22]. Furthermore, various labeling schemes like BIO, BIOES, and BILOU have emerged, evolving from these sequence labeling models and further enriching the tools available for NER tasks.
In the realm of biology, researchers have identified a unique challenge with certain entities that demonstrate nesting issues, where one entity encompasses other entities. Tackling these nested entity problems initially posed significant challenges, but over time, various innovative methods have been proposed to address them. A notable solution is the cascading model, pioneered by Ju et al. [16]. This model adopts a hierarchical strategy, building multiple layers of flat NER recognition that systematically identify entities from the innermost layer outward until no further entities are detectable. This approach ensures that information about inner entities is fully utilized by outer entities. However, it has a limitation in the one-way flow of information, as it does not effectively harness information from outer entities. Consequently, when misidentifications occur at the innermost levels, they can result in cascading errors throughout the system.
To overcome this limitation, Luo (2020) implemented neural network techniques to facilitate a mutual interaction between inner and outer entities [23]. This innovation marked a significant advancement in handling nested entities. Additionally, Wang (2020) introduced a pyramid structure to refine and optimize the concept of the hierarchical model [24], further enhancing the effectiveness of nested entity recognition.
Several researchers have innovated named entity recognition (NER) by proposing the multi-head approach. In this method, token pairs, comprising head and tail tokens, define a span. A multi-head matrix is then constructed using various techniques like dot-product scoring [25], additive methods [26], and a blend of multiplicative and additive approaches [27]. This approach also integrates the concept of Biaffine [28] into the field.
Another significant advancement is the span-based method. This technique identifies all entities within a sequence by considering every possible span of characters in the text, employing a neural network-based classifier for this purpose [29,30]. An alternative perspective treats NER as a reading comprehension task [31]. This involves incorporating entity category information into the input, presenting a unified framework suitable for a range of information extraction tasks [32].
Finally, traditional sequence labeling methods struggle with nested entities. To address this, the “merged label layer” approach has been proposed, which consolidates multiple labels into a single layer. This enables nested entities to be treated as flat entities, making them recognizable using sequence labeling models.
The pipeline method, used in information extraction tasks, has received mixed reviews from researchers. Some, like Chen (2020), have successfully applied this method to entity relationship extraction, achieving notable results [10]. A key improvement in this approach is the incorporation of entity boundaries and types as identifiers before and after entity spans, proving that, when applied appropriately, the pipeline model can outperform joint models.
In summary, various methods have been proposed to tackle the nesting problem in NER, each achieving significant results. However, these methods often face challenges related to complexity or implementation issues. In response, we have improved the simple and easily implementable sequence labeling method with a merged label layer, transforming it into a pipeline approach.

3. Datasets

In the task of named entity recognition, open-source datasets containing nested entities are very scarce. Therefore, in this experiment, we utilized datasets in three different languages to demonstrate the feasibility of the model in nested named entity recognition tasks across various languages. The open-source datasets for these three languages are: the Chinese medical dataset CMeEE, the English medical dataset GENIA, and the German daily news dataset GermEval 2014. All three datasets were divided into training sets (Train), validation sets (Dev), and test sets (Test), and each dataset included nested entities. Detailed descriptions of the datasets are provided below.

3.1. CMeEE Dataset

The CMeEE dataset, tailored specifically for the Chinese medical domain, is a rich and comprehensive resource for researchers and practitioners. This dataset, encompassing a training set, validation set, and test set, was meticulously sourced from the AliTianchi platform and is presented in a JSON format for ease of use. The training and validation sets are particularly robust, featuring not only sentences but also detailed annotations. These include entities, the start index and end index of each entity, and entity categories.
The test set, while providing only sentences, plays a crucial role in evaluating the effectiveness of models trained on this dataset. Once a model is trained, its performance can be rigorously tested using this set, and the results can be submitted to the Ali platform for a comprehensive evaluation. This process ensures the robustness and applicability of models in real-world medical scenarios.
In our thorough analysis, we delved into the intricate details of the main entities and nested entities (sub-entities) present in the dataset. The findings of this detailed statistical examination are depicted in the table below.
Table 1 reveals insightful data about the CMeEE dataset. It includes the counts of each category label as main entities and sub-entities. The items in the table are described in Table 2. It shows that the dataset is composed of a total of 79,111 entities. A closer look at the distribution of these entities across different categories reveals a significant imbalance. When considering the entities as main categories, the categories “bod” (body parts) and “dis” (diseases) dominate in terms of occurrences, with the “sym” (symptoms) category following closely. Interestingly, the scenario shifts when we analyze the entities as sub-entities. Here, the “bod” category overwhelmingly constitutes about 84.7% of the total entities in both the training and test sets, underscoring its predominant role in the dataset’s structure. The “ite” category, although smaller in quantity, stands out due to its proportional significance. Despite its lower overall count, it ranks second in frequency of occurrence as a sub-entity, highlighting its importance in the context of nested entities. This pattern suggests a characteristic feature of the Chinese medical dataset: if an entity contains nested entities, there is a high probability that these nested entities belong to the “bod” category.

3.2. GermEval 2014

The GermEval 2014 dataset stands out as a comprehensive German named entity recognition resource, notable for its inclusion of nested entities. This dataset, meticulously compiled, derives its content from German news corpora and Wikipedia, ensuring a rich and varied linguistic representation. The dataset’s annotation rigorously adheres to the labeling scheme developed by D. Benikova et al. [12], employing the “BIO” (beginning, inside, outside) tagging method for precision and clarity in entity identification.
Mirroring the structure commonly found in domain-specific datasets, the GermEval 2014 dataset is methodically segmented into training, validation, and test sets. This segmentation facilitates a systematic approach to model training and evaluation, ensuring that the models developed are robust and effective across different data splits.
A distinct feature of the GermEval 2014 dataset is its composition of predominantly everyday language corpora, as opposed to the specialized language found in medical datasets. This characteristic necessitated an in-depth statistical analysis of the main entities and sub-entities present within the dataset, with the findings detailed below.
Table 3 provides the statistics of the GermEval 2014 dataset. It includes the counts of each category label as main entities and sub-entities. The items that appear in the table are described in detail in Table 4. It highlights that the dataset encompasses a total of 39,558 entities. An analysis of the distribution of these entities across various categories reveals an imbalance. However, an interesting observation emerges when the “deriv” (derivatives) and “part” (parts) categories are amalgamated with broader categories. When examining the dataset in terms of sub-entities, a significant pattern is observed. Entities categorized as “LOC” (locations) represent approximately 70.7% of the total entities across the training, validation, and test sets. This prevalence suggests a notable tendency within the German language corpus dataset: entities that act as nested entities are highly likely to fall under the “LOC” category. Such insights not only enhance our understanding of the dataset’s structure but also inform strategies for effective entity recognition and extraction in the German language context.

3.3. GENIA

The GENIA dataset, a specialized resource tailored for the English medical domain, represents a pivotal tool in advancing natural language processing within this field. Our pre-processing strategy for this dataset is grounded in the methodologies proposed by Finkel et al. [33] and Lu et al. [34]. By closely adhering to these established approaches, we ensure that our data handling and preparation processes align with the best practices in the field. This dataset is meticulously partitioned into training, validation, and test sets, adhering to a carefully considered ratio of 8.1:0.9:1. This allocation ensures a comprehensive and balanced distribution of data, facilitating effective training and rigorous evaluation of models.
Detailed information about the training, validation, and test sets for the three datasets is presented in Table 5. It includes the number of sentences in three types of datasets: the training set (Train), the development set (Dev), and the test set (Test), as well as the number of entity categories in each dataset.

4. Method

In our paper, we introduce a novel pipeline approach specifically designed to tackle the complex challenge of nested structures in named entity recognition. The flowchart of our model is depicted in Figure 2, where the sequence labeling model can be chosen as BERT-CRF or BERT-Bi-LSTM-CRF, with the model frameworks shown in Figure 3. In our subsequent experiments, we constructed the PNER I and PNER II models for them, respectively. For the flowchart, initially, the model adopts a sequential labeling technique to accurately identify entity boundaries. To address the complexities of nested entities, we integrate a merging label layer into the sequence labeling framework. To simplify the labeling process, which often involves a daunting variety of categories, we adopt a unified approach to entity types. Furthermore, we implement hierarchical annotations while maintaining a manageable number of labels, thereby reducing the complexity of entity categorization.
In the subsequent phase of our model development, we introduce a text classification system designed to categorize entity types using an advanced multi-class classification strategy. Our method involves the strategic use of hierarchical entity categories as markers within the text, along with precise annotations of the locations of these entities. Importantly, our approach pays special attention to nested entities, marking not only the primary entities but also the nested ones. This detailed and inclusive annotation strategy greatly improves the model’s capacity to recognize and understand the relationships between various entities.
The subsequent sections of the paper provide an in-depth and detailed discussion of each component of our methodology. This comprehensive explanation is intended to offer a clear and nuanced insight into the sophisticated processes and techniques we have employed in developing our approach.

4.1. Fusion Labeling Layer

In our study, we adopted a methodology inspired by prior research, focusing on organizing nested entity structures in the “BIO” format. To illustrate, take the example of the sentence “Mouse interleukin-2 receptor alpha gene expression”. In this case, the “BIO” labeling system provides a structured approach: “B-” indicates the beginning of an entity, “I-” marks the intermediate part of an entity, and “O” signifies segments without any entity association.
Our technique for integrating label layers follows a specific hierarchy, prioritizing longer entities over shorter ones. This hierarchy ranges from the primary entity to nested entities, arranged from left to right. We use a “|” separator to clearly differentiate the labels at each level, thus enabling precise delineation of entity hierarchies. Table 6 in our paper provides detailed insights into this process, with the sample sentence from the GENIA dataset along with nested-level annotation (label1 and label2) and fusion labeling layers. This method was first proposed by Straková et al. [9]. The pseudo-code for this method is shown in Algorithm 1.
A key feature of our labeling approach is how we handle sub-level label segments that align with non-entity sections. In the final annotation phase, these segments are hidden instead of being marked as “O”. This approach results in a more streamlined and accurate labeling scheme. Our systematic method simplifies the complexity involved in understanding and recognizing nested entities within texts, presenting an organized framework for effective entity recognition and hierarchy delineation.
Algorithm 1 Fusion labeling layer
Input: label1, label2
Output: result
        1:
if label2 is empty then
        2:
    result ← label1
        3:
else
        4:
    result ← label1 + “|” + label2
        5:
end if
        6:
return result

4.2. Unified Entity Categories

The implementation of the merging label layer technique in the sequence labeling model resulted in a significant increase in the number of labels. To address this issue, we further refined our method by consolidating entity categories. This improved labeling scheme is illustrated in Table 7 below. While we continue to use the unchanged “BIO” labeling rule, our modified merging label layer method offers a more ambiguous categorization of entity types. Regardless of the longer entity’s specific category, it is uniformly labeled as “-ent”. In the same vein, shorter entities (such as nested entities) are uniformly labeled as “-sub”. This system leads to a streamlined set of six possible labels in cases with only a single nesting layer: “B-ent”, “I-ent”, “B-ent|B-sub”, “I-ent|B-sub”, “I-ent|I-sub”, and “O”. It is important to note that, due to our labeling conventions, the label “B-ent|I-sub” is not used in our system. After the pre-processing of the unified label layer, the statistical results of the dataset size are shown in Table 8. The pseudocode for the unified entity categories is shown in Algorithm 2.
During the decoding phase, we use a systematic method to unravel arrays marked by “|” as the delimiter. We then meticulously merge these arrays in sequential order, starting with “B-” and ending when an “O” is encountered. This precise process allows for the complete construction of an entity. These refinements make the labeling process more streamlined, enabling a more effective and organized approach to entity recognition. They also maintain clarity and accuracy in decoding complex nested structures within text data.
Algorithm 2 Unified entity categories
Input: label1, label2
Output: result
       1:
Begin
       2:
if label1 starts with “B-” then
       3:
    result ← “B-ent”
       4:
else if label1 starts with “I-” then
       5:
    result ← “I-ent”
       6:
else
       7:
    result ← label1
       8:
end if
       9:
if label2 is not empty then
     10:
    result ← result + “|”
     11:
    if label2 starts with “B-” then
     12:
        result ← result + “B-sub”
     13:
    else if label2 starts with “I-” then
     14:
        result ← result + “I-sub”
     15:
    else
     16:
        result ← result + label2
     17:
    end if
     18:
end if
     19:
Return result
     20:
End

4.3. Entity Category Classification

Given our approach of unifying entity categories in the initial sequence annotation process, we have encountered limitations in achieving more precise entity classifications. As a result, incorporating an additional text classification model is crucial for effectively distinguishing between different entity categories. Figure 4 showcases the use of identifiers for text annotation within the entity category classification phase, employing example sentences from the GENIA dataset for illustration. The text includes category tags representing distinct entity types, along with position markers that indicate the boundaries of the entities, all integrated within sentences. Our category tags are divided into two main classifications: “main entity” (ent) and “nested entity” (sub). The position markers, like <ent> and </ent> for starting and ending recognized main entities and <sub> and </sub> for marking the beginning and end of recognized nested entities, act as our guide for annotations. For the annotation protocol of the classification model, the following applies:
  • We begin by injecting the category tag (“ent” or “sub”) at the start of each sentence.
  • When the main entity category is identified within a sentence, we strategically place markers to denote the beginning and end of this main entity.
  • To identify nested entity categories, our initial step involves inserting markers that indicate the start and finish of the main entity enclosing the nested one. Following this, we insert additional markers to highlight the start and end positions of the nested entity itself. It is important to note that the markers for the main entity take priority in this process.
Let Xmain denote this modified main entity sequence with text markers inserted, and Xsub denote this modified sub-entity sequence with text markers inserted:
X main = main entity , , ent , x START ( main ) , , x END ( main ) , / ent ,
X sub = sub - entity , , ent , x START ( main ) , , sub , x START ( sub ) , , x END ( sub ) , / sub , , x END ( main ) , / ent ,
Each entity requires careful text annotation, not just during the training phase but also in subsequent prediction tasks. This detailed annotation is vital, as it forms the cornerstone of improving our model’s ability to understand and categorize entities within text data. After the pre-processing of inserting identifiers, the statistical results of the dataset size are as shown in Table 9.

5. Results and Discussion

Our experimental setup was conducted on a platform equipped with Intel i7-12700 and NVIDIA GeForce RTX 3050 Ti. For the development and execution of our models, we employed the PaddlePaddle 2.1 framework, which provides a comprehensive and efficient environment for deep learning applications. Meanwhile, in our experiment, we adopted a highly rigorous evaluation approach. An entity match was considered successful only if it accurately matched the start and end indices of the entity and correctly classified its category. We used the micro-F1 score to evaluate our model’s performance.
Initially, on the CMeEE dataset, we assessed our model’s performance using various pre-trained BERT models and compared its effectiveness in recognizing entities against traditional models. We then explored the impact of incorporating identifiers in the classification of entity categories and examined how this influenced the outcomes.
Furthermore, we conducted a comparison involving two approaches against other baseline models. The first method used a pre-trained BERT model as the encoding layer, complemented by an additional conditional random field (CRF) layer. The second approach integrated contextual word embeddings from the pre-trained BERT model, coupled with a bi-directional LSTM for encoding and a CRF for decoding.
After completing our experiments on the CMeEE dataset, we extended our analysis to compare the most effective methods with other baseline models on datasets in two additional languages: GermEval 2014 and GENIA.

5.1. Discussion of Results on the CMeEE Dataset

Firstly, we referred to the performance of three pre-trained models on the CMeEE dataset provided by Zhang et al. in CBLUE. These three pre-trained models are as follows:
  • BERT-wwm-ext-base [35]. This pre-trained model adopts the whole-word masking approach, masking the entire word as the smallest unit, and is trained on a Chinese language corpus.
  • BERT-base [13]. This is the basic BERT model, comprising 110 million parameters, and is trained on a Chinese language corpus.
  • MacBERT-large [36]. This is an enhanced BERT model that employs the masked language model (MLM) as a pre-training task.
In our PNER model, specifically for the entity boundary recognition task, we utilized three different pre-trained models. We then benchmarked these against the standards set by Zhang et al. [22]. The outcomes of this comparison are detailed in Table 10 below, where the same pre-trained models and parameters were used for the comparison, and the baseline model data were obtained from CBLUE [22]. Our experiments revealed that, across the board, all the enhanced versions of our model outperformed their respective baseline counterparts. Notably, when using the MacBERT pre-trained model, the improvement in performance was even more significant compared to the other two enhanced versions. In fact, it achieved the highest score. This finding indicates that while there is a boost in performance for entity boundary recognition tasks, the enhancement is even more pronounced in tasks related to recognizing entity categories. This leads to an overall more substantial improvement in the performance of our model.
During the classification task of our study, we introduced identifiers in the pre-processing stage of the dataset to indicate whether the entity being classified was a main entity or a sub-entity. Additionally, we annotated the dataset with position index identifiers for both the main and sub-entities. For training, we employed the BERT-base pre-trained model on the pre-processed CMeEE training set, and we carried out testing on the validation set. Alongside this, we monitored the classification performance across various entity categories. The model’s performance in different categories is presented in Table 11. It is important to note that we are not certain whether the inclusion of identifiers will affect the performance of the classification model. Therefore, we need to conduct an experiment to determine this.
Our primary objective in the classification task was to achieve higher accuracy. To this end, we conducted experiments to compare scenarios with and without the inclusion of identifiers. We evaluated the accuracy of both approaches, and the comparative findings are depicted in Table 12. The results clearly demonstrate that including identifiers significantly enhances the model’s classification precision. This underscores the vital role these identifiers play in effectively classifying different entity categories.
We proceeded to conduct entity boundary recognition tasks using two distinct models: BERT-CRF and BERT-Bi-LSTM-CRF. Upon completing these tasks, we introduced identifiers for entity category classification and proceeded to classify the entities accordingly. We designated these two models as PNER I and PNER II, respectively. To evaluate their effectiveness, we compared their performance on the CMeEE dataset against three benchmark models.
The traditional BERT-CRF model employs the BERT pre-trained model [13] for encoding and a CRF [20] layer for decoding. The RICON model [37] is a span-based named entity recognition model that enhances span representation by integrating regularized information about entities. The Global Pointer model [25] approaches entity recognition by considering the starting and ending positions of entities from a global perspective.
The results of these comparisons are detailed in Table 13, where “*” denotes our implementation via their code. From the findings, we observed that our enhanced PNER I model, based on BERT-CRF, shows some improvement. Moreover, PNER II, which incorporates LSTM in the encoding layer, performs even better, slightly surpassing the other models in effectiveness.

5.2. Discussion of Results on Other Datasets

When we tested our model on the GENIA dataset, we applied the PNER II framework and initially compared it with the model introduced by Straková et al. [9]. In their research, they presented two named entity recognition methods. The first method employs an LSTM-CRF neural model framework, which was designed to tackle nested entities through the integration of label layers. Our model builds upon this approach, capitalizing on its simplicity and efficiency for improvement. The second method they proposed uses a sequence-to-sequence (seq2seq) approach, and both methods have shown commendable performance. We compared our model against both of these methods, as detailed in Table 14.
The comparison results revealed that, by using the same LSTM-CRF framework, our model achieved significant performance improvements. These improvements are attributed to optimizations in our annotation method during sequence labeling and enhanced accuracy in identifying entity categories. This underscores that effectively addressing the complexity of increasing named entity categories in nested entities can notably boost model performance. Additionally, our model retains the simplicity and efficiency of the original method. Furthermore, our model shows notable improvements compared to the second seq2seq-based model framework proposed by Straková et al. [9]
The comparison results in Table 14 indicate that our enhanced model achieved significant improvements while maintaining the original framework, where “#” denotes the baseline model before improvement. We then compared our model with various benchmark models on the GENIA dataset. These benchmarks include the sequence labeling-based method by Ju et al. [16] used for nested entity recognition using stacked flat entity layers, hypergraph-based approaches by Wang and Lu [38], seq2seq framework-based methods by Straková et al. [9] and Yan et al. [39], span-based methods by Wang et al. [24], and span-level graphs introduced by Wan et al. [40]. The results of these comparisons are presented in Table 15. From this analysis, we observe that, although our model does not outperform all benchmark models on the GENIA dataset, it is notable for its simplicity and ease of implementation while still being competitive in performance.
In Table 16, we compare our model against benchmarks on the GermEval 2014 dataset, a German language dataset. For this experiment, we utilized the multilingual version of BERT-base to obtain contextual word vectors. The results indicate that our model performs better on the GermEval 2014 dataset compared to other models.
Table 17 displays the execution times of the model on the GENIA dataset, calculated per single epoch round. It is observable that, due to the pipeline method requiring the use of two models concurrently, the model takes longer to execute compared to traditional sequence labeling models. However, we believe that, due to the inherently simple framework of the model, this method does not take significantly longer compared to other, more complex models.
It is important to highlight that, while addressing nested named entities often leads to more complex models, the sequence labeling method remains the most traditional and straightforward approach in named entity recognition. This method requires only simple annotations for identifying entities. Our proposed method for fusing label layers represents a straightforward enhancement to this labeling technique, facilitating the recognition of nested named entities. This enhancement optimizes the model through the unification of label layers, maintaining its simplicity and effectiveness. By incorporating identifiers, we ensure the precision of entity category recognition. Our model stands out for its strong competitiveness compared to other more complex models.

6. Conclusions

Our model builds upon the fused labeling approach pioneered by Straková et al. [9]. To tackle the challenge of a rapidly increasing number of entity category labels, we have implemented a pipeline approach. This method first recognizes entity boundaries and then classifies entity categories. It effectively bypasses the error propagation problem that exists in the pipeline method. This feasibility stems from our rigorous evaluation criteria, which require the exact identification of both entity boundaries and categories for a match to be deemed successful. In the entity boundary recognition task, we utilize a unified entity category approach, categorizing entities as either main entities or sub-entities. This unified method substantially decreases the number of required labels, thereby improving the model’s accuracy in identifying entity boundaries. This strategy of having both main and sub-entity categories is crucial because if there were only a single category labeled as “entity”, label fusion would make it challenging to distinguish the boundaries of individual entities. For the entity category identification task, we adopted a methodology inspired by Chen et al.’s work in relationship extraction. We introduced the identifier (including the start and end index) in the text to distinguish between main entities and sub-entities. To validate this method, we initially analyzed the distribution of main and sub-entities across datasets from two different domains. We noticed a consistent trend across various named entity datasets in different domains, where invariably one category tends to dominate the majority of sub-entities. Further, we conducted tests both before and after adding these identifiers. The results confirm that the inclusion of identifiers impacts the recognition of entity categories. Our model, tested on named entity recognition datasets in three languages, demonstrates its exceptional performance. Additionally, it retains the simplicity and efficiency of the traditional sequence annotation method while being highly competitive with other benchmark models in terms of performance. However, from the execution times of the model, we can observe that the PNER model, due to the use of a pipeline approach, inevitably requires the utilization of two models. This results in a significantly longer processing time compared to traditional sequence labeling models. Furthermore, this method demands more time for data annotation. Meanwhile, our current method does not address discontinuous entities, a notable aspect of named entity recognition. In the future, we aim to expand our method to encompass discontinuous entities, thereby providing a comprehensive solution to all types of named entity recognition challenges in a straightforward manner.

Author Contributions

Conceptualization, H.Y.; methodology, H.Y.; software, Q.Z.; validation, H.Y. and Q.Z.; investigation, H.Y.; resources, H.Y.; data curation, H.Y. and Q.Z.; writing—original draft preparation, H.Y.; writing—review and editing, H.Y. and H.-C.K.; visualization, H.Y.; supervision, H.-C.K.; funding acquisition, H.-C.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2024-RS-2023-00254177) grant funded by the Korea government(MSIT).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that are used in this research work are openly available at the following links: 1. CMeEE dataset: https://tianchi.aliyun.com/dataset/95414 (accessed on 29 June 2023); 2. GermEval 2014 dataset: https://sites.google.com/site/germeval2014ner/data (accessed on 17 July 2023); 3. GENIA dataset: http://www.geniaproject.org/genia-corpus (accessed on 17 July 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Babych, B.; Hartley, A. Improving Machine Translation Quality with Automatic Named Entity Recognition. In Proceedings of the 7th International EAMT Workshop on MT and Other Language Technology Tools, Improving MT through Other Language Technology Tools, Resource and Tools for Building MT, Budapest, Hungary, 13 April 2003. [Google Scholar]
  2. Mollá, D.; van Zaanen, M.; Smith, D. Named Entity Recognition for Question Answering. In Proceedings of the Australasian Language Technology Workshop, Sydney, Australia, 30 November–1 December 2006; Cavedon, L., Zukerman, I., Eds.; pp. 51–58. [Google Scholar]
  3. Alzubi, J.A.; Jain, R.; Singh, A.; Parwekar, P.; Gupta, M. COBERT: COVID-19 Question Answering System Using BERT. Arab. J. Sci. Eng. 2023, 48, 11003–11013. [Google Scholar] [CrossRef]
  4. Le, P.; Titov, I. Improving Entity Linking by Modeling Latent Relations between Mentions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Melbourne, Australia, 2018; pp. 1595–1604. [Google Scholar]
  5. Wei, Z.; Su, J.; Wang, Y.; Tian, Y.; Chang, Y. A Novel Cascade Binary Tagging Framework for Relational Triple Extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1476–1488. [Google Scholar]
  6. Shen, Y.; Ma, X.; Tan, Z.; Zhang, S.; Wang, W.; Lu, W. Locate and Label: A Two-Stage Identifier for Nested Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2782–2794. [Google Scholar]
  7. Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics: San Diego, CA, USA, 2016; pp. 260–270. [Google Scholar]
  8. Kim, J.-D.; Ohta, T.; Tateisi, Y.; Tsujii, J. GENIA Corpus—A Semantically Annotated Corpus for Bio-Textmining. Bioinformatics 2003, 19 (Suppl. 1), i180–i182. [Google Scholar] [CrossRef]
  9. Straková, J.; Straka, M.; Hajic, J. Neural Architectures for Nested NER through Linearization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Florence, Italy, 2019; pp. 5326–5331. [Google Scholar]
  10. Zhong, Z.; Chen, D. A Frustratingly Easy Approach for Entity and Relation Extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 50–61. [Google Scholar]
  11. Hongying, Z.; Wenxin, L.; Kunli, Z.; Yajuan, Y.; Baobao, C.; Zhifang, S. Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation. In Proceedings of the Chinese Lexical Semantics, Nanjing, China, 15 May 2021; Liu, M., Kit, C., Su, Q., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 652–664. [Google Scholar]
  12. Benikova, D.; Biemann, C.; Reznicek, M. NoSta-D Named Entity Annotation for German: Guidelines and Dataset. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 2524–2531. [Google Scholar]
  13. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
  14. Grishman, R.; Sundheim, B. Message Understanding Conference-6: A Brief History. In Proceedings of the COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, Copenhagen, Denmark, 5–9 August 1996. [Google Scholar]
  15. Sang, E.F.T.K.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. arXiv 2003, arXiv:cs/0306050. [Google Scholar]
  16. Ju, M.; Miwa, M.; Ananiadou, S. A Neural Layered Model for Nested Named Entity Recognition. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; Walker, M., Ji, H., Stent, A., Eds.; Association for Computational Linguistics: New Orleans, LA, USA, 2018; pp. 1446–1459. [Google Scholar]
  17. Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
  18. Goller, C.; Kuchler, A. Learning Task-Dependent Distributed Representations by Backpropagation through Structure. In Proceedings of the International Conference on Neural Networks (ICNN’96), Washington, DC, USA, 3–6 June 1996; Volume 1, pp. 347–352. [Google Scholar]
  19. Graves, A.; Mohamed, A.; Hinton, G. Speech Recognition with Deep Recurrent Neural Networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
  20. Lafferty, J.; McCallum, A.; Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning, San Francisco, CA, USA, 28 June–1 July 2001. [Google Scholar]
  21. Alex, B.; Haddow, B.; Grover, C. Recognising Nested Named Entities in Biomedical Text. In Proceedings of the Biological, Translational, and Clinical Language Processing, Prague, Czech Republic, 29 June 2007; pp. 65–72. [Google Scholar]
  22. Zhang, N.; Chen, M.; Bi, Z.; Liang, X.; Li, L.; Shang, X.; Yin, K.; Tan, C.; Xu, J.; Huang, F.; et al. CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark. arXiv 2022, arXiv:2106.08087. [Google Scholar]
  23. Luo, Y.; Zhao, H. Bipartite Flat-Graph Network for Nested Named Entity Recognition. arXiv 2020, arXiv:2005.00436. [Google Scholar]
  24. Wang, J.; Shou, L.; Chen, K.; Chen, G. Pyramid: A Layered Model for Nested Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 5918–5928. [Google Scholar]
  25. Su, J.; Murtadha, A.; Pan, S.; Hou, J.; Sun, J.; Huang, W.; Wen, B.; Liu, Y. Global Pointer: Novel Efficient Span-Based Approach for Named Entity Recognition. arXiv 2022, arXiv:2208.03054. [Google Scholar]
  26. Bekoulis, G.; Deleu, J.; Demeester, T.; Develder, C. Joint Entity Recognition and Relation Extraction as a Multi-Head Selection Problem. Expert Syst. Appl. 2018, 114, 34–45. [Google Scholar] [CrossRef]
  27. Yu, J.; Bohnet, B.; Poesio, M. Named Entity Recognition as Dependency Parsing. arXiv 2020, arXiv:2005.07150. [Google Scholar]
  28. Dozat, T.; Manning, C.D. Deep Biaffine Attention for Neural Dependency Parsing. arXiv 2017, arXiv:1611.01734. [Google Scholar]
  29. Sohrab, M.G.; Miwa, M. Deep Exhaustive Model for Nested Named Entity Recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 2843–2849. [Google Scholar]
  30. Fu, J.; Huang, X.; Liu, P. SpanNER: Named Entity Re-/Recognition as Span Prediction. arXiv 2021, arXiv:2106.00641. [Google Scholar]
  31. Li, X.; Yin, F.; Sun, Z.; Li, X.; Yuan, A.; Chai, D.; Zhou, M.; Li, J. Entity-Relation Extraction as Multi-Turn Question Answering. arXiv 2019, arXiv:1905.05529. [Google Scholar]
  32. Li, X.; Feng, J.; Meng, Y.; Han, Q.; Wu, F.; Li, J. A Unified MRC Framework for Named Entity Recognition. arXiv 2022, arXiv:1910.11476. [Google Scholar]
  33. Finkel, J.R.; Manning, C.D. Nested Named Entity Recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–7 August 2009; Koehn, P., Mihalcea, R., Eds.; Association for Computational Linguistics: Singapore, 2009; pp. 141–150. [Google Scholar]
  34. Lu, W.; Roth, D. Joint Mention Extraction and Classification with Mention Hypergraphs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; Màrquez, L., Callison-Burch, C., Su, J., Eds.; Association for Computational Linguistics: Lisbon, Portugal, 2015; pp. 857–867. [Google Scholar]
  35. Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z. Pre-Training with Whole Word Masking for Chinese BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3504–3514. [Google Scholar] [CrossRef]
  36. Cui, Y.; Che, W.; Liu, T.; Qin, B.; Wang, S.; Hu, G. Revisiting Pre-Trained Models for Chinese Natural Language Processing. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020; pp. 657–668. [Google Scholar]
  37. Gu, Y.; Qu, X.; Wang, Z.; Zheng, Y.; Huai, B.; Yuan, N.J. Delving Deep into Regularity: A Simple but Effective Method for Chinese Named Entity Recognition. arXiv 2022, arXiv:2204.05544. [Google Scholar]
  38. Wang, B.; Lu, W.; Wang, Y.; Jin, H. A Neural Transition-Based Model for Nested Mention Recognition. arXiv 2018, arXiv:1810.01808. [Google Scholar]
  39. Yan, H.; Gui, T.; Dai, J.; Guo, Q.; Zhang, Z.; Qiu, X. A Unified Generative Framework for Various NER Subtasks. arXiv 2021, arXiv:2106.01223. [Google Scholar]
  40. Wan, J.; Ru, D.; Zhang, W.; Yu, Y. Nested Named Entity Recognition with Span-Level Graphs. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 892–903. [Google Scholar]
  41. Zheng, C.; Cai, Y.; Xu, J.; Leung, H.; Xu, G. A Boundary-Aware Neural Model for Nested Named Entity Recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3 November 2019; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 357–366. [Google Scholar]
  42. Wang, Y.; Li, Y.; Tong, H.; Zhu, Z. HIT: Nested Named Entity Recognition via Head-Tail Pair and Token Interaction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online Event, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6027–6036. [Google Scholar]
  43. Pikuliak, M.; Simko, M.; Bielikova, M. Towards Combining Multitask and Multilingual Learning. In SOFSEM 2019: Theory and Practice of Computer Science; Catania, B., Královič, R., Nawrocki, J., Pighizzini, G., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; Volume 11376, pp. 435–446. ISBN 978-3-030-10800-7. [Google Scholar]
  44. Agrawal, A.; Tripathi, S.; Vardhan, M.; Sihag, V.; Choudhary, G.; Dragoni, N. BERT-Based Transfer-Learning Approach for Nested Named-Entity Recognition Using Joint Labeling. Appl. Sci. 2022, 12, 976. [Google Scholar] [CrossRef]
  45. Marcińczuk, M.; Radom, J. A Single-Run Recognition of Nested Named Entities with Transformers. Procedia Comput. Sci. 2021, 192, 291–297. [Google Scholar] [CrossRef]
Figure 1. An example of nested entities from GENIA dataset; protein and DNA are categories of entities.
Figure 1. An example of nested entities from GENIA dataset; protein and DNA are categories of entities.
Applsci 14 01717 g001
Figure 2. The flowchart of the method.
Figure 2. The flowchart of the method.
Applsci 14 01717 g002
Figure 3. Two kinds of sequence labeling model frameworks.
Figure 3. Two kinds of sequence labeling model frameworks.
Applsci 14 01717 g003
Figure 4. Data annotation method in text classification model.
Figure 4. Data annotation method in text classification model.
Applsci 14 01717 g004
Table 1. Statistics on the number of various category labels in the CMeEE dataset.
Table 1. Statistics on the number of various category labels in the CMeEE dataset.
ItemTrainDevOverall
Main Sub Main Sub
dis15,66415548804720,746
sym10,11523388013,505
pro6291402039168386
equ878923531125
dru390029143375369
ite22653018131043483
bod14,60930634857101923,548
dep34801081457
mic18891958042492
Overall55,959361818,333120179,111
Table 2. Details of labels written in the CMeEE dataset.
Table 2. Details of labels written in the CMeEE dataset.
LabelNamed Entity (NE) TypeSubclasses
disDiseaseDiseases or syndromes; poisoning or injury; organ or cell damage
symSymptomSymptoms; signs
proMedical procedureDiagnostic procedure; treatment or prevention procedure
equMedical equipmentDiagnostic equipment; treatment equipment
druDrugDrugs
iteMedical test itemMedical test items
bodBodyBodily substances; body parts
depDepartmentDepartments
micMicroorganismMicroorganisms
Table 3. Statistics on the number of various category labels in the GermEval 2014 dataset.
Table 3. Statistics on the number of various category labels in the GermEval 2014 dataset.
ItemTrainDevTestOverall
Main Sub Main Sub Main Sub
PER76791737111816394710,267
PERderiv6252011181
PERpart18412183443264
LOC828190776378170616611,901
LOCderiv2808762356561263712
LOCpart513215211092698
ORG5255414963115086953
ORGderiv410308153
ORGpart805291117201071
OTH302420269069744014
OTHderiv2362160390293
OTHpart1900180421251
Overall29,07812592674110617825939,558
Table 4. Details of labels written in the GermEval 2014 dataset.
Table 4. Details of labels written in the GermEval 2014 dataset.
LabelNE TypeExample
Sentence Entity
PERPersonFür Erika Ziltener besteht angesichts dieser Entwicklung die Gefahr eines Zweiklassen-Gesundheitssystems.Erika Ziltener
PERderivThe derivation of the personIn der ersten Hälfte des 9. Jahrhunderts erstarkten die Sanjaya wieder.Sanjaya
PERpartThe part of the personDie folgenden vier Spielzeiten war er für Motor IFA Karl-Marx-Stadt aktiv.Karl-Marx-Stadt
LOCLocationIn Deutschland ist nach StGB eine Anwerbung für die Fremdenlegion strafba.Deutschland
LOCderivThe derivation of the locationDie 75 Kampfjets warden im englischen Hatfield hergestellt und 1949 in die Schweiz geflogen.englischen
LOCpartThe part of the locationAncient City, die Altstadt, ist das alte Chinesenviertel.Chinesenviertel
ORGOrganizationDas war ein tolles Ergebnis für Porsche.Porsche
ORGderivThe derivation of the organizationDie Republikaner starten bei Portillo de Suano einen Gegenangriff.Republikaner
ORGpartThe part of the organizationHeute soll Asahari ohne die JI-Struktur auskommen.JI-Struktur
OTHOthersDer Haushalt belief sich im vergangenen Jahr auf rund 40.2 Millionen Euro.Euro
OTHderivThe derivation of the othersGörts wahre Leidenschaft gilt der klassischen und romantischen Musik.klassischen; romantischen
OTHpartThe part of the others1938 wurde der Rajon aufgelöst und die deutschsprachige Zeitung Rote Fahne verboten.deutschsprachige
Table 5. Statistics on the number of training, validation, and test sets in three datasets.
Table 5. Statistics on the number of training, validation, and test sets in three datasets.
DatasetTrainDevTestEntity Type
CMeEE15,000500030009
GermEval 201424,0002200510012
GENIA15,023166918545
Table 6. Sample fusion annotation from GENIA dataset.
Table 6. Sample fusion annotation from GENIA dataset.
SentenceLabel1Label2Fusion Labeling Layers
MouseB-DNAOB-DNA
interleukinI-DNAB-proteinI-DNA|B-protein
-I-DNAI-proteinI-DNA|I-protein
2I-DNAI-proteinI-DNA|I-protein
receptorI-DNAOI-DNA
alphaI-DNAOI-DNA
geneI-DNAOI-DNA
expressionI-DNAOI-DNA
Table 7. Uniform label layer example from GENIA dataset.
Table 7. Uniform label layer example from GENIA dataset.
SentenceLabel1Label2Fusion Labeling Layers
MouseB-entOB-ent
interleukinI-entB-subI-ent|B-sub
-I-entI-subI-ent|I-sub
2I-entI-subI-ent|I-sub
receptorI-entOI-ent
alphaI-entOI-ent
geneI-entOI-ent
expressionI-entOI-ent
Table 8. Dataset size statistics after the unified label layer.
Table 8. Dataset size statistics after the unified label layer.
DatasetTrainDevTestLabeling Type
CMeEE15,0005000-6
GermEval 201424,000220051006
GENIA15,023166918546
Table 9. Dataset size statistics after inserting identifiers.
Table 9. Dataset size statistics after inserting identifiers.
DatasetTrainDevTestLabeling Type
CMeEE59,57719,534-9
GermEval 201430,3372784643712
GENIA45,929433754745
Table 10. Comparison results on the CMeEE dataset.
Table 10. Comparison results on the CMeEE dataset.
CMeEEModelF1 Score
BaselineBERT-wwm-ext-base [35]61.7
BERT-base [13]62.1
MacBERT-large [36]62.4
Our modelPNER(BERT-wwm-ext-base)62.2
PNER(BERT-base)62.9
PNER(MacBERT)63.4
Table 11. The performance of the model in classifying each entity category on the validation set.
Table 11. The performance of the model in classifying each entity category on the validation set.
CategoryPRF1 Score
dis83.9593.4488.44
sym91.2882.6486.74
pro90.5295.0792.74
equ93.3765.1376.73
dru92.4295.6994.03
ite85.9665.9474.63
bod95.9594.6395.28
dep98.0092.4595.15
mic86.8194.6990.58
Table 12. Comparing model accuracy in entity classification with and without identifiers.
Table 12. Comparing model accuracy in entity classification with and without identifiers.
ModelAccuracy
With identifier90.36%
Without identifier88.53%
Table 13. Comparison of results with existing approaches for CMeEE dataset.
Table 13. Comparison of results with existing approaches for CMeEE dataset.
ModelPRF1 Score
BERT-CRF *--64.41
RICON [37]66.2564.8965.57
Global Pointer [25]--66.54
PNER I (ours)66.9763.9465.42
PNER II (ours)67.5365.9166.71
Table 14. Comparison results with the baseline model on GENIA dataset.
Table 14. Comparison results with the baseline model on GENIA dataset.
ModelF1 Score
LSTM-CRF+BERT [9] #77.80
seq2seq+BERT+Flair [9]78.31
PNER (ours)79.21
Table 15. Comparison of results with existing approaches for GENIA dataset.
Table 15. Comparison of results with existing approaches for GENIA dataset.
ModelPRF1 Score
Sequence labeling-based [16]78.5071.3074.70
Hypergraph-based [38]77.0073.3075.10
Seq2seq framework-based [9]--78.31
Span-based [24]79.4578.9479.19
Seq2seq unified generative framework [39]78.8779.6079.23
Span-level graphs [40]77.9280.7479.30
PNER (ours)79.3479.0879.21
Table 16. Comparison of results with existing approaches for GermEval 2014 dataset.
Table 16. Comparison of results with existing approaches for GermEval 2014 dataset.
ModelPRF1 Score
Boundary aware-based [41]74.5069.1071.70
HIT [42]74.8070.5072.60
Bi-LSTM-CRF [43]--75.30
Transfer-learning [44]--85.29
PolDeepNer2 [45]87.8687.5387.69
PNER (ours)86.9788.6787.81
Table 17. The execution times of our model on the GENIA dataset.
Table 17. The execution times of our model on the GENIA dataset.
ModelTime/Epoch
Sequence labeling model92 s
Text classification model402 s
Overall494 s
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, H.; Zhang, Q.; Kwon, H.-C. PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition. Appl. Sci. 2024, 14, 1717. https://doi.org/10.3390/app14051717

AMA Style

Yang H, Zhang Q, Kwon H-C. PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition. Applied Sciences. 2024; 14(5):1717. https://doi.org/10.3390/app14051717

Chicago/Turabian Style

Yang, Hongjian, Qinghao Zhang, and Hyuk-Chul Kwon. 2024. "PNER: Applying the Pipeline Method to Resolve Nested Issues in Named Entity Recognition" Applied Sciences 14, no. 5: 1717. https://doi.org/10.3390/app14051717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop