A Chinese Nested Named Entity Recognition Model for Chicken Disease Based on Multiple Fine-Grained Feature Fusion and Efficient Global Pointer

Wang, Xiajun; Peng, Cheng; Li, Qifeng; Yu, Qinyang; Lin, Liqun; Li, Pingping; Gao, Ronghua; Wu, Wenbiao; Jiang, Ruixiang; Yu, Ligen; Ding, Luyu; Zhu, Lei

doi:10.3390/app14188495

Open AccessArticle

A Chinese Nested Named Entity Recognition Model for Chicken Disease Based on Multiple Fine-Grained Feature Fusion and Efficient Global Pointer

by

Xiajun Wang

^1,2

,

Cheng Peng

^1,3,4,*

,

Qifeng Li

^1,3,4,

Qinyang Yu

^1,3,4,

Liqun Lin

²,

Pingping Li

^1,2,

Ronghua Gao

^1,3,4

,

Wenbiao Wu

^1,3,4,

Ruixiang Jiang

^1,3,4,

Ligen Yu

^1,3,4

,

Luyu Ding

^1,3,4

and

Lei Zhu

^1,3,4

¹

Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China

²

Faculty of Resources and Environmental Science, Hubei University, Wuhan 430061, China

³

National Innovation Center of Digital Technology in Animal Husbandry, Beijing 100097, China

⁴

National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8495; https://doi.org/10.3390/app14188495

Submission received: 21 August 2024 / Revised: 17 September 2024 / Accepted: 19 September 2024 / Published: 20 September 2024

(This article belongs to the Special Issue Applied Intelligence in Natural Language Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Featured Application

This study proposes a multiple fine-grained nested named entity recognition model, which provides a solution for other specialized fields and lays the foundation for subsequent knowledge graph construction and intelligent inquiry system construction.

Abstract

Extracting entities from large volumes of chicken epidemic texts is crucial for knowledge sharing, integration, and application. However, named entity recognition (NER) encounters significant challenges in this domain, particularly due to the prevalence of nested entities and domain-specific named entities, coupled with a scarcity of labeled data. To address these challenges, we compiled a corpus from 50 books on chicken diseases, covering 28 different disease types. Utilizing this corpus, we constructed the CDNER dataset and developed a nested NER model, MFGFF-BiLSTM-EGP. This model integrates the multiple fine-grained feature fusion (MFGFF) module with a BiLSTM neural network and employs an efficient global pointer (EGP) to predict the entity location encoding. In the MFGFF module, we designed three encoders: the character encoder, word encoder, and sentence encoder. This design effectively captured fine-grained features and improved the recognition accuracy of nested entities. Experimental results showed that the model performed robustly, with F1 scores of 91.98%, 73.32%, and 82.54% on the CDNER, CMeEE V2, and CLUENER datasets, respectively, outperforming other commonly used NER models. Specifically, on the CDNER dataset, the model achieved an F1 score of 79.68% for nested entity recognition. This research not only advances the development of a knowledge graph and intelligent question-answering system for chicken diseases, but also provides a viable solution for extracting disease information that can be applied to other livestock species.

Keywords:

nested named entity recognition; chicken disease; multiple fine-grained feature fusion; RoBERTa; efficient global pointer

1. Introduction

The egg and broiler industries are pivotal to global food production, with China playing a crucial role as the world’s largest egg producer and a major supplier of broiler meat. China accounts for 36% of the global egg production and 14% of chicken meat, making these industries crucial for both food security and as a significant income source for millions of people in urban and rural areas. However, the rapid expansion of these industries has heightened the threat of various poultry diseases, which can severely impact production and livelihoods. As a result, effective disease prevention and control have become increasingly critical. Despite the availability of information on poultry diseases, it is often fragmented and poorly organized, making it difficult for farmers to access the necessary knowledge when needed. This paper addresses this issue by proposing a method that utilizes deep learning techniques to accurately identify and extract essential information on poultry diseases from extensive text sources. This approach lays the foundation for developing a comprehensive knowledge graph, which can support advanced applications such as intelligent Q&A systems and efficient knowledge retrieval platforms [1]. These tools will equip farmers with the information they need to protect their flocks and maintain their livelihoods.

Named entity recognition (NER) is a critical task in natural language processing that involves the identification of specific entities within text. Its importance has grown significantly with the rapid expansion in the biomedical literature and data. In the biomedical domain, biomedical named entity recognition (BioNER) targets the recognition of entities such as disease names, symptoms, drugs, and anatomical parts [2]. Traditionally, NER models have approached this task as a sequence labeling problem, employing both rule-based and machine learning techniques. Rule-based methods necessitate the manual creation of extensive rule sets that rely heavily on domain expertise and are often restricted to narrowly defined areas [3]. In contrast, machine learning approaches including hidden Markov models (HMMs), maximum entropy models (MEMs), support vector machines (SVMs), and conditional random fields (CRFs) [4,5,6,7] rely on feature engineering. This reliance poses challenges, particularly in selecting appropriate features and capturing long-term dependencies between entities [8]. The advent of deep learning has revolutionized the field such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and long short-term memory (LSTM) for NER tasks [9,10,11], significantly improving both the efficiency and recognition accuracy. Notably, the BiLSTM-CRF model has gained widespread use [12,13,14,15] as it automates feature extraction, thereby enhancing both efficiency and accuracy without the large scale feature engineering required by earlier methods [16].

In recent years, several tools have been developed to enhance NER tasks. SpaCy, a widely-used Python library, offers pre-trained models for the efficient processing and customization of domain-specific tasks including biomedical entities [17]. NLTK, while primarily research-focused, provides a flexible platform with sequence labeling tools, making it well-suited for early-stage NER implementations [18]. Stanford NLP offers a broad range of natural language processing tools [19], with RegexNER standing out as a rule-based tool that leverages regular expressions, ideal for identifying well-defined entity patterns such as chemical names and drug abbreviations [20]. More recently, advanced libraries like Flair have further improved NER performance by incorporating contextualized embeddings like BERT [21], which capture deeper semantic relationships within text. These tools, along with deep learning models, particularly transformer-based architectures, are advancing NER, significantly boosting accuracy in both general tasks and specialized applications like BioNER.

As research in NER has advanced, increasing attention has been directed toward the recognition of nested entities. The concept of “nested region names” was first introduced in the task definition of automatic content extraction (ACE) research [22], represented as entities that exist within other entities [23]. For instance, in the study of chicken diseases, a symptom such as ‘spleen swelling’ includes the nested entities ‘spleen’, representing a body part, and ‘swelling’, representing a symptom. Recognizing these nested entities, whether they are heterogeneous (different types) or homogeneous (overlapping types), is complex and presents a significant challenge for accurate identification [24]. The introduction of the concept of “nested region names” in ACE research underscores the importance of recognizing these intricate structures within named entities.

To address the challenges posed by nested named entities, researchers have proposed various methods including hypergraph-based, region-based, transition-based, and span-based techniques. For example, Huang et al. [25] introduced a hyper graph network (HGN) structure to manage nested entities by representing each sentence as a hypergraph, with words as nodes and entities as hyper-edges, thereby transforming the recognition task into hyper-edge classification. Region-based methods treat nested NER as a multi-class classification problem by first representing potential regions (subsequences) and then classifying them. Jiang et al. [26] proposed a candidate region-aware model that utilized binary sequence labeling followed by candidate region classification, demonstrating significant performance on public datasets. Transition-based methods, inspired by dependency parsers, incrementally construct trees through greedy decoding. Wang et al. [27] developed a neural transition model using Stack LSTM, which effectively captured character-level representations and efficiently represented the system’s state. Span-based methods enumerate all possible text segments (spans) and then determine their entity status [28], which naturally suits the nested entity recognition task. Li et al. [29] enhanced this approach by developing a segment-enhanced span-based model (SESNER) that improved the model performance while accurately handling complex nested entities. Additionally, the global pointer (GP) model, which leverages relative positions through a multiplicative attention mechanism, offers a global perspective on the start and end positions for predicting entities and has demonstrated outstanding performance across various nested NER tasks [30,31,32,33].

Despite these advances, current NER models still rely heavily on large training datasets, making pre-trained models crucial for embedding layers in NER tasks [34,35]. Models such as BERT, MCBERT, RoBERTa, XLNet, and ERNIE, trained on extensive corpora, have significantly improved entity recognition accuracy [36,37,38,39]. However, challenges remain in accurately recognizing rare entities and domain-specific terms. To address these issues, researchers have sought to enhance pre-trained models with lexicon information. Techniques like Softword, ExSoftword, and SoftLexicon incorporate lexicons to improve the recognition of domain-specific named entities [40]. For instance, Zhao et al. [41] introduced the BERT-IDCNN-CRF model, which integrates the SoftLexicon method, demonstrating impressive efficiency across multiple datasets. Similarly, Zhang et al. [42] improved recognition by incorporating lexicons and similar words into character representations. Liu et al. [43] further enhanced the model capabilities by dynamically updating custom lexicon segmentation methods, thereby improving the identification of domain-specific terms and new entities. Additionally, incorporating syntactic information has been shown to significantly enhance NER performance [44]. For example, Tian et al. [45] developed the BioKMNER model, which employed a key-value memory network (KVMN) to integrate syntactic information, achieving excellent results on biomedical datasets. Luoma et al. [46] demonstrated that adding context through additional sentences in BERT input systematically improved the NER performance. These methods, whether introducing lexicon information, syntactic data, sentence context, or even stroke information of Chinese characters [47], are aimed at improving the NER model performance in specialized domains.

In summary, NER continues to face significant challenges in the field of biological disease research, particularly in accurately recognizing nested entities and domain-specific terms. These challenges are especially pronounced in the study of chicken diseases, where the acquisition of relevant corpora is difficult and the labor cost of data annotation is high. To address these issues, we collected and organized a chicken disease corpus comprising 20 million characters, trained a specialized word vector tailored to chicken disease terminology, and annotated a portion of these data with high precision. Building on this foundation, we developed a nested NER model, MFGFF-BiLSTM-EGP, which leverages multiple fine-grained feature fusion (MFGFF) and the efficient global pointer (EGP). The primary contributions of this paper are as follows:

We constructed the MFGFF-BiLSTM-EGP model, which connects the fusion output of multi-fine-grained features to the BiLSTM neural network layer, and finally through a fully connected layer into the EGP to predict the entity position.
In the MFGFF module we designed, the character encoder obtains character features by fine-tuning the RoBERTa pre-trained model, the word encoder acquires word features through word-character matching, word frequency weighting, and multi-head attention mechanism, and the sentence features are output using SBERT. MFGFF effectively integrates multiple fine-grained features. In addition, the introduction of EGP enables the prediction of nested entities by means of positional coding.
We constructed a comprehensive knowledge base for chicken diseases that included a 20-million-character corpus, a vocabulary containing 6760 specialized terms, a 200-dimensional word vector in the field of chicken diseases, and a high-quality annotated dataset CDNER that was curated under the guidance of veterinarians.

The remainder of this paper is organized as follows. Section 2 provides an overview of the materials and methods including data collection, annotation, and the detailed architecture of the proposed MFGFF-BiLSTM-EGP model. Section 3 presents the experimental setup, evaluation metrics, and the results of model performance across different datasets. Section 4 discusses the significance of the findings, the implications for real-world applications, and a comparison with existing models. In Section 5, we conclude the paper with a summary of key contributions, limitations, and directions for future research.

2. Materials and Methods

2.1. Data and Lexicon

To address the lack of available datasets on chicken epidemic diseases, we systematically compiled a 20 million character comprehensive corpus from various sources including 50 professional books, medical records, technical reports, academic literature, and reputable online websites. We first utilized automated Python scripts to eliminate duplicates, redundancies, and any obviously incorrect information. Following this initial automated screening, we conducted a manual review to ensure the authenticity and accuracy of the data. Additionally, with guidance from veterinary experts, we compiled a specialized lexicon comprising 6760 terms specific to chicken diseases. We integrated this lexicon into the Jieba (https://pypi.org/project/jieba/) segmentation tool to facilitate the training of word vectors. Through these steps, we successfully developed a high-quality knowledge base on chicken diseases, providing a robust foundation for further research and practical applications.

2.2. Entity Labeling

Based on the chicken disease corpus, we selected 5000 high-quality data and manually annotated them using the Label-Studio platform (https://labelstud.io/) with the knowledge of veterinary experts. We established five distinct labels: type, disease, symptom, body part, and drug. Table 1 below provides examples and the frequency of each label.

Given the prevalence of nested entities in the task of labeling named entities related to chicken diseases, we developed a unified labeling standard to ensure accurate and comprehensive annotation. For example, In Figure 1, a body part entity “脾脏 (spleen)” and a symptom entity “肿大 (swelling)” were nested within symptom entity “脾脏肿大 (spleen swelling)”, illustrating a nesting of different and identical entity types. Under the guidance of veterinary experts, we meticulously distinguished and labeled entities at different levels, ensuring that all of the named entities were identified and processed independently.

2.3. MFGFF-BiLSTM-EGP

For the MFGFF-BiLSTM-EGP model developed in this study, the algorithmic pseudo-code for the model is presented in Algorithm 1. As illustrated in Figure 2, the model integrates three key modules within its embedding layer: character encoding, word encoding, and sentence encoding.

Algorithm 1. Pseudo-code for the MFGFF-BiLSTM-EGP nested named entity recognition mode.
Input: Pre-trained model RoBERTa, word vectors GloVe, sentence embeddings SBERT, network BiLSTM, efficient global pointer EGP, The number of iterations E.
$Output : Predicted entity spans E$ $, optimal model weights \hat{θ}$ .
1:	$For e$ $= 1, \dots, E$ do
2:		$h_{i} \leftarrow R o B E R T a (c_{i}), \forall c_{i} \in X$
3:		$w_{i}^{'} \leftarrow l o g (1 + f_{j}) \times G l o V e (w_{j})$ , according to Equations (4) and (5).
4:		$h_{i}^{'} \leftarrow M u l t i H e a d (h_{i}, w_{i}^{'}), \forall w_{j} c o n t a i n i n g c_{i}$ , according to Equations (6)–(11).
5:		$u \leftarrow S B E R T (X)$ $, according to Equations (12) - (14) .$
6:		$H_{i} \leftarrow C o n c a t (h_{i}, h_{i}^{'}, u), \forall i$
7:		$C o n t e x t u a l O u t p u t \leftarrow B i L S T M ({H_{1}, H_{2}, \dots, H_{n}})$
8:		$s_{i, j}^{α} \leftarrow (W_{q} H_{i} + b_{q}) \times (W_{k} H_{j} + b_{k}), \forall (c_{i}, c_{j})$ $, according to Equations (17) - (20) .$
9:		$L \leftarrow C i r c l e L o s s (s_{i, j}^{α}, \hat{Y})$ $, according to Equations (21) and (22) .$
10:		$\hat{θ} \leftarrow {a r g m i n}_{θ} L$
11:		$If F 1 > F 1_{\max}$
12:			$F 1_{\max} \leftarrow F 1$
13:			$s a v e \hat{θ}$
14:	$End for$
15:	$Output : Final predicted spans \hat{Y}$ $, best weights \hat{θ}$ $.$

For character encoding, we fine-tuned the RoBERTa pre-trained model using its last hidden state output to generate character vectors. The word encoding module utilized word vectors trained with GloVe, which were aligned with the output from RoBERTa. When a word vector contained the current input character, a multi-head attention mechanism merged the corresponding word vector to form the final word vector. For sentence encoding, each sentence was processed through the SBERT (Sentence-BERT) model, generating a corresponding sentence vector. These three vectors: character, word, and sentence, were concatenated to construct the model’s embedding layer. A BiLSTM network was then employed to capture the contextual information within the text. Finally, the EGP component predicted the positions of named entities within the text.

2.3.1. Character Encoder

As shown in Figure 3, in the character encoding module, we fine-tuned RoBERTa and used RoBERTa’s last hidden state as the character vector. The pre-trained RoBERTa model served as our character encoder, building upon the BERT architecture. BERT is a language model based on the transformer framework, which leverages an attention mechanism to dynamically focus on different parts of the input data, enabling effective sequence processing [48]. RoBERTa enhances BERT’s capabilities, particularly for tasks like NER, by introducing several key modifications [49]. As an embedding layer, RoBERTa transforms input character sequences into high-dimensional feature vectors that not only capture individual word characteristics, but also rich contextual information. This context-awareness is crucial for accurately identifying and classifying named entities as it allows for a nuanced understanding of the text surrounding each word. RoBERTa’s capacity to capture these contextual nuances makes it especially effective for NER tasks.

2.3.2. Word Encoder

We developed a 200-dimensional word vector for chicken diseases using the GloVe model. This process began with the construction of a co-occurrence matrix

X

, where

X_{i j}

represents the frequency with which word

i

co-occurs with word

j

within a predefined context window. The probability of co-occurrence between words

i

and

j

is then calculated using the formula:

P_{i j} = \frac{X_{i j}}{\sum_{k} X_{i k}}

(1)

Here,

P_{i j}

indicates the likelihood of word

j

appearing within the context of word

i .

The GloVe model refines word vectors to accurately reflect these co-occurrence probabilities, aiming to minimize the following objective function:

J = \sum_{i, j = 1}^{V} f (X_{i j}) {(w_{i}^{T} {\tilde{w}}_{j} + b_{i} + {\tilde{b}}_{j} - l o g (X_{i j}))}^{2}

(2)

In this equation,

w_{i}^{T}

and

w_{j}

are the word vectors for words

i

and

j

, respectively, while

b_{i}

and

b_{j}

are the corresponding bias terms.

The logarithm of the co-occurrence count between the words,

l o g (X_{i j})

, and the weighting function

f (X_{i j})

are used to reduce the influence of rare co-occurrences, defined as:

f (X_{i j}) = \{\begin{array}{l} {(\frac{X_{i j}}{X_{m a x}})}^{α} & if X_{i j} < X_{m a x} \\ 1 & otherwise \end{array}

(3)

In this context,

X_{m a x}

and

α

are hyperparameters that shape the weighting function, ensuring that both frequent and rare co-occurrences are appropriately weighted.

As shown in Figure 4, for each sentence

X = {c_{1}, c_{2}, \dots, c_{n}}

with

n

tokens, each token

c_{i}

is represented by combining the character with GloVe word vectors. The set

W_{c} = {w_{i} ∣ c_{i} \in w_{i}}

represents the collection of word vectors

w_{i}

for words containing the character

c_{i}

. To balance the influence of word vectors according to their frequencies, logarithmic scaling is applied to the word frequencies:

f_{j}^{'} = l o g (1 + f_{j})

(4)

where

f_{j}^{'}

is the logarithmically scaled frequency of word

j

, and

f_{j}

is the original frequency. The adjusted word vector for word

j

is then calculated as:

w_{j}^{'} = f_{j}^{'} w_{j}

(5)

Here,

w_{j}^{'}

is the weighted word vector, and

w_{j}

is the original word vector.

The integration of character vectors with weighted word vectors is achieved through a multi-head attention mechanism. The query, key, and value vectors are first calculated as follows:

Q = W_{Q} h_{i}

(6)

K_{j} = W_{K} w_{j}'

(7)

V_{j} = W_{V} w_{j}'

(8)

where

Q

represents the query vector,

K_{j}

is the key vector, and

V_{j}

is the value vector. The matrices

W_{Q}, W_{K}

, and

W_{V}

are the projection matrices for the query, key, and value, respectively.

The attention mechanism computes the attention weights to evaluate the relevance of each key-value pair to the query.

Attention (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(9)

where

d_{k}

is the dimension of the key vector, and the softmax function normalizes the similarity scores, converting them into a probability distribution.

A weighted summation of the value

A t t e n t i o n (Q W_{Q i}, K_{j} W_{K i}, V_{j} W_{V i})

vectors using the computed attentional weights yields an output vector for each header:

{h e a d}_{i} = \sum_{j = 1}^{n} A t t e n t i o n (Q W_{Q i}, K_{j} W_{K i}, V_{j} W_{V i})

(10)

where

n

is the length of the input sequence, and

W_{Q i}, W_{K i}, W_{V i}

are the projection matrices for the

i

-th attention head.

A t t e n t i o n (Q W_{Q i}, K_{j} W_{K i}, V_{j} W_{V i})

is the attention weight calculated using the query matrix

Q W_{Q i}

and the key matrix

K_{j} W_{K i}

. The corresponding value matrix

V_{j} W_{V i}

is weighted and summed up.

These probabilities are used to weigh the value vectors, determining their contribution to the final output. The multi-head attention mechanism allows the model to focus on different parts of the input sequence simultaneously. By employing multiple attention heads, the model captures a wider range of relationships within the data. The outputs from each attention head are concatenated and undergo a linear transformation:

M u l t i H e a d (Q, K, V) = C o n c a t ({h e a d}_{1}, {h e a d}_{2}, \dots, {h e a d}_{h}) W_{O}

(11)

where

M u l t i H e a d (Q, K, V)

denotes the final weighted output of the word vector through the multi-head attention mechanism, and the concatenated outputs are transformed using the projection matrix

W_{O}

to produce the final output.

2.3.3. Sentence Encoder

SBERT (Sentence-BERT) enhances BERT by utilizing Siamese and triplet network architectures to generate fixed-size sentence embeddings. It employs a BERT model with shared weights to process pairs of sentences, creating sentence vectors through pooling operations.

As shown in Figure 5, initially each input sentence is processed by BERT, producing token representations denoted as

{h_{1}, h_{2}, \dots, h_{n}}

, where

h_{i}

is the vector representation of the

i

-th token. SBERT then applies a mean pooling operation to these token vectors to derive a fixed-length sentence vector:

u = \frac{1}{n} \sum_{i = 1}^{n} h_{i}

(12)

Here,

h_{i}

is the vector of the

i

-th token, and

n

is the total number of tokens in the sentence.

SBERT uses a Siamese network structure to process sentence pairs, as shown in Figure 5, where each sentence is passed through a BERT model with shared weights to produce its corresponding sentence vector. Given a pair of sentences

s_{1}

and

s_{2}

, their respective sentence vectors

u_{1}

and

u_{2}

are generated as follows:

u_{1} = P o o l i n g (B E R T (s_{1}))

(13)

u_{2} = P o o l i n g (B E R T (s_{2}))

(14)

This structure allows SBERT to efficiently compare and measure the similarity between sentence pairs by producing consistent and comparable sentence embeddings.

2.3.4. BiLSTM

LSTM networks are a type of recurrent neural network (RNN) designed to overcome issues like gradient vanishing in traditional RNNs when dealing with long sequences and are effective for time series data and capturing long-term dependencies. However, standard LSTMs only process information in one direction, which limits their ability to fully capture context in text sequences. Bidirectional LSTM (BiLSTM) solves this by processing the sequence in both forward and backward directions. It uses two LSTM networks, one for the original sequence and one for the reversed sequence. The combined output captures information from both past and future contexts, making BiLSTM particularly useful for tasks like NER, where understanding context from both directions is important. In this article, BiLSTM was used to optimize the output of the embedding layer.

2.3.5. Efficient Global Pointer

The global pointer (GP) method enhances NER by employing a span classification approach that treats the head and tail of an entity as a unified pair. This approach provides a more holistic view of the entity, ensuring that training, prediction, and evaluation processes are consistently conducted at the entity level. This global perspective allows the GP to effectively recognize both nested and non-nested entities. The efficient global pointer (EGP) builds upon the GP approach by addressing the issue of parameter explosion, a common challenge in complex models. EGP decomposes the NER task into two subtasks: extraction and classification. This approach enables parameter sharing during extraction, which significantly reduces the number of parameters needed for classification. Consequently, the EGP is more scalable and efficient compared to its predecessor.

Furthermore, the EGP introduces a specialized loss function, inspired by circle loss, to manage class imbalance. This loss function ensures that the scores of the target classes consistently exceed those of the non-target classes, enhancing the model’s ability to distinguish between different entities. These improvements make EGP a robust and scalable solution for NER tasks, capable of delivering superior performance across various datasets and entity types.

Figure 6 illustrates the entity prediction process using EGP for the sentence “Swelling of the spleen in chickens infected with Newcastle disease”. EGP accurately labels the entity by employing positional coding, where the end position of the entity is marked as 1. This approach allows the model to effectively identify nested named entities.

After obtaining the integrated token representation

H

, the next step is to predict spans for entity recognition, which involves identifying the start and end positions of potential entity spans within a sentence. For each token

x_{i}

, the span prediction process utilizes feedforward layers to compute the start and end representations as follows:

q_{i, α} = W_{q, α} H_{i} + b_{q, α}

(15)

k_{i, α} = W_{k, α} H_{i} + b_{k, α}

(16)

Here,

q_{i, α}

and

k_{i, α}

represent the vector representations for the start and end tokens of the span for entity type

α .

The span score

s_{α} (i, j)

for entity type

α

is then calculated by:

s_{α} (i, j) = q_{i, α}^{T} k_{j, α}

(17)

To incorporate positional information, relative position encoding (RoPE) is applied, modifying the span score calculation as follows:

s_{α} (i, j) = {(R_{i} q_{i, α})}^{T} (R_{j} k_{j, α}) = q_{i, α}^{T} R_{j - i} k_{j, α}

(18)

In this equation,

R_{i}

and

R_{j}

are the position encodings for the start and end tokens, respectively.

The EGP approach splits the NER task into extraction and classification, sharing parameters across entity types to reduce the total parameter count. Extraction finds the entity spans, and classification assigns the correct type. The scoring function that integrates both tasks is formulated as follows:

s_{α} (i, j) = {(W_{q} H_{i})}^{T} (W_{k} H_{j}) + w_{α}^{T} [H_{i}; H_{j}]

(19)

Here,

W_{q}

and

W_{k}

are the projection matrices used for span extraction, and

w_{α}^{T}

is the parameter vector associated with entity classification.

To further optimize the use of parameters, we employed concatenated span representations

(q_{i}; k_{i})

instead of the full token representation

H_{i}

. The scoring function is then reformulated as:

s_{α} (i, j) = q_{i}^{T} k_{j} + w_{α}^{T} [q_{i}; k_{i}; q_{j}; k_{j}]

(20)

In this equation,

w_{α}^{T}

is a parameter vector specific to the entity type

α

.

To improve the NER performance with imbalanced datasets, EGP introduces a specialized loss function inspired by circle loss. This ensures that the target class scores are higher than non-target ones. The loss function is defined as follows:

L = l o g (1 + \sum_{i \in Ω_{n e g}} e^{s^{i}} \sum_{j \in Ω_{p o s}} e^{- s^{j}})

(21)

Here,

Ω_{p o s}

and

Ω_{n e g}

represent the sets of positive and negative samples, respectively.

This formulation is extended to handle the imbalance in entity types using the following loss function for a specific entity type

α

:

L_{α} = l o g (1 + \sum_{(q, k) \in P_{α}} e^{- s_{α} (q, k)}) + l o g (1 + \sum_{(q, k) \in Q_{α}} e^{s_{α} (q, k)})

(22)

In this equation,

E_{α}

denotes the set of entity spans for entity type

α

, while

P_{α}

and

Q_{α}

represent the sets of non-entity spans and negative spans for entity type

α

, respectively. The term

s_{q, k}^{α}

is the span score for entity type

α

between tokens

q

and

k

.

The loss function

L_{α}

effectively balances the classification of both the entity and non-entity spans, ensuring that the model accurately distinguishes between them, thereby addressing class imbalance and improving the overall NER performance.

2.4. Experimental Environment and Hyperparameter

The experiments were conducted using a Tesla P100 GPU and PyTorch 2 as the deep learning framework. Table 2 outlines the specific experimental parameters.

2.5. Evaluation Indicators

This study was evaluated using a confusion matrix, which categorized the model’s prediction results into four groups: true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs). The confusion matrix provides a detailed insight into the model’s performance across these categories, allowing us to assess its predictive accuracy and identify both strengths and areas for improvement.

Additionally, three key metrics were used to evaluate the model: precision, recall, and F1 score. Precision measures the proportion of correctly predicted positive samples out of all samples predicted as positive, with higher values indicating greater prediction accuracy. Recall assesses the model’s ability to identify all actual positive samples, with higher values reflecting greater sensitivity. F1 score, the harmonic mean of precision and recall, offers a balanced assessment of the model’s performance, where higher values indicate a more reliable and well-rounded model.

P = \frac{T P}{T P + F P}

(23)

R = \frac{T P}{T P + F N}

(24)

F 1 = \frac{2 \times P \times R}{P + R}

(25)

3. Results

We conducted experiments on three datasets, each divided into training, validation, and testing sets with a 6:2:2 ratio:

CDNER (ours): The Chicken Disease Named Entity Recognition dataset comprising 5000 high-quality samples labeled by veterinary experts, containing five types of entities with nested structures.
CMeEE V2 [50]: A Chinese medicine entity recognition dataset with 20,000 annotated samples across nine categories including nested entities. The categories include diseases (dis), clinical symptoms (sym), drugs (dru), medical equipment (equ), medical procedures (pro), body (bod), medical examination items (ite), microbiology (mic), and departments (dep).
CLUENER [51]: A general domain NER dataset with 12,091 annotated samples across ten categories: addresses, books, companies, games, government, movies, names, organizations, positions, and scenes.

For the CDNER dataset, we employed chicken disease word vectors trained on a 20 million characters corpus. For the CMeEE V2 and CLUENER datasets, we utilized 200-dimensional Chinese word vectors from Tencent AI Lab [52].

3.1. Main Results Compared with Other Models

We evaluated the performance of several mainstream NER models on these datasets:

Lattice LSTM [53]: Encodes input characters and all potential words in the matching dictionary.
FLAT [54]: A transformer-based model that utilizes unique position encoding to integrate the lattice structure, seamlessly introducing lexical information.
BERT-CRF: Combines a traditional CRF decoder with the BERT pre-trained model.
BERT-MRC [55]: Reframes the NER task as a reading comprehension problem.
SLRBC [56]: Integrates lexical boundary information using the SoftLexicon method with RoBERTa, BiLSTM, and CRF.

As summarized in Table 3, the MFGFF-BiLSTM-EGP model consistently outperformed other models across all three datasets, achieving the highest F1 scores: 91.98% on CDNER, 73.32% on CMeEE V2, and 82.54% on CLUENER. The superior performance of the MFGFF-BiLSTM-EGP model can be attributed to its integration of character, word, and sentence vectors, fused with BiLSTM and EPG, which enhanced the recognition accuracy by incorporating specialized vocabulary and contextual information.

The SLRBC model also exhibited strong performance, particularly on the CDNER dataset, with an F1 score of 89.64%. This success was largely due to the SoftLexicon method, which enhanced vocabulary representation, combined with RoBERTa’s robust contextual embeddings, BiLSTM’s capacity to capture sequence data, and CRF’s sequence tagging capabilities. Although BERT-CRF lacked the SoftLexicon and BiLSTM modules present in the SLRBC model, it still performed competitively due to BERT’s powerful representation capabilities, albeit with slightly lower metrics across all datasets. BERT-MRC, which reinterprets NER as a reading comprehension task, delivered adequate but not outstanding results, with F1 scores of 82.93% on CDNER, 67.97% on CMeEE V2, and 76.89% on CLUENER. Its performance could potentially be improved by refining the description of entity types during MRC parameter settings.

3.2. Entity Level Evaluation

Figure 7 and Table 4 present the entity-level evaluation results of the MFGFF-BiLSTM-EGP model across the CDNER, CMeEE, and CLUENER datasets, detailing the precision, recall, and F1 scores. In the CDNER dataset, the model demonstrated exceptional performance in recognizing the “symptom”, “body part” “disease”, “drug”, and “type” categories, achieving F1 scores of 87.6%, 92.43%, 92.98%, 93.22%, and 93.68%, respectively. This consistency underscores the high quality of our dataset, indicating that the annotations are both balanced and precise, thereby facilitating robust model training. These results validate the model’s superior performance in entity recognition within this dataset.

For the CMeEE V2 dataset, the model’s performance varied across different categories. It exceled in the “dis” (disease) and “dru” (drug) categories, with F1 scores of 83.81% and 81.1%, respectively. However, the model encountered difficulties in the “equ” (equipment) and “ite” (medical examination items) categories, where the F1 scores dropped to 62.62% and 62.04%, respectively. Notably, in the “equ” category, the model’s precision was only 55.23%, likely due to the limited representation of specialized vocabulary within the general domain word vectors, resulting in weaker recognition performance in these areas.

In the CLUENER dataset, the model showed a relatively balanced performance across categories, achieving high F1 scores in the “company”, “government”, and “position” categories of 86.05%, 84.28%, and 84.33%, respectively. This indicates strong recognition capabilities with minimal disparity between precision and recall, reflecting good stability. However, the model’s performance in the “movie” and “book” categories was less satisfactory, with F1 scores of 75.76% and 80.2%, respectively. Overall, the model maintained a balanced performance across multiple categories.

3.3. Ablation Study

Table 5 shows the results of the ablation experiments performed by the MFGFF-BiLSTM-EGP model on the CDNER dataset to evaluate the impact of different module combinations on the model’s F1 score. The study investigated the contributions of the pre-trained model, word encoder, and sentence encoder. The results indicate that each module significantly enhanced the model’s performance, with the highest F1 score of 91.98% achieved when all three modules were integrated.

3.3.1. Effect of Pre-Trained Model

The pre-trained model substantially improved the overall performance. When only the pre-trained model (Model 1) was employed, the F1 score reached 88.01%, which was 5.69% higher than that of the model without the pre-trained model (82.32% for Model 2). This finding underscores the pre-trained model’s effectiveness in capturing underlying features. The combination of the pre-trained model with the word encoder (Model 4) further increased the F1 score to 91.33%, representing a 9.01% improvement over Model 2 (which used only the word encoder). This emphasizes the significance of pre-trained models in complex feature representation. When the pre-trained model was combined with both the word and sentence encoders (Model 6), the F1 score peaked at 91.98%, showing a 9.31% improvement over Model 3, which combined the word and sentence encoders. This result further demonstrates the pre-trained model’s capacity to maximize the overall model performance.

3.3.2. Effect of Word Encoder

The word encoder plays a crucial role in enhancing the model’s performance, especially when integrated with other modules. The F1 score for the word encoder alone (Model 2) was 82.32%, which, although lower than that of Model 1, which utilized only the pre-trained model, still illustrates the word encoder’s value in word-level feature extraction. When the word encoder was combined with the pre-trained model (Model 4), the F1 score rose to 91.33%, a 3.32% improvement over Model 1. This finding indicates that the word encoder significantly contributes to refining the pre-trained model’s fine-grained feature representation. In Model 6, where the word encoder was combined with both the pre-trained model and the sentence encoder, the F1 score further improved to 91.98%, a 3.64% increase over Model 5, which combined the pre-trained model and the sentence encoder. This demonstrates the pivotal role of the word encoder in a multi-module combination.

3.3.3. Effect of Sentence Encoder

The sentence encoder’s impact on model performance is more nuanced and depends on its combination with other modules. When combined with the word encoder (Model 3), the F1 score reached 82.67%, a modest increase of 0.35% compared to Model 2, which used only the word encoder. This slight improvement may be due to the introduction of the sentence encoder, which could add redundant information in the absence of a pre-trained model. When the sentence encoder was combined with the pre-trained model (Model 5), the F1 score increased slightly to 88.34%, just 0.33% higher than Model 1 (88.01%), which only used the pre-trained model. In Model 6, the F1 score achieved its maximum of 91.98% when all three modules were used together, an improvement of 0.65% compared to Model 4. These results suggest that the sentence encoder, when used alongside the pre-trained model and word encoder, can marginally enhance the global semantic representation of the model.

4. Discussion

4.1. Confusion Matrix Analysis

Figure 8 illustrates the performance of the nested named entity recognition model across three datasets (CDNER, CMeEE V2, and CLUENER) that used a confusion matrix to evaluate classification accuracy. The confusion matrix revealed a recurring issue of missing classifications, particularly in the “Other” category, where the model failed to assign certain entities to any specific category. This problem was most prominent in the medical-related CDNER and CMeEE V2 datasets, where entities like “symptoms” “body parts” and “diseases” were frequently missing classifications. For instance, in the CDNER dataset, 245 “symptom” entities were missing classifications, indicating the model’s limited accuracy in recognizing these medical terms. Similarly, in the CMeEE V2 dataset, 1958 “bod” and 964 “sym” entities were missing classifications.

The CLUENER dataset also showed instances of missing classification, particularly in categories like “position” and “address”, further demonstrating the model’s difficulty in correctly identifying these entities. In addition to missing classifications, the issue of misclassification was also noteworthy. The model sometimes incorrectly assigned entities to the wrong category such as classifying “company” as “position” or “name”. This misclassification may have stemmed from semantic overlaps between categories and the model’s limited ability to process complex contexts.

4.2. Visualization of Token Representations in Feature Space

We conducted feature visualization and analysis across three datasets: CDNER, CMeEE V2, and CLUENER. By labeling 50 entities per category and extracting their features, we applied t-SNE for dimensionality reduction to facilitate visualization. Our analysis focused on two approaches: word vectors and MFGFF.

Figure 9 illustrates that the effectiveness of MFGFF varied significantly across different datasets. In the CDNER dataset, the original word vector representations exhibited minimal distinction between feature vectors across entity categories, leading to substantial overlap and dispersion. This lack of clear differentiation was evident. However, following MFGFF, the feature representations improved markedly, with data points clustering more centrally within their respective categories. This resulted in more distinct category clusters and enhanced inter-category distinguishability. Despite these improvements, challenges remained such as the observed similarity between categories like ‘body part’ and ‘symptom’ This overlap likely arose from semantic similarities or nested relationships within the text.

Similarly, the CMeEE V2 dataset initially showed poor category aggregation, with blurred boundaries and significant overlap under the original word vectors. The application of MFGFF significantly clarified these boundaries and improved data point aggregation, highlighting its effectiveness in enhancing feature differentiation. However, certain categories such as medical examination item, medical procedure, and body still exhibited similarities, likely due to semantic overlaps in medical terminology.

In contrast, the CLUENER dataset, a general domain corpus, displayed a more even distribution of entities under the original word vectors, though noticeable category overlap was still present. The application of MFGFF greatly improved differentiation between entity types, which may be attributed to the dataset’s inherent diversity in entity text, allowing fused features to better segregate categories.

In summary, our analysis highlights the complexities inherent in biomedical NER including challenges like semantic overlap, domain-specific vocabulary, and nested entities. While MFGFF demonstrated significant advantages in enhancing feature representation, it still faced challenges, particularly in addressing category similarity. The varying performance of this technique across different datasets is closely tied to the characteristics of the datasets and the semantic features of the entity categories, indicating a need for further optimization and adaptation in specific applications.

4.3. Nested Entity Predictive Analytic

Based on the experimental results presented in Figure 10, we conducted a detailed evaluation of the effectiveness of nested entity recognition by comparing the performance of fine-tuning (FT) RoBERTa and MFGFF across different datasets. For the CDNER dataset, fine-tuning RoBERTa achieved a precision of 79.68%, a recall of 72.54%, and an F1 score of 75.98%. On the CMeEE V2 dataset, these metrics were notably lower, with a precision of 42.13%, a recall of 47.15%, and an F1 score of 44.50%. Compared to the overall entity recognition, the F1 scores for CDNER and CMeEE V2 decreased by 16.03% and 28.82%, respectively, highlighting the significant challenges posed by nested entities in NER.

To address this challenge, we developed a high-quality nested entity dataset to enhance model training with more representative data. Implementing the MFGFF method further improved the model’s capability to recognize nested entities. On the CDNER dataset, MFGFF achieved precision, recall, and F1 scores of 71.90%, 72.13%, and 71.91%, respectively, closely aligning with the performance of fine-tuning RoBERTa. On the more complex CMeEE V2 dataset, MFGFF yielded precision, recall, and F1 scores of 39.56%, 41.10%, and 40.32%, respectively. Although MFGFF’s absolute values were slightly lower than those of the fine-tuned RoBERTa, its stability and robustness in recognizing nested entities were more pronounced, particularly in the challenging CMeEE V2 dataset.

In conclusion, while nested entity recognition continues to present significant challenges in NER tasks, constructing high-quality datasets and adopting a MFGFF approach can significantly enhance the model performance. The MFGFF method, with its superior stability and robustness, outperformed the fine-tuning-only strategy in recognizing nested entities, demonstrating both the effectiveness and potential application value of our proposed method.

4.4. Comparative Analysis of Pre-Rained Models

The choice of a pre-trained model for embedding is crucial to the recognition effectiveness of NER models. We compared the performance of five pre-trained models—BERT, RoBERTa, XLNet, MCBERT, and ERNIE—as illustrated in Figure 11.

RoBERTa outperformed the other models across all metrics, particularly in recall (92.18%) and F1 score (91.98%), where it led significantly. This superior performance can be attributed to RoBERTa’s use of a larger dataset, extended training time, and the removal of the next sentence prediction task from BERT. These modifications enhanced RoBERTa’s ability to capture contextual information and manage long dependencies effectively. MCBERT, a model pre-trained specifically for the Chinese medical domain, performed comparably to BERT in medical text classification tasks, with accuracy and F1 scores of 90.25% and 90.49%, respectively. MCBERT’s strength lies in its pre-training on a vast corpus of medical domain texts, enabling it to handle medical terminology and specialized sentence structures effectively. However, compared to RoBERTa and ERNIE, MCBERT was slightly weaker in precision and recall, likely due to the more extensive pre-training data and optimization strategies employed by RoBERTa and ERNIE, which resulted in better performance even in specialized areas like chicken disease NER. XLNet, on the other hand, lagged behind the other models in precision (88.87%), recall (90.08%), and F1 score (89.44%). Despite its autoregressive architecture, which aims to merge the benefits of BERT and Transformer-XL to capture richer contextual information, XLNet’s performance may degrade when handling shorter texts or when there is insufficient contextual information.

5. Conclusions

In this study, we constructed the CDNER dataset, specifically designed for nested named entity recognition (NER) in the context of chicken diseases. Additionally, we developed a novel model, termed MFGFF-BiLSTM-EGP. This model achieved strong results across multiple datasets, with F1 scores of 91.98%, 73.32%, and 82.54% on the CDNER, CMeEE V2, and CLUENER datasets, respectively, highlighting its broad generalization capabilities. Notably, the use of a character encoder contributed the most significant improvement (9.31%), while the sentence encoder offered a modest boost (0.65%). The MFGFF module further enhanced the model’s performance by integrating fine-grained feature vectors. The model excelled at recognizing nested entities, with F1 scores of 75.95% on the CDNER dataset and 44.50% on the CMeEE V2 dataset, reflecting improvements of 4.04% and 4.18% compared to using word vector embeddings. This validates the effectiveness of the EGP module in handling nested structures.

Despite achieving promising results, challenges remain. The model exhibited suboptimal performance in distinguishing closely related entities, with instances of misclassification and missed classifications highlighting limitations in semantic understanding and contextual processing. Additionally, the feature fusion process may introduce information redundancy. Future work will focus on addressing issues of misclassification and missed classifications, optimizing feature selection, reducing information redundancy, and enhancing the overall model performance.

This research not only advances the theoretical application of deep learning to complex NER tasks, particularly for nested entity recognition in animal disease contexts, but also offers practical benefits. In the field of agriculture and animal health, the model could be integrated into intelligent diagnostic tools, enabling veterinarians and farmers to develop disease management plans quickly and accurately. Additionally, it lays the foundation for constructing animal disease knowledge graphs, which can be applied in question–answer systems to boost decision-making efficiency.

Author Contributions

Conceptualization, X.W. and C.P.; Methodology, X.W.; Software, X.W.; Validation, X.W., Q.L. and Q.Y.; Formal analysis, L.L. and L.D.; Investigation, P.L.; Resources, C.P.; Data curation, C.P.; Writing—original draft preparation, X.W.; Writing—review and editing, X.W. and L.Z.; Visualization, X.W. and R.J.; Supervision, R.G. and W.W.; Project administration, Q.L. and L.Y.; Funding acquisition, Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Major Project (2021ZD0113802) and the Technological Innovation Capacity Construction of Beijing Academy of Agricultural and Forestry Sciences (KJCX20240318).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in https://github.com/wangxiajun68/chicken-disease (accessed on 18 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Han, H.; Xue, X.; Li, Q.; Gao, H.; Wang, R.; Jiang, R.; Ren, Z.; Meng, R.; Li, M.; Guo, Y.; et al. Pig-Ear Detection from the Thermal Infrared Image Based on Improved YOLOv8n. Intell. Robot. 2024, 4, 20–38. [Google Scholar] [CrossRef]
Hou, G.; Jian, Y.; Zhao, Q.; Quan, X.; Zhang, H. Language Model Based on Deep Learning Network for Biomedical Named Entity Recognition. Methods 2024, 226, 71–77. [Google Scholar] [CrossRef] [PubMed]
Jehangir, B.; Radhakrishnan, S.; Agarwal, R. A Survey on Named Entity Recognition—Datasets, Tools, and Methodologies. Nat. Lang. Process. J. 2023, 3, 100017. [Google Scholar] [CrossRef]
Li, Y.; Song, L.; Zhang, C. Sparse Conditional Hidden Markov Model for Weakly Supervised Named Entity Recognition. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 978–988. [Google Scholar]
De Martino, A.; De Martino, D. An Introduction to the Maximum Entropy Approach and Its Application to Inference Problems in Biology. Heliyon 2018, 4, e00596. [Google Scholar] [CrossRef] [PubMed]
Ramachandran, R.; Arutchelvan, K. Optimized Version of Tree Based Support Vector Machine for Named Entity Recognition in Medical Literature. In Proceedings of the 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS), Thoothukudi, India, 3 December 2020; pp. 357–361. [Google Scholar]
Liu, K.; Hu, Q.; Liu, J.; Xing, C. Named Entity Recognition in Chinese Electronic Medical Records Based on CRF. In Proceedings of the 2017 14th Web Information Systems and Applications Conference (WISA), Liuzhou, China, 11–12 November 2017; pp. 105–110. [Google Scholar]
Dash, A.; Darshana, S.; Yadav, D.K.; Gupta, V. A Clinical Named Entity Recognition Model Using Pretrained Word Embedding and Deep Neural Networks. Decis. Anal. J. 2024, 10, 100426. [Google Scholar] [CrossRef]
Zhang, R.; Zhao, P.; Guo, W.; Wang, R.; Lu, W. Medical Named Entity Recognition Based on Dilated Convolutional Neural Network. Cogn. Robot. 2022, 2, 13–20. [Google Scholar] [CrossRef]
Lerner, I.; Paris, N.; Tannier, X. Terminologies Augmented Recurrent Neural Network Model for Clinical Named Entity Recognition. J. Biomed. Inform. 2020, 102, 103356. [Google Scholar] [CrossRef]
Jia, C.; Zhang, Y. Multi-Cell Compositional LSTM for NER Domain Adaptation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 28 July–2 August 2020; pp. 5906–5917. [Google Scholar]
Chang, C.; Tang, Y.; Long, Y.; Hu, K.; Li, Y.; Li, J.; Wang, C.-D. Multi-Information Preprocessing Event Extraction with BiLSTM-CRF Attention for Academic Knowledge Graph Construction. IEEE Trans. Comput. Soc. Syst. 2023, 10, 2713–2724. [Google Scholar] [CrossRef]
An, Y.; Xia, X.; Chen, X.; Wu, F.-X.; Wang, J. Chinese Clinical Named Entity Recognition via Multi-Head Self-Attention Based BiLSTM-CRF. Artif. Intell. Med. 2022, 127, 102282. [Google Scholar] [CrossRef]
Deng, N.; Fu, H.; Chen, X. Named Entity Recognition of Traditional Chinese Medicine Patents Based on BiLSTM-CRF. Wirel. Commun. Mob. Comput. 2021, 2021, 6696205. [Google Scholar] [CrossRef]
Ma, P.; Jiang, B.; Lu, Z.; Li, N.; Jiang, Z. Cybersecurity Named Entity Recognition Using Bidirectional Long Short-Term Memory with Conditional Random Fields. Tinshhua Sci. Technol. 2021, 26, 259–265. [Google Scholar] [CrossRef]
Baigang, M.; Yi, F. A Review: Development of Named Entity Recognition (NER) Technology for Aeronautical Information Intelligence. Artif. Intell. Rev. 2023, 56, 1515–1542. [Google Scholar] [CrossRef]
Fantechi, A.; Gnesi, S.; Livi, S.; Semini, L. A spaCy-Based Tool for Extracting Variability from NL Requirements. In Proceedings of the 25th ACM International Systems and Software Product Line Conference-Volume B, Leicester, UK, 6 September 2021; pp. 32–35. [Google Scholar]
Wang, M.; Hu, F. The Application of NLTK Library for Python Natural Language Processing in Corpus Research. Theory Pract. Lang. Stud. 2021, 11, 1041–1049. [Google Scholar] [CrossRef]
Kumar, S.; Alam, M.S.; Khursheed, Z.; Bashar, S.; Kalam, N. Enhancing Relational Database Interaction through Open AI and Stanford Core NLP-Based on Natural Language Interface. In Proceedings of the 2024 5th International Conference on Recent Trends in Computer Science and Technology (ICRTCST), Jamshedpur, India, 9 April 2024; pp. 589–602. [Google Scholar]
Pendleton, S.C.; Slater, K.; Karwath, A.; Gilbert, R.M.; Davis, N.; Pesudovs, K.; Liu, X.; Denniston, A.K.; Gkoutos, G.V.; Braithwaite, T. Development and Application of the Ocular Immune-Mediated Inflammatory Diseases Ontology Enhanced with Synonyms from Online Patient Support Forum Conversation. Comput. Biol. Med. 2021, 135, 104542. [Google Scholar] [CrossRef] [PubMed]
ElDin, H.G.; AbdulRazek, M.; Abdelshafi, M.; Sahlol, A.T. Med-Flair: Medical Named Entity Recognition for Diseases and Medications Based on Flair Embedding. Procedia Comput. Sci. 2021, 189, 67–75. [Google Scholar] [CrossRef]
Wang, Y.; Tong, H.; Zhu, Z.; Li, Y. Nested Named Entity Recognition: A Survey. ACM Trans. Knowl. Discov. Data 2022, 16, 1–29. [Google Scholar] [CrossRef]
Yang, J.; Zhang, T.; Tsai, C.-Y.; Lu, Y.; Yao, L. Evolution and Emerging Trends of Named Entity Recognition: Bibliometric Analysis from 2000 to 2023. Heliyon 2024, 10, e30053. [Google Scholar] [CrossRef]
Ming, H.; Yang, J.; Gui, F.; Jiang, L.; An, N. Few-Shot Nested Named Entity Recognition. Knowl.-Based Syst. 2024, 293, 111688. [Google Scholar] [CrossRef]
Huang, H.; Lei, M.; Feng, C. Hypergraph Network Model for Nested Entity Mention Recognition. Neurocomputing 2021, 423, 200–206. [Google Scholar] [CrossRef]
Jiang, D.; Ren, H.; Cai, Y.; Xu, J.; Liu, Y.; Leung, H. Candidate Region Aware Nested Named Entity Recognition. Neural Netw. 2021, 142, 340–350. [Google Scholar] [CrossRef]
Wang, B.; Lu, W.; Wang, Y.; Jin, H. A Neural Transition-Based Model for Nested Mention Recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics, Brussels, Belgium, 31 October–4 November 2018; pp. 1011–1017. [Google Scholar]
Huang, P.; Zhao, X.; Hu, M.; Tan, Z.; Xiao, W. T 2-NER: A T Wo-Stage Span-Based Framework for Unified Named Entity Recognition with T Emplates. Trans. Assoc. Comput. Linguist. 2023, 11, 1265–1282. [Google Scholar] [CrossRef]
Li, F.; Wang, Z.; Hui, S.C.; Liao, L.; Zhu, X.; Huang, H. A Segment Enhanced Span-Based Model for Nested Named Entity Recognition. Neurocomputing 2021, 465, 26–37. [Google Scholar] [CrossRef]
Jiang, W. A Method for Ancient Book Named Entity Recognition Based on BERT-Global Pointer. Int. J. Comput. Sci. Inf. Technol. 2024, 2, 443–450. [Google Scholar] [CrossRef]
Zhang, P.; Liang, W. Medical Name Entity Recognition Based on Lexical Enhancement and Global Pointer. Int. J. Adv. Comput. Sci. Appl. 2023, 14. [Google Scholar] [CrossRef]
Zhang, X.; Luo, X.; Wu, J. A RoBERTa-GlobalPointer-Based Method for Named Entity Recognition of Legal Documents. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18 June 2023; pp. 1–8. [Google Scholar]
Cong, L.; Cui, R.; Dou, Z.; Huang, C.; Zhao, L.; Zhang, Y.; Chen, C.; Su, C.; Li, J.; Qu, C. Named Entity Recognition for Power Data Based on Lexical Enhancement and Global Pointer. In Proceedings of the Third International Conference on Electronic Information Engineering, Big Data, and Computer Technology (EIBDCT 2024), Beijing, China, 19 July 2024; Zhang, J., Sun, N., Eds.; p. 61. [Google Scholar]
Yadav, V.; Bethard, S. A Survey on Recent Advances in Named Entity Recognition from Deep Learning Models. arXiv 2019, arXiv:1910.11470. [Google Scholar]
Liu, Z.; Jiang, F.; Hu, Y.; Shi, C.; Fung, P. NER-BERT: A Pre-Trained Model for Low-Resource Entity Tagging. arXiv 2021, arXiv:2112.00405. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019, arXiv:1906.08237. [Google Scholar]
Zhang, N.; Jia, Q.; Yin, K.; Dong, L.; Gao, F.; Hua, N. Conceptualized Representation Learning for Chinese Biomedical Text Mining. arXiv 2020, arXiv:2008.10813. [Google Scholar]
Zhang, Z.; Han, X.; Liu, Z.; Jiang, X.; Sun, M.; Liu, Q. ERNIE: Enhanced Language Representation with Informative Entities. arXiv 2019, arXiv:1905.07129. [Google Scholar]
Ma, R.; Peng, M.; Zhang, Q.; Wei, Z.; Huang, X. Simplify the Usage of Lexicon in Chinese NER. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 28 July–2 August 2020; pp. 5951–5960. [Google Scholar]
Zhao, J.; Cui, M.; Gao, X.; Yan, S.; Ni, Q. Chinese Named Entity Recognition Based on BERT and Lexicon Enhancement. In Proceedings of the 2022 4th International Conference on Robotics, Intelligent Control and Artificial Intelligence, Dongguan, China, 16 December 2022; pp. 597–604. [Google Scholar]
Zhang, J.; Guo, M.; Geng, Y.; Li, M.; Zhang, Y.; Geng, N. Chinese Named Entity Recognition for Apple Diseases and Pests Based on Character Augmentation. Comput. Electron. Agric. 2021, 190, 106464. [Google Scholar] [CrossRef]
Liu, Q.; Zhang, L.; Ren, G.; Zou, B. Research on Named Entity Recognition of Traditional Chinese Medicine Chest Discomfort Cases Incorporating Domain Vocabulary Features. Comput. Biol. Med. 2023, 166, 107466. [Google Scholar] [CrossRef] [PubMed]
Sun, M.; Yang, Q.; Wang, H.; Pasquine, M.; Hameed, I.A. Learning the Morphological and Syntactic Grammars for Named Entity Recognition. Information 2022, 13, 49. [Google Scholar] [CrossRef]
Tian, Y.; Shen, W.; Song, Y.; Xia, F.; He, M.; Li, K. Improving Biomedical Named Entity Recognition with Syntactic Information. BMC Bioinform. 2020, 21, 539. [Google Scholar] [CrossRef] [PubMed]
Luoma, J.; Pyysalo, S. Exploring Cross-Sentence Contexts for Named Entity Recognition with BERT. arXiv 2020, arXiv:2006.01563. [Google Scholar]
Jia, B.; Wu, Z.; Wu, B.; Liu, Y.; Zhou, P. Enhanced Character Embedding for Chinese Named Entity Recognition. Meas. Control. 2020, 53, 1669–1681. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
Hongying, Z.; Wenxin, L.; Kunli, Z.; Yajuan, Y.; Baobao, C.; Zhifang, S. Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation. In Chinese Lexical Semantics; Liu, M., Kit, C., Su, Q., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2021; Volume 12278, pp. 652–664. ISBN 978-3-030-81196-9. [Google Scholar]
Xu, L.; Liu, W.; Li, L.; Liu, C.; Zhang, X. Cluener2020: Fine-Grained Named Entity Recognition Dataset and Benchmark for Chinese. arXiv 2020, arXiv:2001.04351. [Google Scholar]
Song, Y.; Shi, S.; Li, J.; Zhang, H. Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 175–180. [Google Scholar]
Zhang, Y.; Yang, J. Chinese NER Using Lattice LSTM. arXiv 2018, arXiv:1805.02023. [Google Scholar] [CrossRef]
Li, X.; Yan, H.; Qiu, X.; Huang, X. FLAT: Chinese NER Using Flat-Lattice Transformer. arXiv 2020, arXiv:2004.11795. [Google Scholar]
Li, X.; Feng, J.; Meng, Y.; Han, Q.; Wu, F.; Li, J. A Unified MRC Framework for Named Entity Recognition. arXiv 2019, arXiv:1910.11476. [Google Scholar]
Cui, X.; Yang, Y.; Li, D.; Qu, X.; Yao, L.; Luo, S.; Song, C. Fusion of SoftLexicon and RoBERTa for Purpose-Driven Electronic Medical Record Named Entity Recognition. Appl. Sci. 2023, 13, 13296. [Google Scholar] [CrossRef]

Figure 1. An example of nested entity annotation, with different colored lines representing different entities.

Figure 2. MFGFF-BiLSTM-EGP model framework.

Figure 3. Character encoder framework.

Figure 4. Word encoder framework.

Figure 5. Sentence encoder framework.

Figure 6. An example of EGP prediction for nested entities, where the end position of the entity part of the label is coded 1, and the non-entity part is coded 0.

Figure 7. Radar plots of the entity-level assessment results of the MFGFF-BiLSTM-EGP model on three datasets including precision, recall, and F1.

Figure 8. Visualization of the confusion matrix, where the ‘Other’ category represents missing classification.

Figure 9. Visualization of token representations in feature space using t-SNE for data dimensionality reduction, with different colored points representing labeled entities of different types including word vector features and MFGFF features visualized on three datasets.

Figure 10. P, R, and F1 evaluation results of the fine-tuning method and MFGFF method on the CDNER and CMeEE V2 datasets.

Figure 11. Comparison of the evaluation results of 5 pre-trained models under the MFGFF-BiLSTM-EGP model.

Table 1. Examples of labels for five entity types.

Entity	Labels	Example	Num
Type	TY	Adult chickens, laying hens, flocks, sick chickens	4503
Disease	DI	Newcastle disease, avian influenza	4098
Symptom	SY	Clinical warming, edema, congestion	9265
Body part	BO	Head, eyes, liver, lymphocyte	5744
Drug	DR	Oxytetracycline, gentamycin, tetracycline	3201
Total			26,811

Table 2. Experimental parameters.

Hyperparameter	Value
Optimizer	Adam
Warm up	0.1
LSTM units	256
Batch size	32
Dropout	0.4
Max len	128
Epoch	20

Table 3. Comparison of the results of five commonly used obvious and our model on three datasets.

Model	CDNER			CMeEE V2			CLUENER
Model	P/%	R/%	F1/%	P/%	R/%	F1/%	P/%	R/%	F1/%
Lattice LSTM	82.43	84.51	83.46	61.26	62.33	61.79	71.69	70.78	71.23
FLAT	80.55	82.99	81.75	58.18	61.11	59.61	64.64	68.45	66.49
BERT-CRF	87.86	89.53	88.69	70.93	73.24	71.98	79.15	81.86	80.48
BERT-MRC	80.16	85.91	82.93	68.00	68.35	67.97	75.60	78.22	76.89
SLRBC	88.58	90.72	89.64	71.54	72.59	72.06	81.55	80.51	81.03
MFGFF-BiLSTM-EGP (ours)	91.42	92.18	91.98	73.12	73.68	73.32	85.21	80.38	82.54

Table 4. Entity level evaluation.

Data	Category	P %	R %	F1 %	Macro P %	Macro R %	Macro F1 %
CDNER	symptom	86.99	88.26	87.6	91.42	92.18	91.98
	body part	92.19	92.70	92.43
	disease	92.29	91.70	92.98
	drug	92.50	93.97	93.22
	type	93.11	94.25	93.68
CMeEE V2	bod	74.56	79.05	77.30	73.12	73.68	73.32
	dis	78.73	87.85	83.81
	sym	77.10	65.75	70.9
	ite	70.87	55.58	62.04
	dru	77.94	83.23	81.10
	pro	76.09	66.30	70.84
	mic	69.56	86.60	78.19
	dep	77.97	68.78	73.1
	equ	55.23	70.01	62.62
CLUENER	address	76.81	80.71	78.71	85.21	80.38	82.54
	name	92.67	84.76	88.54
	organization	84.52	84.02	84.27
	game	87.42	80.98	84.08
	scene	81.75	76.87	79.23
	book	88.04	73.64	80.20
	company	84.09	88.10	86.05
	position	86.92	81.88	84.33
	government	82.72	85.90	84.28
	movie	87.21	66.96	75.76

Table 5. Ablation study.

Model	Pre-Trained Model	Word Encoder	Sentence Encoder	F1/%
1	✓			88.01
2		✓		82.32
3		✓	✓	82.67
4	✓	✓		91.33
5	✓		✓	88.34
6	✓	✓	✓	91.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Peng, C.; Li, Q.; Yu, Q.; Lin, L.; Li, P.; Gao, R.; Wu, W.; Jiang, R.; Yu, L.; et al. A Chinese Nested Named Entity Recognition Model for Chicken Disease Based on Multiple Fine-Grained Feature Fusion and Efficient Global Pointer. Appl. Sci. 2024, 14, 8495. https://doi.org/10.3390/app14188495

AMA Style

Wang X, Peng C, Li Q, Yu Q, Lin L, Li P, Gao R, Wu W, Jiang R, Yu L, et al. A Chinese Nested Named Entity Recognition Model for Chicken Disease Based on Multiple Fine-Grained Feature Fusion and Efficient Global Pointer. Applied Sciences. 2024; 14(18):8495. https://doi.org/10.3390/app14188495

Chicago/Turabian Style

Wang, Xiajun, Cheng Peng, Qifeng Li, Qinyang Yu, Liqun Lin, Pingping Li, Ronghua Gao, Wenbiao Wu, Ruixiang Jiang, Ligen Yu, and et al. 2024. "A Chinese Nested Named Entity Recognition Model for Chicken Disease Based on Multiple Fine-Grained Feature Fusion and Efficient Global Pointer" Applied Sciences 14, no. 18: 8495. https://doi.org/10.3390/app14188495

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Chinese Nested Named Entity Recognition Model for Chicken Disease Based on Multiple Fine-Grained Feature Fusion and Efficient Global Pointer

Abstract

Featured Application

Abstract

1. Introduction

2. Materials and Methods

2.1. Data and Lexicon

2.2. Entity Labeling

2.3. MFGFF-BiLSTM-EGP

2.3.1. Character Encoder

2.3.2. Word Encoder

2.3.3. Sentence Encoder

2.3.4. BiLSTM

2.3.5. Efficient Global Pointer

2.4. Experimental Environment and Hyperparameter

2.5. Evaluation Indicators

3. Results

3.1. Main Results Compared with Other Models

3.2. Entity Level Evaluation

3.3. Ablation Study

3.3.1. Effect of Pre-Trained Model

3.3.2. Effect of Word Encoder

3.3.3. Effect of Sentence Encoder

4. Discussion

4.1. Confusion Matrix Analysis

4.2. Visualization of Token Representations in Feature Space

4.3. Nested Entity Predictive Analytic

4.4. Comparative Analysis of Pre-Rained Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI