Reassembling Fragmented Entity Names: A Novel Model for Chinese Compound Noun Processing

Pan, Yuze; Fu, Xiaofeng

doi:10.3390/electronics12204251

Open AccessArticle

Reassembling Fragmented Entity Names: A Novel Model for Chinese Compound Noun Processing

by

Yuze Pan

¹ and

Xiaofeng Fu

^2,*

¹

School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China

²

School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(20), 4251; https://doi.org/10.3390/electronics12204251

Submission received: 16 September 2023 / Revised: 11 October 2023 / Accepted: 12 October 2023 / Published: 14 October 2023

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

In the process of classifying intelligent assets, we encountered challenges with a limited dataset dominated by complex compound noun phrases. Training classifiers directly on this dataset posed risks of overfitting and potential misinterpretations due to inherent ambiguities in these phrases. Recognizing the gap in the current literature for tailored methods addressing this challenge, this paper introduces a refined approach for the accurate extraction of entity names from such structures. We leveraged the Chinese pre-trained BERT model combined with an attention mechanism, ensuring precise interpretation of each token’s significance. This was followed by employing both a multi-layer perceptron (MLP) and an LSTM-based Sequence Parsing Model, tailored for sequence annotation and rule-based parsing. With the aid of a rule-driven decoder, we reconstructed comprehensive entity names. Our approach adeptly extracts structurally coherent entity names from fragmented compound noun phrases. Experiments on a manually annotated dataset of compound noun phrases demonstrate that our model consistently outperforms rival methodologies. These results compellingly validate our method’s superiority in extracting entity names from compound noun phrases.

Keywords:

compound noun phrases; entity name extraction; fragmentation; sequence labeling; sequence parsing

1. Introduction

In recent times, significant advancements in deep learning and the emergence of pre-trained models have substantially propelled research and practical applications within the domain of natural language processing (NLP). NLP research centered on the Chinese language presents significant challenges, attributed to its inherent complexity and ambiguities [1,2,3,4,5]. Notably, within the sector of intelligent asset classification, updates such as the ‘Standard Classification & Codes for Fixed Assets’ require periodic reclassification of extant assets. Historically, this endeavor has been labor-intensive. However, there is an increasing interest in automating the classification of such assets. The primary goal is the autonomous reclassification of assets upon feeding the classification norms and current asset specifics into computational systems.

In the realm of intelligent asset classification, two primary challenges confront researchers, namely the lack of expansive datasets and the frequency of compound noun phrases with significant complexities. To harness the potential of available data and mitigate ambiguities inherent to compound noun phrases, it becomes imperative to extract comprehensive entities from these phrases. Such extraction could augment the dataset’s scope and address its latent ambiguities. The demanding nature of this task is emphasized by information fragmentation and the variety of entity name combinations.

The seamless integration of fragmented data into cohesive structures is pivotal for a spectrum of practical applications. Examples include product categorization on e-commerce interfaces and product querying in intelligent assistants [6,7,8], with a particular emphasis on intelligent asset classification in asset management platforms. In intelligent asset classification, the main challenge is using computational methods to match assets with predefined standards to determine their categories. The inherent ambiguity in compound noun phrases can lead to classification errors. To improve accuracy, it is essential to extract fully structured entity names from these compound noun phrases, avoiding potential issues from this ambiguity. The methodology of carefully extracting and reconstituting complete entities from fragmented datasets might find relevance across various sectors. For example, in the field of bioinformatics, the reassembly of fragmented genetic sequences could lead to a more holistic understanding of genomes. Similarly, in Geographic Information Systems (GISs), the integration of piecemeal geographic data could facilitate the formulation of refined maps and geographical constructs. Such multidisciplinary implications underline the potential widespread impact of our methodology across sophisticated data mining and integration ventures.

The task of deriving complete entity names from compound noun phrases transcends a mere sequential tagging challenge. Given the nuances and multifaceted phrase combinations intrinsic to the Chinese language, conventional methodologies [9,10,11] are markedly suboptimal.

To navigate this problem, our research delineates a sequence tagging methodology rooted in a pre-trained BERT model [12] combined with an LSTM-centric Sequence Parsing Model. This integration efficiently discerns and extracts comprehensive entity names from Chinese compound noun phrases. A noteworthy aspect of our methodology is its training on a limited dataset, which underscores not only its cost-efficiency but also its efficacy and robustness under data-restricted conditions. The pre-trained BERT model, after meticulous refinement, was calibrated to align with the unique demands of this task. The Sequence Parsing Model, capitalizing on its adept contextual modeling capabilities combined with customized decoding strategies, meticulously discerned the entirety of potential entity names embedded within compound noun phrases.

A meticulous review of the existing literature revealed a dearth of concerted efforts dedicated to the comprehensive extraction of entity names from Chinese compound noun phrases. Consequently, the objectives and the methodological underpinnings of our investigation appear to be innovative in this specific domain.

2. Related Work

2.1. Pre-Trained Models in Sequence Labeling Tasks

In the recent developments of natural language processing (NLP), pre-trained models have emerged as major advancements, achieving significant results across myriad tasks. These models, trained on expansive unlabeled textual corpora, have been capable of internalizing intricate language representations. Specifically, they demonstrate a notable ability for understanding complex interactions of semantic nuances, lexemes, and their encompassing contexts, laying the foundation for their unparalleled efficacy in a variety of downstream NLP tasks.

A significant field wherein researchers have extensively harnessed pre-trained models is sequence labeling. The paradigm typically entails fine-tuning these models to optimize their performance metrics on bespoke tasks. Among the range of pre-trained models, BERT (Bidirectional Encoder Representations from Transformers) is prominent [13]. Anchored in a bidirectional Transformer architecture [14], BERT is efficient in synthesizing both word-level and context-level semantic representations. For the Chinese-specific pre-training of the BERT model, data from the Chinese Wikipedia serve as the primary source [12]. During its training phase, the model employs the Masked Language Model (MLM) approach: words from input sentences are randomly masked, prompting the model to predict them. For instance, in the sentence “I like to eat [MASK].”, the model is tasked with predicting the word in the [MASK] position, such as “apple”. Additionally, BERT incorporates the Next Sentence Prediction (NSP) task, which ascertains the coherence between two sentences by determining if the latter naturally follows the former. Through training on Chinese Wikipedia articles, the BERT model has achieved a profound understanding of the Chinese language. This is primarily achieved through the innovative mechanism of predicting occluded words within sequences. By virtue of self-supervised learning on vast, unlabeled text corpora, BERT has not only established its significance in NLP but also recalibrated benchmarks across a multitude of datasets [15,16].

The appeal of BERT in sequence labeling has inspired a slew of research endeavors. For instance, the authors of [17] proposed a sequence labeling architecture supported by BERT, seamlessly amalgamating Bi-LSTM and CRF strata to the foundational BERT layers. This combination harnesses the strengths of contextual data and label relationalities, culminating in impressive results. Building on this, the authors of [18] integrated the BERT architecture with an attention mechanism, bolstering its prowess in contextual information modeling.

2.2. Sequence Parsing Models

2.2.1. Deep Learning-Based Sequence Parsing Models

Due to recent computational advances, significant advances in deep learning have led to increased research interest towards leveraging deep neural networks for the detailed parsing of annotated sequences, aiming for the extraction of entity names. More specifically, sequence parsing architectures based on Recurrent Neural Networks (RNNs) and their evolved counterpart, Long Short-Term Memory networks (LSTMs), have emerged as effective models for such intricate tasks, as supported by a plethora of empirical studies [19,20,21]. These neural architectures have unveiled an inherent ability in assimilating both semantic nuances and contextual inter-relationships within sequences. This enables them to effectively identify entity names, leveraging the model’s hidden states and resultant outputs. An example of this is the BiLSTM-CRF paradigm, an elegant fusion of bidirectional LSTM with Conditional Random Fields, which has achieved impressive results in entity name parsing and extraction [22,23].

Yet, when dealing with the challenges of small datasets, the judicious simplification of the model’s input–output dyad gains importance. The lack of data inherently poses the risk of reducing the model’s generalizability, necessitating a careful balance between model intricacy and its prowess in interpreting complex input–output patterns. In this context, our methodology emphasizes a simple yet effective design approach for the input–output matrix, meticulously tailored to align with the challenges of limited datasets, all while consistently maintaining the model’s operational efficacy—a commitment to the precision in entity name extraction.

2.2.2. Rule-Based Sequence Parsing Model

The rule-based sequence parsing paradigm primarily addresses the intricate challenges of sequential data processing within specialized domains. Its foundational architecture is based on carefully crafted rules, known for its precision and potent interpretability. Such models predominantly find their niche in fields requiring granular parsing of specific patterns and architectures, including but not limited to natural language processing, bioinformatics, and the like.

Historical studies show the widespread use of rule-based Sequence Parsing Models for the extraction of structured information from complex sequence data. For instance, within the field of natural language processing, researchers have harnessed rule-centric paradigms to identify entities and the intricate network of relationships they share within textual corpus [24,25]. Similarly, in the realm of bioinformatics, such models have been instrumental in delineating gene architectures from the complex DNA sequences [26].

Despite their precision and transparency, the inherent constraints of rule-based models are obvious. These chiefly manifest in their susceptibility to the need for manual adjustments, depending on the evolution of input data or the model’s need to incorporate novel tasks—a clearly labor-intensive task.

Yet, in the range of computational models, the rule-based sequence parsing paradigm maintains a distinct advantage in select domains [27], particularly where the emphasis is on interpretative clarity and high precision. In practical applications, these models often find themselves synergistically paired with statistical or machine learning methodologies, aiming for better results. Thus, such models undeniably deserve our careful consideration and judicious application.

3. Methods

3.1. Task Description

This research aims to examine selected Chinese compound noun phrases, with the main goal of algorithmically extracting complete entity names via model-based parsing. Typically, compound noun phrases are combinations of multiple lexical units, each with distinct meanings. Our primary goal lies in using effective algorithms to identify and extract every complete entity name within these phrases. Table 1 presents examples of entity name extractions.

The main challenge of this research is due to the limited scale of the dataset available for training. Directly deploying complex deep learning architectures may cause overfitting due to this limited data. To counteract this, our methodology integrates the pre-trained Chinese BERT model with a deep learning sequence parser, supplemented by rule-based decoding. By adjusting the model’s inputs and outputs of the deep learning model, followed by employing a rule-based decoder for nuanced parsing of the streamlined model output, we adapt to the constraints of the small dataset, all the while maintaining parsing precision.

The main goal of this endeavor is the accurate extraction of entity names embedded within compound noun phrases, thereby enriching the semantic information captured. Such advancements will provide a foundation, supporting both research and practical uses across pertinent domains. As an example, in the domain of asset classification, our method can expand the classification criteria, populated with varied compound noun phrases, into a more expansive and trustworthy dataset. This, in turn, ensures increased accuracy in the concluding asset classification tasks.

3.2. Model Architecture

3.2.1. Overall Model Overview

The entity name extraction framework described in this research includes several key components, carefully designed to identify and extract entity names embedded within compound noun phrases with precision.

Sequence Tagging Part:

(1): Utilization of the Pre-trained Chinese BERT Model with Fine-tuning: we employ the pre-trained Chinese BERT model, combined with character-level segmentation, an attention mechanism, and a multi-layer perceptron classifier, to enable accurate sequence tagging for each character embedded within the compound noun phrases.
(2): Attention Strategy: we employ a multi-layer perceptron to compute the attention weights associated with each character, thereby fusing the character’s positional data with its contextual representation, thereby improving the effectiveness of the sequence tagging endeavor.
(3): Character Classification Module (MLP): a multi-layer perceptron serves as a ternary classifier, designed to predict potential categories for each character, resulting in the annotated sequence of the compound noun phrases.

Sequence Parsing Part:

(1): Sequence Parsing Framework: The sequence tagging component provides annotations for characters within the compound noun phrases. Subsequent sequence parsing refines these annotations, converting them into a consolidated sequence using a pre-trained Seq2Seq model.
(2): Decoding Approach: through a cohesive combination of sequence tagging, sequence parsing, and entity name decoding, the desired entity names are delineated from the compound noun phrases.

By integrating these components, our model effectively extracts entity names from compound noun phrases. The system judiciously balances decoding complexity with data processing, rendering it to be of considerable practical significance. The detailed architecture is depicted in Figure 1.

In subsequent sections, we will discuss the details and implementation principles of these components.

3.2.2. Sequence Tagging Part

(1) Pre-trained Chinese BERT Model and Its Fine-tuning:

This research uses the BERT (Bidirectional Encoder Representations from Transformers) model as its foundational architecture. BERT is distinguished for its significant results in natural language processing (NLP) tasks, largely attributed to its self-supervised learning and profound semantic modeling capabilities. Its encoder design incorporates multiple layers of self-attention mechanisms intertwined with feed-forward neural networks, thereby understanding global semantics.

In our sequence tagging task, detailed examination is applied to every character within compound noun phrases. The initial step involves segmenting these phrases on a character-by-character basis and embedding them within the pre-trained BERT model. Subsequently, the attention mechanism used in this research dynamically weights each position using the positional data of the predicting character, as described in [28]. This process combines the hidden state vectors into a richer representation. Through a multi-layer perceptron (MLP) classifier, labels are then assigned to the character in focus, resulting in precise sequence annotation.

This methodology not only benefits from the pre-training of the BERT model on voluminous unlabeled text corpora but also improves the accuracy of individual characters within compound noun phrases. This is achieved by combining the attention mechanism and an MLP classifier. The workflow combines the strong semantic extraction of pre-trained models with an efficient attention distribution process, facilitating the assignment of exact labels to each character within the phrases, improving the effectiveness of sequence labeling tasks. A visualization of this architecture can be found in Figure 2.

Our design effectively combines BERT’s pre-training advantages with fine-tuning strategies. This, when paired with the attention mechanism and an MLP classifier, culminates in an optimized and precision-oriented framework for sequence labeling of compound noun phrases.

(2) Attention Mechanism:

In the model described in this paper, the prediction process for every character uses specific attention weights. These weights are determined by processing the output sequence obtained from the pre-trained BERT model.

To ensure accurate predictions of individual characters within compound noun phrases, it is essential to understand the full contextual meaning. While traditional attention mechanisms are commonly used in models, they may not be detailed enough for specific tasks, especially those focused on character-level predictions. In such tasks, it is vital to not only understand the general meaning but also to consider the positional importance of the currently predicted character. To address this, we propose a new attention mechanism based on a multi-layer perceptron (MLP) that combines the positional data of the character being considered. A key advantage of this method is its ability to dynamically assign appropriate contextual weights to each predicted character, improving prediction accuracy.

Initially, the hidden state vectors tied to the “[CLS]” and “[SEP]” tokens are removed from the BERT’s output sequence, preserving solely the hidden state vectors corresponding to each character within the compound noun phrases. In this study, our sequence tagging model is designed to predict an individual character within the sequence during each iteration. Therefore, the attention mechanism focuses on the character being predicted at each step. To clearly determine the tagging position of the character predicted during each iteration, we use the position embedding technique from the Transformer architecture [14]. This technique gives a specific vector representation to each character that is to be predicted, combining it with the BERT output vector. This combination captures both the positional details and the general character context. The formulation for the positional encoding is provided in the following equation:

\begin{array}{l} P E_{(p o s, 2 i)} = \sin (p o s / 10000^{2 i / d_{model}}) \\ P E_{(p o s, 2 i + 1)} = \cos (p o s / 10000^{2 i / d_{model}}) \end{array}

(1)

where

p o s

denotes the position of the character targeted for prediction within the sequence,

i

represents the index of the encoding dimension, and

d_{model}

corresponds to the model’s dimensionality. Unlike the Transformer’s conventional position embedding approach, our model applies identical positional encoding information across all output vectors for a singular prediction, representing the positional data of the character being predicted.

As illustrated in Figure 3, a novel vector is fed into the multi-layer perceptron, producing attention weights for designated characters. Subsequently, these attention weights are used to reduce BERT’s two-dimensional contextual representation matrix into a singular dimensional vector. Each row in this matrix corresponds to a character within the compound noun phrase and contains its related contextual data. Through an element-wise multiplication with the attention weights, followed by summation, we obtain a combined representation, symbolized as

V = \sum_{i = 1}^{k} w_{i} \cdot a_{i}

. Here,

a_{i}

designates a character within the compound noun phrase that correlates with BERT’s output, while

w_{i}

denotes the attention weights generated via the attention mechanism.

Given the custom attention mechanism used in this study to combine the hidden state vectors, we provide a detailed explanation of this mechanism in the following sections.

Let

A \in ℝ^{n \times d_{model}}

be the semantic vector from BERT, represented by

A = {[\begin{matrix} a_{1} & a_{2} & \dots & a_{n} \end{matrix}]}^{T}

. By combining this vector with the positional encoding of the character currently being predicted, we produce a semantic representation,

Θ \in ℝ^{n \times d_{model}}

, which includes the immediate prediction position and is given by

Θ = {[\begin{matrix} θ_{1} & θ_{2} & \dots & θ_{n} \end{matrix}]}^{T}

. For each component

θ_{i}

, its transformation after processing by the Attention MLP is as follows:

h_{i} = σ \{U_{3}^{T} \cdot σ [U_{2}^{T} \cdot σ (U_{1}^{T} \cdot θ_{i}^{T} + b_{1}) + b_{2}] + b_{3}\}

(2)

where,

U_{i}

is weight matrix,

U_{1} \in ℝ^{d_{model} \times 256}

,

U_{2} \in ℝ^{256 \times 64}

,

U_{3} \in ℝ^{64 \times 1}

;

b_{i}

is bias,

b_{1} \in ℝ^{256}

,

b_{2} \in ℝ^{64}

;

σ (x) = \frac{1}{1 + e^{- x}}

is the sigmoid activation function; and e is the base of natural logarithms.

After processing matrix

Θ

through the Attention MLP, the output is

H = [\begin{matrix} h_{1} & h_{2} & \dots & h_{n} \end{matrix}]

. Using this attention mechanism to obtain a one-dimensional representation of the semantic matrix, we get the following result:

V = W \cdot A

(3)

where

W

represents the attention weights:

W = [\begin{matrix} \frac{e^{h_{1}}}{\sum_{i = 1}^{n} e^{h_{i}}} & \frac{e^{h_{2}}}{\sum_{i = 1}^{n} e^{h_{i}}} & \dots & \frac{e^{h_{n}}}{\sum_{i = 1}^{n} e^{h_{i}}} \end{matrix}]

(4)

In summary, our model computes attention weights for every character by excluding specific tokens, incorporating positional embeddings, compacting the BERT output leveraging the attention weights, and utilizing a multi-layer perceptron. This architecture efficiently uses positional data, augmenting the efficacy of sequence labeling endeavors.

(3) Classification Using Multilayer Perceptron:

In this phase, the hidden state vector, denoted as

V

, is input into a multi-layer perceptron (MLP). This is used to predict the classification category of the pertinent character via a ternary classifier. The output tags align with three distinct categories ‘g’, ‘z’, and ‘f’; these categories, respectively, signify characters shared across multiple entity names, characters exclusive to a solitary entity name, and characters serving as separators. The MLP model transposes the hidden state vector into an appropriate feature space, using a series of non-linear transformations combined with weight matrix multiplications. This optimizes the classification of character classes [29].

z^{(l + 1)} = σ (U^{(l)} z^{(l)} + b^{(l)})

(5)

where,

z^{(l)}

denotes the hidden state vector at the

l

^th layer of the MLP.

U^{(l)}

is the corresponding weight matrix within the MLP, while

b^{(l)}

represents the bias vector. The symbol MLP designates the activation function. By concatenating multiple hidden layers, the MLP can extract more intricate feature representations. Subsequently, the classifier in the output layer predicts character categories.

3.2.3. Sequence Parsing Section

(1) Sequence Parsing Model

Given the constrained size of our dataset, we fine-tuned the input–output structure to capture all the entity name information within the compound noun phrases in the most succinct format. Specifically, the input sequence is reduced to two basic elements “g” and “z”. Each element in the simplified sequence corresponds to a segment of characters from the original sequence, as illustrated in Figure 4. The output sequence represents six possible combinations of entity nouns in the compound noun phrase “z”, “g”, “zg”, “gz”, “gzg”, and “zgz”. Subsequent extraction is facilitated by a rule-based decoder that collaboratively interacts with the compound noun phrases and their respective annotation sequences to extract the comprehensive entity names from the model’s combinatorial outputs.

Initially, as illustrated in Figure 4, we encode the characters embedded within the compound noun phrase according to their designated categories. This category sequence is then streamlined to accommodate the model’s input requirements. Post this step, we deploy an LSTM-based Seq2Seq model (as visualized in Figure 5) to predict the patterns of character amalgamation [19].

After the model’s prediction, a rule-based decoding strategy is employed to derive the comprehensive entity names from the compound noun phrases. This approach translates the model’s streamlined output to produce the entirety of the entity names embedded within the compound noun phrases. Table 2 presents several examples of Sequence Labeling and Sequence Parsing.

(2) Decoding Strategy

The decoding strategy is pivotal in the extraction of entity names from compound noun phrases. Starting with the outputs from both the Sequence Labeling and Sequence Parsing Models, it thoroughly extracts the requisite entity names employing a structured set of rules.

The decoding strategy approach takes outputs of both the Sequence Labeling and Parsing Models. For example, the input phrase “半导体直、变流设备” (translated as “semiconductor DC and converter equipment”) yields a sequence labeling outcome of ‘gggzfzggg’. Further processing by the Sequence Parsing Model refines this to ‘gzg, gzg’.

(a) Decoding Mechanism for Entity Composition:

Each character from the Sequence Parsing Model (such as ‘g’ or ‘z’) corresponds to specific sections of the original compound noun phrase. For the phrase “半导体直、变流设备” (translated as “semiconductor DC and converter equipment”), the Sequence Parsing Model produces ‘gzg, gzg’. Here, the first ‘gzg’ pattern denotes the structure of the primary entity within the compound noun phrase. Both ‘g’ characters align with the ‘ggg’ tags for “半导体” and the ending ‘ggg’ of “流设备” in sequence labeling results. Here, ‘z’ refers to characters specific to its entity, so the ‘z’ in the primary ‘gzg’ matches the ‘z’ label from sequence tagging, representing “直”.

(b) Entity Name Reconstruction:

Sequential Reconstruction via Parsing Model Output: Using the patterns produced by the Sequence Parsing Model, we reconstruct segments from the original compound noun phrase. For the compound noun phrase “半导体直、变流设备” (translated as “semiconductor DC and converter equipment”), the parsed output “gzg, gzg” indicates that the first combination aligns sequentially with segments “半导体”, “直”, and “流设备”. Following this arrangement, we derive the entity “半导体直流设备” (translated as “semiconductor DC Equipment”) as the primary representation.

Entity Extraction via Iterative Decoding: Using iterative decoding across the full combinatorial sequence allows for the extraction of multiple entity names. Referring again to “半导体直、变流设备” and using the decoding strategy, we can extract the entities “半导体直流设备” (translated as “semiconductor DC Equipment”) and “半导体变流设备” (translated as “semiconductor converter equipment”).

The decoding strategy introduced in this study presents a notable advantage: it allows the Sequence Parsing Model to operate on streamlined sequences, substantially reducing the task’s complexity. As a consequence, even when working with a limited dataset, we are able to train a dependable model that delivers effective entity extraction outcomes. This adaptability is particularly crucial for smaller datasets, enabling us to attain high-precision entity extraction with reduced resource expenditure.

In essence, our approach facilitates the efficient extraction of entity names from compound noun phrases within the constraints of a smaller dataset. While the methodology does involve data streamlining and an intricate decoding strategy, it results in the development of a model adept at producing entity names, thereby providing significant value for tasks in entity recognition and generation.

4. Experiments

4.1. Dataset

In this study, we address the intricacies of asset classification using the 《固定资产等资产基础分类与代码》, known in English as the ‘Standard Classification and Codes for Fixed Assets and Related Assets’, and subsequently referred to as the ‘Classification Standard’. Based on this standard, we compiled a dataset encompassing all 212 compound noun phrases. We conducted a systematic analysis to explore the challenge of systematically extracting entire entity names from these compound phrases. The effectiveness of our approach was rigorously ascertained, ensuring that our findings retain their relevance for practical applications.

4.2. Evaluation Metrics

For the sequence labeling task, the F1 score is utilized as the primary evaluation metric. This metric integrates precision and recall, facilitating a measure of both the accuracy and coverage of each character label. Furthermore, we use the Sequence Accuracy (SA) as a metric to assess the model’s accuracy in predicting complete sequences.

S A = \frac{1}{|S|} \sum_{i = 1}^{|S|} I (i)

(6)

where

|S|

denotes the size of the compound noun phrase set S,

I (i)

stands as an indicator function. Assuming the true label for the

i_{th}

sequence is

y^{(i)} = [y_{1}^{(i)}, y_{2}^{(i)}, \dots, y_{T_{i}}^{(i)}]

, where

T_{i}

is the length of the

i_{th}

sequence and the model’s predictive output is

{\hat{y}}^{(i)} = [{\hat{y}}_{1}^{(i)}, {\hat{y}}_{2}^{(i)}, \dots, {\hat{y}}_{T_{i}}^{(i)}]

,

I (i)

is defined as:

I (i) = \{\begin{matrix} 1 i f y_{t}^{(i)} = {\hat{y}}_{t}^{(i)}, \forall t \in \{1, \dots, T_{i}\} \\ 0 e l s e \end{matrix}

(7)

For predicting combination methods for complete entity names, we employ the BLEU (Bilingual Evaluation Understudy) score [30], a recognized standard for machine translation evaluation. The strength of the BLEU score is its ability to compare generated sequences with reference sequences. In this context, this comparison involves evaluating predicted combination methods against the original complete entity names, providing insight into the model’s precision in generation. Let

c

be the predicted combination length from the Sequence Parsing Model;

r

be the actual combination length;

N

be the maximum n-gram order, set to 2 in this case; and

α_{n}

represent the weight of the n-gram, with

α_{1} = α_{2} = 0.5

and

R_{n}

being the trimmed accuracies for 1-g and 2-g, respectively. The penalty for sentence length,

B P

, is calculated as follows:

B P = \{\begin{matrix} 1 \begin{matrix} i f c > r \end{matrix} \\ e^{(1 - \frac{r}{c})} \begin{matrix} e l s e \end{matrix} \end{matrix}

(8)

The trimmed accuracy of the n-gram is:

R_{n} = \frac{Number of n - grams in candidate that match any reference}{Total number of n - grams in candidate}

(9)

Ultimately, the BLEU score is:

B L E U = B P \times \exp (\sum_{n = 1}^{N} α_{n} \log R_{n})

(10)

To comprehensively assess the model’s ability in deriving and extracting entity names from compound noun phrases, this study presents three novel evaluation metrics:

(1) Entity Name Extraction Coverage Rate (ENEC)

Consider a set S of compound noun phrases. For each phrase

s_{i}

within this set, assume it encapsulates

n_{actual, i}

entity names (hereafter denoted as

n_{a, i}

). If the model successfully extracts

m_{i}

entity names out of these, the completeness of entity name extraction for the given compound noun phrase can be articulated as:

E N E C_{i} = \frac{m_{i}}{n_{a, i}}

(11)

The global extraction coverage metric, represented as ENEC, is defined as follows:

E N E C = \frac{\sum_{i = 1}^{|S|} E N E C_{i}}{|S|}

(12)

In this context, the variable

|S|

represents the size of set S, which contains compound noun phrases. This evaluation metric emphasizes the model’s ability to identify and capture all entities within compound noun phrases, underscoring comprehensive detection and coverage of entities.

(2) Entity Name Quantity Matching Rate, ENQM

Let

n_{a, i}

represent the actual count of entity names within a compound noun phrase

s_{i}

and let

n_{predicted, i}

(subsequently denoted as

n_{p, i}

) signify the number of entity names identified by the model. We introduce a matching function in the manner described below:

m a t c h (n_{a, i}, n_{p, i}) = \{\begin{matrix} 1 i f n_{a, i} = n_{p, i} \\ 0 else \end{matrix}

(13)

The comprehensive degree of entity name quantity alignment, represented by ENQM, is formulated as follows:

E N Q M = \frac{1}{|S|} \sum_{i = 1}^{|S|} m a t c h (n_{a, i}, n_{p, i})

(14)

This metric assesses the alignment between the predicted number of entity names by the model and the actual count in compound noun phrases.

(3) Entity Name Extraction Matching Rate, ENEM

Consider a set

S

comprising compound noun phrases. For each phrase

s_{i}

within

S

, the model deduces it encompasses

n_{p, i}

entity names. For every such phrase denoted by

s_{i}

, we determine a matching score

M_{i j}

between each predicted entity name

e_{p, i j}

and its true counterpart

e_{a, i j}

. The computation of

M_{i j}

can adopt various techniques, ranging from Edit Distance [31] to Embedding Vector Similarity Measurement. The latter technique involves converting text into embedding vectors via pre-trained language models like BERT and subsequently measuring the cosine similarity or Euclidean distance between these vectors. For the purposes of this study, we utilize Edit Distance to derive

M_{i j}

.

M_{i j} = 1 - \frac{L (e_{p, i j}, e_{a, i j})}{\max (|e_{p, i j}|, |e_{a, i j}|)}

(15)

where

L (e_{p, i j}, e_{a, i j})

represents the Edit Distance between entity name

e_{p, i j}

and entity name

e_{a, i j}

, defined as follows:

L (e_{p, i j}, e_{a, i j}) = \{\begin{matrix} \max (|e_{p, i j}|, |e_{a, i j}|) & i f \min (|e_{p, i j}|, |e_{a, i j}|) = 0 \\ \min \{\begin{matrix} L (e_{p, i j} - 1, e_{a, i j}) + 1 \begin{matrix}  \end{matrix} \\ L (e_{p, i j}, e_{a, i j} - 1) + 1 \begin{matrix}  \end{matrix} \\ L (e_{p, i j} - 1, e_{a, i j} - 1) + 1_{(e_{p, i j} [|e_{p, i j}|] \neq e_{a, i j} [|e_{a, i j}|])} \end{matrix} & e l s e \end{matrix}

(16)

where

1_{(e_{p, i j} [|e_{p, i j}|] \neq e_{a, i j} [|e_{a, i j}|])}

is set to 1 if

e_{p, i j} [|e_{p, i j}|] \neq e_{a, i j} [|e_{a, i j}|]

is true; otherwise, it assumes a value of 0.

The metric for entity precision extraction, denoted as

P_{i}

, for each compound noun phrase, is computed as follows:

P_{i} = \frac{\sum_{j = 1}^{n_{i}} M_{i j}}{n_{p, i}}

(17)

Let

ω_{i}

represent the weight assigned to each compound noun phrase, influenced by factors such as the complexity or length of the entity names it contains. Considering the study’s focus on accurate entity name extraction, all instances of

ω_{i}

are set to 1. The overall metric for entity name extraction, labeled as ENEM, for the set of compound noun phrases is calculated as:

E N E M = \frac{\sum_{i = 1}^{|S|} ω_{i} \cdot P_{i}}{\sum_{i = 1}^{|S|} ω_{i}}

(18)

This metric combines the matching score with the assigned weight, reflecting the model’s accuracy in extracting entity names. Using these evaluation criteria, we can accurately assess the performance of Sequence Labeling and Parsing Models in extracting entity names and their combinations. These metrics serve two main functions, namely they measure the model’s precision and consistency and provide a basis for comparing different models or methods. Additionally, they validate the feasibility and stability of the proposed approach.

4.3. Model Comparison

4.3.1. Sequence Tagging Model

(1) Model A: BERT-Based Sequence Labeling without Attention (BSL-woA)

In Model A, we use a pre-trained Chinese BERT model as input. We extract hidden state vectors corresponding to the target characters and feed them directly into a multi-layer perceptron (MLP) classifier for label prediction. This model does not use an attention mechanism, simplifying its architecture and computation. It is referred to as BERT-Based Sequence Labeling without Attention (BSL-woA). Figure 6 shows a diagram of the network designed for predicting the second character in a compound noun phrase.

(2) Model B: BERT-CLS-Based Sequence Labeling Model (BCSL)

In Model B, we use the outputs from a pre-trained Chinese BERT model, focusing only on the hidden state vector for the ‘[CLS]’ token. For labeling characters in compound noun phrases, this vector is combined with specific positional data. This combined vector is then fed into a classification network for label prediction. This approach intends to utilize the overall semantic context from the BERT model to improve character-level labeling. The model is named the BERT-CLS-Based Sequence Labeling Model (BCSL). Figure 7 shows the model’s architecture.

(3) Model C: BERT-Based Sequence Labeling with Attention and CRF (BSL-A-CRF)

In Model C, we use the Conditional Random Field (CRF) [32] instead of the multi-layer perceptron (MLP) classifier previously used for sequence tagging in our initial study. This change is driven by CRF’s widespread use and excellent performance in sequence labeling tasks.

In detail, Model C combines BERT with an attention mechanism, leading to a configuration named as the ‘BERT-Based Sequence Labeling with Attention and CRF’, or BSL-A-CRF for short. In this model, BERT is responsible for extracting feature representations from the text, while the attention mechanism identifies long-term dependencies in the text. The CRF is used to understand relationships among labels in the sequence, directing the final label assignment.

By using this enhanced model, we aim to see if the commonly used CRF in sequence tagging can improve sequence labeling predictions relevant to our study.

(4) Model D: BERT-Based Sequence Labeling with Attention and MLP Classifier (BSL-A-MC)

For our baseline, we used the Sequence Labeling Model we initially proposed, combining attention mechanisms with an MLP classifier. This model processes compound noun phrases and utilizes the BERT model and attention mechanisms to obtain context-rich representations. These representations are then used for label prediction. The model is named the BERT-Based Sequence Labeling with Attention and MLP Classifier, or BSL-A-MC for short.

4.3.2. Sequence Parsing Model

(1) Model E: LSTM-Based Sequence Parsing Model (LSP)

Model E uses an LSTM-based Seq2Seq framework designed for parsing character structures in compound noun phrases. The encoder transforms the input sequence into vector representations, capturing its semantic information. The decoder then uses the encoder’s hidden states to produce the output characters sequentially. This model is referred to as the LSTM-Based Sequence Parsing Model, or LSP.

(2) Model F: RNN-Based Sequence Parsing Model (RSP)

Model F employs an RNN-based Seq2Seq framework to parse character structures in compound noun phrases. Compared to Model E, Model F has a simpler architecture, making it suitable for handling shorter sequences and simpler structures. This model is named the RNN-Based Sequence Parsing Model, or RSP.

4.4. Parameter Configuration

In our experimental setup, we utilized the pre-trained Chinese BERT model, specifically BERT-Base-Chinese [12]. Moreover, we incorporated dropout operations preceding each layer.

The BERT module’s details are in Table 3.

In the baseline model, the attention weight computation includes L_max parallel multi-layer perceptrons (MLPs), all of which are identical. Each MLP has three fully connected layers, followed by an additional layer to apply weights to the BERT’s output. It is worth noting that L_max denotes the maximum length of the input sentence. Details of these MLPs are provided in Table 4.

In this study, output vectors from BERT are combined using attention weights into one vector, which is then processed by a fully connected layer. Details of this layer can be found in Table 5.

In the Sequence Labeling Model, the multi-layer perceptron (MLP) used for classification has uniform parameters across its layers. This MLP has two connected layers, and their parameter details can be found in Table 6.

The Sequence Parsing Model based on the Long Short-Term Memory (LSTM) architecture has specific parameters. Their details are provided in Table 7.

The RNN-based Sequence Parsing Model is defined by a collection of crucial parameters that shape its operation. The specifics of these parameters are elaborated upon in Table 8.

5. Results and Analysis

5.1. Sequence Labeling Section

The Accuracy and F1 scores of four Sequence Labeling Models are illustrated in Figure 8.

We compare the performance of four distinct models in terms of accuracy and F1 scores, as visualized in the bar charts of Figure 9.

The line graph showing changes in the F1 curve, along with the comparative bar chart, indicates the performance differences between the BSL-woA, BCSL, BSL-A-CRF, and BSL-A-MC models for sequence labeling. This visualization aids in understanding the performance differences among the models.

To evaluate the performance of the four Sequence Labeling Models introduced in this manuscript on entire compound noun phrase sequences, Figure 10 compares the Sequence Accuracy (SA) among these models.

The BSL-woA model uses the output from a pre-trained Chinese BERT directly, without adding an attention mechanism. It sends the hidden state vector, corresponding to the targeted character, to an MLP classifier for label prediction. Experimental results show that BSL-woA performs slightly worse than BSL-A-MC, suggesting that an attention mechanism can improve the model’s understanding of context and increase prediction accuracy.

In contrast, BCSL only uses the hidden state vector associated with BERT’s (CLS) token. To predict labels for each character in compound noun phrases, this vector is concatenated with positional cues and then fed into a classification network. Results show that BCSL performs the worst among the four models, suggesting that dependence on the (CLS) hidden state vector may reduce the contextual information necessary for accurate label prediction.

The BSL-A-CRF model, integrating the well-known Conditional Random Field (CRF) for sequence labeling, did not perform as expected. Upon analysis, CRF not only considers local features but also models the transition relationships between label sequences, i.e., the probability of one label following another. This ensures meaningful label combinations across a sentence. For instance, in named entity recognition, names typically consist of a beginning, middle, and end. CRF ensures we do not get nonsensical combinations like “start-end-middle.” Given the uniqueness of entity information in this paper and to facilitate subsequent sequence parsing, we designed a specific label system for compound noun phrases, thereby attenuating the transition relationships between labels. This highlights the study’s emphasis on the discrete classification of characters rather than inter-character label dynamics. The use of CRF increases the parameter pool, especially in the label transition matrix. In constrained dataset settings, this increase in parameters might lead to model overfitting. Given the study’s focus on the discrete classification paradigm, deep learning classifiers can provide feature representations for characters, improving classification accuracy. It is worth noting that the model proposed in this paper, utilizing BERT and its unique attention mechanism, ensures that the information input into the MLP classifier is richly contextualized. Therefore, compared to traditional CRF, our model exhibits higher applicability in addressing this issue.

BSL-A-MC achieves the best performance with the highest accuracy. This model processes compound noun phrases by using both BERT and attention mechanisms to extract relevant contextual representations, leading to label prediction. Within its architecture, the attention mechanism enhances its ability to model individual characters in compound noun phrases. The adaptive attention adjustment allows the model to focus on key information. By adjusting attention weights based on position, the model can assign weights to characters based on contextual relevance, refining the label prediction process.

In summary, our results clearly show that BSL-A-MC, combining the attention mechanism and MLP classifier, achieves the best performance and accuracy among the tested models. The integrated attention mechanism gives the model the ability to focus on relevant parts of the input, improving its ability to understand contextual details. By adjusting attention weights effectively across characters, the model demonstrates enhanced efficacy in label prediction, highlighting important contextual cues.

5.2. Sequence Parsing Section

Through our experiments, we found that both RSP and LSP had similar performances in parsing, achieving good results, as shown in Figure 11.

Figure 11 shows the performance of both the RSP Model and the LSP Model across three evaluative metrics including BLEU Score, Precision, and F1 Score. Interestingly, the two models show very similar results based on the selected evaluation standards.

The main reason for this similarity is the nature of the sequences used: both input and output sequences are short and the vocabulary is limited. Moreover, the structure of the input–output pairs is simple. Given these conditions, the performance of the RNN closely matches that of the LSTM, as the shortness of the sequences reduces potential issues from long-term dependencies.

While LSTMs might have inherent advantages in some contexts, in cases with short and simple input–output sequences, the performance of RNNs and LSTMs appears to be similar.

5.3. Overall Model Entity Extraction Effectiveness

After introducing the ENEC, ENQM, and ENEM evaluation metrics, we assessed the six model architectures proposed in this study. Using these metrics, we get a clear understanding of the models’ effectiveness in extracting entity names from compound noun phrases.

Table 9 shows the performance outcomes based on the mentioned metrics for our test dataset. Examining these results reveals clear performance differences across the evaluated models. Table 10 presents the differences in performance for a specific example when evaluated using various models.

In all assessed metrics, the top-performing model is the baseline BSL-A-MC + LSP introduced in this study. Conversely, the BCSL + RSP has the lowest performance. While there are some differences in the performance metrics across the six models, these variations confirm the earlier observations in the sequence tagging and Sequence Parsing Model evaluations.

6. Discussion

In sequence labeling, various methods address the challenges of tagging compound noun phrases. Results from the BSL-woA model are outperformed by the BSL-A-MC model, highlighting the crucial role of attention mechanisms in capturing context [33,34]. This trend is evident especially in complex sentence structures. The limited effectiveness of BCSL suggests that relying only on the semantic vector of compound noun phrases might not provide sufficient contextual cues. The BSL-A-CRF model utilizes the widely respected Conditional Random Field (CRF) for sequence label prediction. While CRFs have demonstrated excellence in a myriad of sequence tagging research endeavors [35,36], within this specific context, there is a discernible attenuation in label transition dynamics. This suggests that the crux of sequence tagging in this instance leans more towards individual character classification than preserving the relationships between adjacent character labels. Thus, when addressing tasks related to compound noun phrases, it might not represent the optimal choice. On the other hand, BSL-A-MC, combining attention mechanisms with a deep-learning character classifier, performed exceptionally well, emphasizing the value of attention mechanisms in understanding context.

Regarding sequence parsing, both RSP and LSP models display similar outcomes. Our observations are consistent with current studies across languages, suggesting a comparable effectiveness between RNNs and LSTMs, particularly for shorter sequences [37,38].

Examining the entity extraction capabilities and the introduction of the ENEC, ENQM, and ENEM metrics allowed a clearer understanding of differences in model performance. The outstanding performance of the BSL-A-MC + LSP combination demonstrates its effective architecture.

However, it is important to note the limitations of this research. Model generalization is a challenge, given their training on specific, limited datasets. Future work should focus on improving the models’ generalizability and ensuring their flexibility across various data contexts.

7. Conclusions

In this study, we devised methods to efficiently extract complete entity names from compound noun phrases, relying on sequence tagging and parsing approaches. Leveraging the Chinese BERT model coupled with attention mechanisms, our techniques demonstrated proficiency in handling compound noun phrases. A pivotal milestone was achieved using an LSTM-based Sequence Parsing Model in tandem with a rule-based decoder to effectively reconstruct intact entity names. This advancement not only addressed the fragmentation issue of entity names but also enabled the expansion of limited datasets, partially mitigating the data scarcity challenge.

Though our models’ displayed commendable performance on specific domain datasets, their restricted training raises concerns about their broader generalizability. Their domain-specific nature might limit their versatility across diverse datasets, marking an area warranting future investigation.

The implications of our findings are profound, benefiting areas like information processing, knowledge graph construction, semantic searches, and other natural language processing applications. These contributions enrich text understanding, semantic reasoning, and question answering, providing a foundational tool for semantic and contextual interpretation. Our insights could also be instrumental for interdisciplinary research, such as piecing fragmented genetic data into comprehensive genetic structures.

In summary, our primary contributions are: (1) development of efficient methods to extract complete entity names from compound noun phrases using advanced models; (2) significant mitigation of the entity name fragmentation issue, eliminating potential ambiguities in compound noun phrase applications; (3) enhancement of natural language processing tasks, from semantic searches to knowledge graph creation; and (4) highlighting the need for future research in enhancing model adaptability across varied datasets.

Author Contributions

Conceptualization, Y.P.; Methodology, Y.P. and X.F.; Software, Y.P.; Validation, Y.P.; Formal analysis, Y.P. and X.F.; Investigation, Y.P.; Resources, X.F.; Data curation, Y.P.; Writing—original draft preparation, Y.P. and X.F.; Writing—review and editing, Y.P. and X.F.; Visualization, Y.P.; Y.P. designed the study; Y.P. analyzed and interpreted the data; Y.P. conducted the experiments; Y.P. and X.F. provided the technical and material support. All authors contributed to the writing of the manuscript and final approval. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by some grant: National Natural Science Foundation of China (No.61672199).

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments

The authors thank the anonymous reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, Q.; Wen, Z.; Ding, K.; Zhao, Q.; Yang, M.; Yu, X.; Xu, R. Improving Sequence Labeling with Labeled Clue Sentences. Knowl. Based Syst. 2022, 257, 109828. [Google Scholar] [CrossRef]
Wan, F.; Yang, Y.; Zhu, D.; Yu, H.; Zhu, A.; Che, G.; Ma, N. Semantic Role Labeling Integrated with Multilevel Linguistic Cues and Bi-LSTM-CRF. Math. Probl. Eng. 2022, 2022, 9871260. [Google Scholar] [CrossRef]
Hady, M.F.A.; Schwenker, F. Semi-Supervised Learning. In Handbook on Neural Information Processing; Springer: Berlin/Heidelberg, Germany, 2013; pp. 215–239. [Google Scholar]
Alec, R.; Karthik, N.; Tim, S.; Ilya, S. Improving Language Understanding with Unsupervised Learning. Citado 2018, 17, 1–12. [Google Scholar]
Wu, S.; He, Y. Enriching Pre-trained Language Model with Entity Information for Relation Classification. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Birmingham, UK, 3 November 2019; pp. 2361–2364. [Google Scholar] [CrossRef]
Liu, P.; Guo, Y.; Wang, F.; Li, G. Chinese Named Entity Recognition: The State of the Art. Neurocomputing 2022, 473, 37–53. [Google Scholar] [CrossRef]
Wu, F.; Liu, J.; Wu, C.; Huang, Y.; Xie, X. Neural Chinese Named Entity Recognition via CNN-LSTM-CRF and Joint Training with Word Segmentation. In Proceedings of the World Wide Web Conference, Singapore, 13 May 2019; pp. 3342–3348. [Google Scholar] [CrossRef]
Geng, R.; Chen, Y.; Huang, R.; Qin, Y.; Zheng, Q. Planarized Sentence Representation for Nested Named Entity Recognition. Inf. Process. Manag. 2023, 60, 103352. [Google Scholar] [CrossRef]
Catelli, R.; Gargiulo, F.; Casola, V.; De Pietro, G.; Fujita, H.; Esposito, M. Crosslingual named entity recognition for clinical de-identification applied to a COVID-19 Italian data set. Appl. Soft Comput. 2020, 97, 106779. [Google Scholar] [CrossRef] [PubMed]
Cheng, J.; Liu, J.; Xu, X.; Xia, D.; Liu, L.; Sheng, V.S. A Review of Chinese Named Entity Recognition. KSII Trans. Internet Inf. Syst. 2021, 15, 2012–2019. [Google Scholar]
Chen, A.; Peng, F.; Shan, R.; Sun, G. Chinese Named Entity Recognition with Conditional Probabilistic Models. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, Sydney, Australia, 22–23 July 2006; pp. 173–176. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Rogers, A.; Kovaleva, O.; Rumshisky, A. A Primer in BERTology: What We Know About How BERT Works. Trans. Assoc. Comput. Linguist. 2020, 8, 842–866. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4 December 2017; pp. 6000–6010. [Google Scholar]
Yang, W.; Xie, Y.; Lin, A.; Li, X.; Tan, L.; Xiong, K.; Li, M.; Lin, J. End-to-End Open-Domain Question Answering with BERTserini. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, MN, USA, 3–5 June 2019; pp. 72–77. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
Gan, L.; Zhang, Y. Investigating Self-Attention Network for Chinese Word Segmentation. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2933–2941. [Google Scholar] [CrossRef]
Huang, Y.; Li, Z.; Deng, W.; Wang, G.; Lin, Z. D-BERT: Incorporating Dependency-Based Attention into BERT for Relation Extraction. CAAI Trans. Intell. Technol. 2021, 6, 417–425. [Google Scholar] [CrossRef]
Graves, A.; Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; MIT Press: Cambridge, MA, USA, 2012; pp. 37–45. [Google Scholar]
Qi, Q.; Wang, X.; Sun, H.; Wang, J.; Liang, X.; Liao, J. A Novel Multi-Task Learning Framework for Semi-Supervised Semantic Parsing. IEEE/ACM Trans. Audio Speech Lang. Proc. 2020, 28, 2552–2560. [Google Scholar] [CrossRef]
Chiu, J.P.; Nichols, E. Named Entity Recognition with Bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 2016, 4, 357–370. [Google Scholar] [CrossRef]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging. Available online: https://arxiv.org/abs/1508.01991 (accessed on 23 July 2023).
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 11–16 December 2016; pp. 260–270. [Google Scholar]
Manning, C.; Schutze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
Hearst, M.A. Text Tiling: Segmenting Text into Multi-Paragraph Subtopic Passages. Comput. Linguist. 1997, 23, 33–64. [Google Scholar]
Durbin, R.; Eddy, S.R.; Krogh, A.; Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
Hart, P.E.; Stork, D.G.; Duda, R.O. Pattern Classification; Wiley: Hoboken, NJ, USA, 2000. [Google Scholar]
Lin, Z.; Feng, M.; Santos, C.N.D.; Yu, M.; Xiang, B.; Zhou, B.; Bengio, Y. A Structured Self-Attentive Sentence Embedding. arXiv 2017, arXiv:1703.03130. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Levenshtein, V.I. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Sov. Phys. Dokl. 1965, 10, 707–710. [Google Scholar]
Chang, Y.; Kong, L.; Jia, K.; Meng, Q. Chinese Named Entity Recognition Method Based on BERT. In Proceedings of the 2021 IEEE International Conference on Data Science and Computer Application (ICDSCA), Dalian, China, 29–31 October 2021; pp. 294–299. [Google Scholar] [CrossRef]
Yadav, V.; Bethard, S. A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 2145–2158. [Google Scholar]
Wang, B.; Chai, Y.; Xing, S. Attention-Based Recurrent Neural Model for Named Entity Recognition in Chinese Social Media. In Proceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence, Sanya, China, 20–22 December 2020; pp. 291–296. [Google Scholar] [CrossRef]
Settles, B. Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, Geneva, Switzerland, 28 August 2004; pp. 104–107. [Google Scholar] [CrossRef]
Finkel, J.R.; Grenager, T.; Manning, C. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan, 25–30 June 2005; pp. 363–370. [Google Scholar] [CrossRef]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning—Volume 28 (ICML’13), Atlanta, GA, USA, 17–19 June 2013; pp. 1310–1318. [Google Scholar]
Britz, D.; Goldie, A.; Luong, M.-T.; Le, Q. Massive Exploration of Neural Machine Translation Architectures. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 1442–1451. [Google Scholar] [CrossRef]

Figure 1. Overall Model Architecture.

Figure 2. Sequence Labeling Model.

Figure 3. Structure of the Attention Mechanism Used in the Sequence Labeling Model.

Figure 4. Simplification of the Annotated Sequence.

Figure 5. LSTM-based Sequence Parsing Model.

Figure 6. The BSL-woA Sequence Labeling Model.

Figure 7. The BERT-CLS-Based Sequence Labeling Model (BCSL) for Sequence Labeling.

Figure 8. The graphical representations depict the variations in accuracy and F1 scores across different Sequence Labeling Models.

Figure 9. Comparison of accuracy and F1 scores of the Sequence Labeling Model.

Figure 10. Comparison of Sequence Accuracy of Sequence Labeling Models.

Figure 11. Comparison of sequential analytic models.

Table 1. Example of Entity Name Extraction.

Chinese Compound Noun Phrases	Entity Name
起重用吊斗、铲、抓斗和夹钳 (lifting hoist, shovel, grab, and clamp)	起重用吊斗; 起重用铲; 起重用抓斗; 起重用夹钳 (lifting hoist; lifting shovel; lifting grab; lifting clamp)
半导体直、变流设备 (semiconductor DC and converter equipment)	半导体直流设备; 半导体变流设备 (semiconductor DC Equipment; semiconductor converter devices)

Table 2. Examples of Sequence Labeling and Sequence Parsing.

Compound Noun Phrases	Sequence Labeling and Simplification Results	Sequence Parsing Results
水泥及水泥制品设备 (Cement and Cement Product Equipment)	zzfzzzzzz (zz)	z,z
镜头及器材 (Lenses and Equipment)	ggfzz (gz)	g,gz
分离及干燥设备 (Separation and Drying Equipment)	zzfzzgg (zzg)	zg,zg
气体分离及液化设备 (Gas Separation and Liquefaction Equipment)	ggzzfzzgg (gzzg)	gzg,gzg
防疫、防护卫生装备及器具 (Epidemic Prevention, Protective Sanitary Equipment, and Instruments)	zzfzzggzzfzz (zzgzz)	zgz,zgz,zgz,zgz

Table 3. Parameters of the pre-trained BERT Model.

Model	“Bert-Base-Chinese”
Vocabulary Size	21,128
Hidden Size (H)	768
Number of Layers in BERT (L)	12
Number of Attention Heads (A)	12
BERT’s Intermediate Size (I)	3072

Table 4. Attention Mechanism Parameters.

Component	Description	Number of Nodes	Activation Function	Initialization	Regularization
Fully Connected Layer 1	MLP Layer 1	256	Sigmoid	He	L2 (λ = 0.01)
Fully Connected Layer 2	MLP Layer 2	64	Sigmoid	He	L2 (λ = 0.01)
Fully Connected Layer 3	MLP Layer 3	1	Sigmoid	He	L2 (λ = 0.01)

Table 5. Parameters of the Final Layer of the Attention Mechanism.

Component	Description	Number of Nodes	Activation Function	Initialization	Regularization
Fully Connected Layer	Attention Compression Layer	768	None	BERT’s Output	None

Table 6. Parameters of the multi-layer perceptron (MLP) for Classification in Sequence Labeling.

Component	Description	Number of Nodes	Activation Function	Initialization	Regularization
Fully Connected Layer 1	MLP Layer 1	512	ReLu	He	L2 (λ = 0.01)
Fully Connected Layer 2	MLP Layer 2	3	Softmax	He	L2 (λ = 0.01)

Table 7. Parameters of the LSTM-based Sequence Parsing Model.

Parameter	Value
Input Dimension	8
Output Dimension	9
Hidden Dimension	64
Number of Hidden Layers	2
Sequence Length	6

Table 8. Parameters of the RNN-based Sequence Parsing Model.

Parameter	Value
Input Dimension	8
Output Dimension	9
Hidden Dimension	64
Number of Hidden Layers	2
Sequence Length	6

Table 9. ENEC, ENQM, and ENEM indicators on all possible models.

Model	Entity Name Extraction Coverage Rate (ENEC)	Entity Name Quantity Matching Rate (ENQM)	Entity Name Extraction Matching Rate (ENEM)
BSL-woA + LSP	0.803	0.782	0.935
BCSL + LSP	0.711	0.701	0.813
BSL-A-MC + LSP	0.806	0.792	0.946
BSL-A-CRF + LSP	0.726	0.708	0.825
BSL-woA + RSP	0.799	0.783	0.882
BCSL + RSPl	0.708	0.695	0.796
BSL-A-MC + RSP	0.804	0.787	0.937
BSL-A-CRF + RSP	0.716	0.712	0.816

Table 10. Performance differences for a compound noun phrase across various models.

Model	Entity Names Extraction and Reassembly Results
Original compound noun phrase	防疫、防护卫生装备及器具 (Epidemic Prevention, Protective Sanitary Equipment, and Instruments)
BSL-woA + LSP	防疫卫生装备; 防疫卫生器具; 防护卫生装备; 防护卫生器具 (Epidemic Prevention Sanitary Equipment; Epidemic Prevention Sanitary Instruments; Protective Sanitary Equipment; Protective Sanitary Instruments)	√
BCSL + LSP	防疫器具; 防护卫生装备器具 (Epidemic Prevention Instruments; Protective Sanitary Equipment Instruments)	×
BSL-A-MC + LSP	防疫卫生装备; 防疫卫生器具; 防护卫生装备; 防护卫生器具 (Epidemic Prevention Sanitary Equipment; Epidemic Prevention Sanitary Instruments; Protective Sanitary Equipment; Protective Sanitary Instruments)	√
BSL-A-CRF + LSP	防疫防护卫生装备; 防护卫生装备器具 (Epidemic Prevention Protective Sanitary Equipment Sanitary Equipment; Protective Sanitary Equipment Instruments )	×
BSL-woA + RSP	防疫卫生装备; 防疫卫生器具; 防护卫生装备; 防护卫生器具 (Epidemic Prevention Sanitary Equipment; Epidemic Prevention Sanitary Instruments; Protective Sanitary Equipment; Protective Sanitary Instruments)	√
BCSL + RSP	防疫器具; 防护卫生装备器具 (Epidemic Prevention Instruments; Protective Sanitary Equipment Instruments)	×
BSL-A-MC + RSP	防疫卫生装备; 防疫卫生器具; 防护卫生装备; 防护卫生器具(Epidemic Prevention Sanitary Equipment; Epidemic Prevention Sanitary Instruments; Protective Sanitary Equipment; Protective Sanitary Instruments)	√
BSL-A-CRF + RSP	防疫防护卫生装备; 防护卫生装备器具 (Epidemic Prevention Protective Sanitary Equipment Sanitary Equipment; Protective Sanitary Equipment Instruments )	×

“√” denotes correct extraction, while “×” indicates incorrect extraction.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, Y.; Fu, X. Reassembling Fragmented Entity Names: A Novel Model for Chinese Compound Noun Processing. Electronics 2023, 12, 4251. https://doi.org/10.3390/electronics12204251

AMA Style

Pan Y, Fu X. Reassembling Fragmented Entity Names: A Novel Model for Chinese Compound Noun Processing. Electronics. 2023; 12(20):4251. https://doi.org/10.3390/electronics12204251

Chicago/Turabian Style

Pan, Yuze, and Xiaofeng Fu. 2023. "Reassembling Fragmented Entity Names: A Novel Model for Chinese Compound Noun Processing" Electronics 12, no. 20: 4251. https://doi.org/10.3390/electronics12204251

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reassembling Fragmented Entity Names: A Novel Model for Chinese Compound Noun Processing

Abstract

1. Introduction

2. Related Work

2.1. Pre-Trained Models in Sequence Labeling Tasks

2.2. Sequence Parsing Models

2.2.1. Deep Learning-Based Sequence Parsing Models

2.2.2. Rule-Based Sequence Parsing Model

3. Methods

3.1. Task Description

3.2. Model Architecture

3.2.1. Overall Model Overview

3.2.2. Sequence Tagging Part

3.2.3. Sequence Parsing Section

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Model Comparison

4.3.1. Sequence Tagging Model

4.3.2. Sequence Parsing Model

4.4. Parameter Configuration

5. Results and Analysis

5.1. Sequence Labeling Section

5.2. Sequence Parsing Section

5.3. Overall Model Entity Extraction Effectiveness

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI