A Sentence-Matching Model Based on Multi-Granularity Contextual Key Semantic Interaction

Li, Jinhang; Li, Yingna

doi:10.3390/app14125197

Open AccessArticle

A Sentence-Matching Model Based on Multi-Granularity Contextual Key Semantic Interaction

by

Jinhang Li

¹

and

Yingna Li

^1,2,*

¹

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China

²

Computer Technology Application Key Lab of the Yunnan Province, Kunming 650500, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(12), 5197; https://doi.org/10.3390/app14125197

Submission received: 13 May 2024 / Revised: 7 June 2024 / Accepted: 11 June 2024 / Published: 14 June 2024

(This article belongs to the Special Issue Natural Language Processing and Semantic Technologies: From Theories to Applications)

Download

Browse Figures

Versions Notes

Abstract

:

In the task of matching Chinese sentences, the key semantics within sentences and the deep interaction between them significantly affect the matching performance. However, previous studies mainly relied on shallow interactions based on a single semantic granularity, which left them vulnerable to interference from overlapping terms. It is particularly challenging to distinguish between positive and negative examples within datasets from the same thematic domain. This paper proposes a sentence-matching model that incorporates multi-granularity contextual key semantic interaction. The model combines multi-scale convolution and multi-level convolution to extract different levels of contextual semantic information at word, phrase, and sentence granularities. It employs multi-head self-attention and cross-attention mechanisms to align the key semantics between sentences. Furthermore, the model integrates the original, similarity, and dissimilarity information of sentences to establish deep semantic interaction. Experimental results on both open- and closed-domain datasets demonstrate that the proposed model outperforms existing baseline models in terms of matching performance. Additionally, the model achieves matching effectiveness comparable to large-scale pre-trained language models while utilizing a lightweight encoder.

Keywords:

sentence matching; multi-granularity feature extraction; attention mechanism; semantic interaction

1. Introduction

Sentence matching is essential in various natural language processing tasks and finds applications in question answering, information retrieval, recommendation systems, and natural language inference. For example, in question-answering systems, it is used for paragraph retrieval to match relevant paragraphs from text corpus with the given question. Additionally, it is employed in answer selection tasks to choose the most suitable answer from a set of candidates. Information retrieval and recommendation system tasks rely on sentence matching to establish relationships between queries and documents and rank candidate results accordingly. In natural language inference tasks, sentence matching is utilized to evaluate the semantic relationship between hypotheses and premises.

In recent years, pre-trained language models have shown great success in diverse natural language processing tasks. For instance, Garg et al. [1] employed a fully connected layer to combine the classification vector of BERT [2]. Reimers et al. [3] introduced SBERT, which encodes each sentence into a dense vector and measures the similarity between two vectors to assess the textual matching degree. These BERT-based models effectively capture overlapping terms in sentences. However, these methods have limitations in terms of utilizing single-granularity vectors, which are insufficient to fully represent textual information and the limited interaction between sentences.

Inspired by the matching aggregation framework [4], Tang et al. [5] enhanced the effectiveness of short text matching by considering multiple aspects of interaction that capture the similarities and differences between two sentences comprehensively. Wang et al. [6] utilized ChineseBERT [7] to incorporate the visual and phonetic information of Chinese characters into the pre-training model, which served as encoders for questions and answers. This approach enabled multi-granularity feature encoding. They introduced a multi-scale convolutional neural network [8] to extract local semantic information and developed a context-aware interactive module to model sentence interaction. Furthermore, they proposed a multi-perspective fusion mechanism to integrate local and interactive information, with the goal of capturing rich features from questions and answers. However, these methods have limitations in their focus on low-level local semantic interactions and their limited discriminative ability when dealing with candidate options that contain a high degree of overlapping terms.

Lin et al. [9] introduced BERT-SMAP (BERT with self-matching attention pooling), which incorporates a self-matching attention-pooling mechanism into sentence pairs. This mechanism allows the model to prioritize the key terms within the sentence pairs rather than the overlapping terms. Lu et al. [10] proposed MKPM (multi-keyword pair matching), a method for addressing the sentence-matching task. It utilizes attention mechanisms on sentence pairs to identify the most important keyword pairs from the two sentences, representing their semantic relationship and avoiding interference from redundancy and noise. However, these methods have limitations in terms of their performance on closed-domain datasets. In datasets with a high semantic similarity between texts in the same thematic domain, solely focusing on key terms may make it challenging for the model to distinguish between positive and negative examples.

Humans first consider the presence of matching keywords and then synthesize the overall meaning of sentences to assess text alignment. To address these issues, this paper proposes a sentence-matching model that incorporates multi-granularity contextual key semantic interaction. The model combines pre-trained language model embeddings with BiGRU contextual encoding to thoroughly consider sentence contextual information. It utilizes multi-scale CNN and multi-level CNN to extract different levels semantic information at word and phrase granularity. The model introduces multi-head self-attention and cross-attention to emphasize key semantics within sentences and align crucial information between sentence pairs. By integrating three interaction methods, it incorporates interactive information on similarity and dissimilarity, enabling focus on key semantics and local interaction information at different granularity levels. The sentence’s overall representation is obtained through max pooling, with the pre-trained language model’s CLS vector serving as the coarse-grained semantics of the entire sentence pair. By combining these components, the model comprehensively captures the semantic information of the entire sentence pair.

In Chinese, a language lacking natural word delimiters, semantic information is conveyed through lexical units. Additionally, the polysemy of Chinese words can introduce ambiguity in various contexts. Phrases formed by word combinations carry specific underlying semantics, and Chinese sentences with different word orders can exhibit entirely different meanings. Approaches that rely on single-granularity text semantic matching can only analyze semantics at a single level and may not accurately capture the semantic relationship between texts. Furthermore, they often overlook the fusion of features from different granularities. In contrast, MCKI explores key semantic information at the levels of words, phrases, and sentences, while fully considering the interaction and fusion of semantics across different granularities. This enables MCKI to achieve superior performance.

The main contributions of this paper can be summarized as follows:

Introduction of the multi-granularity contextual key semantic interaction model (MCKI) for sentence matching, which combines multi-granularity feature extraction and key semantic interaction fusion. This enables the model to capture comprehensive semantic information in the text.
Development of a semantic feature extraction mechanism that combines multi-scale CNN and multi-level CNN to extract contextual semantic information at different granularities and levels.
Construction of a key semantic alignment mechanism by integrating multi-head self-attention and cross-attention, incorporating interaction information at different levels from low level to high level.
The proposed model demonstrates superior matching performance compared to existing baseline models in the LCQMC and BQ datasets. Additionally, it achieves comparable results to large-scale pre-trained language models while using a lightweight encoder.

2. Related Work

With the advancement of deep learning techniques, attention mechanisms have been extensively explored by researchers to capture semantic interactions and determine the degree of matching. For example, Zhao et al. [11] presented an interactive attention network that employed a matching matrix to model the semantic exchange between source and target texts, improving the matching capability for low-frequency keywords. Fei et al. [12] introduced a deep neural communication model that utilized BiLSTM and tree encoders to capture semantic and syntactic features. Through attention mechanisms, they designed deep communication modules to capture local and global interactions between semantic and syntactic features, thereby enhancing semantic comprehension. However, these studies primarily focus on modeling English sentences and mainly address single-granularity text semantic matching. They primarily analyze semantics from a single level and do not take into account the characteristics of the Chinese language. As a result, accurately measuring semantic relationships between Chinese texts is hindered.

Several studies have investigated multi-granularity text semantic matching to comprehensively assess the semantic similarity and matching degree between texts. These studies have explored deep fusion techniques at the character and word levels. For example, Zhang et al. [13] utilized a soft alignment attention mechanism to enhance local information in sentences at different levels, capturing feature information and correlations from multiple perspectives. Yu et al. [14] employed interactive information to model diverse text pairs across various tasks and languages. Wang et al. [15] concurrently considered deep and shallow semantic similarity, as well as granularity at the lexical and character levels, enabling a deeper exploration of similarity information. Chang et al. [16] proposed semantic similarity analysis through the fusion of words and phrases.

Considering the characteristics of Chinese text, several research studies have employed character-level and word-level approaches to develop semantic matching models. For instance, Zhang et al. [17] proposed a deep feature fusion model that combines various deep learning structures to capture semantic information. They generate a similarity tensor using three matching strategies. Lai et al. [18] designed a lattice-based approach that constructs word grids by selecting character and word paths and utilized CNN to generate sentence representations. Chen et al. [19] introduced a neural graph matching network for Chinese text matching, constructing a graph that includes all possible characters and words in a sentence. They employed graph neural networks to generate graph representations for predicting the matching degree. Zhang et al. [20] presented a multi-granularity fusion model that combines LSTM encoding structures to extract features at different granularities. They employed an interactive matching layer to capture and fuse more interaction features. Zhao et al. [21] proposed a multi-granularity alignment model that utilizes attention mechanisms to capture interaction features between different granularities and merge them for obtaining matching representations. Lyu et al. [22] introduced a language knowledge-enhanced graph transformer to update character representations and semantic representations, obtaining the final sentence representation. Some studies attempted to model text representations using pinyin (phonetic transcription) and radicals. Tao et al. [23] proposed a radical-guided correlation model that extracts features from characters, radicals, and character–radical associations for Chinese text classification. Zhao et al. [24] proposed a multi-granularity interaction model based on pinyin and radicals for Chinese semantic matching. In addition to character and word information, they further incorporated multi-granularity features of Chinese pinyin and radicals. The interaction features between different granularities and sentences were aggregated to generate the final matching representation and predict the matching degree. Although these methods utilized multi-granularity information to comprehend text semantics from different perspectives, they faced limitations with the Word2Vec model for text embedding representation. Performance tended to decline when dealing with rare words or domain-specific terms.

In recent years, the widespread use of pre-trained language models has permeated various domains, benefiting tasks involving the matching of textual semantics. These models undergo extensive pre-training on large textual corpora, equipping them with the ability to understand grammar, semantics, and contextual intricacies in natural language. Wu et al. [25] merged BERT with tree models to investigate matching relationships in question-answering data. Zhang et al. [26] utilized BERT models and multi-feature convolutions to extract semantic features for textual semantic matching. Zou et al. [27] employed RoBERTa as the backbone model and employed a divide-and-conquer strategy by decomposing keywords and intents for textual semantic matching. Wang et al. [15] achieved multi-granularity feature encoding using ChineseBERT. They also developed a context-aware interaction module to model interactive information between sentences and utilized a multi-perspective fusion mechanism to integrate local and interaction information, capturing rich features of questions and answers. However, these approaches primarily focus on low-level local semantic interactions and overlook crucial semantic information within the text, making them susceptible to irrelevant terminologies. Lin et al. [9] introduced a self-matching attention-pooling mechanism in sentence pairs, prioritizing key terms rather than overlapping terms. Lu et al. [10] proposed a matching method based on multiple pairs of key terms, selecting the most significant word pairs from two sentences to avoid redundancy and noise interference. While emphasizing key terms helps eliminate irrelevant semantic information, its effectiveness is limited in domain-specific datasets where semantic similarity is higher within texts of the same theme. Solely focusing on key terms poses challenges for the model in distinguishing positive and negative examples.

3. Multi-Granularity Contextual Key Semantic Interaction Model

This section presents the detailed components of the proposed model, as illustrated in Figure 1. The multi-granularity contextual key semantic interaction model (MCKI) consists of several layers: the contextual representation layer, the multi-granularity feature extraction layer, the key semantic alignment layer, the multi-way interaction fusion layer, and the similarity prediction layer.

The contextual representation layer utilizes a pre-trained language model to obtain embedded representations of the sentence pairs. The BiGRU is then employed to capture contextual features of the sequences.

In the multi-granularity feature extraction layer, a combination of multi-scale CNN and multi-level CNN is used to extract semantic information at different granularities and levels from the contextual features.

The key semantic alignment layer first utilizes a multi-head self-attention mechanism to extract key semantic information within each sentence. Then, a cross-attention mechanism is applied to align the key semantic information between sentence pairs.

Next, the multi-way interaction fusion layer combines the interaction features of key semantics through multiple comparison operations.

Finally, the similarity prediction layer employs max pooling to obtain the final semantic vectors at different granularity levels. These vectors are concatenated with the CLS token vector to measure the semantic matching relationship.

3.1. Contextual Representation Layer

The contextual representation layer consists of two steps: embedding representation and context feature extraction. For two-sentence sequences, denoted as

P = \{p_{1}, p_{2}, \dots, p_{m}\}

and

Q = \{q_{1}, q_{2}, \dots, q_{n}\}

, a pre-trained language model (such as BERT) is utilized as the encoder for the sentence pair. Special tokens, namely [CLS] and [SEP], are added to the sequences to distinguish them. The embedding representation of the sentence pair, denoted as

X_{e m b}

, is obtained through the following encoding process:

X = [[C L S], P, [S E P], Q, [S E P]]

(1)

X_{e m b} = B E R T (X) = [h_{c l s}, h_{p_{1}}, \dots, h_{p_{m}}, h_{s e p}, h_{q_{1}}, \dots, h_{q_{n}}, h_{s e p}]

(2)

where

X

represents the input sequence containing special markers and sentence pairs

P

and

Q

. These sequences are encoded using a pre-trained language model, such as BERT, resulting in the embedded representation

X_{e m b} \in ℝ^{h \times (m + n + 3)}

of the entire sequence.

h \in ℝ^{h}

represents an embedding vector, with h denoting the dimensionality of the encoder’s hidden layer. Notably,

h_{c l s} \in ℝ^{h}

can serve as a coarse-grained feature, summarizing the semantic information of the entire sentence pair. Recurrent neural networks (RNNs) are commonly employed for modeling sequential data. However, RNN models are susceptible to problems like vanishing or exploding gradients. Hence, this study adopts an efficient model called a Bidirectional Gated Recurrent Unit (BiGRU) to further encode the embedded sequences and extract context features for each of the two sentences. Additionally, a linear layer is utilized to merge the embedding representations, facilitating the comprehensive utilization of both the original and contextual information of the sequences. The representation process is illustrated as follows:

P_{c o n t e x t} = B i G R U ([h_{p_{1}}, \dots, h_{p_{m}}]) + L i n e a r ([h_{p_{1}}, \dots, h_{p_{m}}])

(3)

Q_{c o n t e x t} = B i G R U ([h_{q_{1}}, \dots, h_{q_{n}}]) + L i n e a r ([h_{q_{1}}, \dots, h_{q_{n}}])

(4)

B i G R U (H) = C o n c a t (\vec{G R U} (H), \overset{\leftarrow}{G R U} (H))

(5)

where

P_{c o n t e x t} \in ℝ^{m \times 2 d}

,

Q_{c o n t e x t} \in ℝ^{n \times 2 d}

, and

m

and

n

represent the lengths of the sentences, while

d

denotes the hidden dimension of the unidirectional GRU.

3.2. Multi-Granularity Feature Extraction Layer

In the field of natural language processing (NLP), convolutional neural networks (CNNs) have proven to be effective in extracting local features from sentences. They operate by applying convolutional operations to input sentences, which allows them to capture n-gram features and generate latent semantic representations that are rich in information for various downstream tasks. However, when dealing with Chinese text, words and phrases can vary in length and carry different meanings based on the number of characters they consist of. Using a fixed-size convolutional kernel might not be sufficient to capture the diverse range of phrase features, potentially resulting in the omission of crucial semantic information.

To address the limitations of a single fixed-size convolutional kernel in capturing diverse phrase features in Chinese text, this study proposes the use of multiple-scale convolutions to capture semantic features at different granularity levels, including individual words, phrases, and entire sentences. This allows for a comprehensive representation of sequence samples.

The multi-scale CNN performs convolutions on the input sentence using a series of differently sized kernels, each targeting specific n-gram semantic features. By employing multiple CNNs, the model can capture distinct features contained in different combinations of n-grams, overcoming the limitations of a single CNN. The network architecture is depicted in Figure 2.

Moreover, combinations of different phrases and expressions carry higher-level semantic information. By stacking different-scale CNNs, the multi-level CNN can capture advanced semantic features from low-level character sequences. This allows for the extraction of deeper semantic information. The network structure is illustrated in Figure 3.

The multi-scale CNN and multi-level CNN have the ability to capture different aspects of sentences. The multi-scale CNN captures a broader range of semantic information, while the multi-level CNN captures deeper semantic information. These two approaches complement each other. Similar to other CNN models used in NLP tasks, this study employs fixed-width convolutions to extract local features. The width of the convolutional kernel matches the hidden dimension of the input sequence.

For a given sequence of sentences, let us assume

c_{i} \in ℝ^{d}

represents the contextual representation of the

i

-th word in the sentence, where

d

denotes the dimension of the representation. The input sentence consists of

n

words. To extract meaningful local features, we employ convolutional filters of different scales, denoted as

S = \{s_{1}, s_{2}, \dots, s_{t}\}

, with a fixed kernel width of

l

. For example, generating feature

m_{i j}^{S_{i}}

through window

c_{i : i + S_{i} - 1}

:

m_{i j}^{S_{i}} = G E L U (W_{j}^{S_{i}} (c_{i : i + S_{i} - 1}) + b_{j}^{S_{i}})

(6)

where

W_{j}^{S_{i}} \in ℝ^{S_{i} \times d}

represents the weight of the

j

-th feature map at scale

s_{i}

,

G E L U

denotes the activation function, and

b_{j}^{S_{i}}

represents the bias of the feature map.

The convolutional kernel, denoted as

s_{i}

, slides over the entire padded contextual representation matrix, denoted as

C

, to generate feature maps. This process is illustrated below:

m_{j}^{S_{i}} = [m_{1 j}^{S_{i}}, m_{2 j}^{S_{i}}, \dots, m_{n j}^{S_{i}}]

(7)

The local n-gram features of the sentences have been extracted using the multi-scale CNN. By concatenating the outputs of the convolutional kernel

s_{i}

, we obtain the input for the key semantic alignment layer, which is represented as follows:

M^{S_{i}} = [m_{1}^{S_{i}}, m_{2}^{S_{i}}, \dots, m_{h}^{S_{i}}]

(8)

where

h

represents the number of feature maps, and

M^{S_{i}} \in ℝ^{n \times h}

.

For the multi-level CNN, it is necessary to repeat the above operation with the convolutional kernel

s_{j}

on top of

M^{S_{i}}

to realize the extraction of higher-level semantics.

After applying the multi-granularity feature extraction layer to the contextual features of the sentence, the resulting multi-granularity features of the sentence

P

and

Q

are as follows:

M_{P} = \{M_{P}^{S_{0}}, M_{P}^{S_{1}}, M_{P}^{S_{2}}, \dots, M_{P}^{S_{t}}\}

(9)

M_{Q} = \{M_{Q}^{S_{0}}, M_{Q}^{S_{1}}, M_{Q}^{S_{2}}, \dots, M_{Q}^{S_{t}}\}

(10)

where

M_{P}^{S_{i}} \in ℝ^{m \times h}

,

M_{Q}^{S_{i}} \in ℝ^{n \times h}

,

t

represents the number of convolutional kernels,

M^{S_{0}}

represents the output features of the multi-level CNN, and

M^{S_{i}}

represents the output features of different-scale convolutional kernels.

3.3. Key Semantic Alignment Layer

The key semantic alignment layer plays a crucial role in enabling the model to focus on important semantic features at different granularities within a sentence and align key semantic information across sentence pairs. The key semantic features for sentence pairs should meet the following conditions:

They should possess rich semantic characteristics.
They should be highly significant within the sentences and directly impact the overall meaning of the sentences.
As key semantic information for sentence pairs, they should exhibit substantial correlations with each other.

These three features are determined by the contextual features at different granularities within the sentences, self-attention mechanisms within the sentences, and cross-attention mechanisms between sentence pairs. Specifically, the contextual features of the sentences are derived from the outputs of the multi-granularity feature extraction layer, represented by

M_{P}^{S_{i}} \in ℝ^{m \times h}

and

M_{Q}^{S_{i}} \in ℝ^{n \times h}

, where

h

denotes the number of feature maps. The overall structure of the key semantic alignment layer is illustrated in Figure 4.

3.3.1. Multi-Head Self-Attention

The transformer model utilizes the scaled dot-product attention (SDA) mechanism for self-attention. As transformer-based pre-trained language models have gained popularity, the use of the scaled dot-product function has become more prevalent. The process begins by calculating the dot product between

Q

and

K

, which serves as a measure of similarity between them. A higher value indicates a stronger correlation. This result is then divided by the scaling factor

\sqrt{d}

, where

d

represents the dimension of the vectors. The introduction of the scaling factor addresses the issue of gradient vanishing or exploding that may occur when dealing with large vector dimensions during the dot product computation. This ensures smoother and more manageable attention scores, promoting training stability. Subsequently, the softmax function is applied to ensure that the weighted coefficients sum up to one, achieving attention normalization and distribution. Finally, the attention expression is obtained by multiplying it with the matrix v. The SDA function is defined as follows:

S D A (Q, K, V) = S o f t m a x (\frac{Q K^{⊤}}{\sqrt{d}}) V

(11)

By introducing the multi-head mechanism, sequence features are mapped into different linear spaces, allowing for the extraction of relevant information from various representation spaces. This integration of information from different spaces enhances the expressive power of attention. The formula can be represented as follows:

M u l t i H e a d (Q, K, V) = C o n c a t (H e a d_{1}, \dots, H e a d_{t}) W

(12)

H e a d_{i} = S D A (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(13)

where

W

,

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are learnable parameters,

t

denoting the number of attention heads.

Using multi-head self-attention, the key semantic features

S_{P}^{S_{i}}

for the sentence

P

itself can be computed as follows:

S_{P}^{S_{i}} = C o n c a t (S_{P_{1}}^{S_{i}}, \dots, S_{P_{t}}^{S_{i}}) W

(14)

S_{P_{i}}^{S_{i}} = S D A (M_{P}^{S_{i}} W_{i}^{Q}, M_{P}^{S_{i}} W_{i}^{K}, M_{P}^{S_{i}} W_{i}^{V}) = S o f t m a x (\frac{M_{P}^{S_{i}} W_{i}^{Q} {(M_{P}^{S_{i}} W_{i}^{K})}^{⊤}}{\sqrt{h}}) M_{P}^{S_{i}} W_{i}^{V}

(15)

In this equation, the dimension of

S_{P}^{S_{i}}

matches the dimension of the input feature

M_{P}^{S_{i}}

,

S_{P}^{S_{i}} \in ℝ^{m \times h}

. Similarly, the crucial semantic features

S_{Q}^{S_{i}} \in ℝ^{n \times h}

for the sentence

Q

can be calculated.

3.3.2. Cross-Attention

Cross-attention calculates the attention between sentence pairs in both directions, allowing for the alignment of key semantic information between the two sentences. This attention is computed from the similarity matrix

F^{S_{i}} \in ℝ^{m \times n}

, which considers the lengths of sentence

P

(

m

) and sentence

Q

(

n

). The computation of the similarity matrix

F^{S_{i}}

is as follows:

F^{S_{i}} = S_{P}^{S_{i}} {(S_{Q}^{S_{i}})}^{⊤}

(16)

Afterwards, the softmax function is applied to normalize the similarity matrix

F^{S_{i}}

in both directions. This generates attention weights

F_{P T Q}^{S_{i}} \in ℝ^{m \times n}

row-wise for the attention from sentence

P

to sentence

Q

and

F_{Q T P}^{S_{i}} \in ℝ^{n \times m}

column-wise for the attention from sentence

Q

to sentence

P

. The calculation is as follows:

F_{P T Q}^{S_{i}} = S o f t m a x (F_{i :}^{S_{i}})

(17)

F_{Q T P}^{S_{i}} = S o f t m a x (F_{: j}^{S_{i}})

(18)

To obtain the attention-aware representation

R_{P}^{S_{i}} \in ℝ^{m \times h}

for sentence

P

with respect to sentence

Q

, the attention weights are multiplied with the representations. Similarly, the attention-aware representation

R_{Q}^{S_{i}} \in ℝ^{n \times h}

for sentence

Q

with respect to sentence

P

can be obtained. The calculation is as follows:

R_{P}^{S_{i}} = F_{P T Q}^{S_{i}} S_{Q}^{S_{i}}

(19)

R_{Q}^{S_{i}} = F_{Q T P}^{S_{i}} S_{P}^{S_{i}}

(20)

where

R_{P}^{S_{i}}

represents the attention-aware representation of sentence

P

with respect to sentence

Q

after the

i

-th scale convolution and attention mechanism. Similarly,

R_{Q}^{S_{i}}

represents the attention-aware representation of sentence

Q

with respect to sentence

P

after the

i

-th scale convolution and attention mechanism.

3.4. Multi-Way Interaction Fusion Layer

To capture the interaction information between sentences, the key semantic features

S^{S_{i}}

and the attention-aware representation

R^{S_{i}}

can be effectively integrated through three different comparison operations. These operations facilitate a deep interaction between the key semantic features of each sentence, enabling the integration of interaction information across different granularities, from low level to high level. The structure of the multi-way interaction fusion layer is illustrated in Figure 5.

For sentence P, the three comparison operations are defined as follows:

G_{P_{1}}^{S_{i}} = F F N_{1} ([S_{P}^{S_{i}}, R_{P}^{S_{i}}])

(21)

G_{P_{2}}^{S_{i}} = F F N_{2} ([S_{P}^{S_{i}} - R_{P}^{S_{i}}])

(22)

G_{P_{3}}^{S_{i}} = F F N_{3} ([S_{P}^{S_{i}} ⊙ R_{P}^{S_{i}}])

(23)

where

G_{P_{1}}^{S_{i}} \in ℝ^{m \times 2 h}

(achieved through concatenation along the column direction),

G_{P_{2}}^{S_{i}} \in ℝ^{m \times h}

, and

G_{P_{3}}^{S_{i}} \in ℝ^{m \times h}

represent the interaction features obtained after the three different operations:

$G_{P_{1}}^{S_{i}}$ represents the concatenation of the key semantic features $S_{P}^{S_{i}}$ and the attention-aware representation $R_{P}^{S_{i}}$ . In the interaction operation, concatenation is used to preserve all the information.
$G_{P_{2}}^{S_{i}}$ performs element-wise subtraction, approximating the calculation of Euclidean distance and measuring the relevance between the two representations using difference information.
$G_{P_{3}}^{S_{i}}$ performs element-wise multiplication, approximating the calculation of cosine similarity and measuring the similarity between the two representations.

These comparison operations enhance semantic information and capture semantic relationships. To optimize the model’s efficiency, a feed-forward neural network is employed to integrate the interaction information, resulting in the feature interaction matrix

G_{P}^{S_{i}} \in ℝ^{m \times l}

:

G_{P}^{S_{i}} = F F N_{4} ([G_{P_{1}}^{S_{i}}, G_{P_{2}}^{S_{i}}, G_{P_{3}}^{S_{i}}])

(24)

where

F F N_{1}

,

F F N_{2}

,

F F N_{3}

, and

F F N_{4}

denote distinct feed-forward neural networks with non-shared parameters, conventionally represented as follows:

F F N (Z) = G E L U (Z W + b)

(25)

and

[G_{P_{1}}^{S_{i}}, G_{P_{2}}^{S_{i}}, G_{P_{3}}^{S_{i}}] \in ℝ^{m \times 4 h}

is concatenated along the column dimension. The parameters

W_{4} \in ℝ^{4 h \times l}

and

b \in ℝ^{l}

are adjustable, and

l

represents the output dimension of the network. This operation enables the model to capture deeper levels of interaction information and reduces the complexity of the representation. Likewise, the feature interaction matrix

G_{Q}^{S_{i}} \in ℝ^{n \times l}

can be obtained for sentence

Q

.

3.5. Similarity Prediction Layer

By applying the max-pooling operation, we map the feature interaction matrices

G_{P}^{S_{i}}

and

G_{Q}^{S_{i}}

to vector representations of the same dimension. This process allows for the extraction of valuable features at various granularities while reducing computational costs. The max-pooling operation for each granularity is defined as follows:

p^{S_{i}} = M a x P o o l i n g (G_{P}^{S_{i}})

(26)

q^{S_{i}} = M a x P o o l i n g (G_{Q}^{S_{i}})

(27)

where

p^{S_{i}} \in ℝ^{l}

and

q^{S_{i}} \in ℝ^{l}

represent the semantic feature vectors modeled by the model at scale

S_{i}

. By concatenating all semantic feature vectors from different scales with the CLS token vector, we obtain the final semantic interaction vector

f

:

f = C o n c a t (h_{c l s}, p^{S_{0}}, p^{S_{1}}, \dots, p^{S_{t}}, q^{S_{0}}, q^{S_{1}}, \dots, q^{S_{t}})

(28)

where

f \in ℝ^{h + 2 (t + 1) l}

, and

h

represents the dimensions of the CLS token vector, with a total of

t

+ 1 scales, and

l

denotes the dimension of the feature vector at each scale. Finally, a fully connected layer is used to predict the matching relationship between the two sentences based on this vector.

The chosen loss function is the cross-entropy loss, represented by the following formula:

L = - \frac{1}{N} \sum_{N} (y l o g p + (1 - y) \times l o g (1 - p))

(29)

In the equation,

y

represents the true label, which is set to one when there is semantic matching between the sentence pairs and zero otherwise.

p

denotes the predicted result, and

N

represents the number of training samples.

4. Experimental Results and Analysis

4.1. Dataset

The experimental evaluation of the model in this study was conducted on the following two publicly available datasets, and the dataset statistics are presented in Table 1.

Large-scale Chinese Question Matching Corpus (LCQMC) [28]: LCQMC is a dataset designed for sentence pair matching in an open-domain context. It focuses on semantic intent matching, making it more generalizable than sentence rewriting datasets. Each sentence pair in LCQMC is labeled with zero or one to indicate whether the sentences match. The dataset consists of 149,226 positive sentence pairs and 110,842 negative sentence pairs.

Bank Question Dataset (BQ) [29]: The BQ dataset contains customer service logs from online banks spanning one year. It serves as a dataset for identifying sentence semantic equivalence, specifically for sentence pair semantic matching. Similar to LCQMC, each sentence pair in BQ is labeled with zero or one to indicate whether the sentences match. The positive and negative sentence pairs in BQ are balanced with a ratio of 1:1.

4.2. Implementation

The model architecture and system experiments in this study were implemented using the PaddleNLP framework. The framework’s pre-trained models and their initialized weights were utilized. The experiments were conducted using two NVIDIA GeForce RTX 3090 GPUs (Taiwan Semiconductor Manufacturing Company Limited, Hsinchu, Taiwan), each with a memory capacity of 24 GB.

In terms of the model architecture, for the input encoding stage, this study utilized 768-dimensional BERT-base, 312-dimensional ERNIE 3.0-nano, and 768-dimensional ERNIE 3.0-base for text embedding, respectively. For the feature extraction stage, the BiGRU unidirectional hidden dimension was set to 150. The multi-scale CNN adopts convolutional filters of size (3, 4, 5), while the multi-level CNN uses convolutional filters of size (3, 4). The feature map dimension for both was 300. The multi-head self-attention employs 20 heads. In the interaction fusion stage, the hidden dimension of the FNN was set to 150.

Regarding model training, the maximum sentence length was set to 128, and the batch size was 32. The optimization was performed using AdamW with

β_{1}

= 0.9 and

β_{2}

= 0.999. The learning rate was set to 5 × 10⁻⁵, and the dropout parameter was 0.1. The model was trained for 3 epochs. The evaluation metric used was accuracy (Acc), and the best-performing result was chosen as the experimental outcome.

In terms of model testing, the evaluation metric used was accuracy (Acc), and the optimal result was chosen as the experimental outcome. The following definitions were used: True Positives (TPs) represent the cases where the original sentences were matched and correctly classified, while True Negatives (TNs) indicate the cases where the original sentences were unmatched and correctly classified. False Positives (FPs) refer to the cases where the original sentences were unmatched but mistakenly classified, and False Negatives (FNs) denote the cases where the original sentences were matched but erroneously classified. The accuracy was calculated using the following formula:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(30)

4.3. Baseline Models

In this study, representative methods from three categories, namely neural network-based methods, attention-based methods, and pre-trained language model-based methods, were selected as comparative models. The chosen models are as follows:

Neural Network-based Methods:

Text-CNN [30]: a convolutional neural network model commonly used for sentence classification and matching tasks.

BiLSTM [31]: a variant of recurrent neural networks that processes sentences using LSTM units, capturing long-term and short-term dependencies from both forward and backward contexts.

BiMPM [32]: A sentence-matching model based on BiLSTM. It utilizes BiLSTM to represent contextual information and performs matching from two directions and four different perspectives. The matching results are then aggregated using BiLSTM.

Attention-based Methods:

DRCN [33]: a densely connected co-attention recurrent neural network where each layer utilizes attention features and hidden features from all preceding recurrent layers.

MIPR: A multi-granularity interaction model based on pinyin and radicals for Chinese semantic matching. The interaction features between different granularities and sentences are aggregated to generate the final matching representation and predict the matching degree.

MKPM: A sentence-matching method based on multi-keyword pair matching. It employs attention mechanisms for sentence pairs and selects the most important keyword pairs from both sentences to represent their semantic relationships. This approach helps to avoid interference from redundancy and noise.

Pre-trained Language Model-based Methods:

BERT: A renowned pre-trained language model. This study utilizes BERT-base as the encoder to generate representations for sentence pairs. The CLS vector output is fed into a linear layer to predict the matching relationship between the pairs.

SBERT: A semantic textual similarity method based on BERT. It employs a Siamese and triplet network structure to obtain semantically meaningful sentence representations.

MAGE: A method for multi-scale context-aware interaction in sentence pair matching, based on multi-granularity embeddings. It aligns semantic relationships using context-aware interaction modules and combines local semantic features with attention representations using a multi-perspective fusion approach. This study utilizes BERT-base as the encoder and predicts the matching relationship between sentence pairs using a linear layer.

MacBERT [34]: An enhanced Chinese pre-trained language model based on BERT. It replaces the MLM (Masked Language Model) task in BERT with the Mac (MLM as correction) task, which helps alleviate the discrepancy between pre-training and fine-tuning stages.

ERNIE 3.0 [35]: A large-scale knowledge-enhanced pre-trained language model that combines autoregressive and autoencoding networks. This study conducts experiments using both ERNIE 3.0-base and ERNIE 3.0-nano as encoders. Similar to BERT, the CLS vector output is fed into a linear layer to predict the matching relationship between sentence pairs.

4.4. Performance Comparison

Table 2 presents the accuracy results of various models on the test sets of LCQMC and BQ datasets.

Overall, pre-trained language model-based methods outperformed the other two approaches. The performance improvement of these models was more pronounced on open-domain datasets compared to closed-domain datasets. This can be attributed to the emphasis on capturing essential semantic information, which allows for the elimination of irrelevant semantic details. As a result, the models demonstrated significant enhancements when handling sentence pairs with more noise. However, it is worth noting that the BQ dataset consisted of samples with a common theme, resulting in higher similarity between positive and negative instances. This increased similarity posed a challenge in distinguishing between them.

Among the neural network-based methods, BiLSTM demonstrated a substantial accuracy improvement compared to Text-CNN. This is attributed to the fact that text data can be viewed as sequential data, and the recurrent structure of BiLSTM is better suited for modeling sequence features. Unlike Text-CNN, which extracts diverse text features through convolutions of various sizes, BiLSTM takes into account the dependencies between contextual information present in the text. To further enhance accuracy, BiMPM, which builds upon BiLSTM, introduces bidirectional different-perspective matching to capture contextual features and aggregates matching information using BiLSTM.

Methods incorporating attention mechanisms, which enable the model to concentrate on crucial interaction information between sentences, also exhibit enhanced performance. DRCN introduced an attention mechanism in recurrent neural networks. MKPM employed bidirectional attention to select crucial words for keyword matching in sentence pairs, leading to significant improvements on the LCQMC dataset. This highlighted the importance of key semantics in sentence matching within an open-domain context. MIPR extracted semantic interaction features within and between sentences at four different granularities, resulting in notable improvements on the BQ dataset. This showcases the influence of multi-granularity semantics in sentence matching within a closed-domain setting.

Pre-trained language models demonstrate exceptional performance in language modeling by acquiring contextual representations from extensive data. BERT achieved an accuracy of 86.75% on the LCQMC dataset and 84.67% on the BQ dataset. In contrast, SBERT, which employs a dual-encoder architecture to model sentence representations but lacks explicit interaction information between sentences, exhibited lower effectiveness on both datasets when compared to the cross-encoder architecture of BERT. While attention mechanisms enable models to concentrate on important words or phrases in text, they can pose challenges in distinguishing between positive and negative examples in texts with similar themes. MAGE effectively handles this challenge through the use of a context-aware interaction module and a multi-perspective fusion module. Additionally, pre-trained language models such as MacBERT and Ernie 3.0, which have undergone extensive training on large-scale Chinese data, enhance performance by improving the training process.

The proposed method in this study demonstrated significant performance improvements compared to both neural network-based models and attention-based models, highlighting the effectiveness of building models upon pre-trained language models. Specifically, when compared to pre-trained language models, the method in this study achieved improvements of 1.15% and 0.61% on top of BERT, 1.53% and 0.67% on top of Ernie 3.0-nano, and 1.44% and 0.25% on top of Ernie 3.0-base, respectively, surpassing all baseline models. However, the improvement observed for the model based on Ernie 3.0-base in the BQ dataset is relatively smaller. This can be attributed to two factors. Firstly, as mentioned earlier, the BQ dataset poses greater difficulty in terms of matching compared to open-domain datasets. Secondly, Ernie 3.0, being a more advanced pre-trained language model, has undergone extensive knowledge-enhanced pre-training, equipping it with certain capabilities for modeling semantic matching relationships in sentence pairs. Consequently, the potential for improvement relative to the BERT-based models is relatively smaller.

Additionally, this study achieved performance that is comparable to BERT-base and even MacBERT-large by leveraging the capabilities of the lightweight Ernie 3.0-nano model on the LCQMC dataset. This observation suggests that, instead of relying on larger pre-trained language models, the approach presented in this study can attain similar performance levels by utilizing a more lightweight pre-trained language model. This strategy proves effective, particularly when resources are limited.

In comparison to the MAGE model, the proposed method achieved improvements of 0.32% and 0.44% on the two datasets. The approach presented in this study utilizes a fusion of multi-scale and multi-level CNNs to extract comprehensive contextual features, enabling the model to capture a wider and deeper range of information. By combining multi-head self-attention and cross-attention mechanisms, the model not only emphasizes key semantic information but also aligns significant details across different levels of semantics. As a result, significant improvements are observed compared to the MAGE model.

4.5. Ablation Study

To further evaluate the effectiveness of each module, this study included an ablation study on the model based on Ernie 3.0-nano using the BQ dataset. This study involved removing specific modules to assess their impact on performance, as displayed in Table 3. In the table, the symbol “√” denotes the inclusion of the corresponding module, while “-” denotes its removal.

To assess the effectiveness of the contextual representation module, the study conducted an ablation experiment by removing this module and directly extracting multi-granularity features from the original embeddings. The experimental results showed that the model’s accuracy decreased by 0.44%. Furthermore, this study investigated the impact of using different contextual representation modules, and the comparative results are presented in Table 4. Compared to unidirectional LSTM and GRU structures, the bidirectional structure demonstrated an improvement of approximately 0.7% in accuracy. This improvement can be attributed to two factors. Firstly, the hidden dimension of the unidirectional structure is only half that of the bidirectional structure, leading to a decrease in overall model performance. Secondly, the bidirectional structure captures both forward and backward semantic dependencies in the text, effectively capturing contextual semantic information. Additionally, the GRU structure outperformed the LSTM structure, indicating that GRU is more effective in modeling sequential information in the text. Moreover, better results were obtained by combining the original embeddings, which allows for the comprehensive utilization of both the original information and contextual information from the text.

Additionally, to verify the effectiveness of the multi-granularity feature extraction module, its removal resulted in a decrease of 0.37% in model accuracy. This study also explored the replacement of this module with different convolutional methods, as depicted in Table 5. It was observed that a single-scale CNN can only capture features at a specific scale, limiting its ability to incorporate a broader range of semantic information. Conversely, a multi-scale CNN compensates for this limitation by capturing semantic information at multiple scales, while a multi-level CNN focuses on capturing higher-level semantic information. These two approaches possess distinct capabilities in capturing different semantic features. The multi-scale CNN extracts a wider range of semantic information, while the multi-level CNN delves into deeper semantic layers. By combining both approaches, the model achieves improved performance as they complement each other in capturing diverse semantic features.

This study also examined the impact of attention mechanisms on the model’s performance. By removing self-attention and directly fusing the multi-granularity contextual features with the aligned features from cross-attention, the model’s accuracy declined by 0.34%. This decrease can be attributed to the absence of important semantic information at various granularities. Similarly, removing cross-attention and fusing the multi-granularity contextual features with the output features from self-attention led to a decrease in model accuracy by 0.31%. This reduction occurred because the model solely focused on its own semantic information and lacked interaction between sentences. Furthermore, simultaneous removal of cross-attention and the multi-path interaction fusion module, relying solely on key semantic information at different granularities, resulted in a decrease in model accuracy by 0.48%. This finding indicates that aligning crucial semantic information between sentences and considering semantic interactions at various granularities significantly contribute to enhancing semantic matching capabilities.

Additionally, solely removing the multi-way interaction fusion module resulted in a decrease of 0.43% in accuracy. Furthermore, to further validate the effectiveness of different comparison methods, as presented in Table 6, using only the concatenation of sentence semantic features achieved an accuracy of 83.35%. Using only semantic difference information led to an accuracy of 82.91%, and using only semantic similarity information resulted in an accuracy of 83.05%. These results indicate that combining all three types of comparison methods is an effective approach for capturing matching interactions. Moreover, the impact of feature fusion in different directions on performance was explored. In the table, “,” represents concatenation along the column direction of the tensor, expanding the hidden state dimension, while “;” represents concatenation along the row direction of the tensor, expanding the sentence length. The results showed that concatenation in the row direction did not effectively merge the compared information from multiple ways, which failed to reflect the interaction between sentences and resulted in decreased performance.

Lastly, a comparative experiment was conducted to validate the impact of the CLS token vector on model performance, as presented in Table 7. The experiment involved using different prediction modules. The results demonstrated that the best performance is achieved when the CLS token vector is concatenated with the output vector from the max-pooling operation. This finding suggests that in addition to fine-grained features, incorporating the CLS token as a coarse-grained representation of semantic information for the entire sentence pair effectively enhances semantic matching performance. Furthermore, the additional concatenation of the output vector from the average-pooling operation leads to a decrease in model performance. This decrease can be attributed to the introduction of redundant information, which affects the model’s ability to distinguish important features.

4.6. Parameter Sensitivity Analysis

In this section, a further evaluation was carried out to assess the impact of different combinations of multi-scale CNNs on model performance, as presented in Table 8. The results indicated that the best model performance was achieved when using a combination of multi-scale convolutional filters (3, 4, 5) and multi-level convolutional filters (3, 4). Increasing or decreasing the number of convolutional filters resulted in varying degrees of information redundancy or loss, which led to a decrease in performance.

In addition, the impact of the number of self-attention heads on model performance was investigated, as illustrated in Figure 6. It was observed that as the number of heads increased, the model performance improved, reaching its peak when the number of heads was 20. This improvement can be attributed to the average sentence length in the BQ dataset being 11.9 tokens, and with a model hidden dimension of 300, having 20 attention heads ensures that each attention unit has a dimension of 15, which is sufficient to capture most of the sentence patterns in the dataset. However, as the number of heads continues to increase beyond this point, the dimension per attention unit decreases, leading to a low-rank bottleneck problem [36]. This ultimately results in a decline in model performance.

5. Conclusions

This paper presents a sentence-matching model called the multi-granularity contextual key semantic interaction model (MCKI). The model incorporates pre-trained language model embeddings and BiGRU to capture contextual information effectively. It employs a semantic feature extraction mechanism that combines a multi-scale CNN and multi-level CNN to extract semantic contextual information at different granularities and levels. The introduction of a key semantic alignment mechanism by integrating multi-head self-attention and cross-attention mechanisms emphasizes key semantic information between sentence pairs. By combining three interaction methods, the model comprehensively considers both the similarity and difference between sentences, resulting in enhanced matching performance.

The effectiveness of the proposed model and its components was evaluated on public datasets. The findings demonstrated that in open-domain datasets, matching key semantic elements effectively eliminates irrelevant information, leading to significant improvements, especially when sentence pairs contain high levels of noise. Moreover, the proposed methodology in this study achieved comparable performance to large-scale pre-trained models by utilizing lightweight pre-training language models, which proves effective when resources are limited. However, in closed-domain datasets where sentence pairs share the same theme, considering semantic nuances at different levels and granularities aids the model in distinguishing overlapping terminologies within the subject. Nevertheless, the effectiveness of model enhancements remains limited in such scenarios. Additionally, marginal effects are observed when applying improvements to more advanced pre-trained models. The limitations and shortcomings of the current research should be acknowledged. This study primarily focused on combining existing models without thoroughly exploring improvements to these models. Future research will aim to enhance known models in order to achieve better performance. Furthermore, there is a consideration to integrate graph neural networks, which can leverage both the key semantic information and structural aspects of sentences to optimize matching performance.

Author Contributions

Conceptualization, J.L. and Y.L.; methodology, J.L.; software, J.L.; validation, J.L.; formal analysis, J.L.; investigation, J.L.; resources, Y.L.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, Y.L.; supervision, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by the Key projects of science and technology plan of Yunnan Provincial Department of Science and Technology (grant number 202201AS070029).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from Intelligent Computing Research Center, Harbin Institute of Technology (Shenzhen), and are available at http://icrc.hitsz.edu.cn/info/1037/1146.htm and http://icrc.hitsz.edu.cn/info/1037/1162.htm with the permission of Intelligent Computing Research Center, Harbin Institute of Technology (Shenzhen), accessed on 1 October 2023.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Garg, S.; Ramakrishnan, G. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6174–6181. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
Wang, S.; Jiang, J. A compare-aggregate model for matching text sequences. In Proceedings of the ICLR 2017: International Conference on Learning Representations, Toulon, France, 24–26 April 2017; pp. 1–15. [Google Scholar]
Tang, X.; Luo, Y.; Xiong, D.; Yang, J.; Li, R.; Peng, D. Short text matching model with multiway semantic interaction based on multi-granularity semantic embedding. Appl. Intell. 2022, 52, 15632–15642. [Google Scholar] [CrossRef]
Wang, M.; He, X.; Liu, Y.; Qing, L.; Zhang, Z.; Chen, H. MAGE: Multi-scale context-aware interaction based on multi-granularity embedding for chinese medical question answer matching. Comput. Methods Programs Biomed. 2023, 228, 107249. [Google Scholar] [CrossRef]
Sun, Z.; Li, X.; Sun, X.; Meng, Y.; Ao, X.; He, Q.; Wu, F.; Li, J. ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual, 1–6 August 2021; pp. 2065–2075. [Google Scholar]
Zhang, S.; Zhang, X.; Wang, H.; Cheng, J.; Li, P.; Ding, Z. Chinese medical question answer matching using end-to-end character-level multi-scale CNNs. Appl. Sci. 2017, 7, 767. [Google Scholar] [CrossRef]
Lin, D.; Tang, J.; Li, X.; Pang, K.; Li, S.; Wang, T. BERT-SMAP: Paying attention to Essential Terms in passage ranking beyond BERT. Inf. Process. Manag. 2022, 59, 102788. [Google Scholar] [CrossRef]
Lu, X.; Deng, Y.; Sun, T.; Gao, Y.; Feng, J.; Sun, X.; Sutcliffe, R. MKPM: Multi keyword-pair matching for natural language sentences. Appl. Intell. 2022, 52, 1878–1892. [Google Scholar] [CrossRef]
Zhao, S.; Huang, Y.; Su, C.; Li, Y.; Wang, F. Interactive attention networks for semantic text matching. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020; pp. 861–870. [Google Scholar]
Fei, H.; Ren, Y.; Ji, D. Improving text understanding via deep syntax-semantics communication. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 84–93. [Google Scholar]
Zhang, X.; Li, Y.; Lu, W.; Jian, P.; Zhang, G. Intra-correlation encoding for Chinese sentence intention matching. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 8–13 December 2020; pp. 5193–5204. [Google Scholar]
Yu, C.; Xue, H.; Jiang, Y.; An, L.; Li, G. A simple and efficient text matching model based on deep interaction. Inf. Process. Manag. 2021, 58, 102738. [Google Scholar] [CrossRef]
Wang, X.; Yang, H. MGMSN: Multi-Granularity Matching Model Based on Siamese Neural Network. Front. Bioeng. Biotechnol. 2022, 10, 839586. [Google Scholar] [CrossRef] [PubMed]
Chang, G.; Wang, W.; Hu, S. MatchACNN: A multi-granularity deep matching model. Neural Process. Lett. 2023, 55, 4419–4438. [Google Scholar] [CrossRef]
Zhang, X.; Lu, W.; Li, F.; Peng, X.; Zhang, R. Deep feature fusion model for sentence semantic matching. Comput. Mater. Contin. 2019, 61, 601–616. [Google Scholar] [CrossRef]
Lai, Y.; Feng, Y.; Yu, X.; Wang, Z.; Xu, K.; Zhao, D. Lattice cnns for matching based chinese question answering. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6634–6641. [Google Scholar]
Chen, L.; Zhao, Y.; Lyu, B.; Jin, L.; Chen, Z.; Zhu, S.; Yu, K. Neural graph matching networks for Chinese short text matching. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6152–6158. [Google Scholar]
Zhang, X.; Lu, W.; Zhang, G.; Li, F.; Wang, S. Chinese sentence semantic matching based on multi-granularity fusion model. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 11–14 May 2020; pp. 246–257. [Google Scholar]
Zhao, P.; Lu, W.; Li, Y.; Yu, J.; Jian, P.; Zhang, X. Chinese semantic matching with multi-granularity alignment and feature fusion. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
Lyu, B.; Chen, L.; Zhu, S.; Yu, K. Let: Linguistic knowledge enhanced graph transformer for chinese short text matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 13498–13506. [Google Scholar]
Tao, H.; Tong, S.; Zhang, K.; Xu, T.; Liu, Q.; Chen, E.; Hou, M. Ideography leads us to the field of cognition: A radical-guided associative model for chinese text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 13898–13906. [Google Scholar]
Zhao, P.; Lu, W.; Wang, S.; Peng, X.; Jian, P.; Wu, H.; Zhang, W. Multi-granularity interaction model based on pinyins and radicals for Chinese semantic matching. World Wide Web 2022, 25, 1703–1723. [Google Scholar] [CrossRef]
Wu, Z.; Liang, J.; Zhang, Z.; Lei, J. Exploration of text matching methods in Chinese disease Q&A systems: A method using ensemble based on BERT and boosted tree models. J. Biomed. Inform. 2021, 115, 103683. [Google Scholar]
Zhang, Z.; Zhang, Y.; Li, X.; Qian, Y.; Zhang, T. BMCSA: Multi-feature spatial convolution semantic matching model based on BERT. J. Intell. Fuzzy Syst. 2022, 43, 4083–4093. [Google Scholar] [CrossRef]
Zou, Y.; Liu, H.; Gui, T.; Wang, J.; Zhang, Q.; Tang, M.; Li, H.; Wang, D. Divide and Conquer: Text Semantic Matching with Disentangled Keywords and Intents. In Proceeding of the Findings of Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 3622–3632. [Google Scholar]
Liu, X.; Chen, Q.; Deng, C.; Zeng, H.; Chen, J.; Li, D.; Tang, B. Lcqmc: A large-scale chinese question matching corpus. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 1952–1962. [Google Scholar]
Chen, J.; Chen, Q.; Liu, X.; Yang, H.; Lu, D.; Tang, B. The bq corpus: A large-scale domain-specific chinese corpus for sentence semantic equivalence identification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4946–4951. [Google Scholar]
He, T.; Huang, W.; Qiao, Y.; Yao, J. Text-attentional convolutional neural network for scene text detection. IEEE Trans. Image Process. 2016, 25, 2529–2541. [Google Scholar] [CrossRef]
Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The performance of LSTM and BiLSTM in forecasting time series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 3285–3292. [Google Scholar]
Wang, Z.; Hamza, W.; Florian, R. Bilateral multi-perspective matching for natural language sentences. In Proceedings of the International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z. Pre-training with whole word masking for chinese bert. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3504–3514. [Google Scholar] [CrossRef]
Sun, Y.; Wang, S.; Feng, S.; Ding, S.; Pang, C.; Shang, J.; Liu, J.; Chen, X.; Zhao, Y.; Lu, Y. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv 2021, arXiv:2107.02137. [Google Scholar]
Bhojanapalli, S.; Yun, C.; Rawat, A.S.; Reddi, S.; Kumar, S. Low-rank bottleneck in multi-head attention models. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 864–873. [Google Scholar]

Figure 1. The overall architecture of MCKI model.

Figure 2. Structure of multi-scale CNN.

Figure 3. Structure of multi-level CNN.

Figure 4. Structure of key semantic alignment layer.

Figure 5. Structure of multi-way interactive fusion layer.

Figure 6. Effect of different number of self-attention heads on model performance.

Table 1. Dataset statistics.

Datasets	Total	Train	Validation	Test
LCQMC	260,068	238,766	8802	12,500
BQ	120,000	100,000	10,000	10,000

Table 2. Accuracy of different models on LCQMC and BQ.

Methods	LCQMC	BQ
Text-CNN [28,29]	72.80	68.52
BiLSTM [28,29]	76.10	73.51
BiMPM [28,29]	83.40	81.85
DRCN [33]	85.90	83.30
MIPR [24]	85.29	84.45
MKPM [10]	86.71	84.11
BERT-base	86.75	84.67
SBERT	84.44	83.64
MAGE	87.58	84.84
MacBERT-base [34]	87.00	85.20
MacBERT-large [34]	87.60	85.60
Ernie 3.0-nano	86.12	83.02
Ernie 3.0-base	87.64	85.83
MCKI (BERT-base)	87.90	85.28
MCKI (Ernie 3.0-nano)	87.65	83.69
MCKI (Ernie 3.0-base)	89.08	86.08

Table 3. Ablation study on BQ.

Index	Modules					Acc (%)
	Contextual Representation	Multi-Granularity Feature Extraction	Multi-Head Self-Attention	Cross-Attention	Multi-Way Interaction Fusion
1	√	√	√	√	√	83.69
2	-	√	√	√	√	83.25
3	√	-	√	√	√	83.32
4	√	√	-	√	√	83.35
5	√	√	√	-	√	83.38
6	√	√	√	-	-	83.21
7	√	√	√	√	-	83.26

Table 4. Accuracy of different contextual representation modules.

Modules	Acc (%)
LSTM	82.66
GRU	82.83
BiLSTM	83.44
BiGRU	83.57
BiGRU + embedding	83.69

Table 5. Accuracy of different multi-granularity feature extraction modules.

Modules	Filters	Acc (%)
Single-scale CNN	(3)	83.35
Multi-scale CNN	(3, 4, 5)	83.41
Multi-level CNN	(3, 4)	83.37
Multi-scale CNN + Multi-level CNN	(3, 4, 5) + (3, 4)	83.69

Table 6. Accuracy of different feature fusion methods.

Methods	Acc (%)
(M, R)	83.35
(M-R)	82.91
(M⊙R)	83.05
(M-R, M⊙R)	83.24
(M, R, M-R)	83.30
(M, R, M⊙R)	83.51
(M, R, M-R, M⊙R)	83.69
(M; R; M-R; M⊙R)	83.19

Table 7. Accuracy of different prediction modules.

Modules	Acc (%)
Max pooling	83.23
Avg pooling	83.18
(CLS, Max pooling)	83.69
(CLS, Avg pooling)	83.51
(CLS, Max pooling, Avg pooling)	83.09

Table 8. Effect of different scale CNN combinations on model performance.

Multi-Scale CNN	Multi-Level CNN	Acc (%)
(3, 4, 5)	(3, 4)	83.69
(3, 4, 5)	(3, 4, 5)	83.47
(3, 4)	(3, 4)	83.54
(1, 2, 3)	(3, 4)	83.62
(2, 3, 4)	(3, 4)	83.65
(3, 4, 5)	(1, 2)	83.50
	(2, 2)	83.49
	(2, 3)	83.57
	(3, 2)	83.44
	(3, 3)	83.64
	(4, 2)	83.40
	(4, 3)	83.44
	(4, 4)	83.46
	(4, 5)	83.53

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Li, Y. A Sentence-Matching Model Based on Multi-Granularity Contextual Key Semantic Interaction. Appl. Sci. 2024, 14, 5197. https://doi.org/10.3390/app14125197

AMA Style

Li J, Li Y. A Sentence-Matching Model Based on Multi-Granularity Contextual Key Semantic Interaction. Applied Sciences. 2024; 14(12):5197. https://doi.org/10.3390/app14125197

Chicago/Turabian Style

Li, Jinhang, and Yingna Li. 2024. "A Sentence-Matching Model Based on Multi-Granularity Contextual Key Semantic Interaction" Applied Sciences 14, no. 12: 5197. https://doi.org/10.3390/app14125197

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Sentence-Matching Model Based on Multi-Granularity Contextual Key Semantic Interaction

Abstract

1. Introduction

2. Related Work

3. Multi-Granularity Contextual Key Semantic Interaction Model

3.1. Contextual Representation Layer

3.2. Multi-Granularity Feature Extraction Layer

3.3. Key Semantic Alignment Layer

3.3.1. Multi-Head Self-Attention

3.3.2. Cross-Attention

3.4. Multi-Way Interaction Fusion Layer

3.5. Similarity Prediction Layer

4. Experimental Results and Analysis

4.1. Dataset

4.2. Implementation

4.3. Baseline Models

4.4. Performance Comparison

4.5. Ablation Study

4.6. Parameter Sensitivity Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI