A Short-Text Similarity Model Combining Semantic and Syntactic Information

Zhou, Ya; Li, Cheng; Huang, Guimin; Guo, Qingkai; Li, Hui; Wei, Xiong

doi:10.3390/electronics12143126

Open AccessArticle

A Short-Text Similarity Model Combining Semantic and Syntactic Information

by

Ya Zhou

¹,

Cheng Li

^1,2,*,

Guimin Huang

^1,2,*,

Qingkai Guo

^1,2,

Hui Li

^1,2 and

Xiong Wei

^1,2

¹

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

²

Guangxi Key Laboratory of Image and Graphic Intelligent Processing, Guilin University of Electronic Technology, Guilin 541004, China

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(14), 3126; https://doi.org/10.3390/electronics12143126

Submission received: 20 June 2023 / Revised: 9 July 2023 / Accepted: 13 July 2023 / Published: 18 July 2023

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

As one of the prominent research directions in the field of natural language processing (NLP), short-text similarity has been widely used in search recommendation and question-and-answer systems. Most of the existing short textual similarity models focus on considering semantic similarity while overlooking the importance of syntactic similarity. In this paper, we first propose an enhanced knowledge language representation model based on graph convolutional networks (KEBERT-GCN), which effectively uses fine-grained word relations in the knowledge base to assess semantic similarity and model the relationship between knowledge structure and text structure. To fully leverage the syntactic information of sentences, we also propose a computational model of constituency parse trees based on tree kernels (CPT-TK), which combines syntactic information, semantic features, and attentional weighting mechanisms to evaluate syntactic similarity. Finally, we propose a comprehensive model that integrates both semantic and syntactic information to comprehensively evaluate short-text similarity. The experimental results demonstrate that our proposed short-text similarity model outperforms the models proposed in recent years, achieving a Pearson correlation coefficient of 0.8805 on the STS-B dataset.

Keywords:

1. Introduction

Solving the similarities between two short texts is known as short-text similarity. It is a special form of text-matching task or textual entailment task that can be used in various downstream applications of natural language processing, for example, information retrieval [1], question answering [2], and document classification [3].

By learning linguistic expertise from massive amounts of textual data and expressing text (words and sentences) as continuously trainable embedding vectors, deep learning algorithms are better able to accomplish this [4]. Among others, BERT [5] has demonstrated excellent performance by unsupervised pre-training [6] on a huge corpus [7] and learning contextualized word representations through its multi-headed attention mechanism. However, while ordinary people can only understand words based on context when reading a domain-specific text, experts can reason with relevant domain knowledge. External knowledge plays a particularly crucial role for pre-trained language models to understand relevant domain information.

Pre-trained language models are increasingly being enhanced by external knowledge [8]. External knowledge is added to the original BERT model, thus supplementing the short text with more semantic information as well as fine-grained word relationship information. Examples include SemBERT [9] and KIM [10], which can consistently improve semantic text-matching performance. To address the inconsistency of the embedding vector space between words in short texts and entities in the knowledge graph (KG), Baidu ENRIE 2.0 [11] uses two knowledge strategies, phrase-level strategy, and entity-level strategy, to implicitly learn the prior knowledge of phrases and entities during the training process. Considering the possible knowledge noise problem caused by introducing external knowledge bases, K-BERT [12] loads pre-trained BERT models and introduces soft location and visible matrices to limit the influence of knowledge to overcome knowledge noise, and the better integration of external knowledge with the model. The above study demonstrates that introducing external knowledge allows the model to better understand the short text and learn additional external information.

However, even though their studies are aware of the importance of external knowledge for models to understand short texts in relevant domains, integrating external knowledge into pre-trained language models, they also ignore some issues: (1) How to effectively use the information in the knowledge base for short-text similarity computation; (2) How to model the relationship between knowledge structure and text structure. To overcome these challenges, in this paper, we propose an enhanced knowledge language representation model based on graph convolutional networks (KEBERT-GCN), where we directly form a new matrix by multiplying the external knowledge base construction matrix with the attention matrix Hadamard in the multi-headed attention mechanism of BERT and use it as the adjacency matrix of GCN [13] to effectively use the fine-grained word relationship information in the knowledge base to avoid knowledge noise. A graph is constructed with each token in the BERT containing semantic information as a node, and the relational features between each token are represented by the edges of the graph. Finally, to capture the relational features of knowledge structure and text structure, we extract the relational information between tokens using two layers of GCN to obtain the representation vector of sentences. In addition, we embed the relative position into the node representation to make the GCN position-aware [14], and experimental results on four popular short text similarity datasets show that, with the BERT model, the proposed KEBERT-GCN model can consistently improve the performance of short-text semantic similarity and performs significantly better than BERT.

Most of the current research focuses on short texts using language models to model the semantic similarity of short texts and does not fully utilize the sentences’ syntactic information. It has been previously shown that syntactic information is beneficial in improving short-text similarity performance [15,16]. Therefore, we also propose a calculation model for constituency parse trees based on tree kernels (CPT-TK), which combines syntactic information, semantic features, and attentional weighting mechanisms, as a way to obtain syntactic structural information and judge the syntactic similarity of short texts.

In summary, the main contributions of this paper are the following:

(1) An enhanced knowledge-based linguistic representation model based on graph convolutional networks, namely the KEBERT-GCN model, is proposed, which incorporates external knowledge in BERT, effectively uses fine-grained word relationship information in the knowledge base to avoid knowledge noise, and captures the relational features of knowledge structure and text structure to judge the semantic similarity of sentences. Experiments show that KEBERT-GCN significantly outperforms BERT on four publicly available short-text similarity datasets.

(2) A syntactic similarity modeling structure, CPT-TK, is proposed that combines syntactic information, semantic features, and attentional weighting mechanisms as a way to judge the syntactic similarity of sentences.

(3) Making full use of lexical, syntactic, and knowledge information, we propose a short-text similarity model that considers both semantic similarity and syntactic similarity, and our experimental results show that our model has better performance in computing short-text similarity on the STS-B dataset.

2. Related Work

2.1. Semantic Similarity

Semantic similarity is one of the important tasks in natural language processing, which aims to measure the semantic distance between given content blocks (words, sentences, or short texts). It plays an important role in various natural language processing tasks, such as text summarization [17], machine translation [18], keyword extraction [19], and question and answer [20].

The generation of pre-trained language models has led to the rapid development of natural language processing, which has achieved excellent results on semantic similarity tasks. A series of pre-trained language models such as ELMo [21], GPT [22], BERT [5], and XLNet [23] can be applied to semantic similarity tasks by fine-grained word embeddings, which have achieved excellent results. Among them, BERT is the most outstanding pre-trained language model, and AlBERT [24], RoBERTa [25], Sentence-BERT [26], and DeBERTa [27] are all variant models generated based on it, and they have achieved better semantic similarity performance by fine-tuning or modifying BERT.

Although fine-tuned language representation models have all been very successful, when the average person reads a domain-specific text they can only understand words in context and cannot reason using relevant domain knowledge. The integration of knowledge information is especially important for pre-trained models to understand short texts. To address the problem of “low information content of short text” in the semantic similarity calculation of short text, many researchers have expanded the information content of short text by introducing external knowledge to improve the semantic matching performance of short text. Tsinghua ERNIE [28] is a pioneer in this direction, based on a pre-trained model which incorporates entity information but ignores the relationships between entities. SemBERT [9] adds knowledge to the original BERT model by adding explicit contextual semantics to the pre-trained semantic role annotations. They introduce external knowledge but do not make good use of the fine-grained word relationship information in the knowledge base, WordNet (https://wordnet.princeton.edu/ accessed on 23 December 2022) [29] as a semantic concept knowledge base. It does not just arrange words in alphabetical order but also forms a ‘network of words’ based on their meaning. It is a semantic network covering a wide range of English words. The main relationship between words is that of synonyms; for example, car and automobile are synonyms. These synonyms are grouped into unordered sets of synonyms. Words with the same meaning are contained in the same concept node (synonym set), and the different synonym sets are connected by various semantic relations, forming a topological network structure. Synonymy relationships are the most fundamental semantic relationships in WordNet, which exist symmetrically between words, and synonym sets are made up of several synonyms linked with a pointer to synonymy relationships. In addition to synonymy, common semantic relations include subordination, antonymy, and part-whole relations, the most common of which is subordination, e.g., tree is the subordinate word for plant, which is the superordinate word for tree. In our work, we primarily use WordNet’s knowledge of synonymy and the topological network structure formed by the various semantic relations. We use them to construct a word similarity matrix, which allows the model to better capture the fine-grained word relations in the short text and to efficiently perform short-text semantic similarity calculations.

The problem of knowledge noise (some words are not important to the semantic understanding of the whole sentence, and too much knowledge admixture will make the whole sentence deviate from the meaning of the original text) may arise when retrieving the corresponding words in the short text from the knowledge base, which can interfere with the model judgment and reduce model performance. Our proposed KEBERT-GCN model, to avoid knowledge noise, extracts the fine-grained word relationship information of the short text in the knowledge base and constructs a word similarity matrix, which is combined with BERT’s attention matrix to enhance the model’s attention to semantically similar word pairs.

The KIM model [10] is based on the baseline model ESIM [30] and uses word relations to assist in judging the weight of attention, which improves the performance of ESIM on the semantic similarity task by influencing the weights so that the model imposes different emphasis on different words to introduce external knowledge. Although KIM is concerned with integrating external knowledge to capture word relationship information, it ignores the relational features between knowledge structure and text structure. The GCN [13] model can capture the relational features between nodes, and many studies have shown that it can effectively encode the contextual information of the input sentences [31,32]. GCN focuses on the word similarity matrix constructed with WordNet to extract word synonymy information in the short text, allowing the model to pay more attention to semantically similar word pairs. Whereas the graph is topological and cannot perceive the sequential information learned about the words, in many works, GCN relative position encoding (sequential encoding, for the positions of all nodes, in order, starting from 0) proved to have better results than BERT’s absolute position encoding (encoding that focuses on the relative positions between nodes, for each node, there is a relative position encoding concerning other nodes) [27,33,34] and is more conducive to models that capture contextual information about words.

2.2. Syntactic Similarity

Previously popular neural network approaches implemented sentence embeddings. Mikolov proposed Doc2vec [35] by referencing the word2vec [36] idea to learn the semantic representation of the whole sentence. Tien et al. [37] combined LSTM and CNN models to form sentence embeddings using pre-trained word embeddings. Tai et al. [38] proposed a Tree-LSTM model to evaluate sentence similarity. Most of them represent the short text as a vector by learning semantic information in the short text, without mining the syntactic information embedded in the short text. Also, most neural network-based sentence representation models have equal processing for each word in a sentence [35,39]. This is not in line with the way people normally read and understand sentences. Therefore, we take advantage of the attentional weighting mechanism [40] and consider using smooth inverse frequency (SIF) [41] to focus on the more important words in a sentence by assigning different weights to words. We also use pre-trained word embeddings, so that our model can avoid time-consuming learning and training.

We construct a selection parse tree of short text to obtain the syntactic structure information of short text, calculate the similarity between structure trees by calculating the number of common substructures between two trees

T_{1}

and

T_{2}

by the tree kernel method (a method to quantify the similarity between tree-structured data), and combine the attention weight mechanism with pre-trained word embeddings to calculate the syntactic similarity of two short texts. Existing tree kernel methods are classified into three types: subtree [42] (ST), subset tree [43] (SST), and partial tree [44] (PT). A subtree (ST) kernel captures the structure of all descendants including the target root node to the leaf nodes. In contrast, subset trees (SSTs) allow internal subtrees without any leaf nodes. The partial tree (PT) further relaxes the constraints of the SST, resulting in a more flexible substructure, which can have non-terminal leaf nodes. The tree kernel functions on

T_{1}

and

T_{2}

are generally defined as:

T K (T_{1}, T_{2}) = \sum_{n o d e_{1} \in S e t_{T_{1}}} \sum_{n o d e_{2} \in S e t_{T_{2}}} Δ (n o d e_{1}, n o d e_{2})

(1)

where

S e t_{T_{1}}

and

S e t_{T_{2}}

denote the set of nodes in the structure trees

T_{1}

and

T_{2}

, respectively, and

n o d e_{1}

and

n o d e_{2}

denote the nodes in the two trees. Different

Δ (\cdot)

functions represent different kernel spaces, and therefore different tree kernels can be generated.

Therefore, we also propose a syntactic similarity modeling structure, CPT-TK, which combines syntactic information, semantic features, and attentional weighting mechanisms to discriminate the syntactic similarity of sentences.

3. Methodology

3.1. Semantic Similarity Model Framework

In this section, we first construct the word similarity matrix, and then introduce the multi-headed attention in BERT and how to incorporate external knowledge into the BERT model, and finally, we model the relationship between the knowledge structure and text structure of the matrix obtained after BERT using two layers of GCN. As shown in Figure 1, our model (KEBERT-GCN) consists of three parts: the Encoder module, the Transformer module, and the GCN module. In the input phase, a (CLS) token is added to the first sentence, and a (SEP) token is added in the middle of two sentences and at the end of another sentence. In the Encoder module, we have the same model as BERT. The sum of the three parts of token, position, and segmental encoding is used. The word similarity matrix constructed by an external knowledge base is added to the Transformer module of BERT to obtain the score matrix and the token embedding carrying semantic information and inputting them into the two-layer GCN module to obtain a vector representation of the two sentences.

3.1.1. Construction of the Word Similarity Matrix

We build the word similarity matrix with the help of external knowledge bases. This paper first uses the knowledge base WordNet to build the word similarity matrix. Of course, our model is generic and compatible with other knowledge bases such as Wikipedia (https://en.wikipedia.org accessed on 22 December 2022) and Probase (Available online at https://www.microsoft.com/en-us/research/project/probase/ accessed on 23 December 2022) [45], and we can also use the word/concept similarity knowledge from these knowledge bases to construct word similarity matrices.

Given two short texts (i.e., sentences)

S_{1}

= (

W_{1}

,

W_{1}

…

W_{i}

…

W_{L} (S_{1})

) and

S_{2}

= (

W_{1}

,

W_{2}

…

W_{i}

…

W_{L} (S_{2})

), we construct the word similarity matrix S. We calculate the value of each element of S based on the semantic relationships in the WordNet knowledge base. For two words

W_{a}

and

W_{b}

, if the pair of words are identical or synonymous in WordNet, the element value

S_{a b}

is set to 1; if they are not synonyms, we calculate the elemental value

S_{a b}

based on the method proposed by Wu-Palmer [46] for calculating word similarity based on topological distance in WordNet (at this point,

S_{a b}

is a value from 0 to 1); if one or both of the words in the pair are not found in WordNet, we set the two words with similarity values set to 0.

We preprocessed all the datasets and calculated the word similarity matrix for each sentence pair in the dataset. This was used to enhance BERT’s attention to semantically similar word pairs by constructing the matrix S.

Figure 2 shows the heat map of the two-word similarity matrices. Their sentence pairs are derived from the STS-B dataset (each sentence pair has a rating from 0 to 5, with the higher ratings indicating that two sentence pairs are more similar). The first pair (“A man is cycling” and “A man is talking”) has a true similarity score of 0.6, and the second pair (“A deer jumps a fence” and “A deer is jumping over a fence”) has a true similarity score of 5.0.

3.1.2. External Knowledge Integration into BERT for Multi-Headed Attention

In this section, we implement a fine-tuning of the BERT-base model. The external knowledge matrix constructed above is added to the multi-headed note of BERT. In the encoding phase, our model is the same as BERT. The sum of the three components of tagging, location, and segmentation encoding is used. In the Transformer phase, our approach is also similar to BERT. Multi-head attention is a process of mapping the query vector, key vector, and value vector to the output vector. First, they undergo a linear transformation, and they are then fed into scaled dot-product attention (this is performed h times, which is called multi-head, each time counting one head, and the parameters W are different for each linear transformation of Q, K, and V), then the result of the h times scaled dot-product attention is stitched together, and the value obtained by another linear transformation is used as the multi-head. The result of the attention:

\begin{matrix} h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \\ M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, \dots, h e a d_{h}) W^{O} \end{matrix}

(2)

where

W_{i}^{Q}

,

W_{i}^{k}

, and

W_{i}^{V}

are the parameter matrices corresponding to the i-th attention head query, key, and value, respectively, and

W^{O}

are the weight matrices when the h attention heads are spliced.

The BERT attention is computed using the scaled dot-product:

\begin{matrix} S c o r e s = Q K^{T} + M A S K \\ A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Scores}{\sqrt{d_{k}}}) V \end{matrix}

(3)

where MASK is the masked language model matrix and

d_{k}

serves as a moderator to keep the inner product from being too large.

Unlike BERT, our model is injected with external knowledge and uses the word similarity matrix S to compute the Hadamard product so that the model focuses more on the word pairs with higher similarity in the sentence pairs, adjusted as follows:

\begin{matrix} S c o r e s = Q K^{T} * S + M A S K \\ Attention (Q, K, V) = softmax (\frac{S c o r e s}{\sqrt{d_{k}}}) V \end{matrix}

(4)

3.1.3. Scores Matrix Integration into the GCN

The GCN is a widely used architecture for encoding graph information, where the information at each node in each GCN layer is communicated to the neighboring nodes through connections between them. The effectiveness of the GCN model for encoding contextual information on input sentence graphs was demonstrated by many previous studies. Using the GCN model, the relationships between words in a text can be modeled as a graph structure, enabling the better capture of both local and global information in the text.

Typically, graphs in the standard GCN model are constructed from word dependencies, represented by adjacency matrices

A_{i, j}

. Based on the adjacency matrices

A_{i, j}

, for a given node i, the l-th GCN layer collects the relevant information (semantic information, syntactic information, etc.) carried by its contextual words in

A_{i, j}

and computes the output representation

h_{i}^{(l)}

for i by:

h_{i}^{(l)} = R e l u (\sum_{j = 1}^{n} A_{i, j} (W^{(l)} \cdot h_{j}^{(l - 1)} + b^{(l)}))

(5)

where

h_{j}^{(l - 1)}

denotes the output representation of node j in the

(l - 1)

-th GCN layer,

W^{(l)}

and

b^{(l)}

are the trainable matrix and bias of the l-th GCN layer, respectively, and Relu is the activation function.

Before applying the GCN, for each sentence, we first go through the BERT to obtain the embedding

h_{i}

of each token carrying semantic information, and next, we input

h_{i}

into the GCN model. Unlike the standard GCN model, we use the score matrix scores as the adjacency matrix

A_{i, j}

, and each token as each node in the GCN, and we add relative position encoding [14] to the GCN so that it learns the relative position information of the token. Based on the fraction matrix S, for a particular node i, we specify the first GCN layer as follows:

h_{i}^{1} = R e l u (\frac{1}{d_{i}} \sum_{j = 1}^{n} A_{i, j} (h_{j} + h_{P}^{j}) W^{(1)} + b^{(1)})

(6)

where

d_{i}

represents the degree of node i, here divided by

d_{i}

for normalization.

A_{i, j}

is the fraction matrix,

h_{j}

is the Token embedding of node j, and

h_{p}^{j}

represents the relative position representation of i and j.

Similar to the first GCN layer, we specify the second GCN layer as follows:

h_{i}^{2} = R e l u (\frac{1}{d_{i}} \sum_{j = 1}^{n} A_{i, j} ({h_{j}}^{1} + h_{P}^{j}) W^{(2)} + b^{(2)})

(7)

where

h_{j}^{1}

represents the input representation of node j at the first layer.

After two layers of GCN, the vector

h_{i}^{2}

of each token is thus obtained, and then the vector of sentences is obtained by averaging pooling as:

h_{s} = \frac{1}{m} \sum_{i = 1}^{m} h_{i}^{2}

(8)

where m is the total number of tokens, i.e., the length of the sequence of sentences.

We present the proposed method using “A man is cycling” and “A man is talking” as examples.

The dataset is preprocessed and the word similarity matrix S of sentences $S_{1}$ and $S_{2}$ is constructed according to Section 3.1.1;
In the BERT embedding stage, a (CLS) token is added to the first part of the first sentence $S_{1}$ , and an (SEP) token is added to the middle of the two sentences and the end of the last sentence. The sum of the three parts. In the Transformer stage, external knowledge is incorporated into the BERT multi-headed attention, and the Hadamard product is calculated using $Q K^{T}$ and S to obtain the score matrix scores;
The score matrix is incorporated into the GCN, using the score matrix scores to act as the adjacency matrix in the standard GCN model. Based on the score matrix, the token embeddings of sentences $S_{1}$ and $S_{2}$ obtained by BERT are used as input to the two-layer GCN, and the vector representation of each token is obtained after the two-layer GCN, and then the vectors of the two sentences are obtained separately by averaging pooling. Then, it is conventional to pick up a fully connected layer to get Logits, then a Softmax layer to obtain probabilities, after which the loss is calculated based on the true labels.

3.1.4. Setting of Loss Function

For the classification and regression tasks, we experimented with the following structures and objective functions, respectively.

Classification objective function: To solve the problem of sample imbalance in target detection, we use focal loss, which is a modification of the standard cross-entropy loss function and adjusts the category weights, easy-to-classify sample weights, and hard-to-classify sample weights in the loss function to improve the classification accuracy of the model. The cross-entropy loss function is shown in the following figure:

L = \{\begin{matrix} - ln y^{'}, & y = 1 \\ - ln (1 - y^{'}), & y = 0 \end{matrix}

(9)

where: y denotes the label of the true sample and

y^{'}

denotes the predicted value.

The focal loss function introduces a category weight factor

α

to adjust the weight size of samples of different categories,

α

∈ (0, 1). To solve the problem of easy and hard-to-distinguish samples, focal loss also adds a factor

γ

(γ > 0)

to the loss function to make the algorithm pay more attention to the hard-to-distinguish samples. The loss function after adding the adjustment factor is:

L_{FI} = \{\begin{matrix} - α {(1 - y^{'})}^{γ} ln y^{'}, & y = 1 \\ - (1 - α) y^{γ} ln (1 - y^{'}), & y = 0 \end{matrix}

(10)

It further improves the accuracy of the classification model by adjusting the values of

α

and

γ

to change the sample weights and make the model more focused on minority and hard-to-classify samples, improving the imbalance of sample categories at the algorithmic level.

The regression objective function. The similarity between the two sentence embeddings was calculated. We used the mean squared error loss as the objective function:

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - {\hat{Y}}_{i})}^{2}

(11)

where n denotes the number of samples, and

Y_{i}

and

{\hat{Y}}_{i}

predicted labels denote the true and predicted labels, respectively. In particular, we calculated the final loss function as:

L_{F I} = L_{b e r t - f l} + L_{g c n - f l}

(12)

Due to the large number of parameters in the GCN and BERT models, BERT is also involved in the updating of the loss function through the (CLS) token to allow the model to learn better. Secondly, BERT is in the middle of the model, and the output of BERT is also supervised with labels to prevent overfitting and provide better node input to the GCN, which allows the model to learn semantic information better.

3.2. Syntactic Similarity Model Framework

The above methods compute the semantic similarity between short text and achieve good performance. However, the syntactic information of the sentence is ignored. In the case of “She’s very beautiful” and “Is she very beautiful”, for example, the two texts are similar according to the above calculation method. However, according to our human understanding, the two texts are not exactly similar. Therefore, we also propose a computational model of constituency parse trees based on tree kernels (CPT-PK) to judge the syntactic similarity of short texts.

3.2.1. Constructing the Constituency Parse Trees of the Short Text

First, we construct the CPT for short text. In this paper, we use previous work [47] to construct the CPT, which will be very simple to deploy, while featuring faster and easier parallelization to generate constituency short-text parse trees, obtaining new state-of-the-art results on Penn Treebank. Selected parse trees

T_{1}

and

T_{2}

for sentences

S_{1}

and

S_{2}

were constructed using this method, as shown in Figure 3.

3.2.2. Similarity between CPTs Is Calculated Using the Tree Kernel Method

Next, the syntactic similarity was calculated using the tree kernel method to calculate the similarity between CPTs as the syntactic similarity. Most tree kernels only consider syntactic information, and few consider both semantic information and attentional weighting. Our model cleverly combines syntactic information, semantic information, and attentional weighting mechanisms, absorbing the advantages of various techniques.

Our model uses PT kernels to compute syntactic similarity. Similarly to other tree kernel calculation methods, the PT tree kernel function PTK between trees

T_{1}

and

T_{2}

is defined as Equation (13):

P T K (T_{1}, T_{2}) = \sum_{n o d e_{1} \in S e t_{T_{1}}} \sum_{n o d e_{2} \in S e t_{T_{2}}} Δ (n o d e_{1}, n o d e_{2})

(13)

where

S e t_{T_{1}}

and

S e t_{T_{2}}

are the sets of nodes in trees

T_{1}

and

T_{2}

, respectively,

n o d e_{1}

and

n o d e_{2}

are the sets of nodes in trees

T_{1}

and

T_{2}

, respectively, and

Δ

(

n o d e_{1}

,

n o d e_{2}

) is the number of common segments in the tree kernel rooted at

n o d e_{1}

and

n o d e_{2}

, respectively.

For the PT tree kernel, we define

Δ

(

n o d e_{1}

,

n o d e_{2}

) as in Equation (14):

Δ (n o d e_{1}, n o d e_{2}) = \{\begin{matrix} s i m i l a r i t y (v e c t o r_{1}, v e c t o r_{2}) \times w e i g h t_{1} \times w e i g h t_{2}, n o d e_{1}, n o d e_{2} \in leafnodes \\ α (β^{2} + \sum_{l = 1}^{l_{min}} Δ_{l} (c_{n o d e_{1}}, c_{n o d e_{2}})), n o d e_{1} = n o d e_{2} and n o d e_{1}, n o d e_{2} \in non - leafnodes \\ 0, otherwise \end{matrix}

(14)

In the above equation,

α

and

β

are two decay factors:

α

is the height of the tree,

β

is the length of the subsequence;

c_{n o d e_{1}}

and

c_{n o d e_{2}}

are the lists of child nodes of

n o d e_{1}

and

n o d e_{2}

, respectively;

l_{min}

is the length of the list of smallest child nodes between

c_{n o d e_{1}}

and

c_{n o d e_{2}}

;

Δ_{l} (\cdot)

denotes the number of common subsequences of length l in the list of child nodes;

v e c t o r_{1}

and

v e c t o r_{2}

are the word vectors of

n o d e_{1}

and

n o d e_{2}

, respectively;

w e i g h t_{1}

and

w e i g h t_{2}

represent the weights of

n o d e_{1}

and

n o d e_{2}

, respectively;

s i m i l a r i t y (v e c t o r_{1}, v e c t o r_{2})

is the cosine similarity function of the two vectors;

l e a f n o d e s

and

n o n - l e a f n o d e s

denote the leaf nodes and non-leaf nodes in the tree. In this paper, the weights are calculated by referring to the calculation method of SIF, and the specific formula is shown in Equation (15):

S I F (node) = \frac{φ}{φ + p (node)}

(15)

where

φ

is the smoothing parameter in SIF and

p (n o d e)

is the word frequency of the words in the leaf nodes.

To better understand the above equation, we solve for

Δ_{l} (\cdot)

by constructing a ’recursive’ function as follows. This is shown in Equation (16):

Δ_{l} (c_{n o d e_{1}}, c_{n o d e_{2}}) = Δ (p, q) \sum_{i = 1} |n_{1}| \sum_{r = 1} |n_{2}| β^{|n_{1}| - i + |n_{2}| - r} \times Δ_{l - 1} (n_{1} [1 : i], n_{2} [1 : r])

(16)

In the above equation, p and q are the last child nodes,

Δ (p, q)

is calculated by Equation (13); and

n_{1}

and

n_{2}

are the child nodes of

c_{n o d e_{1}}

and

c_{n o d e_{2}}

, respectively, and

| n_{1} |

and

| n_{2} |

denote the lengths of

n_{1}

and

n_{2}

, respectively;

n_{1} [1 : i]

and

n_{2} [1 : r]

denote the subsequence from 1 to i in

n_{1}

and from 1 to r in

n_{2}

, respectively; and

Δ_{l - 1} (\cdot)

is recursively calculated using Equation (13) and stops when the leaf node is reached.

In summary, the syntactic similarity of the two sentences is calculated. Firstly, the CPTs of the two sentences are constructed. Secondly, the similarity of each CPT is calculated according to Equation (13). Finally, all similarity values are summed and normalized to obtain the final syntactic similarity.

3.3. Short Textual Similarity Model Framework

In this paper, the similarity of short texts is calculated by fusing the semantic similarity model and syntactic similarity model, and the formula is shown in Equation (17).

Sentencesim (S_{1}, S_{2}) = θ * S emanticsim + (1 - θ) * Syntacticsim

(17)

where Semanticsim denotes the semantic similarity of the sentence pair, calculated by the model described in Section 3.1; Syntacticsim denotes the syntactic similarity of the sentence pair, calculated by the model described in Section 3.2; and

θ

denotes that the weights belong to

[0, 1]

.

4. Experiments

In the experimental section, we first describe the dataset and experimental setup, and then describe the baseline model used in the experiment and some comparison methods, and then designed an ablation experiment and a comparison experiment to effectively evaluate our proposed model, and finally analyze the experimental results and discussed the effects of the two parameters.

4.1. Dataset and Experimental Setup

4.1.1. Dataset

In our experiments, we used four popular publicly available benchmark short-text similarity datasets: MRPC, STS-B, QQP, and SICK.

MRPC (the dataset can be downloaded from https://www.microsoft.com/en-us/download/details.aspx?id=52398 accessed on 11 November 2022) [48]: Microsoft Research’s standard dataset for paraphrase recognition, which is extracted from news sources on the web and also carries a corpus of human-annotated sentence pairs, where each text pair is manually binary determined to be similar or not, explains whether the two sentences are paraphrased. It contains a total of 5801 text pairs, 4076 training pairs, and 1725 test pairs. In this experiment, we segmented 10% of the training data according to the GLUE [49] criterion as the validation set. The accuracy and F1 values are reported in the experiments.
QQP (The dataset can be downloaded from https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs accessed on 11 November 2022) [50]: is a collection of question pairs from the community question-and-answer site Quora. The task is to determine whether a pair of questions is semantically equivalent. The QQP dataset consists of 404,350 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrased or not. We used the partitioning approach in the article in [51] to randomly select 5000 annotated and 5000 non-annotated as the validation set, and externally 5000 annotated and 5000 non-annotated as the test set. We keep the remaining instances for training. As with MRPC, we report accuracy and F1 values.
STS-B (the dataset can be downloaded from http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark accessed on 11 November 2022) [52]: text from image captions, news headlines, and user forums. This includes 5749 training data, 1500 development data, and 1379 test data. Each sentence pair was annotated by humans with a similarity score of 0–5 (a floating point number greater than or equal to 0 and less than or equal to 5. A score of 0 indicates that the two sentences are unrelated, while a score of 5 indicates that the two sentences are very related. Each score is the average of 10 scores marked by 10 different human annotators. We report the Pearson correlation coefficient (PCC) and Spearman correlation coefficient (SCC) as evaluation metrics.
SICK (the dataset can be downloaded from http://clic.cimec.unitn.it/composes/sick.html accessed on 11 November 2022) [53]: contains 10,000 English sentence pairs from two pre-existing paraphrase datasets: an 8k imageFlickrbuilt dataset, and the SEMEVAL-2012 semantic text similarity video description dataset. It consists of 4439 training data, 495 development data, and 4906 test data. Each sentence pair was annotated by humans as 1–5 (a floating point number greater than or equal to 0 and less than or equal to 5). A score of 0 indicates that the two sentences are unrelated, while a score of 5 indicates that the two sentences are significantly related. Again, we report Pearson correlation coefficient (PCC) and Spearman correlation coefficient (PCC).

4.1.2. Baseline as well as Comparison Methods

To evaluate our proposed approach, we first designed an ablation experiment to evaluate the semantic similarity model, and we compared our proposed model with the state-of-the-art pre-trained model BERT-base. In contrast to the baseline model BERT-base, we used the same transformer model setup as BERT-base. The base model contains 12 layers, 12 self-attentive heads, and 768 dimensions of hidden dimensions. It was subjected to ablation experiments with our proposed two models without external knowledge (BERT-GCN) and with external knowledge (KEBERT-GCN).

We then designed a comparison experiment in which we compared our approach with the following methods.

Neural network models include:
- Skip-Thought [39]: a sentence-embedding model for predicting contextual sentences from central sentences.
- Constituency Tree-LSTM [38]: a model that analyzes the syntactic information of sentences and calculates the similarity of sentences using constructed constituency parse trees.
- SIF [41]: a sentence-embedding model considering word frequency and attention weighting mechanisms.
Pre-trained neural network models include:
- SemBERT [9]: a model for language representation using explicit semantic role annotation.
- Tsinghua ENRIE [28]: a pre-trained language model based on BERT that introduces a priori knowledge of named entities in the knowledge graph.
- Baidu ENRIE2.0 [11]: a sustainable learning framework with external knowledge and multi-task training.
- CF-BERT [54]: a language model that introduces a weighting factor by emphasizing syntactic relations.
- FEFS3C [55]: a joint method for computing frame and item-focused sentence similarity.
- PromptBERT [56]: a cue-based sentence embedding method using denoising techniques for unsupervised training targets;

4.1.3. Other Settings and Implementation Details

We implemented the entire model on the PyTorch framework and trained the model on an NVIDIA GeForce RTX 3090 graphics card. We used the Adam optimizer and experimented with parameter settings on four datasets as shown in Table 1.

In the semantic similarity model, we considered SemBERT, Tsinghua ENRIE, Baidu ENRIE2.0, CF-BERT, FEFS3C, and PromptBERT for comparison experiments with our model, which are all BERT-base-based variants in the experiments for a fair comparison.

In the syntactic similarity model, we used word vectors provided by paragm-sl999 (PSL) (https://www.cs.cmu.edu/~jwieting/ accessed on 10 January 2023). The wiki (https://dumps.wikimedia.org/enwiki/latest/ accessed on 10 January 2023) dataset was used to compute the SIF, and the two decay factors of Equation (14) did not have a significant impact on the performance of the model, and we set them to 0.1 by default.

4.2. Experimental Results and Analysis

To evaluate the performance of the model KEBERT-GCN, we performed ablation experiments on four publicly available datasets, and the results are shown in Table 2.

We conducted ablation experiments with our proposed two models without external knowledge (BERT-GCN) and with external knowledge (KEBERT-GCN). Where the fraction matrix injected with external knowledge is constructed as shown in Equation (4) in Section 3.1.2, while the matrix without external knowledge is the attention matrix of BERT, which is constructed as shown in Equation (3) in Section 3.1.2.

The experiments show that, compared to BERT, BERT-GCN outperforms BERT on all experimental datasets because the attention matrix itself contains semantic, syntactic, and inter-token contextual information. Compared with BERT-GCN, KEBERT-GCN constructs word similarity matrices through external knowledge to enhance the model’s performance on the GCN model with the addition of relative position information models the relationship between tokens and the synonymy information obtained through external knowledge as a graph structure, thus better capturing the local information, fine-grained word relationship information and word position information in the short text, so that it performs better on all four datasets, as well as demonstrates external knowledge. This also demonstrates that external knowledge contributes to the understanding of short textbooks by the pre-trained model.

In conclusion, our proposed models BERT-GCN and KEBERT-GCN show a better performance than the state-of-the-art BERT model on all four datasets.

We designed a comparison experiment to compare our proposed model with models that have performed relatively well in recent years. We experimented on the STS-B dataset and the experimental results are shown in Table 3.

The experiments showed that the pre-trained neural network model significantly outperformed the neural network models Skip-Thought, Constituency Tree-LSTM and SIF on the short text similarity task. The idea of the Skip-Thought model, which predicts central sentence contextual information with the help of the Skip-gram idea, achieved good results on STS-B. SIF takes into account the attentional weighting mechanism, but not the syntactic information of the sentence. Constituency Tree-LSTM captures the syntactic structural information of the sentence through a constituency parse tree, but ignores the importance of the attentional weighting mechanism. They are both deficient compared to our proposed syntactic similarity model (CPT-PK), which has better performance than them on the STS-B dataset, as the results show. However, compared to the pre-trained model, the performance does not appear to be outstanding, possibly because numerical items or special characters in the short text, for example, weaken the performance of the model.

SemBERT, Tsinghua ENRIE, and Baidu ENRIE both added external knowledge information on top of BERT. Compared with BERT, good results were achieved, and it was also demonstrated that adding external knowledge information could significantly enhance the original model, solve the problem of lack of semantic information in short texts, and improve the performance of text matching. While SemBERT, Tsinghua ENRIE, and Baidu ENRIE did not consider the possible noise problem brought by the introduction of external knowledge base. In order to avoid noise, KEBERT-GCN constructs the word similarity matrix, combined with BERT’s attention matrix, to enhance the model’s attention to semantically similar word pairs, and the experiments prove that, compared with them, KEBERT-GCN is more effective in using fine-grained word relationship information from the knowledge base for short-text matching, and is able to capture the relational features of the knowledge structure and text structure using the GCN model, showing better performance on the STS-B dataset. CF-BERT emphasizes syntactic relations to introduce weighting factors, taking into account the enhancement of complete sentences with key components in the sentences, but ignores the fine-grained word relationship information. FEFS3C considers ‘sentence meaning’ and ‘key sentence information’. PromptBERT uses Prompt to improve sentence representation and considers the sentence as a whole, but does not focus on the importance of external knowledge to the language model. It also does not focus on the relationships between words. Although they are fine-tuned or improved on BERT and achieve good semantic similarity performance, our focus is more comprehensive and has a higher PCC in the STS-B dataset than our model.

Meanwhile, experimental results show that the KEBERT-GCN+CPT-PK model combining semantic and syntactic information performs better than KEBERT-GCN in the dataset, demonstrating that the CPT-PK model compensates for the lack of syntactic structure information in the KEBERT-GCN model. Compared with more classical models in recent years, our model mines the semantic and syntactic information of short texts, allowing the model to more comprehensively judge short-text similarity, with a Pearson coefficient of 88.05%, showing better performance.

4.3. Influence of Parameters

We explored the effect of two parameters

θ

and

φ

involved in the model on the model performance. To obtain the parameter values that achieve the optimal model performance, we judge the optimal values of

θ

and

φ

by Pearson correlation coefficient (PCC) on the STS-B dataset.

As shown in Equation (15), we introduce the SIF attention mechanism, where

φ

is a smoothing parameter. The figure shows the results of the experiments on STS-B. As shown in the line graph, when the value of

φ

is small, it does not achieve the best performance on the dataset because the value of

φ

is smaller than the value of the p(node) word frequency. However, as the value of

φ

keeps increasing, PCC first increases to the maximum value and then decreases, which is exactly in line with the law that a lower frequency of words indicates more information. According to the graph, in this paper, we set

φ

= 0.0001.

As shown in Equation (17), we use a linear strategy to combine semantic similarity and syntactic similarity, where

θ

is a weighting parameter to weigh the values of these two components. Figure 4 shows the results of the experiments on STS-B. It can be seen that, as the value of

θ

increases, the PCC value also increases to its maximum value and then decreases. The optimal value is obtained between 0.8 and 0.9. Semantic similarity accounts for a large proportion of judging the similarity of short texts. This is because short texts usually do not strictly adhere to the syntactic structure of written language, and semantic similarity is dominant in the similarity of short texts. After a small span of fine-tuning, we set

θ

= 0.85 in this paper.

5. Conclusions and Future Work

Short-text similarity is one of the important applications of natural language processing. In previous studies, most of them have only focused on the semantic similarity of short texts, and little attention has been paid to the syntactic similarity of short texts. In this paper, we first propose the KEBERT-GCN semantic similarity model to improve the semantic similarity performance by constructing the relationship between knowledge structure and text structure and effectively using the fine-grained word relationship information in the knowledge base. Experiments show that it outperforms existing models that integrate external knowledge on the STS-B dataset. Then, we propose a tree kernel-based computation method for constituency parse trees (CPT-TK) that wisely combines syntactic information, semantic features, and attentional weighting mechanisms as a means to obtain syntactic structure information and judge the syntactic similarity of short texts. Finally, we combine these two models to propose a short-text similarity model that combines semantic and syntactic information. Our model achieves significant performance gains in the short-text similarity task, which shows that the combination of semantic and syntactic information is effective. These findings provide good hints and directions for further exploration and the improvement of techniques for short textual similarity.

In the future, it is worthwhile to focus on the following points: (1) When constructing word similarity matrices, WordNet does not include many proper nouns nor does it resolve word ambiguity. (2) KEBERT-GCN is a generic model based on the fine-tuning of the backbone framework of BERT, and we can replace BERT with RoBERTa [25], DeBERTa [27], and other BERT variants of BERT. It is worth further investigation in subsequent work. (3) CPT-TK is not very sensitive to numbers, such as telephone numbers and the prices of goods mentioned in short texts.

Author Contributions

Conceptualization, C.L.; Formal Analysis—data curation, C.L., Y.Z. and G.H.; Investigation, H.L., Q.G. and X.W.; Writing—Original Draft Preparation, C.L. and G.H.; Writing—Review and Editing, Y.Z. and G.H.; Visualization, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (No. 62066009), Guangxi Key Research & Development Program (Grant No. Gui Ke AB22080047), and the Key Research and Development Project of Guilin (No. 2020010308).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to further research plans.

Conflicts of Interest

The authors declare no conflict of interest.

References

Martinez-Rodriguez, J.L.; Hogan, A.; Lopez-Arevalo, I. Information extraction meets the semantic web: A survey. Semant. Web 2020, 11, 255–335. [Google Scholar] [CrossRef]
Karpukhin, V.; Oğuz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.T. Dense passage retrieval for open-domain question answering. arXiv 2020, arXiv:2004.04906. [Google Scholar]
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep learning–based text classification: A comprehensive review. ACM Comput. Surv. CSUR 2021, 54, 1–40. [Google Scholar] [CrossRef]
Chandrasekaran, D.; Mago, V. Evolution of semantic similarity—A survey. ACM Comput. Surv. CSUR 2021, 54, 1–37. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A survey on contrastive self-supervised learning. Technologies 2020, 9, 2. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the NIPS, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Zhang, Z.; Wu, Y.; Zhou, J.; Duan, S.; Zhao, H.; Wang, R. SG-Net: Syntax-Guided Machine Reading Comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2019. [Google Scholar]
Zhang, Z.; Wu, Y.; Hai, Z.; Li, Z.; Zhang, S.; Zhou, X.; Zhou, X. Semantics-aware BERT for Language Understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2019. [Google Scholar]
Chen, Q.; Zhu, X.D.; Ling, Z.; Inkpen, D.; Wei, S. Neural Natural Language Inference Models Enhanced with External Knowledge. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017. [Google Scholar]
Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Tian, H.; Wu, H.; Wang, H. Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 34, pp. 8968–8975. [Google Scholar]
Liu, W.; Zhou, P.; Zhao, Z.; Wang, Z.; Ju, Q.; Deng, H.; Wang, P. K-bert: Enabling language representation with knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 2901–2908. [Google Scholar]
Kipf, T.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics, New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
Severyn, A.; Nicosia, M.; Moschitti, A. Building structures from classifiers for passage reranking. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013. [Google Scholar]
Croce, D.; Moschitti, A.; Basili, R. Structured Lexical Similarity via Convolution Kernels on Dependency Trees. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 27–31 July 2011. [Google Scholar]
Mohamed, M.; Oussalah, M. SRL-ESA-TextSum: A text summarization approach based on semantic role labeling and explicit semantic analysis. Inf. Process. Manag. 2019, 56, 1356–1372. [Google Scholar] [CrossRef]
Zou, W.Y.; Socher, R.; Cer, D.; Manning, C.D. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1393–1398. [Google Scholar]
Chen, L.C. An Improved Corpus-Based NLP Method for Facilitating Keyword Extraction: An Example of the COVID-19 Vaccine Hesitancy Corpus. Sustainability 2023, 15, 3402. [Google Scholar] [CrossRef]
Lopez-Gazpio, I.; Maritxalar, M.; Gonzalez-Agirre, A.; Rigau, G.; Uria, L.; Agirre, E. Interpretable semantic textual similarity: Finding and explaining differences between sentences. Knowl. Based Syst. 2017, 119, 186–199. [Google Scholar] [CrossRef] [Green Version]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics, New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training; OpenAI: San Francisco, CA, USA, 2018. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.G.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
He, P.; Liu, X.; Gao, J.; Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. arXiv 2020, arXiv:2006.03654. [Google Scholar]
Zhang, Z.; Han, X.; Liu, Z.; Jiang, X.; Sun, M.; Liu, Q. ERNIE: Enhanced Language Representation with Informative Entities. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
Chen, Q.; Zhu, X.; Ling, Z.; Wei, S.; Jiang, H.; Inkpen, D. Enhanced LSTM for natural language inference. arXiv 2016, arXiv:1609.06038. [Google Scholar]
Tian, Y.; Chen, G.; Song, Y. Aspect-based Sentiment Analysis with Type-aware Graph Convolutional Networks and Layer Ensemble. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Online, 6–11 June 2021. [Google Scholar]
Mandya, A.; Bollegala, D.; Coenen, F. Graph Convolution over Multiple Dependency Sub-graphs for Relation Extraction. In Proceedings of the International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020. [Google Scholar]
Wei, J.; Ren, X.; Li, X.; Huang, W.; Liao, Y.; Wang, Y.; Lin, J.; Jiang, X.; Chen, X.; Liu, Q. NEZHA: Neural Contextualized Representation for Chinese Language Understanding. arXiv 2019, arXiv:1909.00204. [Google Scholar]
Su, J.; Lu, Y.; Pan, S.; Wen, B.; Liu, Y. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021, arXiv:2104.09864. [Google Scholar]
Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.S.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations, Scottsdale, Arizona, 2–4 May 2013. [Google Scholar]
Tien, H.N.; Le, N.M.; Tomohiro, Y.; Tatsuya, I. Sentence Modeling via Multiple Word Embeddings and Multi-level Comparison for Semantic Textual Similarity. Inf. Process. Manag. 2019, 56, 102090. [Google Scholar] [CrossRef] [Green Version]
Tai, K.S.; Socher, R.; Manning, C.D. Improved semantic representations from tree-structured long short-term memory networks. arXiv 2015, arXiv:1503.00075. [Google Scholar]
Kiros, R.; Zhu, Y.; Salakhutdinov, R.R.; Zemel, R.; Urtasun, R.; Torralba, A.; Fidler, S. Skip-thought vectors. arXiv 2015, arXiv:1506.06726. [Google Scholar]
Wang, S.; Zhang, J.; Zong, C. Learning sentence representation with guidance of human attention. arXiv 2016, arXiv:1609.09189. [Google Scholar]
Arora, S.; Liang, Y.; Ma, T. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the International Conference on Learning Representations, Palais des Congrès Neptune, Toulon, France, 24–26 April 2017. [Google Scholar]
Vishwanathan, S.V.N.; Smola, A. Fast Kernels for String and Tree Matching. In Proceedings of the NIPS, Cambridge, MA, USA, 1 January 2002. [Google Scholar]
Moschitti, A. Making Tree Kernels Practical for Natural Language Learning. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, 5–6 April 2006. [Google Scholar]
Moschitti, A. Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees. In Proceedings of the European Conference on Machine Learning, Berlin, Germany, 18–22 September 2006. [Google Scholar]
Wu, W.; Li, H.; Wang, H.; Zhu, K.Q. Probase: A probabilistic taxonomy for text understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, AZ, USA, 20–24 May 2012. [Google Scholar]
Wu, Z.; Palmer, M. Verb Semantics and Lexical Selection. arXiv 1994, arXiv:cmp-lg/9406033. [Google Scholar]
Mrini, K.; Dernoncourt, F.; Bui, T.; Chang, W.; Nakashole, N. Rethinking Self-Attention: Towards Interpretability in Neural Parsing. In Proceedings of the Findings, Online, 16–20 November 2020. [Google Scholar]
Dolan, W.B.; Brockett, C. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the International Joint Conference on Natural Language Processing, Jeju Island, Republic of Korea, 11–13 October 2005. [Google Scholar]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv 2018, arXiv:1804.07461. [Google Scholar]
Chandra, A.; Stefanus, R. Experiments on Paraphrase Identification Using Quora Question Pairs Dataset. arXiv 2020, arXiv:2006.02648. [Google Scholar]
Wang, Z.; Hamza, W.; Florian, R. Bilateral Multi-Perspective Matching for Natural Language Sentences. In Proceedings of the International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
Cer, D.M.; Diab, M.T.; Agirre, E.; Lopez-Gazpio, I.; Specia, L. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the International Workshop on Semantic Evaluation, Vancouver, BC, Canada, 3–4 August 2017. [Google Scholar]
Marelli, M.; Menini, S.; Baroni, M.; Bentivogli, L.; Bernardi, R.; Zamparelli, R. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the International Conference on Language Resources and Evaluation, Reykjavik, Iceland, 26–31 May 2014. [Google Scholar]
Yin, X.; Zhang, W.; Zhu, W.; Liu, S.; Yao, T. Improving Sentence Representations via Component Focusing. Appl. Sci. 2020, 10, 958. [Google Scholar] [CrossRef] [Green Version]
Wang, T.; Shi, H.; Liu, W.; Yan, X. A joint FrameNet and element focusing Sentence-BERT method of sentence similarity computation. Expert Syst. Appl. 2022, 200, 117084. [Google Scholar] [CrossRef]
Jiang, T.; Jiao, J.; Huang, S.; Zhang, Z.; Wang, D.; Zhuang, F.; Wei, F.; Huang, H.; Deng, D.; Zhang, Q. Promptbert: Improving bert sentence embeddings with prompts. arXiv 2022, arXiv:2201.04337. [Google Scholar]

Figure 1. KEBERT-GCN.

Figure 2. Heat map of the word similarity matrix.

Figure 3. Constituency parse trees

T_{1}

and

T_{2}

.

Figure 3. Constituency parse trees

T_{1}

and

T_{2}

.

Figure 4. Effect of different

θ

and

φ

on PCC on the STS-B dataset.

Figure 4. Effect of different

θ

and

φ

on PCC on the STS-B dataset.

Table 1. The main parameter settings on the four datasets.

Parameter	Learning Rate	Epochs	Max Length	Batch Size
MRPC	$2 \times 10^{- 5}$	3	128	32
QQP	$3 \times 10^{- 5}$	3	75	16
STS-B	$5 \times 10^{- 5}$	3	128	64
SICK	$5 \times 10^{- 5}$	3	80	64

Table 2. Ablation experiments on four datasets.

Model\Datasets	MRPC	SICK	STS-B	QQP
BERT	84.80/88.90	87.83/81.40	87.10/85.80	91.93/91.70
BERT-GCN (ours)	87.01/90.33	87.86/82.78	87.39/86.62	92.74/92.49
KEBERT-GCN (ours)	87.99/90.58	88.33/82.83	87.87/87.07	94.75/94.60

For public comparison, each performance value in the table is the average of five runs with different random seeds. The best average performance in each column is indicated in bold. The evaluation indicators of MRPC and QQP are accuracy: (accuracy)/F1 (macro F1 value); the evaluation indicators of SICK and STS-B are Pearson correlation coefficient (PCC)/Spearman correlation coefficient(SCC).

Table 3. Results from different models on the STS-B dataset.

Model	PCC	SCC
Skip-Thought	71.80	69.70
Constituency Tree-LSTM	71.90	-
SIF	72.00	-
BERT	87.10	85.80
$S e m B E R T_{B A S E}$	87.30	-
$T s i n g h u a E N R I E_{B A S E}$	-	83.20
$B a i d u E N R I E_{B A S E}$	87.60	86.50
$C F - B E R T_{B A S E}$	87.58	86.29
$F E F S 3 C_{B A S E}$	-	86.78
$P r o m p t B E R T_{B A S E}$	84.56	-
KEBERT-GCN (ours)	87.74	87.07
CPT-PK (ours)	76.42	75.31
KEBERT-GCN+CPT-PK (ours)	88.05	87.30

KEBERT-GCN and CPT-PK denote our proposed semantic similarity and syntactic similarity models, respectively, and KEBERT-GCN+CPT-PK is our proposed short textual similarity model. The results of each experiment in the table are the average of five runs using different random seeds. The best average performance in each column is indicated in bold. PCC: Pearson correlation coefficient; SCC: Spearman correlation coefficient.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Y.; Li, C.; Huang, G.; Guo, Q.; Li, H.; Wei, X. A Short-Text Similarity Model Combining Semantic and Syntactic Information. Electronics 2023, 12, 3126. https://doi.org/10.3390/electronics12143126

AMA Style

Zhou Y, Li C, Huang G, Guo Q, Li H, Wei X. A Short-Text Similarity Model Combining Semantic and Syntactic Information. Electronics. 2023; 12(14):3126. https://doi.org/10.3390/electronics12143126

Chicago/Turabian Style

Zhou, Ya, Cheng Li, Guimin Huang, Qingkai Guo, Hui Li, and Xiong Wei. 2023. "A Short-Text Similarity Model Combining Semantic and Syntactic Information" Electronics 12, no. 14: 3126. https://doi.org/10.3390/electronics12143126

APA Style

Zhou, Y., Li, C., Huang, G., Guo, Q., Li, H., & Wei, X. (2023). A Short-Text Similarity Model Combining Semantic and Syntactic Information. Electronics, 12(14), 3126. https://doi.org/10.3390/electronics12143126

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Short-Text Similarity Model Combining Semantic and Syntactic Information

Abstract

1. Introduction

2. Related Work

2.1. Semantic Similarity

2.2. Syntactic Similarity

3. Methodology

3.1. Semantic Similarity Model Framework

3.1.1. Construction of the Word Similarity Matrix

3.1.2. External Knowledge Integration into BERT for Multi-Headed Attention

3.1.3. Scores Matrix Integration into the GCN

3.1.4. Setting of Loss Function

3.2. Syntactic Similarity Model Framework

3.2.1. Constructing the Constituency Parse Trees of the Short Text

3.2.2. Similarity between CPTs Is Calculated Using the Tree Kernel Method

3.3. Short Textual Similarity Model Framework

4. Experiments

4.1. Dataset and Experimental Setup

4.1.1. Dataset

4.1.2. Baseline as well as Comparison Methods

4.1.3. Other Settings and Implementation Details

4.2. Experimental Results and Analysis

4.3. Influence of Parameters

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI