Next Article in Journal
Impact of Watermelon Rind and Sea Buckthorn Meal on Performance, Blood Parameters, and Gut Microbiota and Morphology in Laying Hens
Previous Article in Journal
Design and Parameter Optimization of Fruit–Soil Separation Device of Lily Harvester
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Dynamic Attention and Multi-Strategy-Matching Neural Network Based on Bert for Chinese Rice-Related Answer Selection

1
School of Information and Electrical Engineering, Shenyang Agricultural University, Shenyang 110866, China
2
College of Computer Science and Technology, Inner Mongolia Minzu University, Tongliao 028043, China
3
National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
4
Research Center for Information Technology, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China
5
Intelligent Equipment Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China
*
Authors to whom correspondence should be addressed.
Agriculture 2022, 12(2), 176; https://doi.org/10.3390/agriculture12020176
Submission received: 27 December 2021 / Revised: 24 January 2022 / Accepted: 24 January 2022 / Published: 26 January 2022
(This article belongs to the Section Digital Agriculture)

Abstract

:
To allow the intelligent detection of correct answers in the rice-related question-and-answer (Q&A) communities of the “China Agricultural Technology Extension Information Platform”, we propose an answer selection model with dynamic attention and multi-strategy matching (DAMM). According to the characteristics of the rice-related dataset, the twelve-layer Chinese Bert pre-training model was employed to vectorize the text data and was compared with Word2vec, GloVe, and TF-IDF (Term Frequency–Inverse Document Frequency) methods. It was concluded that Bert could effectively solve the agricultural text’s high dimensionality and sparsity problems. As well as the problem of polysemy having different meanings in different contexts, dynamic attention with two different filtering strategies was used in the attention layer to effectively remove the sentence’s noise. The sentence representation of question-and-answer sentences was obtained. Secondly, two matching strategies (Full matching and Attentive matching) were introduced in the matching layer to complete the interaction between sentence vectors. Thirdly, a bi-directional gated recurrent unit (BiGRU) network spliced the sentence vectors obtained from the matching layer. Finally, a classifier was employed to calculate the similarity of splicing vectors, and the semantic correlation between question-and-answer sentences was acquired. The experimental results showed that DAMM had the best performance in the rice-related answer selection dataset compared with the other six answer selection models, of which MAP (Mean Average Precision) and MRR (Mean Reciprocal Rank) of DAMM gained 85.7% and 88.9%, respectively. Compared with the other six kinds of answer selection models, we present a new state-of-the-art method with the rice-related answer selection dataset.

1. Introduction

Rice is one of the essential food crops in China, with a wide planting area. Diseases and insect pests are critical factors affecting rice yield and quality in rice production; thus, it is essential to obtain the treatment methods of rice-related problems quickly and accurately in the planting process. With the rapid development of the internet era, questions asking, answering, and discussion in the online question-and-answer (Q&A) community [1], it has become an essential way to seek answers and meet their own information needs. The “China agricultural technology promotion information platform” is a comprehensive service platform specialized in providing an agricultural technology Q&A community, expert guidance, online learning, achievement delivery, and knowledge exchange, among others, which plays a vital role in helping farmers find solutions to problems. The Rice-related Q&A community has accumulated a large number of users and content, and it has produced a large number of low-quality texts, which has dramatically affected users’ efficiency in retrieving satisfactory answers. Therefore, improving the performance of answer quality prediction has become particularly important. The challenge of redundancy, sparsity [2], and the poor standardization of agricultural texts leads to inaccurate text feature extraction and the difficulty in determining the relationship between features. The critical technical link to realize an intelligent agricultural Q&A community is to detect the correct answer from the candidate answer text dataset and return it to the user quickly, automatically, and accurately. Traditional answer selection [3] relies on manual screening and it is challenging to process text data efficiently. Due to human feature selection, it does not automatically and accurately judge the correct answer from a large amount of agricultural text data. Therefore, deep learning [4] and natural language processing technology [5] to realize the intelligent selection of rice-related question answers is a significant problem to be solved by the “China agricultural technology promotion information platform”.
In recent years, deep neural networks have made remarkable achievements in many natural language processing tasks, including answer selection. The core problem is to obtain distinctive semantic features from question-and-answer sentences with deep neural networks. Lei et al. [6] proposed an answer selection model based on a CNN (Convolutional Neural Networks) [7], to extract semantic information from question-and-answer sentences, map them into low-dimensional distributed representation vectors, and learn a semantic matching function, semantic matching of question-and-answer pairs. Severyn et al. [8] also proposed an answer selection model based on a CNN, which simultaneously learns the intermediate representation and final representation of question-and-answer pairs to generate a more refined semantic representation of question-and-answer pairs for semantic similarity matching. Kalchbrenner et al. [9] proposed a DCNN (Dynamic Convolutional Neural Network) for answer selection. To accurately present the semantic information in question-and-answer sentences, the DCNN adopts a multi-layer wide convolution and dynamic K-MAX pooling operation, which retains the word order information in the sentence and the relative position between words and can dynamically deal with question-and-answer pairs of different lengths. Compared with the traditional statistical learning model, the performance based on a CNN has been significantly improved. However, a single CNN cannot effectively extract the contextual semantic correlation information in question-and-answer pairs, which is vital for selecting the most suitable answer from the answer sequence. Some models use RNNs (Recurrent Neural Networks) for answer selection tasks. Wang et al. [10] obtained semantic representations of different granularities in the text through an RNN. Then, they extracted their semantic matching information from the semantic interaction information of different granularities in the question-and-answer text to calculate the semantic matching degree of the question-and-answer pairs. Wang et al. [11] applied the stacked bi-directional LSTM (Long Short-Term Memory) [12] network to capture the forward and backward context information from the question-and-answer sentence. Cai et al. [13] used the bi-directional LSTM network to extract question-and-answer features at various scales. It used three different similarity matrix learning models to obtain the overall similarity of question-and-answer pairs from the local feature similarity.
With the good performance of the attention mechanism [14] in the sentence representation task, researchers applied it to the answer selection task. Tan et al. [15] proposed QA-LSTM-CNN with attention, where the sentence vector representation obtained through the LSTM-CNN network is taken into the attention network. The attention distribution of different units in the sentence vector, and sentence correlation information were obtained. Secondly, there are also different methods and strategies in applying attention mechanisms. Santos et al. [16] proposed the attention pooling networks and the concept of attention pooling. In the case of pairwise sorting or classification with the neural network, the attention pooling can perceive the current input pair, so the information from two input items can directly affect the calculation represented by each other. The target task is regarded as a three-tuple similarity solution in the answer selection task. After the pre-training model [17] was proposed, researchers fine-tuned it to solve the downstream target tasks, and its efficiency has been verified to a great extent. Thirdly, the researchers also optimized the pre-training model under a large corpus, only used as the underlying language representation structure to obtain sentence representation, and combined it with other network structures to model the target task [18].
However, mainstream neural networks have recently focused more on the representation of sentence interaction, which have their shortcomings in the expression of model performance. This paper proposed an effective method to solve the problem of incomplete sentence representation and sentence interaction. To solve the problem of incomplete sentence representation, this paper embedded the word vector into the sentence representation by using a twelve-layer Chinese Bert pre-training model [19]. The word vector was generated by the pre-training language model containing some contextual semantic information. Secondly, this paper also introduced a dynamic attention mechanism to filter the irrelevant information in the sentence vector to understand the sentence vector better. To solve the problem of insufficient information interaction between sentences, this paper introduced a multi-Strategy matching strategy in the matching layer. The obtained sentence vectors interact with each other through two different matching strategies to better capture the semantic correlation information between question and candidate answer. The main work of this paper is as follows:
  • A method based on dynamic attention mechanisms and multi-strategy matching was proposed.
  • The pre-training model was cleverly transferred to the context embedding layer, which provides a new idea of sentence pair representation.
  • The dynamic attention mechanism can effectively screen the irrelevant information in sentence representation and improve sentence expression.
  • The model introduced two matching strategies to capture the interaction between the question and candidate answer fully.

2. Materials and Methods

2.1. Corpus Preparation

The rice-related dataset was derived from the Q&A community of the China Agricultural Technology Extension Information Platform [20] in this paper. First, 5000 common rice-related questions were selected, classified into five categories, including 1519, 475, 1503, 376, and 1127 pairs, respectively, regarding diseases and pests, weeds, pesticides, cultivation management, storage and transportation, and OTHERS. There were multiple answers for each question in the Q&A community, and five candidate answers for each question were chosen. Only one of them was positive, marked as 1, and the other four were negative, marked as 0. In addition, 25,000 pairs of the rice-related Q&A dataset were obtained, of which the positive example was 5000, and the negative example was 20,000. Examples of rice-related answer selection samples are shown in Table 1.
The rice-related answer selection dataset was constructed, which was different from the answer selection dataset in the general field. It had the following characteristics:
  • Highly field-specific. All stationary was in Q&A pairs in rice-related fields, and the boundary of sentence semantics was fuzzy. Therefore, it was difficult to transfer the general domain model to rice diseases and pests for training.
  • The distribution of all types of question pairs was not uniform, in which diseases and pests, cultivation management, and OTHERS question-and-answer pairs were most distributed, while weeds and pesticides, and storage and transportation question pairs were relatively limited.
  • The sequence of question-and-answer pair was not long. The statistical analysis found that the maximum length of the sample in the rice-related question-and-answer pair set was 150 characters, the average length was 79 characters, and the maximum length and average length of the sample in the Sogou news dataset were 38,872 characters and 1250 characters, respectively [21]. It was challenging to extract semantic information from the features generated in the training process of our short text, which increased the difficulty of model recognition.

2.2. Methods

This paper utilized the Dynamic Attention and Multi-Strategy Matching model, as shown in Figure 1. The model consists of four parts: the embedding layer, multi-strategy interaction layer, polymerization layer, and output layer. We utilized the twelve-layer Chinese Bert model to expand the text feature words and calculated the weighted word vector according to its importance. A dynamic attention with two different filtering strategies was used in the dynamic layer to remove the sentence’s noise effectively, and a multi-strategy interaction layer and polymerization layer were used to extract local features of different granularities of text. Finally, the Softmax function was employed on the extracted feature vectors for output.

2.2.1. Embedding Layer

Text semantic representation has developed from the one hot representation to the current mainstream neural network, including Word2vec [22] and GloVe [23]. Although the problem of word context has been solved to a certain extent with those models, the problem of polysemy where words have different meanings in different contexts still exists. This paper used a twelve-layer Chinese Bert as a language feature extraction and representation, which can obtain the rich grammatical and semantic features of agricultural texts and solve the polysemy problem.
Devlin of the Google team proposed Bert [24] in 2018, and they applied it to various natural language processing tasks. Bert adopts a transformer language model [25] and an encoder and a decoder structure, and it abandons the recursive structure and utilizes the attention mechanism to determine the relationship between the input and output. The transformer model structure is shown in Figure 2.
Each base layer of the transformer model encoder contains two sub-layers: the attention mechanism layer adopts multi-head attention, and the other is a fully connected feedforward neural network. In addition to these two layers, another attention mechanism layer is added to the decoder. Residual connection and layer normalization (LN) are introduced into each model’s sublayer; that is, there is a residual connection around each sublayer in each encoder and decoder, followed by an LN operation. The transformer is the first model built entirely on the self-attention mechanism. Instead of the traditional encoder–decoder architecture, it must combine the inherent mode of CNN or RNN [26]. Compared with the recurrent neural network, the transformer captures a more extended range of information and realizes parallelization, which improves computing speed.
Bert’s pre-training objective function adopts a masked language model (MLM); some words are masked randomly and then predicted in the pre-training process. In this way, the representation of texts in two different directions can be learned. The input of the Bert model is the word embedding, generated by the addition of token embedding, segment embedding, and positional embedding, as shown in Figure 3.
The first mark of each input sentence is [CLS], which corresponds to the transformer’s output. It represents the whole sentence and can be used for downstream classification tasks. Mark [SEP] is used to separate two sentences. For the sentence classification task, only one sentence is the input, which means only one segment vector is used for a single sentence. The language technology platform (LTP) tool developed by Harbin University is used as a word segmentation tool, which blocks all the Chinese characters that make up the same word and then trains to obtain segmented words. This paper used the word vector obtained in the Bert pre-training model as the context embedding vector, spliced the obtained context embedding word vector with the word vector in the word representation layer, and normalized to obtain a new sentence vector representing Q and A.

2.2.2. Dynamic Attention Mechanism

After the sentence vectorization of Q and A, a dynamic attention mechanism was introduced to remove the redundant information in the sentence representation. When processing sequence information with the traditional static attention mechanism, the position of each time-point in the question needs to be compared with positions of all the time-points in the answer, which is repetitive and time-consuming. Therefore, this paper applied two different filtering mechanisms (K-threshold filtering and K-max filtering) based on dynamic attention to remove noise. By applying the filtering strategy in the attention layer, the weight of irrelevant information was set to zero. Then, the final sentence representation was obtained according to the new weight. The weight of important information occupied a more significant proportion in the new sentence representation to improve the computational efficiency in the matching layer.
The initial attention weight matrix was obtained with the traditional attention mechanism method. The semantic similarity between the question unit α i to be matched and the candidate answer unit β j was calculated. For the calculation of the similarity scoring function ω i j (also known as attention scoring function), this paper adopted point multiplication operation, and the expression is as follows:
ω i j = α i · β j
After obtaining the similarity, the Softmax function was used for normalization, and the normalized result was taken as the attention weight of the question unit in the candidate answer sequence.
Similarly, the attention weight of candidate answer unit β j in the question sequence was calculated. Assuming the weight coefficients of β j and α i after normalization are φ i j α and   φ i j β , the expression is as follows:
φ i j β = exp ( ω i j ) k = 1 l α exp ( ω k j )
φ i j α = exp ( ω i j ) k = 1 l β exp ( ω k j )
After obtaining the weight coefficient of each word in the sentence, the filtering strategy is obtained with the proposed way above. For question Q, it is assumed that the weight matrix obtained can be expressed as φ = [ φ 1 j , , φ i j , , φ l α j ] T , where φ i j is the abbreviation of φ i j β . In the K-threshold filtering, assuming a threshold K, representing the correlation strength, retains a weight greater than the threshold K, the weight less than the threshold K is set to zero, and then the weight coefficient of each word is calculated according to the new weight. The expression is as follows:
φ i j = φ i j φ i j K ,   φ i j = 0       φ i j < K .
φ i j = φ i j k = 1 l α φ i j
The filtering of irrelevant information is realized in sentences by setting different threshold values of K; moreover, the size of K can be dynamically adjusted accordingly.
In the K-MAX filtering method, the weight coefficients were arranged in the weight matrix into descending order to obtain the weight sequence table T. Secondly, the weight value with a significant weight coefficient was retained in the first K bits and the weight after K bits was set to zero. Then, the weight coefficient of each word was calculated according to the new weight. The expression is as follows:
φ i j = φ i j i T φ i j i T ,   φ i j = 0     j T .
For answer A, the same method was used to obtain the new weight coefficient of each word in the answer A. According to the new weight coefficient, the soft alignment vectors σ j β , σ j α of each word in Q and A was calculated, respectively, that is, the weighted sum was calculated. The calculation formula is as follows:
σ j β = i = 1 l q φ i j β α i , σ i α = j = 1 l q φ i j α β j
where σ j β is the soft alignment vector corresponding to β j in α , and the expression of σ i α is similar. Finally, the output of this layer is expressed as Q : [ σ 1 β , σ 2 β , , σ j β , , σ l β β ] , A : [ σ 1 α , σ 2 α , , σ i α , , σ l α α ] .

2.2.3. Multi-Strategy Interaction Layer

Most previous sentence matching methods use a single operation method to interact between sentence units, where it is challenging to capture the deep-seated semantic association information between sentences. Therefore, this paper used the multi-Strategy matching to exchange information between question and answer. Applying various matching strategies to calculate the information interaction between sentence units will produce better matching results. Two different matching strategies for comparison in different directions were adopted here: Full matching and Attentive matching.
First, we defined a multi-dimensional cosine matching function f x , compared the matching degree between the two vectors through the matching function, and returned the result to x :
x = f x ( a 1 , a 2 ; V )
where a 1 and a 2 are d-dimensional vectors, V l × d is a trainable parameter matrix, l represents the number of matching angles, the matching result x = [ x 1 , , x k , , x l ] is an l -dimensional vector, and the element x k x represents the matching result of the k-th matching angle. It is obtained by calculating the cosine similarity between the two weighted vectors. The expression is as follows:
x k = cosine ( V k · a 1 , V k · a 2 )
where V k represents the k-th row of parameter matrix V = [ V 1 , , V k , , V l ] T , controls the k-th matching angle, and assigns weights to different dimensions of d-dimensional space.
To thoroughly compare each sentence unit vector of the question sentence with all unit vectors of the candidate answer, this layer introduces two different matching strategies based on the matching function f x . In the Full matching strategy, each sentence unit vector of the question is compared with the last sentence unit vector in the candidate answer.
In the whole matching layer, a bi-directional gated recurrent unit network [27] is used to realize parameter transfer, so forward and backward transfer is distinguished during comparison. Assuming that the return result is x j β after comparing the sentence unit σ j β in question Q with the last sentence unit in answer A, then:
( x j β ) f u l l = f x ( σ j β , σ l α α ; V 1 )
( x j β ) f u l l = f x ( σ j β , σ 1 α ; V 1 )
In the Attentive matching strategy, the cosine similarity of each sentence unit in question σ j β and answer A to be matched is first calculated. The similarity can be taken as the attention weight of the question unit in answer sentence A, then the obtained weight β i j is further normalized, and, finally, all sentence units in the answer are weighted and summed. The attention vector representation σ j m e a n corresponding to the question unit is obtained.
β i j = cos ine ( σ j β , σ i α ) i = 1 , , l α
β i j = cosine ( σ j β , σ i α ) i = 1 , , l α
σ j m e a n = i = 1 l α β i , j i = 1 l α β i , j · σ i α
σ j m e a n = i = 1 l α β i , j i = 1 l α β i , j · σ i α
The question unit σ j β to be matched is matched and compared with the obtained attention vector representation σ j m e a n , and the expression is as follows:
( x j β ) a t t = f x ( σ j β , σ j m e a n ; V 3 )
( x j β ) a t t = f x ( σ j β , σ j m e a n ; V 4 )
The above two matching strategies are applied to process the sentence unit in question Q, and each sentence unit contains four vectors. Further, all the vectors are spliced in each sentence unit to obtain the final representation x j β of the sentence unit.
x j β = concatenate [ ( x j β ) f u l l , ( x j β ) f u l l , ( x j β ) a t t , ( x j β ) a t t ]
Similarly, the result x i α is calculated. Finally, the final representation Q: [ x 1 β , x 2 β , , x j β , , x l β β ] and A: [ x 1 α , x 2 α , , x i α , , x l α α ] of this layer are presented.

2.2.4. Polymerization Layer

The drive of this layer was to integrate the forward and backward information obtained by the matching layer, splice the question representation and candidate answer representation, and generate the sentence vector containing the question and candidate answer information representation. In this paper, a bi-directional gated recurrent unit network was used to fuse the sequence information of forward sentence unit vectors x j β and x i α , and backward sentence unit vectors x j β and x i α transmitted by the matching layer. Each sentence unit is passed through the BiGRU network to obtain a new sentence unit vector x j β , and then the last unit vector in the forward transmission and backward transmission is taken as the connection vector. After, the connected vectors r α and r β are further spliced to obtain the output representation of this layer, the scoring vector score. In the output layer, this paper used the scoring vector to sort the candidate answers, where M is the trainable parameter variable of the output layer.
x j β = G R U ( x 1 β , , x j β ) j = 1 , , l β
x j β = G R U ( x 1 β , , x j β ) j = 1 , , l β
r α = G R U ( μ 1 β , x l β β )
r β = G R U ( μ 1 α , x l α α )
S c o r e = [ r α , r β ] T M

2.2.5. Output Layer

To consider the correlation between matching questions and candidate answers, the list wise method is used. The corresponding candidate answer set is represented as A { A 1 , A 2 , , A N } and the label set is Y { Y 1 , Y 2 , , Y N } , and the score vector is calculated by the question Q and each candidate answer. Finally, the Softmax classifier is applied to obtain the final output vector S:
S = Softmax ( [ S c o r e 1 , , S c o r e j , , S c o r e N ] )
S c o r e j = model [ Q , A j ]
Finally, the Kullback–Leibler (KL) divergence loss function was used, and the expression is as follows:
Y = Y i = 1 N y i
L o s s = 1 n 1 n K L ( S | | Y )

3. Results

3.1. Hardware, Software Environment, and Evaluation Indicators

The server’s hardware environment was NVIDIA Corporation device 1e04 (Rev A1), the GPU was an NVIDIA GeForce RTX 2080TI, and both the research and control experiments were performed in the Ubuntu 18.04 environment. For training, the deep learning framework Pytorch was utilized in conjunction with Cuda10.1. The training and validation sets’ network batch sizes were set to 16 and 32, respectively, during the experiment design and control phase. All network models were given a total of 50 iterations. A total of 25,000 pairs of the rice-related Q&A dataset were divided into the training set and test set according to the ratio 9:1. There were 22,500 training sets and 2500 test sets.
In the experiment, Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) were used as the evaluation indexes of model performance.
MRR: take the position reciprocal of the standard answer in the ranking results given as the evaluation index, and then average all the questions. The calculation formula is as follows:
M R R = 1 Q i = 1 Q 1 r a n k i
where Q represents the number of questions, and r a n k i represents the position where the correct answer first appears in the predicted answers returned by the model for a certain question. If all the correct answers do not appear, set 1 r a n k i equal to 0.
Map: first, calculate the average precision of standard answers in each question, and then average the average precision of standard answers calculated for all questions. MAP is calculated as follows:
M A P = 1 Q i = 1 Q A v e P ( C i , A i )
A v e P ( C , A ) = k = 1 n ( P ( k ) · r e l ( k ) ) min ( m , n )
r e l ( k ) = 1 ,   C k StandardAnswer
r e l ( k ) = 0 ,   C k StandardAnswer
where A v e P represents the average precision; k represents the ranking of n prediction answers returned; m is the number of standard answers; n is the number of predicted answers; if min ( m , n ) = 0 , set A v e P ( C , A ) to 0; P ( k ) is the prediction precision up to k ; r e l ( k ) indicates whether the k-th prediction answer is the correct answer. If yes, it is recorded as 1; otherwise, it is recorded as 0.

3.2. Text Vectorization Processing and Analysis

We used the twelve-layer Chinese Bert model to vectorize the rice-related question-and-answer dataset in this paper. At first, it was compared with the GloVe, TF-IDF [28], and Word2vec vectorization model. The text features obtained from the training of the four models were directly put into the Softmax classifier with a full connection layer. It can be seen from Table 2 that four different word vector conversion tools were used in the embedding layer, and the twelve-layer Chinese Bert pre-training model achieved the highest MAP and MRR, reaching 75.3% and 78.2%, respectively. The effect of the TF-IDF method was worst, which indicates that TF-IDF mainly considers the importance of word frequency and the position information of words, and TF-IDF ignores the relationship between words. After Bert pre-training, the method improved the MAP and MPP compared with word2vec by 2.7% and 3.5%, respectively, which indicates that word2vec considers the surrounding information of words but ignores the word order problem. Due to window size limitations, it cannot consider the correlation of all words in the whole sentence. After the text representation obtained by Bert, it can consider the context and word order information simultaneously, which improves the MAP and MPP of the neural network. It shows that Bert can solve the problem where words have different meanings in different contexts. Therefore, the Bert pre-training model was used to transform the rice-related answer selection dataset in this paper; then, word vectors were used to input into the neural network model.

3.3. Experiment

The DAMM model proposed in this paper was compared with six other answer selection models (LSTM [12], Coattention-BiLSTM [13], CNN [29], Attention-CNN [30], RNN [10], and Attention_BiRNN [31]) on the rice-related answer selection dataset, and all used the twelve-layer Chinese Bert model to vectorize the text. Table 3 shows the comparison of seven different deep learning models in MAP and MRR. Compared with LSTM, the MAR and MPP of BiLSTM based on the self-attention mechanism increased by 4.7 and 4.4 percentage points, respectively; the MAP and MRR of the convolutional neural network model based on the attention mechanism (Attention-CNN) increased by 3.2 and 3.9 percentage points compared with CNN model, respectively; and the MAP and MRR of BiRNN based on the attention mechanism increased by 1.9% and 3.6% compared with RNN model, respectively. The model DAMM proposed in this paper obtained the highest MAP and MRR, reaching 85.7% and 88.9%, respectively; the MAP and MRR were significantly better than the other six deep learning models. Figure 4 shows the comparison of the MAP with seven answer selection models during the training process.
It can be seen from Table 4 that, compared with LSTM, Coattention-BiLSTM, CNN, Attention-CNN, RNN, and Attention_BiRNN, DAMM had the highest matching performance in the dataset of five categories (diseases and insect pests, weeds and pesticides, cultivation management, storage, and transportation, and OTHER). The MAP and MRR of DAMM were more significant than 82.8% and 83.7%, respectively, and the overall classification effect was better than those of the other models. The MAP of the DAMM model was slightly higher than those of the other models with sufficient data of diseases and pests, cultivation management, and OTHER. This is because, in the process of iterative training, the more significant the dataset, the better the training effect of the model. In the two categories (weeds and pesticides, storage and transportation) with fewer data, the MAP and MRR of DSMM models were significantly higher than those of the other six models, which indicates that the DAMM model can still effectively extract the features with short text under the condition of insufficient data. It also shows that the model has good robustness.
Table 5 shows the response time, MAP, and MRR of four neural network models based on the attention mechanism on 2500 test sets, which meets the requirements for quick answer selection of the rice-related Q&A dataset. Attention-CNN is fastest in response time due to the simple structure of the Attention-CNN model, fewer training layers, and fewer model parameters. The model DAMM proposed in this paper was able to accurately judge rice-related question and answer pairs in the test set of 2500 question and answer pairs in 11 s; the MAP and MRR reached 83.7% and 86.9%, respectively. It had the best performance compared with the other three models.

4. Discussions

4.1. Effectiveness of Embedding layer

Table 6 shows the MAP and MRR of seven neural network models (LSTM, Coattention-BiLSTM, CNN, Attention-CNN, RNN, and Attention_BiRNN) with the BERT text representation and Word2vec text representation in the rice-related question-and-answer dataset. This paper proposed the Bert text representation method, which had a higher MAP and MRR than the Word2vec text representation method under seven neural network models. The DAMM model achieved the highest MAP and MRR with Bert text representation and word2vec text representation, reaching 85.7%, 88.9%, 82.5%, and 83.7%, respectively. The answer selection effect was significantly better than those of the other six neural network models. It can be seen from Table 5 that the Bert text representation method improved MAP and MRR in each group of comparative experiments. This is because the word2vec text representation method ignores the problems of polysemy and long-distance semantic association information in different contexts. The Bert text representation method can solve the above problems to improve the effectiveness of the answer selection in the rice-related question-and-answer dataset.

4.2. Effectiveness of Dynamic Attention Mechanism

Table 7 shows a set of experiments undertaken to prove the effectiveness of the dynamic attention mechanism in the DAMM model proposed in this paper. After removing the k-threshold and K-MAX attention mechanism from the DAMM, model 1 (DAMM without k-threshold and K-MAX) was obtained. It can be seen from Table 7 that, compared with the DAMM model, the MAP and MRR of model 1 on the rice-related answer selection dataset decreased by 9.1% and 10.4%, respectively. After adding the attention mechanism to model 1, model 2 ((DAMM without k-threshold and K-max) + attention) was obtained; compared with model 1, the MAP and MRR of model 2 increased by 2.9% and 2%, respectively, proving that adding an attention mechanism can improve the model’s effectiveness because the attention mechanism can strengthen the weight of keywords in the process of answer selection. Finally, after removing the K-threshold and K-max attention mechanisms in model DAMM, model 3 and model 4 were obtained; compared with model 2, the MAP and MRR increased by 1.6%, 2%, 2.7%, and 4.2%, respectively. This is because the application of the filtering strategy of irrelevant information can reset to zero; then, according to the new weight, to obtain the final sentence representation, compared with the attention mechanism, the dynamic mechanism can make the weight of important information in the new sentence occupy a more significant proportion of allocation, raising the computation efficiency of the matching layer.

4.3. Effectiveness of Multi-Strategy Interaction Layer

In order to explore the impact of the matching layer, two strategies were removed in the matching layer to obtain model 1. Then, each strategy was removed alternatively to obtain model 2 and model 3 while ensuring that other factors remained unchanged. The impact of different matching strategies was evaluated according to the difference between the obtained MAP and MRR indicators. It can be seen from Table 8 that, compared with the DAMM model, the MAP and MRR of model 1 on the rice-related answer selection dataset decreased by 5.8% and 7.8%. Compared with model 1, the MAP and MRR of model 2 and model 3 on the rice-related answer selection dataset increased by 2.7%, 3.4%, 2.8%, and 4.6%, respectively. This indicated that adding one of the two strategies can improve the effectiveness. Full-matching and Attentive Matching strategies strengthen the interaction between knowledge representations of question-and-answer pairs. The model can accurately obtain the critical information related to the semantic information of question-and-answer pairs in the knowledge base. Compared with the DAMM model, the MAP and MRR of model 2 and model 3 on the rice-related answer selection dataset decreased by 3.1%, 4.4%, 3%, and 3.2%, respectively. This shows that adding two matching strategies can improve the model’s effectiveness.

5. Conclusions

In this paper, a dynamic attention mechanism and multi-Strategy matching model was proposed to complete the rice-related answer selection task. The model combined the advantages of the pre-training model in language representation. At the same time, a dynamic attention mechanism was introduced to remove some irrelevant and redundant information to obtain sentence representation efficiently. Secondly, multi-Strategy matching was introduced to compare different units between sentences, and the semantic association between target sentences was fully obtained. The experimental results showed that the overall performance of the proposed model was better than that of the baseline model. In addition, the model proposed in this paper can be applied not only to answer selection tasks but also to matching and sorting tasks—for example, automatic question answering, machine reading comprehension, and dialogue system. Currently, this paper migrated the trained Bert pre-training model because there is an out-of-vocabulary (OOV) problem in the text field that cannot be ignored. In future work, we will first train the Elmo model on our dataset and embed the new word vector representation obtained from the training into the context embedding layer; we will explore the influence of the corpus knowledge field on word vector representation. In addition, we will further explore the impact of other existing pre-training models on downstream processing tasks, such as general pre-training (GPT) and universal language model fine-tuning (ULMFIT), which are currently well used; and then analyze and study the representation mechanism, applicable scenarios, and migration strategies of the language model. Finally, we will also explore the matching strategy of deep interaction between sentences.

Author Contributions

Conceptualization, T.X. and H.W. (Haoriqin Wang); methodology, H.W. (Huarui Wu); software, H.W. (Huarui Wu); validation, H.W. (Haoriqin Wang), H.Z. and T.X.; formal analysis, H.W. (Haoriqin Wang); investigation, Q.W.; resources, T.X.; data curation, H.W. (Huarui Wu); writing—original draft preparation, H.W. (Huarui Wu) and H.Z.; writing—review and editing, H.W. (Haoriqin Wang) and H.W. (Huarui Wu); visualization, H.W. (Haoriqin Wang).; supervision, S.Q. and H.Z.; project administration, S.Q.; funding acquisition, Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China, grant number of 2019YFD1101105; Key technologies of Liaoning large-scale rice production in 5G smart unmanned farm, grant number of LSNZD202005; Science and Technology Plan Project of Inner Mongolia Autonomous Region of China, grant number of 2020GG0189; The Central Government Guided Local Science and Technology Development Fund project, grant number of 2020ZY0003; Higher Education Science Research Project of Inner Mongolia Autonomous Region of China, grant number of NJZY21419; Natural Science Foundation of Inner Mongolia Autonomous Region, grant number of 2021LHMS06006. The APC was funded by the National Key Research and Development Program of China, grant number of 2019YFD1101105.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available, due to the privacy policy of the Authors’ Institution.

Acknowledgments

This work was supported by the National Key Research and Development Program of China (2019YFD1101105), Key technologies of Liaoning large-scale rice production in 5G smart unmanned farm (LSNZD202005), Science and Technology Plan Project of Inner Mongolia Autonomous Region of China (2020GG0189), The Central Government Guided Local Science and Technology Development Fund project (2020ZY0003), Higher Education Science Research Project of Inner Mongolia Autonomous Region of China (NJZY21419), and Natural Science Foundation of Inner Mongolia Autonomous Region (2021LHMS06006).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Li, M.; Li, Y.; Peng, Q.; Wang, J.; Yu, C. Evaluating community question-answering websites using interval-valued intuitionistic fuzzy DANP and TODIM methods. Appl. Soft Comput. 2020, 99, 106918. [Google Scholar] [CrossRef]
  2. Li, C.; Liu, F.; Li, P. Text Similarity Computation Model for Identifying Rumor Based on Bayesian Network in Microblog. Int. Arab. J. Inf. Technol. 2020, 17, 731–741. [Google Scholar] [CrossRef]
  3. Xiaoqiang, Z.; Baotian, H.; Qingcai, C.; Xiaolong, W. Recurrent convolutional neural network for answer selection in community question answering. Neurocomputing 2018, 274, 8–18. [Google Scholar]
  4. Jürgen, S. Deep Learning in Neural Networks: An Overview. Neural Netw. 2015, 61, 85–117. [Google Scholar]
  5. Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent Trends in Deep Learning Based Natural Language Processing. Comput. Intell. Mag. 2018, 13, 55–75. [Google Scholar] [CrossRef]
  6. Lei, Y.; Hermann, K.M.; Blunsom, P.; Pulman, S. Deep Learning for Answer Sentence Selection. arXiv 2014, arXiv:1412.1632. [Google Scholar]
  7. Kim, Y. Convolutional Neural Networks for Sentence Classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
  8. Severyn, A.; Moschitti, A. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 9–13 August 2015; pp. 373–382. [Google Scholar]
  9. Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. A Convolutional Neural Network for Modelling Sentences. arXiv 2014, arXiv:1404.2188. [Google Scholar]
  10. Wang, B.; Liu, K.; Zhao, J. Inner attention based recurrent neural networks for answer selection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1, pp. 1288–1297, Long Papaer. [Google Scholar]
  11. Wang, D.; Nyberg, E. A long short-term memory model for answer sentence selection in question answering. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, July 2015; Volume 2, pp. 707–712, Short Papers. [Google Scholar]
  12. Tan, M.; dos Santos, C.; Xiang, B.; Zhou, B. Lstm-based deep learning models for non-factoid answer selection. arXiv 2015, arXiv:1511.04108. [Google Scholar]
  13. Cai, L.; Zhou, S.; Yan, X.; Yuan, R. A stacked BiLSTM neural network based on coattention mechanism for question answering. Comput. Intell. Neurosci. 2019, 9, 1–12. [Google Scholar] [CrossRef] [Green Version]
  14. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  15. Tan, M.; Dos Santos, C.; Xiang, B.; Zhou, B. Improved representation learning for question answer matching. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1, pp. 464–473, Long Papers. [Google Scholar]
  16. dos Santos, C.; Tan, M.; Xiang, B.; Zhou, B. Attentive pooling networks. arXiv 2016, arXiv:1602.03609. [Google Scholar]
  17. Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.-W. Unified language model pre-training for natural language understanding and generation. arXiv 2019, arXiv:1905.03197. [Google Scholar]
  18. Laskar, M.T.R.; Huang, X.; Hoque, E. Contextualized embeddings based transformer encoder for sentence similarity modeling in answer selection task. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 5505–5514. [Google Scholar]
  19. Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z.; Wang, S.; Hu, G. Pre-training with whole word masking for chinese bert. arXiv 2019, arXiv:1906.08101. [Google Scholar] [CrossRef]
  20. Wang, H.; Zhu, H.; Wu, H.; Wang, X.; Han, X.; Xu, T. A Densely Connected GRU Neural Network Based on Coattention Mechanism for Chinese Rice-Related Question Similarity Matching. Agronomy 2021, 11, 1307. [Google Scholar] [CrossRef]
  21. Wang, R.; Li, Z.; Cao, J.; Chen, T.; Wang, L. Convolutional recurrent neural networks for text classification. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–6. [Google Scholar]
  22. Goldberg, Y.; Levy, O. word2vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv 2014, arXiv:1402.3722. [Google Scholar]
  23. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
  24. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  25. Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. arXiv 2021, arXiv:2103.00112. [Google Scholar]
  26. Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent neural network regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
  27. Liu, X.; Wang, Y.; Wang, X.; Xu, H.; Li, C.; Xin, X. Bi-directional gated recurrent unit neural network based nonlinear equalizer for coherent optical communication system. Opt. Express 2021, 29, 5923–5933. [Google Scholar] [CrossRef]
  28. Qaiser, S.; Ali, R. Text mining: Use of TF-IDF to examine the relevance of words to documents. Int. J. Comput. Appl. 2018, 181, 25–29. [Google Scholar] [CrossRef]
  29. Sequiera, R.; Baruah, G.; Tu, Z.; Mohammed, S.; Rao, J.; Zhang, H.; Lin, J. Exploring the effectiveness of convolutional neural networks for answer selection in end-to-end question answering. arXiv 2017, arXiv:1707.07804. [Google Scholar]
  30. Xiang, Y.; Chen, Q.; Wang, X.; Qin, Y. Answer selection in community question answering via attentive neural networks. IEEE Signal Process. Lett. 2017, 24, 505–509. [Google Scholar] [CrossRef]
  31. Ma, J.; Che, C.; Zhang, Q. Medical answer selection based on two attention mechanisms with birnn. MATEC Web Conf. 2018, 176, 01024. [Google Scholar] [CrossRef]
Figure 1. DSMM model architecture diagram.
Figure 1. DSMM model architecture diagram.
Agriculture 12 00176 g001
Figure 2. Transformer model structure.
Figure 2. Transformer model structure.
Agriculture 12 00176 g002
Figure 3. Bert input example.
Figure 3. Bert input example.
Agriculture 12 00176 g003
Figure 4. Comparison of the MAP with seven answer selection models.
Figure 4. Comparison of the MAP with seven answer selection models.
Agriculture 12 00176 g004
Table 1. Sample of rice-related answer selection dataset.
Table 1. Sample of rice-related answer selection dataset.
QuestionAnswerLabel
What is the reason for the stiff seedling of rice?Rice seedling stiffness is caused by climate, water quality, temperature and humidity, lack of zinc and so on.1
What is the reason for the stiff seedling of rice?Mainly pay attention to the treatment time and solution concentration.0
What is the reason for the stiff seedling of rice?The dead seedlings of rice are mainly in the cold soaked field with high groundwater level.0
What is the reason for the stiff seedling of rice?This phenomenon is very common and does no harm to rice0
What is the reason for the stiff seedling of rice?Rice seedling needs, disinfection, seed drying, seed soaking, Northern Greenhouse breeding0
What are the transmission routes of Rice False Smut?The main transmission routes of Rice False Smut are seed, air flow and soil.1
What are the transmission routes of Rice False Smut?Rice false smut usually occurs in rice.0
What are the transmission routes of Rice False Smut?Rice false smut mainly occurs in the panicle and is mainly infected by bacteria.0
What are the transmission routes of Rice False Smut?The control time of Rice False Smut is mainly chemical seed dressing during seedling raising.0
What are the transmission routes of Rice False Smut?According to the characteristics of rice false smut, the control effect can reach more than 85%.0
What are the symptoms of bud rot in rice fungal bacterial wilt?The tooth root is withered and yellow, with brown mold layer, rotten and easy to break.1
What are the symptoms of bud rot in rice fungal bacterial wilt? Strengthen water and fertilizer management and timely prevent and control diseases and pests.0
What are the symptoms of bud rot in rice fungal bacterial wilt?Fungal bacterial wilt in rice is a dead seedling of bacterial wilt.0
What are the symptoms of bud rot in rice fungal bacterial wilt?Bacterial base rot of rice is mainly transmitted by overwintering pathogens on rice straw, rice pile and weeds.0
What are the symptoms of bud rot in rice fungal bacterial wilt?Bacteria are solitary, short rod-shaped and blunt at both ends.0
Table 2. Model matching effect under different embedding layers.
Table 2. Model matching effect under different embedding layers.
ModelMAP (%)MRR (%)
TF-IDF63.765.9
Glove71.873.9
Word2Vec72.674.7
BERT75.378.2
Table 3. Effects of different models on rice-related answer selection dataset.
Table 3. Effects of different models on rice-related answer selection dataset.
ModelMAP (%)MRR (%)
LSTM76.678.5
Coattention-BiLSTM82.783.7
CNN78.679.9
Attention-CNN79.581.3
RNN81.782.1
Attention_BiRNN83.685.7
DAMM85.788.9
Table 4. Effect of different models on rice-related answer selection dataset.
Table 4. Effect of different models on rice-related answer selection dataset.
ModelMAP (%)MRR (%)
1234512345
LSTM84.273.775.879.779.783.777.879.775.181.1
Coattention-BiLSTM83.181.981.678.578.785.781.184.181.186.7
CNN74.178.479.179.776.782.174.177.779.684.1
Attention-CNN86.681.281.972.675.384.778.683.978.785.6
RNN82.377.781.579.377.783.579.382.378.182.3
Attention_BiRNN83.379.98283.379.985.178.983.48283.9
DAMM89.182.887.183.686.193.183.792.986.792.9
Note: 1, 2, 3, 4, and 5 represent the data of diseases and pests, weeds and pesticides, cultivation management, storage and transportation, and OTHER five categories, respectively.
Table 5. Response time and precision of four network models.
Table 5. Response time and precision of four network models.
ModelResponse Time (s)MAP (%)MRR (%)
Coattention-BiLSTM1480.782.6
Attention-CNN1079.681.5
Attention_BiRNN1381.582.7
DAMM1183.786.9
Table 6. Effect of different models on rice-related answer selection dataset.
Table 6. Effect of different models on rice-related answer selection dataset.
ModelBERTWord2vec
MAP (%)MRR (%)MAP (%)MRR (%)
LSTM76.678.574.775.7
Coattention-BiLSTM82.783.777.679.2
CNN78.679.975.376.2
Attention-CNN79.581.376.577.2
RNN81.782.178.179.8
Attention_BiRNN83.685.779.781.2
DAMM85.788.982.583.7
Table 7. Effect of different models on rice-related answer selection dataset.
Table 7. Effect of different models on rice-related answer selection dataset.
LabelModelMAP (%)MRR (%)
1DAMM without K-threshold and K-max76.678.5
2(DAMM without K-threshold and K-max)+attention79.580.5
3DAMM without K-threshold81.183.2
4DAMM without K-max82.584.7
5DAMM85.788.9
Table 8. Effect of different models on rice-related answer selection dataset.
Table 8. Effect of different models on rice-related answer selection dataset.
LabelModelMAP (%)MRR (%)
1DAMM without Full-Matching and Attentive-Matching79.981.1
2DAMM without Full-Matching82.684.5
3DAMM without Attentive-Matching82.785.7
4DAMM85.788.9
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, H.; Wu, H.; Wang, Q.; Qiao, S.; Xu, T.; Zhu, H. A Dynamic Attention and Multi-Strategy-Matching Neural Network Based on Bert for Chinese Rice-Related Answer Selection. Agriculture 2022, 12, 176. https://doi.org/10.3390/agriculture12020176

AMA Style

Wang H, Wu H, Wang Q, Qiao S, Xu T, Zhu H. A Dynamic Attention and Multi-Strategy-Matching Neural Network Based on Bert for Chinese Rice-Related Answer Selection. Agriculture. 2022; 12(2):176. https://doi.org/10.3390/agriculture12020176

Chicago/Turabian Style

Wang, Haoriqin, Huarui Wu, Qinghu Wang, Shicheng Qiao, Tongyu Xu, and Huaji Zhu. 2022. "A Dynamic Attention and Multi-Strategy-Matching Neural Network Based on Bert for Chinese Rice-Related Answer Selection" Agriculture 12, no. 2: 176. https://doi.org/10.3390/agriculture12020176

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop