An Idiom Reading Comprehension Model Based on Multi-Granularity Reasoning and Paraphrase Expansion

Dai, Yu; Liu, Yuqiao; Yang, Lei; Fu, Yufan

doi:10.3390/app13095777

Open AccessArticle

An Idiom Reading Comprehension Model Based on Multi-Granularity Reasoning and Paraphrase Expansion

by

Yu Dai

¹,

Yuqiao Liu

¹,

Lei Yang

^2,* and

Yufan Fu

¹

Software College, Northeastern University, Shenyang 110819, China

²

Key Laboratory of Intelligent Computing in Medical Image of Ministry of Education, College of Computer Science and Engineering, Northeastern University, Shenyang 110169, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(9), 5777; https://doi.org/10.3390/app13095777

Submission received: 8 April 2023 / Revised: 26 April 2023 / Accepted: 3 May 2023 / Published: 8 May 2023

Download

Browse Figures

Versions Notes

Abstract

:

Idioms are a unique class of words in the Chinese language that can be challenging for Chinese machine reading comprehension due to their formal simplicity and the potential mismatch between their literal and figurative meanings. To address this issue, this paper adopted the “2 + 2” structure as the representation model for idiom structure feature extraction. According to the linguistic theory of idioms, to enhance the model’s learning ability for idiom semantics, we propose a two-stage semantic expansion method that leverages semantic knowledge during the pre-training stage and extracts idiom interpretation information during the fine-tuning stage to improve the model’s understanding of idioms. Moreover, with the consideration of the inferential interaction between global and local information during context fusion, which is neglected by all current works, we propose a method that utilizes a multi-headed attention mechanism to fully extract global and local information by selecting three attention patterns in the fine-tuning stage. This enables the fusion of semantic information extracted by the model at different granularities, leading to improved accuracy. The experimental results demonstrated that our proposed BERT-IDM model outperformed the baseline BERT model by achieving a 4.1% accuracy improvement.

Keywords:

machine reading; comprehension; pre-trained model; cloze-style reading comprehension; Chinese idioms; BERT

1. Introduction

The development of machine reading comprehension (MRC) and natural language processing (NLP) [1] is gaining more and more attention. In Chinese machine reading comprehension, one of the important tasks is the reading comprehension of Chinese idioms. Idioms, due to their simple forms, often have literal meanings that differ from their original meanings, and it is difficult for machines to select the correct answer by simple matching.

To improve the accuracy of Chinese machine reading comprehension, many studies have been carried out by researchers. For example, Cui [2] et al. proposed BERT-WWE to improve the accuracy of Chinese machine reading by replacing whole-word masking with a single-word masking strategy. To better study fill-in-the-blank idiom machine reading comprehension, researchers proposed the ChID [3] dataset, which focuses on Chinese idioms and asks readers to select the correct option that best fits the context based on the given story background. However, due to the complexity of the idiom machine reading comprehension problem, it still faces the following problems:

(1) Idioms often have deep semantic information and usually require the introduction of external knowledge to enhance their semantic information understanding. However, most current studies [4] rely on the ChID dataset for idiom machine reading comprehension research. This dataset cannot easily improve the accuracy of the machine reading comprehension of idioms, because the ChID dataset only focuses on the idioms themselves. Less consideration is given to extending collections of external knowledge such as idiom interpretation.

(2) The treatment of idioms in current research is not reasonable enough. Most existing methods [2] treat four-character idioms as combinations of individual Chinese characters to slice and dice or as a complete words without processing and input, which makes the model unable to parse the intrinsic structure of idioms and extract feature information, thus reducing the semantic learning ability of the model for idioms.

(3) Cloze-style machine reading comprehension requires models to understand the course and options deeply. Existing studies have not fully exploited the advantages of end-to-end information extraction and have not been able to reason from multiple perspectives, such as global and local, regarding the information given by the question. Additionally, current studies have not taken full advantage of the end-to-end information extraction of deep models.

In response to the above problems, this paper will improve the three aspects of idiom representation, expanded interpretation, and multi-granularity semantic capture and propose an idiom-oriented machine reading comprehension model, BERT-IDM (IDM is the abbreviated representation of idiom). The main research work included the following three aspects:

(1) To address the problem that the current models lack external knowledge expansion and corpus data support, in this paper, we constructed a pre-trained idiom interpretation sentence corpus in the pre-training stage and initially incorporated idiom interpretation information as external knowledge support based on this training. In the fine-tuning stage, we introduced an autoregressive model, XLNet, which integrated the two mainstream pre-training models to complement each other and form a dual-stream model structure, so that the idiom interpretation could be encoded and interact with the main model to achieve the integration of interpretation, options, and contextual semantics, thus improving the model’s ability to understand idioms.

(2) To improve the model’s idiom reading comprehension level, we used idiom mask training to obtain a representation form (“two add two” form structure) that was more in line with the inherent characteristics of the idiom, allowing the model to understand the idiom and its context more deeply and naturally.

(3) To address the problem that the model could not efficiently extract semantic information from candidates and contexts, we designed a self-masking self-attention mechanism (SMSA) to prevent inter-layer information leakage and simplify model complexity by improving the attention mechanism in the continued pre-training phase, and then used global attention, windowed attention, and a unique combination of idiom masks designed for the fine-tuning phase to obtain a better latent layer representation. The variety of global awareness, windowed engagement, and idiom masks were then used in the fine-tuning phase to modify Transformer’s multi-headed attention structure to obtain the multi-granularity integrated attention (MGIA) mechanism for the information fusion of the dual-stream model, effectively improving model robustness and multidimensional information capture.

The rest of the paper is organized as follows: Section 2 introduces the work related to the study of idiom machine reading comprehension; Section 3 presents the idiom reading comprehension model; Section 4 compares the model proposed in this paper with other benchmark models used for the experiments; Section 5 presents an analysis of the experimental results and conclusions; and Section 6 summarizes the entire work.

2. Related Work

Machine reading comprehension is not a new natural language processing topic; as early as 1977, Lehnert [5] et al. built the question and answer program QUALM to comprehend stories, bringing contextual comprehension under study for the first time. In 1999, Hirschman [6] et al. created a reading comprehension system containing a development set and a test set of 60 discourse items, each using reading material from grades 3 to 6. At that time, Deep Read’s baseline system obtained 30–40% accuracy in eleven subtasks, while most machine reading comprehension systems of the same period were rule-based [7] and statistically based models [8]. Riloff [9] et al. designed a rule-based machine reading comprehension system, Quarc, containing multiple heuristic rules and morphological analysis as a means to provide answers. In 2010, Poon [10] et al. started to use machine learning methods for machine reading, combining techniques such as bootstrap sampling, Markov logic, and self-supervised learning. However, the lack of high-quality, large-scale reading comprehension datasets in this area and the high reliance on manual rule sets or features constructed by many human beings led to a long period during which research in the field of machine reading comprehension did not receive sufficient attention.

The above situation was resolved in 2015 with the emergence of neural machine reading comprehension and new large-scale benchmark datasets. On the one hand, deep-learning-based machine reading comprehension began to show its unique advantage in capturing contextual information, and on the other hand, the availability of datasets such as the CNN/Daily Mail dataset [11], Stanford Question-Answering Dataset (SQuAD) [12], and MS MARCO [13] dataset provided data support for deep neural network architectures and evaluation testbeds. Deep learning does not rely on linguistic feature tools, does not require manual feature construction, and has more robust generalization than traditional methods.

Hermann [8] et al. proposed Attentive Reader, a supervised attentional bidirectional long- and short-term memory (LSTM) model based on the CNN/Daily Mail dataset in the form of a completion filler, with an accuracy of 63.8%, obtaining a performance improvement of more than 10% compared to the traditional model. The following year, Chen [14] et al. introduced a bilinear layer to replace the tanh layer, thus further improving the accuracy to over 70%. Although neural models have been successful in NLP tasks, their performance improvement in MRC accuracy is still insufficient, a significant reason being that the current datasets are negligible for most supervised NLP tasks (except machine translation), and deep neural networks usually have a large number of parameters. Smaller training datasets lead to the appearance of overfitting. As a result, early neural models for NLP tasks were relatively shallow, often containing only one to three layers.

A large body of research in recent years has shown that pre-trained models (PTMs) based on large corpora can learn generic linguistic representations and benefit downstream NLP tasks while avoiding training models from scratch. With the development of computational power, the emergence of deep models (such as Transformer [15]) and the enhancement of training techniques have allowed PTMs to evolve from shallow to deep.

In 2018, ELMo [16] used bidirectional LSTM as a feature extractor to learn deep contextual word representations. A new paradigm of pre-training + fine-tuning gradually formed and opened a new era of PLMs. The current deep PLM has shown a powerful ability to learn universal language representations. The Transformer [17] structure was widely used in a series of large PLMs such as BERT [18] and OpenAI [15] in subsequent studies due to its more powerful feature extraction ability. At the same time, the emergence of the attention mechanism allowed the model to learn to focus on certain information. With this combination of features, PLMs continue to be used in various NLP subfields, setting new SOTA scores. BERT, based on the Transformer structure and pre-trained with the masked language model (MLM) and next sentence prediction (NSP) on a large unlabeled corpus, brought machine reading comprehension research into the BERT-base era. Among these models, RoBERTa [19] removed the NSP pre-training task of BERT, proving that it was not beneficial for downstream tasks, and further improved the performance by adopting dynamic masks based on the random acts of BERT. Joshi [20] et al. proposed SpanBERT, a pre-training model that transformed the input into a set of spans and achieved good results by pre-training on a masked language modeling task at the span level. Zhang [21] et al. introduced a Chinese pre-trained language model called CPM (Chinese pre-trained language model), which was a large-scale generative model pre-trained on a corpus of over 10 billion Chinese language tokens that had strong language generation capabilities.

Unlike English, Chinese is unique in its syntax, vocabulary, and pronunciation. Li [22] et al. proposed in 2019 that deep learning Chinese representations should use Chinese characters as the basic unit in the word separation process instead of using words or subwords as in English according to the standard proposed by Wu [23] et al. Xu [24] et al. constructed a standard dataset for evaluating Chinese natural language processing models in 2020. Subsequently, ERNIE [25] proposed three masking strategies (character-level masking, phrase-level masking, and entity-level masking) to enhance the ability of the model to perform multi-granularity semantic capture. Cui [2] et al. proposed BERT-WWM based on BERT and pre-trained the model by modifying the whole-word masking strategy of BERT to mask each word of a Chinese character instead of masking the character as a unit. As a result, the model was forced to learn the words themselves at the mask instead of predicting the word components, as in the original BERT model, which could improve the model’s understanding of Chinese inputs.

Idioms are different from ordinary Chinese characters in that they have semantic unity and structural persistence; that is, they are semantically indivisible as a whole, and their overall meaning cannot be speculated from the individual words that make up the idiom. In terms of structure, the order cannot be changed casually, let alone the grammatical structure. This leads to the low accuracy of the existing models for idiom machine reading.

The Modern Chinese Dictionary defines “idiom” as follows: “a fixed phrase or short sentence that is concise and concise in practice; most Chinese idioms consist of four characters”. More than 95% of Chinese idioms comprise four characters (the rest range from 3 to 16 characters), and among these four-character idioms, the majority are of the “two plus two” structure. The “two plus two” structure of four-character idioms requires the treatment of one or two and three or four characters as a single unit. These are the most widely distributed examples [26] in the existing corpus and the system of everyday discourse applications. In 1995, Goldberg [27] proposed a theoretical framework for constructional grammar, arguing that some aspects of the form or meaning of a construction cannot be fully predicted from its components or established constructions.

Based on Goldberg’s theory, Wang [28] et al. investigated the antonymic co-occurrence construct “no A no B”, which fixes one or three characters and then pairs them with two or four characters of opposite meaning, and explored the intrinsic non-constructive meaning formation mechanism of this construct. Xie [29] et al. introduced the event-related brain potential technique to study the human brain’s form of understanding Chinese idioms from a neuroscientific perspective and found that there are differences in idiom processing patterns between the East and the West, and that Chinese idiom comprehension obeys idiom constructivity theory. They also confirmed that the process of Chinese idiom comprehension could not be abstracted to a simple linear extraction construction. The relationship between rhyme structure and syntactic construction at the level of Chinese utterance was discussed. The two-plus-two rhyme unit of the four-character idiom is considered a unique feature of Chinese idiom comprehension compared with foreign languages, especially English. At the same time, Yang [30] explicitly proposed that idiom comprehension needs to be split into two-plus-two structures under the constructive theory system and explored the relationship between these components. However, incorporating two-plus-two structural semantics in the process of idiom semantic understanding for training is still a fundamental problem to be solved in machine learning. At the same time, the interaction of contextual semantics also needs to be considered because of idiomatic semantic understanding. However, the capability for multi-dimensional semantic acquisition is still lacking in current methods, affecting the understanding of idiomatic semantics. Madabushi [31] et al. proposed a cross-lingual idiom detection and sentence embedding evaluation task. The task aimed to evaluate whether computers could accurately identify idioms in multiple languages and embed them contextually into semantic space.

This paper improves the BERT-Chinese [25] model by using paraphrase expansion and external knowledge introduction to enhance the model’s understanding of idioms, changing the masking method according to the structure of Chinese idioms, and introducing multi-granularity semantic reasoning to obtain more contextual information to improve the accuracy of the model.

3. Model

In this paper, the initial improvement of BERT-Chinese was first achieved via the BERT-IDM_base model. Then, the overall model of this paper, BERT-IDM-FULL, was obtained by adding a multi-granularity inference mechanism. The two models were compared with a series of benchmark models in the experimental phase. The overall architecture of the model is shown in Figure 1.

3.1. BERT-IDM_base Model

In contrast to classical machine learning methods, the “pre-training + fine-tuning” model enables larger, better-performing, and more generalizable models using large-scale unlabeled data without needing data labeling. The in-domain further pre-training (IDFP) method continues to pre-train a pre-existing model on specific types of data, allowing increasingly large pre-training models to avoid starting from scratch each time. With the support of a sufficient unlabeled corpus, IDFP can effectively improve the performance of the original pre-trained model on a particular task. In this work, we first crawled the idiom entries in the “Idioms in Sentences” section of the Sentences website as an index and then used the “Sentence Search” function of the site to backtrack the idiom sentences in the index dictionary and stitch them together using English commas. Based on this, since the idioms and their corresponding meanings also provided opportunities for the model to learn the idiom comprehension mechanism, we continued to query the implications based on the crawled idiom index from the online idiom dictionary website, and the idioms and their corresponding definitions were also separated by commas and stitched together.

The BERT-IDM_base model proposed in this paper is shown in Figure 2. In Figure 2, [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token. We denoted the input embedding as E, the final hidden vector of the special [CLS] token as C

\in R^{H}

, and the final hidden vector for the ith input token as T

_{i} \in R^{H}

. To address the problem that the current idiom segmentation approach did not enable the model to fully understand idioms, the BERT-IDM_base model uses a two-plus-two mask processing model for idiom input, masking two words at a time and using the idiom mask to process each idiom’s one or two words and three or four words as two input units. This mode enabled the model to pay attention to the association information contained in the idiom structure, thus effectively supporting the extraction of additional structural information pertaining to the idiom and avoiding word-based prediction that would lead to the destruction of structural information, as well as whole-word masking that would mask the whole idiom at once and ignore structural information. In order to conform to the structural features of the idioms themselves, the model was required to both infer the missing part according to the given part of the idiom (simulate the understanding of the idiom itself) and absorb the advantages of whole-word masking, which put the target as a whole word into context for overall contextual inference. The model structure of the Transformer not only required a large number of matrix operations in the self-attention module, increasing the arithmetic load by a large amount, but also introduced the problem of information leakage in the MLM target task.

In this paper, the structure of the Transformer is improved, and the self-masked self-attention mechanism (SMSA) is proposed. In the self-attention masking process, SMSA adopts two critical steps to ensure the efficiency and security of a model. Firstly, in the information aggregation stage, SMSA removes the token information in the query vector Q and only retains the positional information. This is because the inclusion of raw content in the query vector Q may cause information leakage and pose a threat to the security of the model. By only retaining the positional information, this situation can be effectively avoided, thereby ensuring the security of the model. Secondly, in the self-attention calculation process, SMSA normalizes the attention weights through a scaling operation and then adds an attention masking matrix to the self-attention mechanism. This attention masking matrix is a diagonal matrix with 1 s on the main diagonal and 0 s elsewhere. Its role is to mask the matrix elements on the main diagonal, thereby shielding the self-information of each vector. This can ensure that when calculating attention weights, each vector only focuses on the information of other vectors, reducing computational complexity and improving the efficiency of the model. At the same time, the introduction of the attention masking matrix can also further ensure the security of the model and avoid interference from the self-information of each vector on the performance of the model. The attention mask pattern is shown in Figure 3. Based on this, the compatibility of the model with multi-layer deep stacking architectures and residual connections was restored by setting up an isolation mechanism for vectors.

The key to Transformer, the core structure of BERT, is to use the self-attentive mechanism to compute the

A t t e n t i o n (Q, K, V)

vector. Equivalent Q, K, and V vectors were selected in the self-attentive mechanism of BERT. If the original BERT’s self-attention layer-wise propagation algorithm is used to select the K and V vectors in subsequent hidden layers, the content information in the token vectors will diffuse into the query vector Q in the subsequent layers, because the K and V vectors have already been fused in the upper layers. The final vector used for the prediction output is the query vector Q, so if this problem is not addressed, the SMSA mechanism will fail after the first layer of input in a deep model. To solve this problem, we set up an isolation mechanism for the K and V vectors to restore the compatibility of the model with a multi-layer deep stacking architecture and residual connections. Specifically, all K and V vectors in the encoding layers were fixed to a constant value, and only the query vector Q, which was used for the final prediction output and did not contain self-content information, was updated across layers. The fixed values of the K and V vectors were determined by the combination of the input embedding sequence E and the position vector P. The modified SMSA updated the attention calculation process as shown in Figure 4. Algorithm 1 formatively defines the attention flow isolation of SMSA and describes the flow in detail. This algorithm simultaneously reduced the time complexity of the large number of linear operations in the computation of the self-attentive mechanism by fixing the vector.

Algorithm 1: BERT-IDM_base based on SMSA mechanism

Input: Query vector Q, key vector K, value vector V

Output: Query vector for predicting output

Q^{'}

₁: The sequence of participle embeddings E and the corresponding position vector P are loaded;
₂: The query vector $Q_{1}$ of the first layer input is initialized as the position vector P, and the key vector K and value vector V are fixed as the vector $E + P$ and not updated in subsequent layers;
₃: The input $Q_{m}$ and K vectors in the m layer pass through the attention mask module after matrix multiplication and deflation operations; proceed through the Softmax layer; and are then subject to matrix multiplication with the vector V before outputting the attention module to obtain the intermediate output $T_{m} = Q_{m} + 1$ in accordance with Equation (1):

$T_{m} = S M S A (Q_{m}, K, V) = S o f t m a x (A t t n M a s k (\frac{Q_{m} K^{T}}{\sqrt{d}})) V;$

(1)
₄: The output of this layer is obtained by forward propagation in the fully connected layer (FFN), denoted as $S_{m}$ . $S_{m}$ satisfies Equation (2) after residual concatenation and regularization based on the results of step 3:

$S_{m} = f (N o r m (A d d (Q_{m}, T_{m}))),$

(2)

where $f (.)$ satisfies Equation (3) for input x,

$f (x) = N o r m (A d d (x, F F N (x)));$

(3)
₅: The process loops through the Transformer layers selected in the set Transformer stacking architecture, sequentially looping through steps 3 to 4 until the end of the calculation;
₆: The final layer outputs the trained query vector $Q^{'}$ , and the algorithm ends.

3.2. Multi-Granularity Integrated Attention Mechanism

We found that the existing work focused on optimizing the network and model structure from an engineering perspective rather than modeling the realistic behavior of humans in reading comprehension. In contrast, human reading comprehension is characterized by leaps, global grasp, and local focus, and we attempted to build a new idiom reading comprehension model based on the BERT-IDM_base model proposed in the previous section with more explanatory capabilities using multi-granularity reasoning and the introduction of idiom interpretation for expansion.

We based our model on the fact that people tend to focus their attention on the main idea of the passage after grasping it, rather than grasping every detail for memorization, when performing reading comprehension. Therefore, we improved the attention mechanism of the BERT-IDM_base model in the fine-tuning phase by introducing a multi-granularity integrated attention mechanism.

In MGIA, global attention (GA) is used to simulate the general reading process of human reading comprehension, stride attention (SA) is used to simulate the jump reading process of reading comprehension, and window attention (WA) is used to simulate fine reading focusing on local information. MGIA is a kind of sparse attention, which could effectively reduce the number of parameters and improve the operational efficiency compared with the fully connected dense matrix used by the original self-attentive mechanism.

In terms of implementation, MGIA makes use of the multi-headed attention mechanism of the Transformer structure, wherein the stacked attention heads are assigned to different attentions according to a predefined pattern. This allows the model to use different approaches to the reading comprehension inference process simultaneously, and the importance (i.e., weight) of different inference patterns can be adjusted by setting the distribution of attention heads. The combined and integrated attention mask matrix is shown in Figure 5, where the pink squares are global attention (GA), the blue squares are stride attention (SA), and the yellow squares are window attention (WA).

For a set recording the positions of a set of elements to be attended to by the attention mechanism as the current position coordinates, the three types of attention used by MGIA can be specified as described below (using the lower triangular matrix).

There are two specific implementation methods for global attention (GA), one is to select the individual clauses at the head of the sequence as global clauses, and the other is to add new clauses without real meaning to the beginning of the sentence to aggregate the information of the whole sequence. In this section, we chose to use the latter and add [CLS] clauses uniformly to avoid the bias caused by data samples. After setting the global clauses, all clauses needed to calculate attention scores with global clauses, and the global clauses also needed to calculate attention scores with global clauses. Thus, all clauses in the sequence (i.e., the overall information distributed) were linked by global clauses, at which time extracting the global clauses could have the effect of grasping the main idea of the passage. The attention set corresponding to global attention can be expressed as presented in Equation (4):

S_{i} = {j | j < i < n} .

(4)

The stride attention mechanism (SA) was originally used for image generation and later applied to language modeling. Through the SA model, long-range dependencies in sequences can be constructed, and vector representations can be compressed to reduce computational complexity. This effectively increases the receptive field (analogous to a convolution operation) for the same computational cost under longer sequence lengths. For a given stride length k, the attention matrix attends to every k element from the current element and aggregates them into set

S_{i}

for attention. Increasing the stride length can further reduce the computational complexity. For example, when

k = \sqrt{n}

, the complexity of the attention mechanism decreases from

O (n^{2})

to

O (n * \sqrt{n})

. The attention set corresponding to stride attention can be expressed as in Equation (5):

S_{i} = {j | (i - j) (m o d k) = 0, j < i} .

(5)

Window attention (WA) focuses on the information around the current time step and assumes a window size of k. Then, window attention focuses on the elements within the distance of the attention nucleus at this moment as the object of attention. Window attention can also be considered as a kind of local attention, which enables the model to focus fully on the semantic information embedded in the vicinity of the location to be processed and can significantly reduce the diversion of attention from information that is theoretically irrelevant to the current subword. The attention set corresponding to the window attention can be expressed as in Equation (6):

S_{i} = {j | i - k \leq j < i} .

(6)

The attention mechanism of MGIA with three such pre-defined templates can be set up in such a way as to allocate computational resources to the key parts in a very efficient way, avoiding distracting attention while combining global and local information in a multi-granular form, taking into account long-range dependency modeling, and reducing the complexity of the attention computation process. The general structure of the MGIA mechanism is shown in Figure 6.

The multi-headed attention mechanism is an extension of the single-headed attention mechanism, which branches the

Q - K - V

vector according to the number of attention heads. The attention heads are isolated from each other and made invisible for independent parallel attention score computation. Different attention masks were chosen so that the same input yielded different weight coefficients, which were stitched together in the form of first-place joins in the output. In this way, the semantic information learned by many different granularities of attention was integrated, thus simulating the process of reading and reasoning by humans using multiple modalities. This can be expressed formally as in Equation (7).

For an input sequence of length

X = (x_{1}, x_{2}, \dots, x_{n}) \in R^{n \times d}

, representing the set of activated elements in the attention mask of the MGIA mechanism, the update of the multi-headed self-attentive mechanism is

A t t e n t i o n_{M G I A} {(X)}_{i} = x_{i} + \sum_{j = 1}^{j} σ (Q_{j} (x_{i}) K_{j} {(X_{N (i)})}^{⊤}) \cdot V_{j} (X_{N (i)}) .

(7)

3.3. Embedded Layer

In the embedding layer, a pre-trained model that could sufficiently extract semantic information and syntactic structure needed to be selected as the encoder of the component used to encode the input in this layer. Although both BERT-IDM_base and XLNET are pre-trained models that contain a large amount of linguistic knowledge, have been trained by learning on a large-scale corpus, and are capable of performing this task independently, it was possible to combine the advantages of both to obtain a better idiomatic representation by building a dual-stream pre-training model.

A major drawback of BERT is that the downstream task does not contain artificially added symbols such as [MASK], leading to a mismatch between pre-training and fine-tuning. This is a common problem in the design mechanism of BERT models and is a drawback brought about by the introduction of MLM target tasks. However, this drawback was not significant in the downstream task involved in this study, because the completion task precisely required the model to select the correct candidates from the masked integrity-breaking input to recover the original passage, not only with artificially added marker symbols, but also with a pattern of tasks consistent with MLM. XLNet was not suitable as a backbone model because of the large difference between the target task and completion task; however, because of its own powerful semantic information extraction ability and the native advantage that the autoregressive model could take into account the interrelationship between sequence subwords, it was suitable for use as a secondary model.

The additional idioms introduced in this work were data that did not contain any artificial symbols and were inherently flawed when encoded using BERT-IDM_base; thus, using XLNet as a black box in an end-to-end fashion was appropriate for this scenario. In addition, since the pre-training task of BERT-IDM learned the idiom reading pattern so that the model processed the idioms in a “two-plus-two” structure, to further enhance the multi-granularity setting of the model, the input of XLNet was decomposed into a simple sequence of characters, i.e., the idioms were not entered into the model as whole words or in a two-plus-two structure.

For a given chapter

P = [p_{1}, p_{2}, \dots, [M A S K], \dots, p_{n}]

and seven options (provided by the ChID dataset of seven-choice completion questions)

d_{1}, \dots, d_{7}

, the input for XLNet was

{[C L S], p_{1}, p_{2}, \dots, d_{i 1}, d_{i 2}, d_{i 3}, d_{i 4}, \dots, p_{n}, [S E Q]}

, where

d_{i j}

denotes the jth character of the ith option, while the input for BERT-IDM was entered as whole words in the

d_{i}

sequence.

3.4. Feature Extraction Layer

After acquiring the embedding representation of the original input for initialization in the embedding layer, the idioms were encoded from the character level and word level using XLNet and BERT-IDM_base, respectively, potentially introducing the ability to understand idioms in multiple dimensions, while in the feature extraction layer the two input streams were directly feature-extracted using MGIA, the multi-grained integrated attention proposed in this paper. The models in this layer performed multi-granularity inference on top of their respective encoders, which was equivalent to testing two subjects who had learned the language in different ways, both using the same inference paradigm consistent with human thinking patterns to complete the test questions. The multi-granularity inference approach further amplified the differences between the multi-dimensional comprehension modalities in the embedding layer, increasing the model’s potential in terms of multi-perspective reading inference capabilities and generalization performance.

Let the hidden layer of the attention mechanism be represented as

h_{i}

. The synthetic vector of

d_{k}

input into the feature extraction layer through the embedding layer is represented as

d_{k}^{'}

, and the computation between the hidden layers is shown in Equation (8).

Q_{j}^{'} (d_{k}^{'}) = M G I A_M a s k (\frac{Q_{j} (d_{k}^{'}) K_{j} {(h_{j})}^{⊤}}{\sqrt{n}}) .

(8)

3.5. Semantic Interaction Layer

The vector flow passed through the feature extraction layer and entered the semantic interaction layer, where the context was fused with the combination of options plus interpretation. In this layer, multiple attention heads computed attention scores in parallel and integrated them in the manner shown in Equation (9). Two reading comprehension methods with different granularity grouped the results into a unified model in this layer and prepared the final output after splicing the vectors to obtain the overall attention in the form of Equation (10). The fusion vector at this point already contained the results after the expansion of multi-granularity inference and interpretation information.

A t t e n t i o n_{M G I A} {(d_{k}^{'})}_{i} = h_{i} + \sum_{j = 1}^{J} σ (Q_{j}^{'} (d_{k}^{'})) \cdot V_{j} (h_{j}),

(9)

A t t e n t i o n_{T o t a l} = C o n c a t (A t t e n t i o n_{M G I A} (d_{B E R T - I D M_{b a s e}}^{'}, A t t e n t i o n_{M G I A} (d_{X L N e t}^{'}))) .

(10)

3.6. Answer Prediction Layer

Finally, the model prediction results were outputted after calculating the probability of the semantic fusion information

A t t e n t i o n_{T o t a l}

in the upper layer through the linear fully connected layer, and the inferred answer was derived as shown in Equation (11).

p_{k} = \frac{e x p (w \cdot A t t e n t i o n_{T o t a l} (d_{k}^{'}) + b)}{\sum_{j = 1}^{K} e x p (w \cdot A t t e n t i o n_{T o t a l} (d_{k}^{'}) + b)} .

(11)

This concludes the introduction of the BERT-IDM-Full model proposed in this paper. Experiments comparing it with BERT-IDM_base and other baseline models are presented in the next section.

4. Experiment

4.1. Experimental Dataset

The dataset used in this paper, ChID, is a new, large-scale idiom completion dataset created in 2019. The idioms used in the dataset all consist of four words, and it contains a total of 580,807 text passages and 728,713 vacancies to be filled. To ensure sample diversity, the texts included in the dataset were mainly extracted from novels, essays, and newspapers. The out-of-domain data contain more characters on average in each paragraph than the in-domain data (127 characters vs. 99 characters), while the number of unfilled vacancies per article is higher in the former (1.49 vacancies vs. 1.25 vacancies). Other specific data are shown in Table 1.

Each vacancy to be filled is masked by the #idiom# marker, and seven candidates are provided. The candidates contain one correct choice, the ground truth. Three of the other six confusing items are randomly selected from the top ten in the set of idioms with a cosine similarity of 0.7 or less to the respective word embedding of the correct choice, and the other three are randomly sampled from other dissimilar idioms in the idiom database. This excludes choices whose meaning is too close to the correct answer, but ensures that the meaning is close enough to assess model comprehension, as shown in Figure 7. It is worth noting that the construction of the ChID dataset shows that there is a specific semantic connection between the candidate set and the correct choice, which is also consistent with the characteristics of real-world completion questions. The model needed to read and comprehend texts of a moderate length, make inferences about the meaning of the candidate set with idioms that were similar or irrelevant to the correct option, and output the answer.

The training, test, and development sets of the ChID dataset all consist of news and fiction stories, while only the extra-territorial dataset includes prose-type passages. This means that the idiom usage in the extra-domain dataset is more stylistically variable and difficult to infer depending on the corresponding literary style of the genre. In addition, the frequency of idioms in the extra-domain dataset is lower than that in the in-domain dataset, and the passages are longer; thus, the idioms in the extra-domain dataset are rarer and less exposed.

4.2. Evaluation Metrics

For ordinary dichotomous classification tasks, accuracy, precision, recall, and F1 values can be calculated from the TP, FP, TN, and FN values in the confusion matrix, and then the P–R curves can be plotted. In this paper, the completed-form idiom reading comprehension problem was a seven-category problem in which the option with the highest prediction probability was selected from the seven options provided in the question. In this type of multi-classification problem, the most straightforward method is to use the number of correctly classified samples divided by the overall sample size to calculate the accuracy rate. Since the seven categories corresponded to seven alternatives, and different categories did not have proximity relationships, there was no need to set up further dichotomous evaluation metrics for each category to calculate the accuracy and recall rates. The experiments presented in this chapter examined the accuracy of classifying data with a total sample size of N for the whole category, and the model output of the predicted category

y_{i}

and the baseline fact

y_{i}

were matched without error according to statistical calculations. The formulae are shown in Equations (12) and (13).

A c c u r a c y = \frac{1}{N} \sum_{i = 1}^{N} f (y_{i}, y_{i}^{'}),

(12)

f (y_{i}, y_{i}^{'}) = \{\begin{matrix} 1, & y_{i} = y_{i}^{'} \\ 0, & y_{i} \neq y_{i}^{'} \end{matrix} .

(13)

4.3. Experimental Settings

Hyperparameters cannot be learned from the standard training process, and good hyperparameter settings can effectively improve model performance. During the process of setting the hyperparameters for the experiment, we selected the best fine-tuning learning rate on the dev set [18] and trained using this value, ultimately choosing a learning rate of 5 × 10⁻⁵. Due to the high computational and disk space requirements of the pre-trained model, the IDFP training stage performed sampling and incremental backup on the ChID development set in each round. If there was a continuous decline in performance, the training process was terminated early, and the model with the highest accuracy was selected to continue pre-training. Since the BERT-IDM-FULL model proposed in this paper was an improvement on BERT-IDM_base, its hyperparameters remained the same except for special features. The hyperparameter settings are shown in Table 2. MGIA specifically uses the same hyperparameter settings as BERT-IDM_base, while the hyperparameter configuration of XLNet is recorded in Table 3.

In order to prevent the overfitting of the model, the validation set was tested in each round during the training phase of the model. If the accuracy did not improve further after two consecutive rounds on the validation set, the training process of the model was stopped, and the model with the highest accuracy round was used as the final model.

5. Results and Analysis

5.1. Experimental Results

This section presents the experimental investigations conducted on the proposed models using the ChID dataset and the CNCID dataset constructed in this study. From the perspectives of reading comprehension and out-of-domain generalization ability, commonly used baseline models were compared with the proposed BERT-IDM_base model and the BERT-IDM-FULL model. The experimental results indicated that the pre-trained models (PTMs), in the era of pre-training, achieved significant improvements over the traditional models. BERTBASE, RoBERTa, and BERT-IDM-FULL all outperformed the classic baseline models by a large margin.The overall experimental results obtained from the experiments are shown in Table 4.

This experiment retained the same BiLSTM baseline as XLNet for regression prediction, while XLNet was proposed as a large pre-training model with RoBERTa as the baseline and obtained better results for several tasks. The performance of XLNet was indeed stronger than that of RoBERTa and BERT-IDM_base in the experiments, but it was slightly weaker than that of the other two models on both the ChID out-of-domain and CNCID datasets, which indicated that the inference generalization ability of XLNet was slightly weaker than that of the other two baselines. This may have been due to the fact that the inference process modeled by the AR model only computed the log-conditional probabilities of each random variable and then summed them to obtain the log-likelihood, unlike the AE model, which learned the unsupervised representation of the data in the self-coding process.

Furthermore, because of XLNet’s powerful performance and slightly deficient generalization reasoning ability, using it as one of the two streams of fused semantic integration could effectively assist in the performance improvement of the model, while the self-encoding model used in the other stream provided the basis for the high generalization ability of the model. The integrated AR and AE models dealt with idioms as characters (character embedding) and idioms (idiom embedding), respectively, into which the paraphrase information of the idioms was expanded, bringing accuracy improvements of 1.2% and 1.6%, respectively, for the overall BERT-IDM-Full model proposed in this paper. Although this model achieved the best performance on all three datasets, there was still a substantial gap of close to 4% in reading comprehension ability compared with trained native Chinese speakers.

The accuracy convergence between the baseline model and the overall model training in this paper is shown in Figure 8, and by observing the experimental results, it can be found that all models reached convergence before the seventh round of training, the performance of all the large-scale pre-trained models improved, the baseline BERT converged significantly worse because of its lack of language knowledge and ability to process idioms, and the simple Bi-LSTM model converged faster. However, there was a significant performance gap. The overall model in this paper obtained a better convergence speed and prediction accuracy because it could fully parse the idiom structure and obtain knowledge of idioms in the target domain through further pre-training, which proved that the overall BERT-IDM-Full model proposed in this work had better robustness and better performance in reading and understanding idioms.

5.2. MGIA Experimental Analysis

There are two configurable parts in MGIA, one is the scheme of assigning three different types of attention to 12 attention heads, and the other is the Stride value of the stride attention among the three types of attention used. To find the attention head allocation scheme that improved the performance, the following three attention combinations were chosen for this experiment, and the results are shown in Table 5 after fixing the combination and changing the step size on the ChID test set. The three combinations were: 12 stride attention heads (12SA); 6 global attention heads and 6 stride attention heads (6GA + 6SA); and 4 global attention heads, 4 stride attention heads, and 4 local attention heads (4GA + 4SA + 4WA).

As can be seen from the table, the model performed better when the step size was 1, while any increase in the step size setting led to a decrease in performance. This may have been due to the fact that the sample sequence of the ChID dataset was too short, and a larger step size could not be chosen. The first pure cross-step attention scheme lacking global attention performed poorly due to the lack of full-text information, although it picked up later, probably as a result of model oscillation. With the addition of global attention and windowed attention as an expansion of global and local information, the third scheme achieved the best performance among all the schemes. Although the model performance decreased when the step size increased, it could be observed that the third scheme decreased more slowly and with fewer oscillations, which indicated that the multi-granularity inference approach designed in this section was more robust.

However, a very important property of the step size of 1 was that an idiom was divided into two subwords, because it followed the setting of the idiom mask in Section 3, and a step size of 1 caused the two subwords to remain invisible to each other and provide no information to each other. Similar to BERT-WWM, which predicted whole words instead of characters and exchanged information by isolating subwords, the channel provided a more challenging setting for the model learning process. The experiments in this section observed the effect of three schemes for improving the robustness and shock resistance of the model under the settings of a step size of 1 and 3. The experimental results are shown in Figure 9, where 1S and 3S represent the two settings of step size 1 and 3, respectively.

From Figure 9, it can be seen that the final training losses under the step size 3 setting were all greater than when the step size was 1. The MGIA module loss drop for the hybrid attention head mechanism under the single step size setting was smoother, and the accuracy rate corresponding to the combination of loss settings in training was also better, as shown in Table 5, which demonstrated that higher robustness improved performance results and convergence.

5.3. Ablation Experiments

In this section, we present the ablation experiments conducted on the XLNet interpretation expansion module, a constituent element of BERT-IDM-Full, and the MGIA multi-granularity integrated attentional reasoning module to investigate the impact of both on the performance of the overall model and the effectiveness of model improvement. The experimental results obtained are shown in Table 6.

From the experimental results, it can be seen that the performance of the model with the addition of the MGIA module for enhanced inference had a significant enhancement of more than 1% compared to the baseline BERT-IDM model, which indicated that multi-granularity inference enabled BERT-IDM to catch up with XLNET in terms of performance. However, the generalization ability of the model with this setting was slightly weakened. The generalization ability of XLNet in combination with the BERT-IDM_base model was greater than that of XLNet alone, and the substantial improvement in performance compared to the baseline was a good indication of the effectiveness of incorporating idiom interpretation. However, the model results were only comparable to the BERT-IDM model with the addition of MGIA, which again was not the upper limit of XLNet’s capability. The results of both experiments indicated that only after the multi-granularity integration of MGIA could the AE and AR models each take advantage of their strengths to achieve a better overall model structure.

In contrast, the overall model proposed in this work still significantly outperformed the other models on the new web-based idiom dataset CNCID, which indicated that the multi-granularity splitting of idioms followed by multi-granularity inference could actually enhance the model reading comprehension and fully improve the potential of the overall model.

6. Conclusions

This paper explored and studied problems in the field of Chinese idiom reading comprehension (CIRC), such as data scarcity, insufficient contextual interaction, and the lack of model interpretability. To address these issues, a masked-language-model-based machine reading comprehension (MRC) method for Chinese idioms was proposed, including pre-training and fine-tuning phases. To improve the current approach of treating idioms as disjointed character sequences or unprocessed complete words, we proposed the BERT-IDM method based on a “2 + 2” masking pattern, which was grounded in linguistic research on idiom structures. The model was further pre-trained to extend the idiom definitions and extract internal structural information to enhance the idiom reading comprehension performance. During pre-training, we modified the self-attention mechanism in the Transformer structure used in BERT to enhance the model’s ability to resist overfitting and feature extraction. To address the limitations of data scarcity and context–option interaction improvement in the fine-tuning phase, we used multi-granularity ensemble attention to simulate various methods of human reading comprehension. By using two pre-training models, XLNet and BERT-IDM, to learn idiom and context representations at different granularities and fusing them with multi-granularity attention, we introduced idiom definitions as external knowledge for multi-granularity inference and prediction. A new cross-domain idiom dataset was also constructed to test the model’s generalization and inference abilities. The experimental results demonstrated the effectiveness of the proposed model in the CIRC task. In future work, we will improve the model’s ability to adapt to different idiomatic expressions by selecting appropriate inference paradigms and autonomously adjusting key factors such as the stride and attention head allocation template.

Author Contributions

Conceptualization, Y.L., L.Y. and Y.F.; methodology, Y.D. and Y.F.; writing—original draft, Y.D.; writing—review and editing, Y.L. and L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

Supported by the National Key Research and Development Program of China (no. 2021YFF0901200).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The model is trained on ChID and self-built CNCID datasets for training and testing. Dataset link: https://github.com/chujiezheng/ChID-Dataset. CNCID dataset is not yet open source if you need it, please contact the author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv 2016, arXiv:1606.05250. [Google Scholar]
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z. Pre-Training with Whole Word Masking for Chinese BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3504–3514. [Google Scholar] [CrossRef]
Cui, Y.; Liu, T.; Yang, Z.; Chen, Z.; Ma, W.; Che, W.; Wang, S.; Hu, G. A Sentence Cloze Dataset for Chinese Machine Reading Comprehension. arXiv 2020, arXiv:2004.03116. [Google Scholar]
Zheng, C.; Huang, M.; Sun, A. ChID: A Large-scale Chinese IDiom Dataset for Cloze Test. In Proceedings of the Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Lehnert, W.G. The Process of Question Answering; Yale University: New Haven, CT, USA, 1978. [Google Scholar]
Hirschman, L.; Light, M.; Breck, E.; Burger, J.D. Deep Read: A Reading Comprehension System. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, MD, USA, 20–26 June 1999; Association for Computational Linguistics: Stroudsburg, PA, USA, 1999; pp. 325–332. [Google Scholar]
Choi, I.C.; Kim, K.S.; Boo, J. Comparability of a paper-based language test and a computer-based language test. Lang. Test. 2003, 20, 295–320. [Google Scholar] [CrossRef]
Charniak, E.; Altun, Y.; de Salvo Braz, R.; Garrett, B.; Kosmala, M.; Moscovich, T.; Pang, L.; Pyo, C.; Sun, Y.; Wy, W.; et al. Reading Comprehension Programs in a Statistical-Language-Processing Class. In ANLP-NAACL 2000 Workshop: Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems; Association for Computational Linguistics: Stroudsburg, PA, USA, 2000. [Google Scholar]
Riloff, E.; Thelen, M. Rule-based question answering system for reading comprehension tests. In Proceedings of the Workshop on Reading Comprehension NAACL/ANLP, Seattle, WA, USA, 4 May 2000. [Google Scholar]
Poon, H.; Christensen, J.; Domingos, P.; Etzioni, O.; Hoffmann, R.; Kiddon, C.; Lin, T.; Ling, X.; Ritter, A.; Schoenmackers, S.; et al. Machine Reading at the University of Washington. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, Los Angeles, CA, USA, 6 June 2010; pp. 87–95. [Google Scholar]
Hermann, K.M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching Machines to Read and Comprehend; MIT Press: Cambridge, MA, USA, 2015. [Google Scholar]
Rajpurkar, P.; Jia, R.; Liang, P. Know what you don’t know: Unanswerable questions for SQuAD. arXiv 2018, arXiv:1806.03822. [Google Scholar]
Bajaj, P.; Campos, D.; Craswell, N.; Deng, L.; Gao, J.; Liu, X.; Majumder, R.; McNamara, A.; Mitra, B.; Nguyen, T.; et al. Ms Marco: A Human Generated Machine Reading Comprehension Dataset. arXiv 2016, arXiv:1611.09268. [Google Scholar]
Chen, D.; Bolton, J.; Manning, C.D. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 1 May 2013).
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Guu, K.; Hashimoto, T.B.; Oren, Y.; Liang, P. Generating sentences by editing prototypes. Trans. Assoc. Comput. Linguist. 2018, 6, 437–450. [Google Scholar] [CrossRef]
Zhang, Z.; Han, X.; Zhou, H.; Ke, P.; Gu, Y.; Ye, D.; Qin, Y.; Su, Y.; Ji, H.; Guan, J.; et al. CPM: A large-scale generative Chinese pre-trained language model. AI Open 2021, 2, 93–99. [Google Scholar] [CrossRef]
Li, X.; Meng, Y.; Sun, X.; Han, Q.; Yuan, A.; Li, J. Is Word Segmentation Necessary for Deep Learning of Chinese Representations? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Xu, L.; Hu, H.; Zhang, X.; Li, L.; Cao, C.; Li, Y.; Xu, Y.; Sun, K.; Yu, D.; Yu, C.; et al. CLUE: A Chinese language understanding evaluation benchmark. arXiv 2020, arXiv:2004.05986. [Google Scholar]
Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Tian, H.; Wu, H.; Wang, H. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding. Proc. Aaai Conf. Artif. Intell. 2020, 34, 8968–8975. [Google Scholar] [CrossRef]
Yang, H.R. Study on the Composition Grammar of Four-Character Idioms; Wuhan University of Technology: Wuhan, China, 2018. [Google Scholar]
Goldberg, A.E. Constructions: A Construction Grammar Approach to Argument Structure; University of Chicago Press: Chicago, IL, USA, 1995. (In Japanese) [Google Scholar]
Lei, W.; Guojun, L. A Cognitive Interpretation of the Co-occurrence Pattern of Antonyms in Modern Chinese: “Mei A Mei B”. Foreign Lang. Stud. 2016, 38, 5–12. [Google Scholar]
Xie, X.Y.; Bai, C. Meaning or form: A study on the cognitive mechanism of Chinese idioms. Nat. Dialectics Lett. 2017, 39, 6. [Google Scholar]
Yang, H.R. Cognitive interpretation of the 2 + 2 structure of Chinese idioms from the perspective of constructional grammar. J. Hubei Univ. Econ. Humanit. Soc. Sci. Ed. 2018, 15, 3. [Google Scholar]
Madabushi, H.T.; Gow-Smith, E.; Garcia, M.; Scarton, C.; Idiart, M.; Villavicencio, A. SemEval-2022 task 2: Multilingual idiomaticity detection and sentence embedding. arXiv 2022, arXiv:2204.10050. [Google Scholar]

Figure 1. Overall structure of the proposed model.

Figure 2. Diagram of pretraining process of BERT-IDM.

Figure 3. Schematic Diagram of SMSA mechanism.

Figure 4. Calculation process of SMSA mechanism.

Figure 5. Attention mask under multi-head settings.

Figure 6. Schematic diagram of MGIA mechanism.

Figure 7. Example of data in ChID dataset.

Figure 8. Experimental results of the models.

Figure 9. Experimental results of training loss.

Table 1. Statistical information regarding the ChID dataset.

Statistical Item	Intra-Domain Dataset			Extra-Territorial Dataset	Total
Statistical Item	Training Dataset	Development Dataset	Test Dataset	Extra-Territorial Dataset	Total
Number of paragraphs	520,711	20,000	20,000	20,096	580,807
Average length of text segments	99	99	99	127	100
Number of idioms included	3848	3458	3502	3626	3848
Average word frequency	168.6	7.2	7.1	8.3	189.6
Total number of vacancies	648,920	24,822	24,948	30,023	728,713
Average number of vacancies in text paragraphs	1.25	1.24	1.25	1.49	1.25

Table 2. Hyperparameter configuration of Experiment 1.

Parameter Name	Parameter Value
Learning rate	5 × 10⁻⁵
Batch size	32
Number of training rounds	10
Dropout probability	0.1
Optimizer	Adam ( $β_{1} = 0.9, β_{2} = 0.999$ )
Transformer layers	12
Attention head count	12
Maximum sequence length	256

Table 3. Hyperparameter configuration of Experiment 2.

Parameter Name	Parameter Value
Learning rate	2 × 10⁻⁵
Batch size	32
Hidden dimensions	768
Dropout probability	0.1
Optimizer	RAdam
Transformer layers	12
Activation function	ReLU
Maximum sequence length	256

Table 4. Overall experimental results (accuracy percentage, %).

Model Name	${ChID}_{TEST}$	${ChID}_{OUT}$	$CNCID$
BiLSTM	71.5	61.3	25.8
BERTBASE	78.2	64.3	37.6
RoBERTa	81.8	65.9	43.8
BERT-IDM_base	81.6	66.3	44.2
XLNet	82	65.7	41.6
BERT-IDM-Full	83.2	67.5	46.7
Human performance	87.1	86.2	78

Table 5. Stride–performance relationships.

	1	2	3	4	5
Program	1	2	3	4	5
12SA	81.2	80.5	79.4	79.1	79.9
6GA + 6SA	82.8	82.1	81.7	81.0	80.2
4GA + 4SA + 4WA	83.0	82.8	82.5	81.9	81.4

Table 6. Ablation study of BERT-IDM (accuracy percentage, %).

Model Name	${ChID}_{TEST}$	${ChID}_{OUT}$	$CNCID$
BERTCHINESE	79.1	63.5	37.9
BERT-IDM_base	81.6	66.3	44.2
BERT-IDMMGIA	82.7	65.9	44.5
BERT-IDMXLNet	82.6	66	42.5
BERT-IDM-Full	83.2	67.5	46.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dai, Y.; Liu, Y.; Yang, L.; Fu, Y. An Idiom Reading Comprehension Model Based on Multi-Granularity Reasoning and Paraphrase Expansion. Appl. Sci. 2023, 13, 5777. https://doi.org/10.3390/app13095777

AMA Style

Dai Y, Liu Y, Yang L, Fu Y. An Idiom Reading Comprehension Model Based on Multi-Granularity Reasoning and Paraphrase Expansion. Applied Sciences. 2023; 13(9):5777. https://doi.org/10.3390/app13095777

Chicago/Turabian Style

Dai, Yu, Yuqiao Liu, Lei Yang, and Yufan Fu. 2023. "An Idiom Reading Comprehension Model Based on Multi-Granularity Reasoning and Paraphrase Expansion" Applied Sciences 13, no. 9: 5777. https://doi.org/10.3390/app13095777

APA Style

Dai, Y., Liu, Y., Yang, L., & Fu, Y. (2023). An Idiom Reading Comprehension Model Based on Multi-Granularity Reasoning and Paraphrase Expansion. Applied Sciences, 13(9), 5777. https://doi.org/10.3390/app13095777

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Idiom Reading Comprehension Model Based on Multi-Granularity Reasoning and Paraphrase Expansion

Abstract

1. Introduction

2. Related Work