DaGATN: A Type of Machine Reading Comprehension Based on Discourse-Apperceptive Graph Attention Networks

Wu, Mingli; Sun, Tianyu; Wang, Zhuangzhuang; Duan, Jianyong

doi:10.3390/app132212156

Open AccessArticle

DaGATN: A Type of Machine Reading Comprehension Based on Discourse-Apperceptive Graph Attention Networks

School of Information, North China University of Technology, Beijing 100144, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(22), 12156; https://doi.org/10.3390/app132212156

Submission received: 6 October 2023 / Revised: 23 October 2023 / Accepted: 3 November 2023 / Published: 9 November 2023

(This article belongs to the Special Issue AI for Computational Vision, Natural Language Processing, and Geoinformatics)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, with the advancement of natural language processing techniques and the release of models like ChatGPT, how language models understand questions has become a hot topic. In handling complex logical reasoning with pre-trained models, its performance still has room for improvement. Inspired by DAGN, we propose an improved DaGATN (Discourse-apperceptive Graph Attention Networks) model. By constructing a discourse information graph to learn logical clues in the text, we decompose the context, question, and answer into elementary discourse units (EDUs) and connect them with discourse relations to construct a relation graph. The text features are learned through a discourse graph attention network and applied to downstream multiple-choice tasks. Our method was evaluated on the ReClor dataset and achieved an accuracy of 74.3%, surpassing the best-known performance methods utilizing deberta-xlarge-level pre-trained models, and also performed better than ChatGPT (Zero-Shot).

Keywords:

machine reading comprehension; graph attention network; logical reasoning

1. Introduction

Artificial intelligence has deeply influenced people’s work and daily lives today. For instance, voice assistants like Siri and Cortana can now help users operate their devices, and ChatGPT, with its impressive capabilities, provides users with inspiration, references, and aids in decision-making. But the foundation of all these is that AI can correctly understand the requirements you express in natural language. This is closely related to the goal of machine reading comprehension tasks.

Machine reading comprehension (MRC) is a fundamental task in the field of natural language processing that requires models to respond to a given passage of text and related questions. Just as we assess human comprehension of a passage of text through reading comprehension tests, MRC can be used to assess a computer system’s ability to understand human language.

As one of the important research task in natural language processing, a large number of reading comprehension datasets have been proposed such as SQuAD [1], an extractive reading comprehension dataset based on Wikipedia. HotpotQA [2], a multi-hop reading comprehension dataset that requires extracting information from multiple distinct text passages. And DROP [3], a generative reading comprehension dataset that assesses discrete reasoning abilities, etc. As the datasets continue to evolve, their difficulty is gradually increasing. Since logical reasoning ability has long been considered as a key thinking ability of the human brain [4], this has also been recognized by many cutting-edge academics in the field. Several challenging multiple-choice logical reasoning reading comprehension datasets have been built, such as LogiQA [5] and ReClor [6].

Following Google’s proposal of the BERT [7] model, the method based on pre-trained language models, which can fully exploit and utilize the predictive information and prior knowledge obtained from massive training data, have achieved substantial performance gains in 11 downstream tasks in the natural language processing domain, including machine reading comprehension tasks. The Transformer architecture was originally proposed and designed to address sequence transformation and machine translation tasks [8]. Its encoding layer employs a self-attention mechanism and significantly improved performance compared to the RNN method. Subsequently, an increasing number of NLP tasks use methods based on pre-trained models, including named entity recognition [9,10,11], machine translation [12,13], and machine reading comprehension [14,15,16,17,18].

Logical reasoning requires correct understanding of the logical relationships between different sentences, pointing out a positive example that enhances the reliability of a conclusion or a negative example that weakens the reliability of a conclusion. The need for this capability places higher demands on the performance of existing reading comprehension models since the inference capability of a large number of models relies heavily on entities and their numerical weights. However, due to the complexity of logical reasoning machine reading comprehension problems, pre-trained language models still do not perform cautiously well on such tasks and struggle to reach the average human level.

In recent years, researchers have worked on designing specific model architectures for integrating logical structures, Jiao et al. [4] proposed the introduction of symbolic logic as data expansion into neural network models using self-supervised and contrastive learning. Wang et al. [17] proposed LReasoner, a contextual and data enhancement framework based on parsing logical expressions.

With the introduction of DAGN [18], utilizing graph structures to model the abstract logical relationships in logical reasoning tasks and employing computational methods like GNN or Graph Transform to simulate the reasoning process, a novel approach to addressing this task has been presented. After that, Li [16] and Ouyang et al. [19] also proposed implicit inference of logical information of articles using graph structures, and these approaches have made a certain degree of progress on various datasets. Previous research believes that an intuitive idea for identifying logical relationships between text units for this goal is to use discourse relations [20], such as words like “because” and “therefore” for cause-effect relationships, and “if” words to indicate hypothetical relations, and implicit logical relations brought about by punctuation. Modeling logical structure has proven to be one of the effective methods for enhancing logical reasoning for the widely used pre-trained models so far in the current reasoning task.

However, all of the above approaches suffer from a certain degree of suboptimal generalization ability and overfitting. First, logical expression-based data enhancement methods still focus on sentence-level interactions between tokens, cannot capture abstract logical relations well, and are prone to overfitting during the training process. This results in sacrificing the accuracy of certain types of questions, such as deductive reasoning, in order to improve the accuracy of specific types of reasoning problems, such as inductive reasoning. Second, regarding neural networks based on graph structures, due to the structural specificity of graph data, the position information is actually lost during the processing. For language models due to their extensive use of Transformer architecture, absolute/relative position embedding are both important components of their encoding layer [21], because the parallel computing nature of Transformer also makes it lose position information, which is also important for understanding semantics. Third, most of the existing methods for graph structure use graph convolution, and the computed EDU Embedding is aggregated with the token Embedding of the pre-trained model and sent to the Bi-GRU [18,22] or multi-headed attention mechanism [8,23]. The simple aggregation of self and neighbor node information by averaging does not combine EDU-level granularity of discourse units with the attention mechanism.

To address the above problems, we will improve three aspects of the construction of the discourse graph neural network, node encoding method, and graph reasoning calculation method, and propose a machine reading comprehension model, DaGATN, for logical reasoning multiple choice task. The main contributions are summarized as follows:

(1): To improve logical reasoning processes, we propose a discourse graph construction approach that uses punctuation and explicit connectives for node segmentation, aided by positional encoding. It uses the PDTB2.0 [24] to partition the context and answers into nodes. After that, to supplement the positional information missing in DAGN, the nodes are positional encoded according to their order in the original text to supplement the missing positional information.
(2): In order to make the node position encoding imply both absolute and relative position information, we use a period function-based approach to encode the node position vectors in two-by-two groups for their dimensions. This ensures that each position vector is unique and bounded, while different position vectors can be obtained by linear transformation. The model is easier to generalize when dealing with position vectors and better handles sequences with inconsistent length and training data distribution.
(3): In order to solve the problem that existing models such as DAGN, still have room to improve the message passing method, and make it more effectively perform message passing among nodes when performing graph reasoning, we design a discourse graph attention network based on a multi-head attention mechanism. By using attention weight coefficients to adaptively obtain information about neighboring nodes to simulate different degrees of focus on each condition during reasoning, we enhanced its performance on the inductive task.

2. Related Work

2.1. Logical Reasoning Machine Reading Comperhension

In recent years, with the success of pre-trained language models in NLP, many pre-trained language models (e.g., BERT [7], RoBERTa [25], XLNet [26], GPT-3 [27], etc.) have met or exceeded human performance on popular MRC datasets.

However, those MRC datasets are lacking, or just have a little of data examining logical reasoning abilities. For example, according to Sugawara and Aizawa et al. [28], there is no logical reasoning content in the MCTest [29] dataset, while only 1.2% of the SQuAD dataset requires logical reasoning to answer questions. Therefore, Yu et al. [6] proposed the ReClor dataset, which focuses on examining logical reasoning ability. A task related to logical inference MRC is Natural Language Inference (NLI), which requires the model to classify the logical relationships of given sentence pairs. However, the NLI task only considers three simple logical relations (implication, contradiction, and irrelevance) at the sentence level, whereas logical reasoning MRC is more challenging as it needs to predict multiple complex logical relations at the chapter level to determine the answer.

2.2. Research Actuality

As shown in Table 1, the approaches for logical reasoning machine reading comprehension in recent years can be divided into the following categories:

The first category is the approaches from the pre-training perspective, based on heuristic rules to capture logical relations in large corpora, and design corresponding training tasks for these relations to secondary train the existing pre-trained language models, such as MERIt and LogiGAN [33]. MERIt [4] proposes to use rules based on a large amount of unlabeled textual data, modeled after the form of the logical inference MRC task, to construct data for self-supervised pre-training in contrast learning. LogiGAN first uses pre-specified logical indicators (e.g., “therefore”, “due to”, “we may infer that“) to identify logical inference phenomena from large-scale unlabeled text, and then masks the expressions that follow the logical indicators and trains the generative model to recover the masked expressions.

The second category is the approaches from data enhancement perspective, which symbolically infers implicitly existing expressions based on logical equivalence laws and expands the given text to match the answers, such as LReasoner [17]. It proposes a logic-driven context extension framework that integrates three steps: logical identification to parse out logical expressions from context, logical extension to derive implicit expressions, and logical verbalization to predict the answer.

The third category is the approaches which use predefined rules to construct a graph structure based on the content of the text and options. The nodes of the graph correspond to logical units in the text, i.e., meaningful sentences or text fragments, the edges of the graph represent the relationships between the logical units. By employing methods such as Graph Neural Networks (GNN) and Graph Transformers [36], the logical reasoning process is modeled, thereby enhancing the performance of logical reasoning.

As the difficulty of the logical reasoning machine reading comprehension task continues to increase, merely focusing on the interaction between tokens at the sentence-level granularity is far from sufficient. Models need to establish relationships between sentences at a holistic level consisting of context, questions, and answers. However, logical relationships are difficult to extract as implicit structures hidden in the context, and the existing datasets are not labeled with logical structures. Therefore, DAGN proposed by Huang et al. [18] and Logiformer proposed by Xu et al. [34] both use graph structure to represent logical information in the context. DAGN uses discourse relations in PDTB2.0 [24] as separators to divide articles into multiple elementary discourse units (EDUs). The graph structure is obtained by using EDUs as nodes and discourse relations as edges, and the graph network is used to learn logical features of the text from EDUs to improve its reasoning ability.

Currently graph neural networks (GNNs) are successfully used in logical reasoning tasks, but the node-to-node messaging in the model is still inadequate, resulting in a continued lack of adequate means of interaction between articles and options. To address the above challenges, AdaLoGN proposed by Li et al. [16] employs directed textual logical graphs and predefined logical relations, and makes these predefined relations to reason with each other based on certain rules, adaptively extending the already constructed discourse graphs in a relevant way so as to enhance symbolic reasoning capabilities. Logiformer proposed by Xu et al. [34] uses graph transformer to model the dependency relations in logical and syntactic graphs, respectively, and introduces the structural information of the graph by introducing the adjacency matrix corresponding to the graph into the attention computation process.

Although the above methods improve the reasoning ability of GNN models, there is still room for improving the inter-node message passing mechanism. Logiformer only constructs logical graphs based on logical words and punctuation marks that represent causal relationships in the text. This approach easily overlooks other logical relationship information, such as coordination, continuation, contrast, and comparison. The graph convolution network of DAGN and AdaLoGN using fixed neighbor node sampling strategy, which might lead to information loss or redundancy.

3. Methodology

In this paper, we propose a new approach to solve the logical reasoning multiple choice task by integrating discourse-based and punctuation-based information through a graph attention network. The discourse relations are separated and aggregated into elementary discourse units (EDUs) using the University of Pennsylvania Discourse Treebank (PDTB2.0). Next, the discourse units are positional encoded to reinforce their position information, and discourse graphs and punctuation graphs are constructed with discourse units as nodes and connectives and punctuation marks as edges. Finally, the discourse graph and punctuation graph are computed for learning higher-level discourse logic features using discourse-apperceptive graph attention networks (DaGATN). By fusing discourse logic features, punctuation features and token-level contextual representation features in the pre-trained model, DaGATN can predict answers to multiple choice logic inference questions through this enhancement.

3.1. Overall Architecture

The general overview structure of DaGATN is shown in Figure 1. The model can be divided into two parts: the construction of a discourse graph and the acquisition of discourse features from this graph structure. The data is pre-processed and fed into a large pre-trained language model as a backbone in the form of “[CLS]context[SEP]question + answer[SEP]”, obtain token-level feature representation through the encoding layer. The basic discourse units (EDUs), as the basic units in logical reasoning, can be divided into “explicit” and “implicit” connectives to represent their discourse relations. Explicit relations, i.e., PDTB2.0 [24] contains discourse relations from the manually annotated explicit discourse relations on the 1 million Wall Street Journal (WSJ) corpus, and implicit relations, i.e., four types of punctuation marks: periods, commas, semicolons, and colons. Details of “explicit” and “implicit” connectives can be found in Appendix A.

Clearly, “explicit” connectives embody very distinct logical relationships, determined by the intrinsic meanings of the connectives themselves. The discourse in the context, such as “because, if”, is considered to correspond to the logical relations in the text, where “because” represents the causal relation and “if” refers to the hypothetical relation.

However, segmenting EDUs solely based on this does not resonate with human linguistic habits. As is widely recognized, humans typically express themselves in units of sentences, and there is a slight pause between sentences. This is intuitively represented in text by a series of punctuation marks such as “periods”, “commas”, “semicolons”, and “colons”. The logical relationships implied by these punctuation marks are often diverse and hard to define, hence we categorize them as “implicit” connectives.

The context and answers are divided into EDUs using the PDTB2.0 tool, and the Embedding of each EDU relies on its constituent tokens for acquisition. Considering the significant difference between the two connectives, for the pre-processed EDUs, the argument graph and punctuation graph are constructed separately using connectives and punctuation marks as edges. So that when calculating attention coefficients, the two types of edges have independent weight calculation methods. We believe that explicit connectives have a higher priority. Therefore, our strategy is: if both “punctuation marks” and “explicit connectives” exist between two EDUs, the explicit connective takes precedence.

Based on the information of the graph, in order to obtain higher-level representations of EDUs to enhance token features and to solve the problem of long-distance dependency between nodes, we used graph attention networks for graph reasoning. Meanwhile, since the graph neural network ignores the anterior-posterior order relationship when reasoning EDUs, we encode the position information of EDU nodes and add it into EDUs Embedding.

The EDU features will be fused into the embedding according to its corresponding “constituent token sequence” after the message passing of the graph attention network, and thus the token features are obtained after the discourse logic enhancement. For the resulting Discourse-enhanced Token Embedding, the sequence of tokens representing the context and the sequence of tokens representing the questions and options are pooled via weighted summation, and then we concatenate them with the “[CLS]” embedding output using the backbone model, which represents the global features. Finally, the final answer prediction is performed after mapping by feedforward neural network.

3.2. Discourse Graph Construction and Positional Encoding of EDUs

Inspired by previous research [16,18], the text was cut to EDUs using PTDB 2.0, and the sentences were cut in the following way:

The sentences in the text and in each option were marked by conjunctions and punctuation marks, and each conjunct and punctuation mark were assumed to act as an edge connecting the EDUs before and after it. For example, the options in Figure 1 are separated into two linguistic units, E5 and E6, where E5 is “Only very careful drivers use headlights”, and E6 is “their use is not legally required”, and the edge r connecting them is “when”. For each structure with one piece of context and multiple options, the graph corresponding to different options is constructed separately, the graph corresponding to option k is denoted by

G_{k} = (V^{k}, E^{k})

.

Since the embedding of an EDU depends on the tokens that make it up, the original text is first input to the pre-trained language model in the form of

[C L S] + C o n t e x t + [S E P] + Q u e s t i o n + O p t i o n + [S E P]

, where

[C L S]

and

[S E P]

are special tokens for DeBERTa, to obtain the sequence output

S = \{t_{0}, t_{1}, \dots, t_{n - 1} t_{n}\}

of its encoder layer. The embedding sequence of the output is segmented at the corresponding position according to the pre-marked position, and the embedding

e_{n}

of the n-th EDU is obtained by the following way:

e_{n} = \sum_{l \in s_{n}} t_{l} + P E (n),

(1)

where

s_{n}

is the index set of tokens that make up the n-th EDU node,

P E (n)

is the position embedding of the node, and the meaning of the location information and the way to obtain it are as follows.

Graph neural networks usually process data without considering the position information of the data, only the abstract concept of neighbors. In contrast, the example in Figure 2 demonstrates that even if the neighbors are the same, the order of precedence is actually crucial for semantics.

The two sentences in Figure 2 represent completely opposite meanings, and the reason is that the EDUs are in opposite sequential order. When performing graph attention computation, since there is no sequential relationship between each node, shuffling all the EDUs in a sentence will not cause any change in the embedding of an EDU. In previous approaches, when faced with the lack of positional information, there was a risk of ambiguity in reasoning. The typical solution was to feed the enhanced token sequence into a recurrent neural network before predicting the answer. However, this method increased computational complexity and reduced efficiency, while also elevating the risk of overfitting. Therefore, position encoding is added to address the lack of positional relationships between nodes during graph computation so that the same data in different positions are distinguished from each other, but the representation of position information needs to satisfy the following conditions:

First, it can represent the absolute position of a token in the sequence.

Second, the relative positions/distances of tokens in different sequences should be consistent in the case of different sequence lengths.

Inspired by Transformer [8], regarding the EDU at position

t \in p o s

, suppose its position embedding is

P E_{t} \in R^{d}

.

R^{d}

denotes all vectors of dimension d, and

P E_{t}

can be expressed in the following way:

P E_{t}^{(i)} = \{\begin{array}{l} \sin (p o s / 10000^{2 i / d_{m o d e l}}) i m o d 2 = 0 \\ \cos (p o s / 10000^{2 i / d_{m o d e l}}) i m o d 2 = 1 \end{array},

(2)

where

P E_{t}^{(i)}

is the i-th dimension in its position vector, and

d_{m o d e l}

is the dimension of EDUs embedding. The primary purpose of choosing the constant 10,000 as the denominator is to extend the wavelength of the periodic function to a very high level. From the above formula for positional encoding, we can identify the following characteristics:

Due to the very low periodic change frequency of the function, it ensures that even if the EDU sequence is very long, the position vector for each token remains unique.

The characteristics of the periodic function ensure that every value in the positional vector is bounded and lies in a continuous space. This makes it easier for the model to generalize when handling positional vectors and better handle sequences whose lengths differ from the training data.

The vectors represent different positions that can be obtained through linear transformation. We will prove this in the following paragraph.

Figure 3 shows a sequence of EDUs of length 50, after position encoding the first 128-dimensional visualization results, it can be found that due to the nature of the sin/cos function, every value of the position vector lies in the bounded continuous space of [−1, 1]. And with the help of the nature of trigonometric functions:

\{\begin{matrix} \sin (α + β) = \sin α c o s β + \cos α \sin β \\ \cos (α + β) = \cos α c o s β - \sin α \sin β \end{matrix},

(3)

we found that:

\{\begin{matrix} P E^{(2 i)} (t + k) = P E^{(2 i)} (t) \times P E^{(2 i + 1)} (k) + P E^{(2 i + 1)} (t) \times P E^{(2 i)} (k) \\ P E^{(2 i + 1)} (t + k) = P E^{(2 i + 1)} (t) \times P E^{(2 i + 1)} (k) + P E^{(2 i)} (t) \times P E^{(2 i)} (k) \end{matrix},

(4)

It can be seen that for any position

t + k \in p o s

, its position embedding can be linearly combined by the vector of position t and position k, which makes the position vector that also contains the relative position information.

3.3. Graph Attention Networks

Previous studies have shown that for the transductive task, since both the training and testing phases are based on the same graph structure, message passing over the graph in the form of convolution and can effectively capture local structure information between nodes [37]. However, for the Inductive task, the method uses a fixed neighbor node sampling strategy where each node only considers information from its one-hop neighbor nodes. This message passing method which DAGN chose may lead to loss of information or redundancy of information.

In contrast, graph attention networks (GAT) use an adaptive attention mechanism that can better capture the relationships between nodes by assigning different weights to each node’s neighbor nodes during message passing [38]. This approach allows for a better understanding of local and global information in the graph structure.

By analyzing logical reasoning tasks, we can conclude that, as each question’s passage and answer are distinct, every prediction is inferred based on different graph structures. Thus, this is a very typical Inductive task. For such tasks, compared to Graph Convolutional Networks that uniformly aggregate neighboring node features, Graph Attention Networks, which can automatically capture and learn heterogeneous relationships between nodes and assign unique weights to each adjacency relationship using the attention mechanism, are a more effective choice. Furthermore, Graph Attention Networks can utilize multi-head attention, implying they can learn multiple sets of attention weights in parallel, further enhancing the model’s expressive capability.

Figure 4 shows the process of calculating the attention coefficient of GAT. For discourse unit

i

, separately calculate its neighbors

j \in N_{i}

and its own raw correlation coefficient

e_{i j}

, then based on

e_{i j}

calculate the attention coefficient

a_{i j}

.

3.3.1. Node Feature Transformation

The specific approach to learn the correlation between EDUs

i

,

j

first requires a learnable linear mapping

W

of shared parameters, which is essentially a matrix of

F \times F^{'}

size, where

F

is the input node feature dimension and

F^{'}

is its output dimension. For an EDU with feature representation

h

, the calculation of feature correlation coefficients for its neighbors can be written as follows:

e_{i j} = a ([W h_{i} ‖ W h_{j}]), j \in N_{i},

(5)

For EDU

i

and

j

, the mapped features are concatenated to obtain

[W h_{i} ‖ W h_{j}]

, and the concatenated high-dimensional features are mapped to a numerical value by a single-layer FNN

a (\cdot)

to obtain

e_{i j}

.

3.3.2. Attention Mechanism

Then all the correlation coefficients

e_{i k} k \in N_{i}

need to be activated by LeakyReLU and normalized by Softmax function to get their attention coefficients

a_{i j}

:

a_{i j} = \frac{e x p (L e a k y R e L U (e_{i j}))}{\sum_{k \in N_{i}} e x p (L e a k y R e L U (e_{i j}))},

(6)

The core idea behind the attention mechanism is that not all neighboring nodes are equally important to a given node. By computing attention coefficients between each node and its neighbors, we can allocate different weights to each neighbor. We compared the model performance after normalization using the raw correlation coefficient and the correlation coefficient after LeakyReLU activation. It turns out that the latter performs better.

3.3.3. Message Aggregation and Multi-Head Operation

Based on the calculated attention coefficients

a_{i j}

, the neighbors’ and their own features are weighted and summed to obtain

h_{i}^{'}

, which is the output of GAT updated feature representation for EDU

i

:

h_{i}^{'} = R e L U (\sum_{j \in N_{i}} a_{i j} W h_{j}),

(7)

In this section, we introduced a multi-head mechanism to stabilize the learning process of attention, assuming there are

K

heads. For each head

k

, repeat the above process to obtain multiple outputs and concat them. Thus, the output for each intermediate layer of EDU

i

is as follows:

h_{i}^{'} = ‖_{k = 1}^{K} R e L U (\sum_{j \in N_{i}} a^{k}_{i j} W^{k} h_{j}),

(8)

In the final layer of the graph attention network, concatenation is no longer advisable. Instead, we employ averaging. Thus, the final output representation of EDU

i

is as follows:

h_{i}^{'} = R e L U (\frac{1}{K} \sum_{k = 1}^{K} \sum_{j \in N_{i}} a^{k}_{i j} W^{k} h_{j}),

(9)

and the updated EDU feature representation will be used to enhance the token embedding

t_{l}

of the article by adding and summing the EDU features and token embedding obtained at the corresponding positions according to the set

s_{n}

of token indexes that make up the n-th EDU node.

t_{l}^{'} = t_{l} + h_{n}^{'} l \in s_{n}

(10)

3.4. Answer Prediction

The answer prediction module is shown in Figure 5. The model performs end-to-end training using a cross-entropy loss function, and the enhanced article token embedding is normalized as a residual structure and then undergoes a one-layer linear mapping without changing the hidden dimension and is activated using ReLU to obtain the final encoded sequence.

Next, the overall features of the subject and the token-level representations are fused, first by mapping the high-dimensional embedding to a numerical value after a single-layer linear transformation and obtaining its token-level weights after softmax.

After that, the token embeddings of context and question-option are weighted and summed to produce their individual feature vectors, respectively. Finally, the above feature vectors are concatenated with the [CLS] embedding representing the whole subject, and the new vectors are fed into a two-layer perceptron with a GELU activation function to obtain the final output features for classification.

4. Experiment

4.1. Dataset and Evaluation Metrics

Experiments were first conducted on the ReClor [6] dataset, which consists of 6138 multiple-choice questions, each with a context and its corresponding four answer options, only one of which is the correct answer. The majority of the questions in the dataset are logical reasoning questions from the Graduate Management Admission Test (GMAT) and the Law School Admission Test (LSAT), accounting for 91.2% of the questions, with the remainder being supplemented by high-quality practice exams. Answering these questions requires complex logical reasoning. The data set was divided into three parts: training, validation, and testing, with the training set having 4638 questions, the validation set having 500 questions, and the testing set having 1000 questions. The questions in the test set are classified into EASY and HARD according to their difficulty, in which the questions that can predict the correct answer more consistently by eliminating the context and questions only through the options are classified as the EASY set with 440 questions, and the rest are classified as the HARD set. The training and validation sets have clearly labeled answers, while the testing set does not include answers, and the test results are obtained by submitting the predictions to the leaderboard.

Then we also conducted experiments on the LogiQA [5] dataset. Different to ReClor, the problems in LogiQA are generated based on the Nation Civil Servants Examination of China. A total of 8678 samples are divided into three sets at 8:1:1. The training set has 7376 questions, and both the validation and test set have 651 questions. Unlike ReClor, LogiQA has no need to submit predictions; the answer of each question is open. Also, the dataset does not differentiate between the easy set and the hard set.

ReClor is more diverse in the number of logical reasoning types, while LogiQA contains more examples. Both of them are challenging for the task of logical reasoning.

The evaluation metrics for this experiment use accuracy, i.e., the percentage of questions with correct answer predictions.

4.2. Experimental Configuration

The platform used for the experiments is Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50 GHz processor, 32 G running memory, and NVIDIA GeForce RTX3090 GPU which has 24 GB video memory. We use the DeBERTa-xlarge [39] fine-tuned by the MNLI task as the backbone model, which contains 48 hidden layers and 1024 hidden sizes, implemented by Hugging Face [40].

For this dataset, our training hyperparameters were set as shown in Table 2. The model was trained end-to-end with a batch size of 1 gradient accumulation of 8 and a total of 10 epochs using the AdamW optimizer [41] with β1 = 0.9 and β2 = 0.99, with a max sequence length of 200.

We tuned the hyperparameters on the validation set to find the most suitable hyperparameters for the model, specifically learning rate = 5 × 10⁻⁵ from {3 × 10⁻⁵, 4 × 10⁻⁵, 5 × 10⁻⁵, 5 × 10⁻⁶, 5 × 10⁻⁷}, weight decay = 0.01, seed = 42 from {3407, 42, 123}, iteration steps of graph reasoning = 1 from {1, 2}, and graph attention head = 5.

4.3. Comparative Experiments

On the ReClor dataset and LogiQA dataset, using the same experimental setup, the DaGATN model was compared with baseline models and the other logical inference machine reading comprehension models on leaderboard, i.e., graph neural network-based DAGN [18], AdaLoGN [16], Logiformer [34], LoCSGN [35], the pre-training-based MERIt [4] and the data enhancement-based LReasoner [17]. For a fair comparison with our approach, we selected the experimental results of MERIt using deberta-v2-xlarge and the best results of LReasoner under non-ensemble conditions, and the experimental results are shown in Table 3.

The experiments show that on the ReClor dataset, our proposed DaGATN achieves an overall accuracy of 74.3% on the Test set and 65.71% on the Hard test subset, which is closer to the performance of humans on the Hard test set. Compared with other graph neural networks (AdaLoGN, Logiformer, DAGN, LoCSGN), pre-training based (MERIt), and data enhancement based (LReasoner) methods, our model has a higher accuracy on the Hard test set. And by comparing the accuracy difference between the validation set and the test set, we can find that the degree of overfitting is lower compared to others.

On the LogiQA dataset, DaGATN also demonstrated competitive performance. The accuracy of the validation set is better than competitors, slightly behind the MERIt in the test set.

Due to the more detailed categorization in the ReClor dataset, we have chosen this dataset for the analysis of experimental results, facilitating the discussion of the strengths and weaknesses of DaGATN.

The detailed analysis results for each type of logical reasoning type are shown in Table 4. The ReClor dataset inherits 17 logical reasoning types, and we selected DeBERTa-xlarge as the baseline model to compare and analyze the performance of our system on different types of questions. We also list the detailed performance of different types of approaches in logical inference machine reading comprehension tasks, i.e., MERIt, which represents the pre-training perspective and captures the logical relationships of text based on heuristic rules, and LReasoner, which represents the data enhancement based and captures the logical symbolic relationships in text and options, to compare and analyze our strengths and weaknesses.

In terms of overall performance, our model shows significant performance improvements over the baseline model for most types of reasoning tasks, with DaGATN achieving a 4.45% accuracy improvement on conditional classification questions and a 6.46% accuracy improvement on strengthening\weakening argument type questions. Also, the degree of improvement is higher compared to logical symbol-based data enhancement (LReasoner) and heuristic rule-based pre-training methods (MERIt).

This is due to the fact that although logical notation can be a good aid for highly structured logical reasoning tasks, it essentially summarizes the logical structure given by the context and matches it with the structure between answers. Selecting the one with the highest similarity, LReasoner is not good at generalizing what is sufficient and necessary from the existing logical reasoning equations. Similarly, the model does not perform well when it is required to select the flaws of the arguments presented by the article. The LReasoner in the table is less correct in the face of NA, SA, W, and MF type questions but has excellent performance on MS type questions.

The heuristic rule-based approach is mainly based on the theory that an arbitrarily complex logical structure in a text can be stitched together by a series of A→B→C tuple of logical relations, with the correct option being logically consistent with the context and the wrong one logically contradictory. However, in some cases, when it comes to finding sufficient conditions for an argument or strong support for that argument, the presence of necessary conditions or multiple positive supports of different degrees in the options can confuse the judgment of the MERIt. The performance of MERIt in the table is not competitive enough in dealing with NA and S type questions, and the correctness rate is lower in the face of MSS type questions, but the excellent performance in ER and MS type questions proves the above argument.

Since our approach introduces prior knowledge about logical relationships into the model, DaGATN achieves better results than models based on pre-training and data enhancement. Due to the introduction of graph attention (GAT) that allows different weight assignments to each node’s neighboring nodes, thus better capturing local and global information in processing discourse graph data, DaGATN also outperforms AdaLoGN, Logiformer, and LoCSGN.

4.4. Ablation Experiments

In order to verify the performance of the discourse graph attention network module, the location encoding module in this paper, we conducted a series of ablation experiments on the ReClor and LogiQA dataset. In these experiments, the baseline model DeBERTa-xlarge training batch size is 1, the gradient accumulation is 8, and the max sequence length is 200. In order to compare the performance of the graph attention network, two kinds of computation methods, convolution and attention, are set in the graph neural network, and the graph convolution iteration step is set to 1. The node information updating uses the average value of its one-hop neighbors to compare the performance of the graph attention network. The specific performance of each classification is shown in Table 5.

In Table 4, we can see that with the addition of the discourse graph module, the accuracy of the DeBERTa model on the validation set improves by 3.4% compared to the baseline. However, the accuracy on the test set only improves by 1.19% compared to the baseline. In contrast to the validation set, the enhancement in accuracy on the test set is not notably substantial. The accuracy improvements on the easy and hard subsets are 0.39% and 1.85%, respectively, suggesting a certain degree of overfitting. After switching the method of graph computation from convolution to attention, its correct rate on the validation set improves to 1.6% compared to baseline, but the correct rate on the test set improves to 2.4%. The correct rate on the simple subset improves by 0.68% and on the difficult subset improves by 3.75%. It is demonstrated that the node information can be adaptively applied to neighboring nodes with different weight assignments, which can better capture local and global information and alleviate the model overfitting problem improving the model generalization ability. The addition of location encoding contributes to further performance improvements in both graph computation methods. It improves the validation set accuracy by 0.3% and the test set accuracy by 0.45% when using graph convolution. It improves the verification set correct rate by 0.4% and the test set correct rate by 0.9% when using the graph attention mechanism. It is demonstrated that the complementary position information for the nodes of the discourse graph compensates the ambiguity of the graph structure for position relations and enhances the ability of the discourse graph to express logical relations.

4.5. Case Studies

This section provides the following two cases. In order to better analyze the advantages and disadvantages of DaGATN in the logical reasoning MRC task, we selected two cases from the verification set for the case study.

Case 1 comes from MF type logical reasoning, and the reasoning process of DaGATN is shown in Figure 6:

As can be seen, for the context and candidate answers, DaGATN’s modeling of the reasoning process makes the model well recovered from the human thinking pattern when citing one answer to another. The flawed pattern of reasoning in context is that the protagonist claims that his opinion is confirmed, while the confirmation comes from a group that has nothing to do with that opinion. First, DaGATN extracts 6 EDUs based on relation connectives and punctuations for each branch, and among them five pairs of discourse relationships are detected. For argument branch, there are two referential relations (EDU1–EDU2, EDU4–EDU5) and for punctuation branch there are three implicit relations (EDU2–EDU3, EDU3–EDU4, EDU5–EDU6). The topology of the two graphs provides an explicit understanding of the text. Additionally, we display the attention maps from the final layer of the graph attention network for both branches, as well as the overall attention map after merging according to the strategy (blue for the argument branch, red for the punctuation branch, and purple for the merged). The data in the attention matrices is mapped to the range of [0, 1] for a clearer illustration. The darker the color, the stronger the correlation between two logical units. These weighted attention maps effectively showcase the relationships and offer a broader perspective for interpretability. Through the argument graph, the model understands that this pattern of flaws requires the protagonist to claim that a point of view is confirmed. Afterwards, through the punctuation graph, the model understands that the view is shared by practitioners in unrelated domains. Finally, the model selects the correct option D from the options based on this feature.

Although the overall performance of DaGATN is better than other models, there is still potential for improvement on some specific types of logical inference problems. After analyzing the test results, we found that the model performs uncompetitively on MS and ER type problems due to the lack of modeling for the logical symbolic aspect. And our model performs poorly on IF type problems, which require an understanding of the more abstract conceptual deficiencies of error arguments, and DaGATN falls short in this regard.

Case 2 comes from MF type logical reasoning, and the reasoning process of DaGATN is shown in Figure 7:

As can be seen, based on the context, DaGATN disentangles two reasoning processes for the flawed viewpoint, the first focusing on “All sides of all Stories”, and the second focusing on “All sides of important stories”. The model’s modeling of the reasoning processes led it to focus more on the explicit differences between the two reasoning outcomes and to select option C based on this feature, ignoring the fact that it was viewed as a whole and categorized as a type of error. For this example, option A is the correct answer.

5. Conclusions

In this paper, we propose DaGATN, a high-performance multiple choice machine reading comprehension method for logical inference, to address the problem of unsatisfactory performance of existing pre-trained language models for downstream tasks on complex logical inference problems and the tendency of overfitting in the graph-based fine-tuning approach. We improve the existing approach in two ways. Firstly, for discourse node units, the introduction of positional encoding enables graph neural networks to improve logical relation representation using positional information. Second, by using a graph attention mechanism to update node information, better handle local and global information in graph data and solve the long-distance dependency problem among nodes, adaptively assign different weights to each node’s neighbor nodes to better capture logical information in context and options.

DaGATN can offer an effective solution for current chatbots and intelligent voice assistants in understanding user commands. Moreover, due to its performance in challenging test sets, showing human-like logical reasoning capabilities, it can provide valuable references for the difficulty level of test questions.

Our series of experiments on the ReCLor dataset and LogiQA dataset demonstrate the promising performance of DaGATN, but there is still room for improvement of our approach. In the future, we would like to reduce the overall training parameters and computational resources used by DaGATN, and due to the lack of modeling and positive and negative example comparison learning for logical symbolic aspects, DaGATN does not perform well on abstract conceptual defects and structural matching problems. We will try to supplement the DaGATN structure with logical-symbolic reasoning capabilities to enhance its understanding of highly structured logical reasoning questions. In addition, we will try to enhance DaGATN’s ability to comprehend more abstract concept-like questions using human feedback-based reinforcement learning methods.

Author Contributions

Conceptualization, M.W. and T.S.; data curation, T.S.; formal analysis, J.D.; investigation, T.S. and Z.W.; methodology, M.W. and T.S.; project administration, T.S.; resources, M.W.; software, T.S.; supervision, T.S. and Z.W.; validation, M.W.; visualization, J.D.; writing—original draft, T.S.; writing—review and editing, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (61972003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All relevant data are within the paper. You can find our test results at https://eval.ai/web/challenges/challenge-page/503/leaderboard/1347 (accessed on 5 June 2023).

Acknowledgments

We would like to thank the anonymous reviewers for their helpful comments. We would like to thank the referees for their comments, which helped improve this paper considerably.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Detailed table regarding Explicit Connectives and Implicit Connectives.

Explicit Connectives

“once”, “although”, “though”, “but”, “because”,
“nevertheless”, “before”, “for example”, “until”, “if”,
“previously”, “when”, “and”, “so”, “then”, “while”,
“as long as”, “however”, “also”, “after”, “separately”,
“still”, “so that”, “or”, “moreover”, “in addition”,
“instead”, “on the other hand”, “as”, “for instance”,
“nonetheless”, “unless”, “meanwhile”, “yet”, “since”,
“rather”, “in fact”, “indeed”, “later”, “ultimately”,
“as a result”, “either or”, “therefore”, “in turn”,
“thus”, “in particular”, “further”, “afterward”,
“next”, “similarly”, “besides”, “if and when”,
“nor”, “alternatively”, “whereas”, “overall”, “by comparison”,
“till”, “in contrast”, “finally”, “otherwise”, “as if”,
“thereby”, “now that”, “before and after”, “additionally”,
“meantime”, “by contrast”, “if then”, “likewise”,
“in the end”, “regardless”, “thereafter”, “earlier”,
“in other words”, “as soon as”, ”except”, “in short”,
“neither nor”, “furthermore”, “lest”, ”as though”,
“specifically”, “conversely”, “consequently”, “as well”,
“much as”, “plus”, “And”, “hence”, “by then”, “accordingly”,
“on the contrary”, “simultaneously”, “for”, ”in sum”,
“when and if”, “insofar as”, “else”,
“as an alternative”, “on the one hand on the other hand”.

Implicit Connectives
(Punctuation Marks)

“ . ” “ , ” “ ; ” “ : ”

References

Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv 2016, arXiv:1606.05250. Available online: http://arxiv.org/pdf/1606.05250v3 (accessed on 15 September 2021).
Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. arXiv 2018, arXiv:1809.09600. Available online: http://arxiv.org/pdf/1809.09600v1 (accessed on 16 September 2021).
Dua, D.; Wang, Y.; Dasigi, P.; Stanovsky, G.; Singh, S.; Gardner, M. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. arXiv 2019, arXiv:1903.00161. Available online: http://arxiv.org/pdf/1903.00161v2 (accessed on 17 September 2021).
Jiao, F.; Guo, Y.; Song, X.; Nie, L. MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning. arXiv 2022, arXiv:2203.00357. Available online: http://arxiv.org/pdf/2203.00357v1 (accessed on 14 February 2023).
Liu, J.; Cui, L.; Liu, H.; Huang, D.; Wang, Y.; Zhang, Y. LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning. arXiv 2020, arXiv:2007.08124. Available online: http://arxiv.org/pdf/2007.08124v1 (accessed on 22 September 2021).
Yu, W.; Jiang, Z.; Dong, Y.; Feng, J. ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning. arXiv 2020, arXiv:2002.04326. Available online: http://arxiv.org/pdf/2002.04326v3 (accessed on 25 September 2021).
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. Available online: http://arxiv.org/pdf/1810.04805v2 (accessed on 4 September 2021).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. Available online: http://arxiv.org/pdf/1706.03762v5 (accessed on 7 September 2021).
Zhang, J.; Liu, L.; Gao, K.; Hu, D. Few-shot Class-incremental Pill Recognition. arXiv 2023, arXiv:2304.11959. Available online: http://arxiv.org/pdf/2304.11959v1 (accessed on 17 April 2023).
Cui, L.; Wu, Y.; Liu, J.; Yang, S.; Zhang, Y. Template-Based Named Entity Recognition Using BART. arXiv 2021, arXiv:2106.01760. Available online: http://arxiv.org/pdf/2106.01760v1 (accessed on 23 June 2022).
Qin, Y.; Lin, Y.; Takanobu, R.; Liu, Z.; Li, P.; Ji, H.; Huang, M.; Sun, M.; Zhou, J. ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning. arXiv 2020, arXiv:2012.15022. Available online: http://arxiv.org/pdf/2012.15022v2 (accessed on 11 June 2022).
Chen, G.; Ma, S.; Chen, Y.; Zhang, D.; Pan, J.; Wang, W.; Wei, F. Towards Making the Most of Cross-Lingual Transfer for Zero-Shot Neural Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; (Volume 1: Long Papers). Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 142–157. [Google Scholar]
Adelani, D.I.; Alabi, J.O.; Fan, A.; Kreutzer, J.; Shen, X.; Reid, M.; Ruiter, D.; Klakow, D.; Nabende, P.; Chang, E.; et al. A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation. arXiv 2022, arXiv:2205.02022. Available online: http://arxiv.org/pdf/2205.02022v2 (accessed on 6 June 2023).
Sun, Y.; Guo, D.; Tang, D.; Duan, N.; Yan, Z.; Feng, X.; Qin, B. Knowledge Based Machine Reading Comprehension. arXiv 2018, arXiv:1809.04267. Available online: http://arxiv.org/pdf/1809.04267v1 (accessed on 20 May 2022).
Tan, C.; Wei, F.; Zhou, Q.; Yang, N.; Lv, W.; Zhou, M. I Know There Is No Answer: Modeling Answer Validation for Machine Reading Comprehension. In Natural Language Processing and Chinese Computing; Zhang, M., Ng, V., Zhao, D., Li, S., Zan, H., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 85–97. ISBN 978-3-319-99494-9. [Google Scholar]
Li, X.; Cheng, G.; Chen, Z.; Sun, Y.; Qu, Y. AdaLoGN: Adaptive Logic Graph Network for Reasoning-Based Machine Reading Comprehension. arXiv 2022, arXiv:2203.08992. Available online: http://arxiv.org/pdf/2203.08992v1 (accessed on 4 November 2022).
Wang, S.; Zhong, W.; Tang, D.; Wei, Z.; Fan, Z.; Jiang, D.; Zhou, M.; Duan, N. Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text. arXiv 2021, arXiv:2105.03659. Available online: http://arxiv.org/pdf/2105.03659v1 (accessed on 19 October 2022).
Huang, Y.; Fang, M.; Cao, Y.; Wang, L.; Liang, X. DAGN: Discourse-Aware Graph Network for Logical Reasoning. arXiv 2021, arXiv:2103.14349. Available online: http://arxiv.org/pdf/2103.14349v2 (accessed on 13 November 2022).
Ouyang, S.; Zhang, Z.; Zhao, H. Fact-driven Logical Reasoning for Machine Reading Comprehension. arXiv 2021, arXiv:2105.10334. Available online: http://arxiv.org/pdf/2105.10334v2 (accessed on 22 November 2022).
Gao, Y.; Wu, C.-S.; Li, J.; Joty, S.; Hoi, S.C.H.; Xiong, C.; King, I.; Lyu, M.R. Discern: Discourse-Aware Entailment Reasoning Network for Conversational Machine Reading. arXiv 2020, arXiv:2010.01838. Available online: http://arxiv.org/pdf/2010.01838v3 (accessed on 29 November 2022).
Yan, H.; Deng, B.; Li, X.; Qiu, X. TENER: Adapting Transformer Encoder for Named Entity Recognition. arXiv 2019, arXiv:1911.04474. Available online: http://arxiv.org/pdf/1911.04474v3 (accessed on 7 September 2021).
Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. arXiv 2014, arXiv:1409.1259. Available online: http://arxiv.org/pdf/1409.1259v2 (accessed on 7 September 2021).
Wu, M.; Wang, Z.; Duan, J. Machine Reading Comprehension Based on Deberta and Discourse Graph Neural Networks. Comput. Appl. Softw. 2023; accepted. [Google Scholar]
Prasad, R.; Dinesh, N.; Lee, A.; Miltsakaki, E.; Robaldo, L.; Joshi, A.; Webber, B. The Penn Discourse TreeBank 2.0. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, Marrakech, Morocco, 26 May–1 June 2008. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. Available online: http://arxiv.org/pdf/1907.11692v1 (accessed on 23 September 2021).
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, V.Q. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019, arXiv:1906.08237. Available online: http://arxiv.org/pdf/1906.08237v2 (accessed on 17 October 2021).
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. Available online: http://arxiv.org/pdf/2005.14165v4 (accessed on 23 March 2021).
Sugawara, S.; Aizawa, A. An Analysis of Prerequisite Skills for Reading Comprehension. In Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods, Austin, TX, USA, 5 November 2016; Louis, A., Roth, M., Webber, B., White, M., Zettlemoyer, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 1–5. [Google Scholar]
Richardson, M. MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text. In Proceedings of the 2013 Conference on Emprical Methods in Natural Language Processing (EMNLP 2013), Seattle, WA, USA, 18–21 October 2013. [Google Scholar]
MacCartney, B.; Manning, C.D. Natural Logic for Textual Inference. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, Prague, Czech Republic, 28–29 June 2007; pp. 193–200. [Google Scholar]
MacCartney, B.; Grenager, T.; de Marneffe, M.-C.; Cer, D.; Manning, C.D. Learning to recognize features of valid textual entailments. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, New York, NY, USA, 4–9 June 2006; pp. 41–48. [Google Scholar]
Li, T.; Srikumar, V. Augmenting Neural Networks with First-order Logic. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Màrquez, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 292–302. [Google Scholar]
Pi, X.; Zhong, W.; Gao, Y.; Duan, N.; Lou, J.-G. LogiGAN: Learning Logical Reasoning via Adversarial Pre-training. arXiv 2022, arXiv:2205.08794. Available online: http://arxiv.org/pdf/2205.08794v2 (accessed on 2 March 2023).
Xu, F.; Liu, J.; Lin, Q.; Pan, Y.; Zhang, L. Logiformer: A Two-Branch Graph Transformer Network for Interpretable Logical Reasoning. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; Volume 35, pp. 1055–1065. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, T.; Lu, Y.; Liu, G. LoCSGN: Logic-Contrast Semantic Graph Network for Machine Reading Comprehension. In Natural Language Processing and Chinese Computing; Lu, W., Huang, S., Hong, Y., Zhou, X., Eds.; Springer International Publishing: Cham, Switzerkand, 2022; pp. 405–417. ISBN 978-3-031-17119-2. [Google Scholar]
Ying, C.; Cai, T.; Luo, S.; Zheng, S.; Ke, G.; He, D.; Shen, Y.; Liu, T.-Y. Do Transformers Really Perform Bad for Graph Representation? arXiv 2021, arXiv:2106.05234. [Google Scholar]
Pope, P.E.; Kolouri, S.; Rostami, M.; Martin, C.E.; Hoffmann, H. Explainability Methods for Graph Convolutional Neural Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10764–10773, ISBN 978-1-7281-3293-8. [Google Scholar]
Lee, J.B.; Rossi, R.A.; Kim, S.; Ahmed, N.K.; Koh, E. Attention Models in Graphs. ACM Trans. Knowl. Discov. Data 2019, 13, 62. [Google Scholar] [CrossRef]
He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv 2020, arXiv:2006.03654. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv 2019, arXiv:1910.03771. Available online: http://arxiv.org/pdf/1910.03771v5 (accessed on 27 March 2022).
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. Available online: http://arxiv.org/pdf/1711.05101v3 (accessed on 17 October 2021).

Figure 1. Overview Structure of DaGATN.

Figure 2. Example of two sentences with the same neighbor node but with opposite semantics.

Figure 3. Visualization results of Position Encoding.

Figure 4. Illustration of calculating the attention coefficient.

Figure 5. Illustration of answer prediction module.

Figure 6. Illustration of reasoning process of case 1.

Figure 7. Illustration of reasoning process of case 2.

Table 1. Summary of related work.

Rule-Based	Pre-Training Based	Data Enhancement	GNN Based
NatLog [30] Stanford RTE [31]	LReasoner [17] L-Datt [32]	MERIt [4] LogiGAN [33]	DAGN [18] AdaLoGN [16] Logiformer [34] LoCSGN [35]

Table 2. The training hyperparameters.

Parameters	Value
Dropout	0.1
Learning rate	5 × 10⁻⁵
Optimizer	AdamW
Batch size	1
Epoch	10
Weight decay	0.01
Max sequence length	200
Graph attention head	5
Graph reasoning iteration step	1
Gradient accumulation step	8

Table 3. Overview of the results from the comparative experiment.

Method	ReClor				LogiQA
Method	Val (%)	Test (%)	Test-E (%)	Test-H (%)	Val (%)	Test (%)
Human Performance *	-	63.00	57.10	67.20	-	86.00
ChatGPT(zero-shot) *	-	60.90	64.55	58.04	-	38.44
DeBERTa-xlarge *	72.60	71.00	83.41	61.25	44.40	41.50
AdaLoGN Logiformer DAGN(deberta-large) LoCSGN	65.80	60.20	79.32	45.18	39.94	40.71
	68.40	63.50	79.09	51.25	42.24	42.55
	72.40	66.80	79.80	56.58	42.40	41.70
	78.60	73.20	84.77	64.11	43.70	43.20
MERIt	78.00	73.10	85.23	63.57	43.90	45.30
LReasoner	76.40	70.70	81.10	62.50	41.60	41.20
DaGATN	74.60	74.30	85.22	65.71	44.71	44.50

* Baseline, The results of human performance is come from Yu et al. (2020) [6] Liu et al. (2020) [5] and leaderboard.

Table 4. Detailed results on different logical reasoning types. Base is the DeBERTa-xlarge model. ↑, ↓ and –, respectively, mean that our performance is better, worse than, or equal to the baseline model. Reasoning Type is represented by acronyms. NA: Necessary Assumptions, SA: Sufficient Assumptions, S: Strengthen, W: Weaken, E: Evaluation, I: Implication, CMP: Conclusion/Main Point, MSS: Most Strongly Supported, ER: Explain or Resolve, P: Principle, D: Dispute, T: Technique, R: Role, IF: Identify a Flaw, MF: Match Flaws, MS: Match the Structure, O: Others. Percentages of different reasoning types are in parentheses.

Reasoning Type	Base (%)	DaGATN (%)	LReasoner (%)	MERIt (%)
Test ¹	71.00	74.30	70.07	73.10
NA (11.0%)	77.19	84.21 (↑)	76.30	79.82
SA (3.6%)	73.33	70.00 (↓)	70.00	70.00
S (9.0%)	68.09	78.72 (↑)	70.20	71.28
W (10.6%)	63.72	66.37 (↑)	59.30	67.26
E (1.6%)	67.92	84.61 (↑)	69.20	69.23
I (6.2%)	54.35	56.52 (↑)	54.30	56.52
CMP (3.1%)	88.89	86.11 (↓)	77.80	83.33
MSS (6.7%)	71.43	78.57 (↑)	71.40	66.07
ER (8.0%)	66.67	72.61 (↑)	67.90	78.57
P (5.7%)	72.31	78.46 (↑)	76.90	72.31
D (2.5%)	70.00	70.00 (–)	80.00	63.33
T (3.8%)	86.11	86.11 (–)	80.60	83.33
R (3.7%)	68.75	65.62 (↓)	68.80	62.50
IF (11.3%)	76.07	70.94 (↓)	71.80	76.07
MF (4.9%)	58.06	74.19 (↑)	61.30	70.97
MS (2.7%)	76.67	83.33 (↑)	86.70	86.67
O (5.5%)	68.49	67.12 (↓)	72.60	75.34

¹ The overall accuracy of the test set.

Table 5. Overview of the results from the ablation experiments.

Method	ReClor				LogiQA
Method	Dev (%)	Test (%)	Test-E (%)	Test-H (%)	Dev	Test
DeBERTa-xlarge	72.60	71.00	83.41	61.25	44.40	41.50
+DGCN w/o PosEmbed	76.00	72.19	83.80	63.10	43.62	42.71
+DGAT w/o PosEmbed	74.20	73.40	84.09	65.00	44.90	44.25
+DGCN, PosEmbed	76.30	72.54	83.61	61.90	43.40	43.20
DaGATN	74.60	74.30	85.22	65.71	44.71	44.50

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, M.; Sun, T.; Wang, Z.; Duan, J. DaGATN: A Type of Machine Reading Comprehension Based on Discourse-Apperceptive Graph Attention Networks. Appl. Sci. 2023, 13, 12156. https://doi.org/10.3390/app132212156

AMA Style

Wu M, Sun T, Wang Z, Duan J. DaGATN: A Type of Machine Reading Comprehension Based on Discourse-Apperceptive Graph Attention Networks. Applied Sciences. 2023; 13(22):12156. https://doi.org/10.3390/app132212156

Chicago/Turabian Style

Wu, Mingli, Tianyu Sun, Zhuangzhuang Wang, and Jianyong Duan. 2023. "DaGATN: A Type of Machine Reading Comprehension Based on Discourse-Apperceptive Graph Attention Networks" Applied Sciences 13, no. 22: 12156. https://doi.org/10.3390/app132212156

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DaGATN: A Type of Machine Reading Comprehension Based on Discourse-Apperceptive Graph Attention Networks

Abstract

1. Introduction

2. Related Work

2.1. Logical Reasoning Machine Reading Comperhension

2.2. Research Actuality

3. Methodology

3.1. Overall Architecture

3.2. Discourse Graph Construction and Positional Encoding of EDUs

3.3. Graph Attention Networks

3.3.1. Node Feature Transformation

3.3.2. Attention Mechanism

3.3.3. Message Aggregation and Multi-Head Operation

3.4. Answer Prediction

4. Experiment

4.1. Dataset and Evaluation Metrics

4.2. Experimental Configuration

4.3. Comparative Experiments

4.4. Ablation Experiments

4.5. Case Studies

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI