Retrieving Chinese Questions and Answers Based on Deep-Learning Algorithm

Wang, Huan; Li, Jian; Wang, Jiapeng

doi:10.3390/math11183843

Open AccessArticle

Retrieving Chinese Questions and Answers Based on Deep-Learning Algorithm

by

Huan Wang

¹,

Jian Li

¹ and

Jiapeng Wang

^2,*

¹

Beijing Modern Manufacturing Industry Development Research Base, College of Economics and Management, Beijing University of Technology, Beijing 100124, China

²

Information Technology Department, Xiaomi Inc., Beijing 100085, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(18), 3843; https://doi.org/10.3390/math11183843

Submission received: 10 August 2023 / Revised: 29 August 2023 / Accepted: 3 September 2023 / Published: 7 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

Chinese open-domain reading comprehension question answering is a task in the field of natural language processing. Traditional neural network-based methods lack interpretability in answer reasoning when addressing open-domain reading comprehension questions. This research is grounded in cognitive science’s dual-process theory, where System One performs question reading and System Two handles reasoning, resulting in a novel Chinese open-domain question-answering retrieval algorithm. The experiment employs the publicly available WebQA dataset and is compared against other reading comprehension methods, with the F1-score reaching 78.66%, confirming the effectiveness of the proposed approach. Therefore, adopting a reading comprehension question-answering model based on cognitive graphs can effectively address Chinese reading comprehension questions.

Keywords:

dual-process theory; cognitive graph; Chinese open-domain reading comprehension question answering

MSC:

68T50

1. Introduction

1.1. The Core Concept of Cognitive Graph

The cognitive graph is inspired by the dual-process theory of human cognitive processes [1]. This theory regards human reading comprehension as comprising two distinct cognitive processes: “quickly focusing attention on relevant entities” and “analyzing sentence semantics for inference”. In cognitive science, the well-known “dual-process theory” posits that human cognition operates through two systems. System One functions as an intuitive, unconscious thinking system, relying on experiences and associations. In contrast, System Two represents the unique logical reasoning ability of humans, utilizing knowledge stored in working memory to perform slower but reliable logical reasoning. System Two is the unique logical reasoning ability of humans, relying on knowledge in working memory to perform slow but reliable logical reasoning. System Two is explicit, requiring conscious control, and represents the manifestation of human higher intelligence. The cognitive graph leverages System One to query states and construct the graph through relevant entity recognition models. Subsequently, System Two learns hidden representations of contextual information on graph nodes and performs interpretable relationship reasoning.

The essence of the cognitive graph lies in minimizing information loss during graph construction while retaining the graph structure for interpretable relationship reasoning. Simultaneously, it transfers the burden of information processing to retrieval and natural language understanding algorithms.

1.2. Basic Concept of Chinese Reading Comprehension Question Answering

With the development of information technology, available data resources have experienced explosive growth, and users require powerful retrieval tools to find desired information from large datasets. Various data retrieval systems, represented by search engines, have had a significant impact and provided great convenience to users. However, they also exhibit several drawbacks: the system can return to users a batch of sorted document links, and users themselves need to browse through them to locate genuinely useful information in order to find answers. Consequently, the quality of the query terms constructed by users profoundly affects the efficiency and performance of the retrieval system.

Open-domain question answering and reading comprehension [2], also known as OpenQA, aims to provide accurate answers to natural language questions without being limited to a specific domain. Its distinctive feature is that users can express their queries in natural language, and the system automatically retrieves precise answers from various data resources. The scope of user questions is not confined to a specific application or domain. In contrast to traditional machine reading comprehension (MRC) tasks [3], where given a question, the system provides answers from a single passage or document, Open-QA requires searching for the answer within a collection of documents or the entire web.

Early open-domain question-answering systems utilized non-parametric models, such as TF-IDF or BM25 [4], to retrieve answers from a fixed set of documents, with the answer scope extracted using neural reading comprehension models [5]. These methods performed well on single-hop questions, where they could answer questions based on individual paragraphs. However, they often struggled to retrieve evidence required for answering multi-hop questions. Multi-hop question answering typically involves finding multiple supporting paragraphs, where one supporting paragraph might have little lexical overlap or semantic relationship with the original question. Subsequent open-domain QA approaches employed end-to-end models to jointly retrieve and read documents. These methods trained and unified various modules in the neural network to retrieve answers from given documents [6]. However, these approaches compressed necessary information into the embedding space, leading to the lack of capturing semantic information about entities’ vocabulary or terminology. Consequently, challenges persisted in dealing with entity-centered question-answering tasks [7], and these model-based methods still faced issues concerning answer interpretability.

In recent years, with the development of knowledge graphs [8], some methods have attempted to utilize existing facts or relationships within the knowledge graph to infer new relationships and obtain answers, thereby addressing interpretability concerns. Currently, knowledge graph reasoning can be broadly divided into two categories: methods based on logical symbols [9,10] (ontology axioms or symbolic rules) and methods based on representation learning [11]. While traditional methods based on logical symbols offer interpretability, they struggle to handle implicit and uncertain knowledge. On the other hand, representation learning-based methods can capture implicit knowledge, significantly improving reasoning efficiency, making them the mainstream technique for knowledge graph reasoning. However, when using knowledge graphs for knowledge base question answering (KBQA) [12], it is often assumed that there are enough triple instances of entities or relationships in the existing knowledge graph to train vector representations. In open-domain question answering, the existing knowledge graph may not contain entities or relationships present in the questions, leading to a lack of corresponding training instances. This presents a challenge for knowledge graph reading comprehension question-answering methods.

Chinese reading comprehension question answering aims to find answers to questions from a large collection of documents. In current methods, although end-to-end approaches can achieve satisfactory results, they lack reasoning with answer paths. On the other hand, using knowledge graphs with reasoning paths for question answering requires a knowledge graph with reasoning paths, which is not available in existing open-domain large-scale knowledge graphs for assisted reasoning.

Cognitive graph is inspired by human cognitive processes [7], which categorize question answering into two different thinking processes: “quickly focusing attention on relevant entity information” and “analyzing sentence semantics for inference”. In cognitive science, the well-known “dual-process theory” posits that human cognition consists of two systems. System 1 is an intuition-based, unconscious thinking system that relies on experience and associations. On the other hand, System 2 is the unique logical reasoning ability of humans, utilizing knowledge from working memory for slow but reliable logical inference. System 2 is explicit and requires conscious control, representing the manifestation of human higher intelligence.

In response to the limitations of previous Chinese reading comprehension methods, this paper proposes a cognitive graph-based Chinese reading comprehension retrieval approach. This approach utilizes Wikipedia as a source to find evidence documents as reasoning paths for answering complex questions. Subsequently, existing reading comprehension models are employed to answer questions given the identified reasoning paths. The strong interaction between the retrieval of reasoning paths and the reading of answers within these paths enables robust pipeline processing. Figure 1 provides an overview of the proposed cognitive graph-based reading comprehension retrieval model (QARCG) in this paper.

In the experimental section, we selected MemN2N [13], LSTM [14], DrQA [15], BIDA [16], R-net [17], Bert [18], SRQA [19], and Attentive and Impatient Reader [20] as the comparison methods. We conducted experiments using Baidu’s open-source Chinese reading comprehension question-answering dataset, WebQA [21]. The evaluation was performed against both single-fact-answer and complete-answer methods, and the model’s effectiveness was validated using strict matching and fuzzy matching approaches. The results analysis indicates that the proposed model outperforms existing methods, achieving better performance in terms of accuracy and efficiency.

1.3. Triple Extraction

Extracting relationship triples from unstructured natural language texts is a extensively studied topic in the field of information extraction [22], and it constitutes a fundamental basis for various artificial intelligence applications, including information retrieval, intelligent question answering, and dialog systems. The key components of knowledge graphs are factual relationships, with a significant portion represented by relationship triples. A triple is composed of two entities connected by a semantic relationship; these facts take the form of (subject, relationship, object) or (s, r, o), commonly referred to as relationship triples. Extracting relationship triples from natural language texts is a crucial step in constructing cognitive graphs for retrieval, which is a focal point of this study.

Early efforts in relationship triple extraction processed this task in a pipeline manner [22,23,24,25]. They extracted relationship triples through two distinct steps: firstly, performing named entity recognition on the text to identify all entities, and then classifying the identified entities for relationships. While this segregated framework simplifies task handling, it overlooks the interdependency between the two subtasks; errors in entity recognition might impact relationship classification, often leading to issues like error propagation.

Presently, this paper effectively addresses the challenge of triple overlap. The method employed for triple extraction in this study is derived from this paper and is implemented using the bert4keras framework.

1.4. Main Contribution

This paper’s key contribution is the introduction of the QARCG model, a cognitive graph-based approach for enhancing reading comprehension. The model seamlessly merges retrieval and reasoning systems, facilitating efficient inference and interpretability in open-domain question answering. The QARCG model partitions the open-domain reading comprehension task into two systems: retrieval and reasoning, utilizing a cognitive graph representation to ensure effective information interaction.

In the retrieval system, the model employs triplet extraction to establish reasoning pathways, enhancing coherence between paragraphs and enabling the flow of information. The reasoning system employs an RNN to capture dynamic interactions among paragraphs, enabling reordering and scoring of reasoning pathways to generate answers. The model’s efficacy is substantiated through WebQA dataset evaluations, demonstrating superior performance in entity-level and comprehensive answers. Moreover, the paper conducts experiments and analyses, shedding light on component functions and offering insights for model optimization. Future prospects include exploring cognitive science theories, integrating memory mechanisms, and incorporating external feedback, paving innovative directions in cognitive graph representation research.

2. Model Definition Core Concepts

Definition 1

(Cognitive Graph). In the QARCG model, the cognitive graph is defined as

G = [P_{1}, P_{2}, \dots, P_{k}]

, where each node in G corresponds to a paragraph

P_{i}

. The retrieval system 1 reads each paragraph

P_{i}

and extracts triples from the paragraph as the next hop paragraphs. These new nodes are then used to expand G, providing an explicit structure for the reasoning module system 2.

Definition 2

(Bert-wwm Embedding [8]). As System 2 requires transforming the paragraph

P_{i}

into vector representations when learning reasoning paths, wwm stands for whole-word masking, which is an improvement over Bert’s masking technique. It replaces a complete word with a mask label instead of subword tokens. In contrast to English, where the smallest token is a word, in Chinese, the smallest token is a character, and words are composed of one or more characters with no obvious delimiters. Words contain more information, and whole-word masking involves masking the entire word. In the model, the input for Bert-wwm is as follows: [CLS] Question[SEP] clues

[p, G]

[SEP] Para

[x]

.

Here, clues

[p, G]

denote the passage p propagated from the preceding node in the cognitive graph G. During the first hop, Bert-wwm encodes the concatenation of the question and the paragraph. The output vector representation of Bert is denoted as

T \in R^{l, H}

, where L is the length of the input sequence, and H is the dimensionality of the hidden representation.

3. Model Framework

The model consists of two systems: the retrieval reasoning system and the reading comprehension system. System 1 is responsible for constructing the inference path of the paragraph graph, while System 2 involves scoring the reasoning paths and extracting answers from the highest-scoring paragraphs. The two systems are jointly trained and integrated to provide the final objective function.

3.1. Retrieval Reasoning Paths

System 1 of the cognitive graph method requires constructing and retrieving reasoning paths. For some complex questions, the evidence paragraphs may not directly have lexical relevance to the question, and independent retrieval of a given document list may not be sufficient for inferring the answer. However, it is highly likely to find the answer through text related to the answer (as shown in Figure 2). To perform such multi-hop reasoning, it is necessary to build a paragraph graph covering Wikipedia paragraphs relevant to the question. The Wikipedia graph is defined as G, where each node

P_{i}

represents an individual paragraph.

3.1.1. Paragraph Graph Construction

In Wikipedia, there is a wealth of entity entry information, which can be regarded as a knowledge resource for constructing multiple linguistic corpora. QARCG uses Wikipedia entries to construct direct edges in G, enabling the navigation from one paragraph to another. Based on the given question, the model initially retrieves the top F highest-scoring paragraphs using the Tf-IDF method as the initial nodes. Then, starting from these F paragraphs, it utilizes the extracted triples to retrieve and direct to other paragraphs in Wikipedia, continuously expanding the tree structure. The iteration stops when the tree depth reaches the upper limit or no more triples can be found in the paragraphs. Once the paragraph graph construction is complete, the next step is to model the reasoning paths based on this paragraph graph.

3.1.2. Reasoning Path Modeling

In the constructed paragraph graph, its sequential nature plays a crucial role in text localization, while RNN (recurrent neural network) is proficient in learning hidden state connections in relationships, enabling the learning of reasoning paths. Therefore, utilizing RNN for reasoning path modeling can achieve a more effective representation [26].

In the reasoning path, [EOE] is used as the end-of-path control symbol for the next-hop paragraph

p_{i}

3.1.3. Model Optimization

In the main text, to improve the retrieval of reasoning paths, the model is optimized as follows:

(1) The computation cost of determining whether a paragraph can be included in the reasoning path in Wikipedia is high, prior to outputting the reasoning path results to System 2. The score of a reasoning path is calculated as the product of the probability scores of all paragraphs along that path [27]. Beam search [28] is then utilized for pruning to select the top B reasoning paths.

In the specific approach, to construct an effective cognitive graph, the TF-IDF retrieval method is initially used to initialize the candidate paragraphs and guide their search on Wikipedia. The top F paragraphs with the highest TF-IDF scores relevant to the question are selected as the initial candidate set

C_{1}

. The candidate paragraph set

C_{t} (t \geq 2)

is then expanded using the end-of-path marker [EOE]. The time complexity of processing the candidate set is given by

O (| C_{1} | + B \sum_{t \geq 2} | C_{t} |)

, where B is the beam size and

| C_{t} |

is the average size of the candidate set

C_{t} (t \geq 2)

.

Beam search is used to retrieve the reasoning paths in the cognitive graph. For a reasoning path E, the probabilities of selecting paragraphs are multiplied together:

\begin{matrix} E = [p_{1}, \dots, p_{k}] \end{matrix}

(6)

\begin{matrix} P (p_{i} | h_{1}) \dots P (p_{k} | h_{| E |}) \end{matrix}

(7)

By using the beam search method, the top B reasoning paths with the highest scores are selected from the set of reasoning paths E and passed to System 2, the reader model.

\begin{matrix} E & = [E_{1}, \dots, E_{B}] \end{matrix}

(8)

\begin{matrix} S (q, E, a) & = S_{2} (q, E, a), E \in E \end{matrix}

(9)

(2) To effectively control the depth of retrieval when searching existing paragraphs in Wikipedia, the approach adopts the extraction of triples from paragraphs. The subject and object of the extracted triples are then used as keywords for retrieval. Subsequently, the relations matching the subject and object entities in the retrieved Wikipedia entries are considered as the next hop in the retrieval process. Experimental results demonstrate that this triple extraction approach effectively reduces the retrieval depth.

(3) The training is performed using negative sampling. During the training of the retrieval reasoning path model, it is necessary to distinguish between relevant and irrelevant paragraphs. Therefore, the model uses the “no_answer” paragraphs from the WebQA dataset [21], which cannot derive the answer, as negative examples for training. Since the number of paragraphs provided for each question is limited, there is no specific threshold set for the quantity of negative examples. The training loss function is represented by formula (17).

The specific algorithmic procedure is as follows Algorithm 1:

Algorithm 1: Cognitive Graph-Based Question Answering with Reasoning

1: Input: System 1 model, System 2 model, question Q, predicted value F, Wikipedia database W
2: use Q and the given paragraphs to select the top K paragraphs P using the TF-IDF algorithm to initialize the cognitive graph G
3: repeat
4: pop the outermost paragraph p from graph G as the node x
5: Extract the preceding paragraphs of node x as reasoning path clues $c l u e s [x, G]$
6: if p is not None: continue retrieving paragraphs from Wikipedia W
7: if x is a reasoning path node then
8: Use triple extraction of to find a new paragraph p as y in Wikipedia
9: for y as a reasoning path node do
10: if $y \notin G$ and y belongs to Wikipedia W then
11: Create a new node for y in the cognitive graph;
12: if $y \in G$ then
13: pass
14: end
15: until there are no boundary nodes in G, or the threshold is reached
16: return Path

The algorithm’s space complexity is

O (d | V | + d | W | \times N)

, where V represents the number of supporting evidence documents in the dataset, W represents the number of paragraphs retrieved from each Wiki search, and N represents the retrieval depth. As the sentences in the model are encoded by BERT into d-dimensional vectors, the required storage space is

d | V | + d | W | \times N

. The algorithm’s time complexity is

O (t V (K + W N))

, where t denotes the number of model training iterations and K is the number of negative sampling iterations.

3.2. Reading and Answering Based on Reasoning Paths

System 2 of the cognitive graph needs to read and answer the already constructed reasoning paths. The model first scores the reasoning paths from E, selecting the highest-scoring path. Then, from the most reasonable path, the model extracts the answer span from the paragraphs of the path.

(1): Scoring the Reasoning Paths

Answer extraction heavily relies on the paragraphs within the reasoning paths. Therefore, the model’s initial task is to re-rank the reasoning paths in the path set based on their relevance to the question. Both the paragraphs and the question in the reasoning paths are encoded using the BERT method, as shown below:

\begin{matrix} u_{E} & = B E R T_{[C L S]} (q, E) \in R^{D} \end{matrix}

(10)

\begin{matrix} P (E | q) & = δ (w_{n}, u_{E}) \end{matrix}

(11)

Consequently, in the inference set, the highest-scoring inference path

E_{b e s t}

is selected as the final basis for deriving the answer, where

E_{b e s t} \in E

and

W_{n} \in R^{D}

is a weight vector:

\begin{matrix} E_{b e s t} = \underset{E \in E}{argmax} P (E | q) \end{matrix}

(12)

(2): Extracting Answers from the Highest-Scoring Paragraph

After obtaining the highest-scoring reasoning path from the candidate set of reasoning paths, the next step is to extract the answer span from this path to predict the answer. Similarly, the reasoning path is encoded using BERT. For the BERT output, a linear layer and softmax function are applied to compute the probabilities of each position in the paragraph as the start of the answer. The top K positions with the highest probabilities are selected. Subsequently, the model continues to find the end position of the answer by considering each position in the paragraph and computing the probabilities of being the end.

\begin{matrix} P_{i}^{s t a r t} = \frac{e_{T_{i}}^{W_{a n s}}}{\sum_{m} e_{T_{j}}^{W_{a n s}}} \end{matrix}

(13)

\begin{matrix} P_{j}^{s t a r t} = \frac{e_{T_{i}}^{W_{a n s}}}{\sum_{m} e_{T_{j}}^{W_{a n s}}} \end{matrix}

(14)

\begin{matrix} e n d_{k} = \underset{s t a r t \leq j \leq m a x L}{argmax} P_{j}^{e n d} \end{matrix}

(15)

where

W_{a n s}

is the parameter during training,

P_{i}^{s t a r t}

represents the probability of the i-th position in the paragraph as the start of the answer span,

e n d_{P_{j}}

represents the probability of the j-th position in the paragraph as the end of the answer span, and

m a x L

is the maximum initial answer span length set at the beginning. In this way, K combinations of start and end predictions are found.

E_{b e s t}

contains the i-th and j-th tokens representing the probabilities of the start and end positions of the answer span, respectively. The final answer is selected by taking the maximum of the product of probabilities among the K combinations. The calculation is as follows, where

S_{2}

represents the reasoning system 2:

\begin{matrix} S_{2} = \underset{i, j, i \leq j}{argmax} P_{i}^{s t a r t} P_{j}^{e n d} \end{matrix}

(16)

(3): Model Optimization

To better distinguish paragraphs that do not contain answers, a negative sampling strategy is introduced during training. If a paragraph can directly lead to an answer, it is labeled as

P_{r}

; otherwise, it is labeled as

1 - P_{r}

. The definition of the loss function is as shown in Formula (18).

3.3. Joint Training

QARCG uses Wikipedia for open-domain paragraph retrieval, where each article is divided into multiple paragraphs, resulting in millions of paragraphs. Each paragraph p is treated as the retrieval target for the retriever. Given a question q, the QARCG framework aims to derive the answer a through retrieval and reading of reasoning paths, as shown in Equation (5). Each reasoning path is represented by a sequence of paragraphs. QARCG formulates the task by decomposing the objective into the retriever objective

S_{1} = (q, E)

, which selects reasoning paths E relevant to the question, and the reader objective

S_{2} = (q, E, a)

, which finds the answer a within E:

\begin{matrix} a r g m a x S (q, E, a) s . t . S (q, E, a) = S_{1} (q, E) + S_{2} (q, E, a) \end{matrix}

(17)

\begin{matrix} L_{1} (p_{t} | h_{t}) = - log P (p_{t} | h_{t}) - \sum_{\tilde{p} \in \tilde{C_{t}}} log (1 - \tilde{p} | \tilde{C}) \end{matrix}

(18)

\begin{matrix} L_{2} = L_{a n s w e r} + L_{n o} = (- log P_{y^{s t a r t}}^{s t a r t} - log P_{y^{e n d}}^{e n d}) - log P_{r} \end{matrix}

(19)

After encoding paragraphs and questions, the retriever model of QARCG captures paragraph interactions through the [CLS] representation of BERT, learning interactions among paragraphs to enhance the credibility of predicting reasoning paths. Additionally, the reordering process reduces uncertainty during reasoning path selection, making the model framework more robust.

4. Experimental Analysis

This section validates and evaluates the effectiveness of the proposed model using a Chinese reading comprehension question-answering dataset, which consists of questions with one or multiple documents as context. To ensure the authenticity of the experiments, the Baidu WebQA dataset, an open-source dataset, is used for experimentation and testing. The experimental results of QARCG and the compared methods are presented in a unified environment. For single-fact answer questions, the strict matching F1 score on the validation set increases from 74.28% to 75.99%, and on the test set, it increases from 73.53% to 74.98%. Furthermore, for the complete dataset, both the strict matching and fuzzy matching scores for annotated and retrieved types also show improvement, confirming the effectiveness of the proposed model.

4.1. Experimental Configuration

In this section, the model experiments were conducted in a unified experimental environment with a GeForce RTX 2080Ti GPU and Ubuntu 16.04.6 operating system. The hyperparameters were set following the principle of controlling variables uniformly, and benchmark parameters were defined to compare and analyze the model’s accuracy and time efficiency. In the experimental analysis section, we will further elaborate on the model’s performance and the impact of its hyperparameters.

4.2. Dataset

To ensure the authenticity and effectiveness of the experiments, we selected the Baidu WebQA dataset [10] for experimentation. This dataset comprises questions, annotated evidence, retrieved evidence, and answers. Unlike SQuAD [29], the questions in WebQA are derived from user queries in search engines, while the provided passages are extracted from web pages. WebQA provides document passages for each question retrieved from the evidence. Therefore, we used this dataset to evaluate the model’s answer-locating ability in reading comprehension tasks.

The statistical information of the WebQA dataset is presented in Table 1. The experiments were conducted separately on annotated and retrieved evidence for model training and evaluation. In the “Annotated” setting, each question is provided with one evidence passage, while in the “Retrieved” setting, multiple evidence passages are provided for each question. The data distribution of the validation set (validation) is closer to the training set (train). Generally, the validation set is used to assess the model’s accuracy, while the test set (test) is used to evaluate its transferability. In this section, we primarily focus on evaluating the model’s accuracy.

4.3. Comparative Algorithms

In this section, we compare our model with other algorithms from two dimensions:

(1) For questions with fact-based entity-type answers, we compare our model with the following end-to-end approaches: MemN2N, LSTM with attention mechanism, and end-to-end sequence-based baseline model.

MemN2N [13] is one of the implementations of end-to-end models with memory networks [30]. It uses the bag-of-words method to encode the question and evidence, and stores the representation of evidence in an external memory. The recurrent attention model is used to retrieve relevant information from memory to answer questions.

Attentive and Impatient Readers [20] use bidirectional LSTM [14] to encode the question and evidence and employ a model that classifies the vocabulary based on these two encodings. It is a classic application of a simple fine-grained attention mechanism in machine reading comprehension tasks. The simpler “attention reader” computes the attention to evidence documents in an attention-based manner, while more complex readers calculate attention after processing each query word.

The end-to-end sequence-based baseline model [21], proposed by Baidu, uses sequence labeling to generate answers by classifying a huge number of vocabularies. However, this approach incurs high computational cost, and it is difficult for it to handle unseen words. This model utilizes an end-to-end trainable sequence labeling technique to process question answering, ultimately predicting answers, and is evaluated on the WebQA test dataset.

(2) For all types of answers in the WebQA dataset, we conduct experiments and compare our model with the following methods: LSTM [21], DrQA [15], BIDAF [16], R-net [17], BERT [18], and SRQA [19].

LSTM is a baseline model that uses sequence labeling to label answers. The simple usage of LSTM leads to limited text representation capacity, and its accuracy in sequence labeling is low, resulting in a fuzzy matching answer score below 70% for datasets with single texts.

The rest of the models are based on attention mechanisms and pointer networks. DrQA, BIDAF, and R-Net propose innovative attention methods. DrQA simply uses a bilinear term to calculate attention weights to obtain word-level question-merged paragraph representations, capturing the similarity between paragraphs and questions, and computing the probability of each word being the start or end position of the answer.

BIDAF introduces a memory attention mechanism to generate bidirectional attention flow and obtains word-level representations through multi-stage hierarchy for different granularities of context. It still uses span range probability prediction for answer extraction.

R-Net extends self-matching attention to fully exploit information from the paragraph itself to distinguish different meanings of the same word, thereby enhancing the context information. It ultimately extracts answers by predicting the positions of the answers.

BERT adopts the Transformer as the attention mechanism model, and is trained on a large-scale corpus. It uses span probability prediction as the final answer extraction.

SRQA utilizes a multi-layer attention network to learn better representations. The multi-layer attention network learns the interaction between questions and documents in each layer, and each layer’s document representation corresponds to the needs of the question. It conducts experiments using three different approaches: multi-layer attention (MA), cross evidence (CE), and adversarial training (AT).

4.4. Evaluation Metrics

The evaluation of open-domain reading comprehension question answering is of paramount importance [31]. In the WebQA dataset, the majority of answers are entity names, such as names of people, locations, and time. Therefore, it is appropriate to directly measure the accuracy of predicted answers as the evaluation criterion, instead of using approximate evaluation methods such as Bleu [32] and Rouge [33]. In Section 4.6, the model performance is evaluated using three metrics: precision (P), recall (R), and F1-score, which are defined as follows in Formula (20):

\begin{matrix} P = \frac{| C |}{| A |}, P = \frac{| C |}{| Q |}, P = \frac{2 P R}{P + R} \end{matrix}

(20)

where

| C |

denotes true positive,

| A |

denotes false positive, and

| Q |

denotes false negative. These metrics are used to compare the predicted answers with the given answers in the dataset and assess the model’s performance.

Because the WebQA dataset is collected from the internet, answers with the same meaning in the questions may appear in different forms, such as “Beijing” and “Beijing city”. To properly evaluate the correctness of these cases, two methods are used in the experiments to calculate the correct answers: strict matching and fuzzy matching. Strict matching refers to considering the model’s predicted answer as correct only if it is exactly the same as the given correct answer in the dataset, while fuzzy matching refers to considering the model’s predicted answer as correct if it contains the correct answer or is a synonym of it.

Moreover, the WebQA dataset includes questions with a single evidence document (annotated) and questions with multiple evidence documents (retrieved). To fully consider these scenarios, the model is evaluated separately on the Annotated and Retrieved subsets of the dataset.

In the algorithm performance analysis in Section 5, the exact match (EM) is used to calculate whether the predicted results match the standard answers exactly, and the F1 score is used to measure the word-level matching between the predicted results and the standard answers. EM is a common evaluation criterion in question-answering systems, used to assess the percentage of correctly matched answers in the predictions. The calculation is represented by Formula (21), where C represents the number of correct answers predicted, and A represents the total number of answers.

\begin{matrix} E M = \frac{C}{A} \end{matrix}

(21)

4.5. Parameter Settings

In the construction of the cognitive graph, QARCG adopts the TF-IDF algorithm [34] to select initial nodes. The number of TF-IDF is set to 3, and the maximum depth for retrieving from Wikipedia is set to 5. The pretrained Bert-wwm model is used, and both the retrieval system and the reasoning system use the same wwm configuration (d = 1024). For the other comparative methods, the best parameter settings provided in the original papers are used for comparison. To explore the impact of different dimensional parameters on model performance, various parameter settings are compared within the QARCG model.

4.6. Algorithm Comparison

Experiments are conducted on the entity-answer data in the WebQA dataset, and the results are shown in Table 2. Experiments are also conducted on all data in the WebQA dataset, and the results are presented in Table 3.

Entity-Answer Experiment: End-to-end models are more suitable for entity-centered questions. Therefore, in the WebQA dataset, a comparative experiment is conducted on questions with entity-type answers. The validation set, which includes both annotated evidence and retrieved evidence, is combined for validation. Likewise, the test set, which includes both annotated evidence and retrieved evidence, is combined for testing. The experiments are conducted using strict matching. The results in Table 2 show that QARCG consistently outperforms MemN2N, the Attentive Reader, the Impatient Reader, and the baseline model, demonstrating the effectiveness of the proposed model.

WebQA Dataset Experiment: Experiments are conducted on both annotated evidence and retrieved evidence subsets of the WebQA dataset. For the annotated evidence validation set, strict matching is used for the experiments, while for the test sets of both annotated and retrieved evidence, experiments are conducted using both strict matching and fuzzy matching. The results in Table 3 show that QARCG outperforms different models, such as LSTM, DrQA, BIDAF, R-net, Bert, and SRQA, achieving the best F1 values under various conditions, further demonstrating the effectiveness of the proposed model.

In summary, there exist three key distinctions between Table 2 and Table 3. Firstly, Table 2 utilizes a subset of the WebQA dataset, specifically focusing on instances where answers are expressed as entities. In contrast, Table 3 encompasses the entirety of the WebQA dataset, comprising a broader scope. Therefore, the dataset employed in Table 2 can be regarded as a subset of the dataset utilized in Table 3.

Secondly, the dataset utilized for Table 2 is characterized by answers in the form of entities, denoted as “retrieved”. In contrast, Table 3 introduces complex representations, encompassing both “retrieved” and “annotated”. In this context, “retrieved” signifies cases wherein definitive answers are present, while “annotated” encapsulates scenarios where multiple answer candidates are identified.

Thirdly, our experimental methodology encompasses two distinct approaches: “Strict” and “Fuzzy”. The “Strict” approach signifies instances where the generated answers exhibit precise alignment with the dataset answers, thus reflecting a complete match. On the other hand, the “Fuzzy” approach pertains to situations where the generated answers demonstrate partial correspondence with the dataset answers, indicating a nuanced level of alignment. The distinctions presented highlight the intricate dataset compositions, as well as the evaluation methodologies employed, contributing to a deeper understanding of the experimental framework.

5. Algorithm Performance Analysis

5.1. Ablation Experiments

Ablation experiments are particularly important for studying the role of each module in the model [35]. As shown in Table 4, removing certain components from the model significantly decreases its performance. We analyze the components used for retrieval in System 1 and those used for reading comprehension in System 2. In the experiments, EM values indicate whether the predicted answers match the correct answers exactly, and F1 values represent the overlap between predicted answers and correct answers.

In System 1, QARCG constructs reasoning paths using the RNN module. As shown in Figure 1 of the model diagram, the retrieval process in the construction of cognitive graph often depends on the information mentioned in the previous paragraph. Therefore, if the model does not condition the next paragraph on the previous one, it will fail to retrieve the connection between paragraphs. Additionally, following the approach of the CogQA model [7], experiments were conducted using CNN and GNN to replace the RNN module.

The triple extraction module provides evidence paragraphs more efficiently for the reasoning paths. When pure entity recognition is used for paragraph retrieval, a drop in performance is observed, indicating the importance of the triple extraction module for reasoning paths. The beam search module is designed to facilitate the search for better reasoning paths. Replacing it with the greedy module for search leads to a performance drop of approximately 4 points, demonstrating that the graph structure is more conducive to finding the optimal reasoning path. The negative sampling training is also an essential module that cannot be neglected. Experimental results show that without negative sampling as part of the training, the QARCG model is easily misled by reasoning paths without the correct answer, resulting in the retrieval of supporting paragraphs with incorrect answers.

In System 2, the module for reordering reasoning paths demonstrates the importance of finding the best reasoning path among multiple ones to identify the answer span. The use of negative sampling for answer prediction [36] also indicates that negative sampling is beneficial to prevent the model from overly relying on selecting answers related to the question.

5.2. Parameter Experiments

5.2.1. Retrieval Length

The maximum allowed retrieval length in Wikipedia during the construction of the cognitive graph in System 1 can have an impact on the model’s performance. Due to limitations in machine performance in the experiments, we only evaluated the QARCG model by fixing the reasoning path length to 1, 2, 3, 4. As shown in Figure 3, although the retrieval performance with a reasoning path length of 2 is slightly lower than that with a length of 3 in the WebQA dataset, the reasoning time is significantly reduced. Considering the trade-off between retrieval efficiency and performance, we set the reasoning path length to 2 as it provides the best retrieval performance.

5.2.2. Pretrained Models

Considering that different pretrained models can also have an impact on the model’s performance, several variants based on BERT have been proposed in recent years. Therefore, we conducted experiments to encode the QARCG model using different pretrained models for comparison, including Albert [37], Roberta [38], and GPT [39]. As shown in Figure 4, the influence of different pretrained models on the model’s performance is negligible in the same batch training. Therefore, in the QARCG model, we continue to use Bert-wwm as the pretrained model.

5.2.3. Top-K Inference Paths

In the model, we set three top thresholds for inference paths. Firstly, we obtain the top-K inference paths from RNN, then the top B paths selected by beam search are passed to System Two for reading comprehension, and finally, the top inference path is determined in System Two based on its relevance to the question. The top thresholds are set as commonly used empirical values, representing a gradually narrowing scope:

K = 5

,

B = 3

, and finally, 1.

5.3. Data Augmentation Experiment

To validate the QARCG model, we conducted additional experiments using the Sogou question-answering competition dataset. Since the Sogou competition dataset is not publicly available, we performed data augmentation experiments by mixing the training data from the WebQA dataset with that from SogouQA. We then tested the QARCG model on the WebQA’s Annotated and Retrieved datasets (as shown in Table 3). The experimental results, as depicted in Figure 5, demonstrate that the F1 score of the model improved after data augmentation, showcasing the universality and scalability of the model in open-domain question answering.

6. Conclusions and Future Work

6.1. Conclusions

In light of the current state of open-domain reading comprehension question-answering methods, this study proposed the QARCG method based on the dual-process theory of cognitive science. The QARCG approach views open-domain question answering as a combination of retrieval and reasoning systems. System 1, responsible for retrieval, extracts triples from given supporting text and iteratively retrieves information from Wikipedia, constructing a cognitive graph with reasoning paths. System 2, responsible for reasoning, learns the interaction information between paragraphs using RNN based on the built cognitive graph. It reorders and scores different reasoning paths, and predicts the answer’s span based on the highest-scoring reasoning path’s paragraphs. The integration of retrieval and reasoning reduces the loss in graph construction and maintains graph structure, thereby enhancing interpretability and overcoming the lack of reasoning interpretability in traditional end-to-end reading comprehension methods. Additionally, it addresses the requirement for existing large-scale knowledge graphs in knowledge graph question answering.

6.2. Future Work

The open-domain reading comprehension question-answering method based on cognitive graphs integrates two systems to attain precise answers while maintaining interpretability. Nevertheless, there exists considerable potential for advancing and refining this approach.

(1) Dual-process theory: Currently, the cognitive graph is based on the dual-process theory of cognitive science, dividing reading comprehension into System 1 for retrieval and System 2 for reasoning. However, there may be other relevant theories in cognitive science that could provide support. Exploring how to construct a novel learning architecture that combines symbolic reasoning and deep learning is an important future task.

(2) Memory mechanisms in cognitive graphs: The retrieval system in the cognitive graph question-answering method simulates memory models in reading comprehension. However, human memory mechanisms encompass both long-term and short-term memory, operating with distinct modes and mechanisms. Considering how to build a memory model that reflects long-term memory storage is a challenge that needs to be addressed.

(3) Integration of cognitive graphs and external feedback: While the cognitive graph question-answering method achieves answer extraction through the tight integration of retrieval and reasoning systems, incorporating human cognitive processes may benefit from reinforcement learning to learn feedback and interact with the external world. Thus, the integration of cognitive graphs and external feedback is a topic worth exploring.

In conclusion, the research and application of cognitive graphs offer great potential for future exploration. Further work will focus on how to perform text reasoning based on complex knowledge, as there is still considerable room for improvement in addressing reading comprehension question-answering challenges.

Author Contributions

Conceptualization and validation, J.W. and H.W.; methodology, software, data curation, writing—original draft preparation, J.W.; writing—review and editing, H.W.; supervision, project administration, funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [grant numbers 71932002], and by the Youth Beijing Scholars Program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We also sincerely appreciate the comments and suggestions given by the editor-in-chief, the review team and the editors to improve the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sloman, S. The Empirical Case for Two Systems of Reasoning. Psychol. Bull. 1996, 119, 3–22. [Google Scholar] [CrossRef]
Chen, D. Neural Reading Comprehension and Beyond; Stanford University: Stanford, CA, USA, 2018. [Google Scholar]
Lee, K.; Chang, M.W.; Toutanova, K. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Marquez, L., Eds.; Association for Computational Linguistics (ACL): Cedarville, OH, USA, 2019; pp. 6086–6096. [Google Scholar]
Robertson, S.E.; Jones, K.S. Relevance Weighting of Search Terms. J. Am. Soc. Inf. Sci. 1976, 27, 129–146. [Google Scholar] [CrossRef]
Yang, Z.; Peng, Q.; Zhang, S.; Bengiov, Y.; Cohent, W.W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium, 31 October–4 November 2018; pp. 2369–2380. [Google Scholar]
Seo, M.; Lee, J.; Kwiatkowski, T.; Parikh, A.P.; Farhadi, A.; Hajishirzi, H. Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Marquez, L., Eds.; Association for Computational Linguistics (ACL): Cedarville, OH, USA, 2019; pp. 4430–4441. [Google Scholar]
Ding, M.; Zhou, C.; Chen, Q.; Yang, H.; Tang, J. Cognitive Graph for Multi-Hop Reading Comprehension at Scale. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Marquez, L., Eds.; Association for Computational Linguistics (ACL): Cedarville, OH, USA, 2019; pp. 2694–2703. [Google Scholar]
Berant, J.; Chou, A.; Frostig, R.; Liang, P. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1533–1544. [Google Scholar]
He, S.; Liu, C.; Liu, K.; Zhao, J. Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, BC, Canada, 30 July–4 August 2017; Barzilay, R., Kan, M., Eds.; Association for Computational Linguistics (ACL): Cedarville, OH, USA, 2017; Volume 1, pp. 199–208. [Google Scholar]
Chen, Y.; Wu, L.; Zaki, M.J. Bidirectional Attentive Memory Networks for Question Answering over Knowledge Bases. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 2913–2923. [Google Scholar]
Mohammed, S.; Shi, P.; Lin, J. Strong Baselines for Simple Question Answering over Knowledge Graphs with and without Neural Networks. arXiv 2017, arXiv:1712.01969. [Google Scholar]
Zhang, Y.; Dai, H.; Kozareva, Z.; Smola, A.J.; Song, L. Variational Reasoning for Question Answering with Knowledge Graph. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence/Thirtieth Innovative Applications of Artificial Intelligence Conference/Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 6069–6076. [Google Scholar]
Sukhbaatar, S.; Weston, J.; Fergus, R. End-to-End Memory Networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
Staudemeyer, R.C.; Morris, E.R. Understanding LSTM–a Tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv 2019, arXiv:1909.09586. [Google Scholar]
Chen, D.; Fisch, A.; Weston, J.; Bordes, A. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, BC, Canada, 30 July–4 August 2017; Barzilay, R., Kan, M., Eds.; Association for Computational Linguistics: Cedarville, OH, USA, 2017; Volume 1, pp. 1870–1879. [Google Scholar]
Seo, M.; Kembhavi, A.; Farhadi, A.; Hajishirzi, H. Bidirectional Attention Flow for Machine Comprehension. arXiv 2016, arXiv:1611.01603. [Google Scholar]
Park, C.; Lee, C.; Hong, L.; Hwang, Y.; Yoo, T.; Jang, J.; Hong, Y.; Bae, K.H.; Kim, H.K. S2-Net: Machine Reading Comprehension with SRU-based Self-Matching Networks. ETRI J. 2019, 41, 371–382. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of The Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Wang, J.; Xu, W.; Fu, X.; Wei, Y.; Jin, L.; Chen, Z.; Xu, G.; Wu, Y. SRQA: Synthetic Reader for Factoid Question Answering. Knownledge-Based Syst. 2020, 193, 105415. [Google Scholar] [CrossRef]
Hermann, K.M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching Machines to Read and Comprehend. Adv. Neural Inf. Process. Syst. 2015, 28, 1693–1701. [Google Scholar]
Li, P.; Li, W.; He, Z.; Wang, X.; Cao, Y.; Zhou, J.; Xu, W. Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering. arXiv 2016, arXiv:1607.06275. [Google Scholar]
Wei, Z.; Su, J.; Wang, Y.; Tian, Y.; Chang, Y. A novel cascade binary tagging framework for relational triple extraction. arXiv 2019, arXiv:1909.03227. [Google Scholar]
Nadeau, D.; Sekine, S. A survey of named entity recognition and classification. Lingvisticae Investig. 2007, 30, 3–26. [Google Scholar] [CrossRef]
Rink, B.; Harabagiu, S. Classifying semantic relations by combining lexical and semantic resources. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, 15–16 July 2010; pp. 256–259. [Google Scholar]
Li, Q.; Ji, H. Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, 23–24 June 2014; Volume 1: Long Papers. pp. 402–412. [Google Scholar]
Das, R.; Neelakantan, A.; Belanger, D.; McCallum, A. Chains of Reasoning over Entities, Relations, and Text using Recurrent Neural Networks. In Proceedings of the 15TH Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017), Valencia, Spain, 3–7 April 2017; Volume 1: Long papers. pp. 132–141. [Google Scholar]
Hu, M.; Peng, Y.; Huang, Z.; Qiu, X.; Wei, F.; Zhou, M. Reinforced Mnemonic Reader for Machine Reading Comprehension. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; Lang, J., Ed.; AAAI Press: Washington, DC, USA, 2018; pp. 4099–4106. [Google Scholar]
Wiseman, S.; Rush, A.M. Sequence-to-Sequence Learning as Beam-Search Optimization. arXiv 2016, arXiv:1606.02960. [Google Scholar]
Rajpurkar, P.; Jia, R.; Liang, P. Know What You Don’t Know: Unanswerable Questions for SQuAD. arXiv 2018, arXiv:1806.03822. [Google Scholar]
Weston, J.; Chopra, S.; Bordes, A. Memory Networks. arXiv 2014, arXiv:1410.3916. [Google Scholar]
Rodrigo, A.; Penas, A. A Study about the Future Evaluation of Question-Answering Systems. Knownledge-Based Syst. 2017, 137, 83–93. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Lin, C.Y. Rouge: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
Sparck Jones, K. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. J. Doc. 1972, 28, 11–21. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Clark, C.; Gardner, M. Simple and Effective Multi-Paragraph Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 15–20 July 2018; Gurevych, I., Miyao, Y., Eds.; Association for Computational Linguistics (ACL): Cedarville, OH, USA, 2018; Volume 1, pp. 845–855. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A Lite Bert for Self-Supervised Learning of Language Representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A Robustly Optimized Bert Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 1 January 2022).

Figure 1. QARCG model framework.

Figure 2. Examples of problematic cases.

Figure 3. F1 and EM scores with different retrieval pathways.

Figure 4. F1 and EM scores with different pretrained models.

Figure 5. Comparison of F1 scores with data augmentation.

Table 1. Statistical information of the WebQA dataset.

Dataset	Question		Annotated Evidence		Retrieved Evidence
Dataset	#	Word	#	Word	#	Word
Train	36,145	374,500	140,897	10,757,652	171,838	7,233,543
Validation	3018	36,666	5412	233,911	60,351	3,633,540
Test	3024	36,815	5445	234,258	60,645	3,620,391

Table 2. Comparison results for entity answers.

System	Validation (Strict)			Test (Strict)
System	P	R	F1	P	R	F1
MemN2N	52.61	52.61	52.61	50.14	50.14	50.14
MemN2N	52.61	52.61	52.61	50.14	50.14	50.14
Impatient Reader	63.05	63.05	63.05	59.83	59.83	59.83
Base	64.46	87.62	74.28	63.30	87.70	73.53
QARCG	76.62	75.13	75.86	75.74	74.23	74.97

The font is bolded to highlight the score results of the algorithm in this paper.

Table 3. Comparison results for all answers.

Model		Annotated Evidence									Retrieved Evidence
		Strict (Val.)			Strict (Test)			Fuzzy (Test)			Fuzzy (Test)
		P	R	F1	P	R	F1	P	R	F1	P	R	F1
LSTM+softmax		59.74	69.11	64.08	59.38	68.77	63.73	63.58	73.63	68.24	69.75	74.72	72.15
LSTM+softmax(k-1)		59.84	67.51	63.44	59.76	67.61	63.44	64.02	72.44	67.97	69.11	73.93	71.44
LSTM+CRF		64.42	75.84	69.67	63.72	76.09	69.36	67.53	80.63	73.50	72.66	76.83	74.69
DrQA					69.62	69.62	69.62	72.86	72.86	72.86	75.24	75.24	75.24
BIDAF					70.04	70.04	70.04	74.43	74.43	74.43	75.62	75.62	75.62
R-net					70.48	70.48	70.48	74.82	74.82	74.82	76.06	76.06	76.06
Bert					71.36	71.36	71.36	75.58	75.58	75.58	77.84	77.84	77.84
	MA				71.03	71.03	71.03	75.46	75.46	75.46	77.23	77.23	77.23
SRQA	MA+RN				71.28	71.28	71.28	75.89	75.89	75.89	77.84	77.84	77.84
	MA+AT				72.51	72.51	72.51	77.01	77.01	77.01	78.56	78.56	78.56
QARCG		79.73	72.66	76.03	74.66	73.86	74.26	77.78	76.86	77.32	79.09	78.23	78.66

The font is bolded to highlight the score results of the algorithm in this paper.

Table 4. Rigorous experimentation: evaluating different variants of the model on the WebQA dataset.

Setting		F1	EM
Completeness		78.3	74.9
Retrieval System I	No RNN	56.7	52.9
	CNN	72.6	69.3
	GNN	74.5	72.6
	No Ternary	70.8	68.6
	No beam search	74.2	70.7
	No negative sampling	69.3	66.2
Reading System I	No Rearrangement Path	75.8	72.2
Reading System I	No negative sampling	62.3	60.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Li, J.; Wang, J. Retrieving Chinese Questions and Answers Based on Deep-Learning Algorithm. Mathematics 2023, 11, 3843. https://doi.org/10.3390/math11183843

AMA Style

Wang H, Li J, Wang J. Retrieving Chinese Questions and Answers Based on Deep-Learning Algorithm. Mathematics. 2023; 11(18):3843. https://doi.org/10.3390/math11183843

Chicago/Turabian Style

Wang, Huan, Jian Li, and Jiapeng Wang. 2023. "Retrieving Chinese Questions and Answers Based on Deep-Learning Algorithm" Mathematics 11, no. 18: 3843. https://doi.org/10.3390/math11183843

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Retrieving Chinese Questions and Answers Based on Deep-Learning Algorithm

Abstract

1. Introduction

1.1. The Core Concept of Cognitive Graph

1.2. Basic Concept of Chinese Reading Comprehension Question Answering

1.3. Triple Extraction

1.4. Main Contribution

2. Model Definition Core Concepts

3. Model Framework

3.1. Retrieval Reasoning Paths

3.1.1. Paragraph Graph Construction

3.1.2. Reasoning Path Modeling

3.1.3. Model Optimization

3.2. Reading and Answering Based on Reasoning Paths

3.3. Joint Training

4. Experimental Analysis

4.1. Experimental Configuration

4.2. Dataset

4.3. Comparative Algorithms

4.4. Evaluation Metrics

4.5. Parameter Settings

4.6. Algorithm Comparison

5. Algorithm Performance Analysis

5.1. Ablation Experiments

5.2. Parameter Experiments

5.2.1. Retrieval Length

5.2.2. Pretrained Models

5.2.3. Top-K Inference Paths

5.3. Data Augmentation Experiment

6. Conclusions and Future Work

6.1. Conclusions

6.2. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI