1. Introduction
Machine reading comprehension (MRC) is one of the most attractive and long-standing tasks in natural language processing (NLP). Compared with single-paragraph MRC, multi-hop MRC is more challenging since multiple confusing answer candidates are contained in different passages [
1,
2]. Models designed for this task are supposed to have abilities to reasonably traverse multiple passages and discover reasoning clues following given questions. For complex multi-hop MRC tasks, more understandable, reliable, and analyzable methodologies are required to improve reading performance.
Better understanding of biological brains could play a vital role in building artificial intelligent systems [
3]. Previous cognitive research in reading can be of benefit to challenging multi-hop MRC tasks. The concept of
grandmother cells can be traced back to a 1969 academic lecture given by the neuroscientist Jerome Lettvin [
4] and was later defined by the physiologist Horace Barlow as cells in the brain that respond specifically to a single familiar person or object. In experiments on primates, researchers discovered individual neurons that responded specifically to a specific person, image, or concept after differentiation [
5]. A study of a patient with epilepsy found a neuron in the patient’s anterior temporal lobe that responded specifically to the Hollywood star Jennifer Aniston [
6]. Any form of stimulation related to Aniston, whether it be a color photograph, a close-up of her face, or a cartoon portrait, or even just seeing her name written on paper, could and would only stimulate that neuron to produce an excited signal. As research into the concept of
grandmother cells, the underlying mechanism of their response became clearer. The signal output from a single
grandmother cell in response to specific stimuli actually stems from the coordinated calculation of a large-scale neural network behind
grandmother cells [
5]. It suggests that a single neuron can respond to only one out of thousands of stimulations, which is somehow intuitively similar to reading and inference in multi-hop MRC:
Selectivity. The grandmother cells concept organizes the neurons in a hierarchical “sparse” coding scheme. It activates some specific neurons to respond to stimulation, similar to the manner in which we store reasoning evidence maps (neurons) in our minds during reading and recall-related evidence maps to reason the answer with a question (stimulation) constrained.
Specificity. The concept implies that brains contain grandmother neurons that are so specialized and dedicated to a specific object, which is similar to a particular MRC question resulting in a specific answer among multiple reading passages and their complex reasoning evidence.
Class character. Amazing selectivity is captured in grandmother cells. However, it results from computation by much larger networks and the collective operations of many functionally different low-level cells, similar to human multi-hop reading in which evidence is usually gathered from different levels as much as possible and the final answer is decided in some candidate endpoints.
To imitate grandmother cells in multi-hop MRC, the reading evidence is supposed to be organized as level-classified neurons and the selections must be performed in response to specific question stimulation. As for multi-hop MRC tasks, the hops between two entities could be connected as node pairs and gradually constructed into a reasoning evidence graph taking all related entities as nodes. This reasoning evidence graph is intuitively represented as a graph structure, which can be empirically considered to contain the implicit reasoning chains from the start of the question to the end of the answer nodes (entities). We generally recall considerable related evidence as a node, whatever form it is (such as a paragraph, a short sentence, or a phrase), to meet the class character, and we coordinate their inter-relationship before obtaining the results.
Graph neural networks (GNNs) inspire us to posit that operating on graphs and manipulating the structured knowledge can support relational reasoning [
7,
8] in a sophisticated and flexible pattern, similar to the implementation of
grandmother cells regarding the cells as nodes in the graph and collecting evidence in multi-classified aspects of node representations. Further, spatial graph attention networks (GATs) perform the selectivity in reasoning evidence graphs in the manner of
grandmother cells using attention mechanisms. This work has the following main contributions:
In order to construct a more reasonable graph, ClueReader draws inspiration from the concept of grandmother cells in the brain during information cognition, in which cells in the brain only output specific entities. This leads to the creation of heterogeneous graph attention networks with multiple types of nodes.
By taking the subject of queries as the starting point, potential reasoning entities in multiple documents as bridge points, and mention entities consistent with candidate answers as end points, the proposed ClueReader is a heuristic way of constructing MRC chains.
Before outputting predicted answers, ClueReader innovatively visualizes the internal state of the heterogeneous graph attention network, providing intuitive quantitative data displays for analyzing the effectiveness, rationality, and explainability.
The remainder of the article is organized as follows.
Section 2 describes the work related to multi-hop MRC, and
Section 3 proposes the
ClueReader that imitates
grandmother cells for multi-hop MRC. Experimental evaluations are conducted in
Section 4, and conclusions are summarized in
Section 5.
3. Methodology
We introduce the design and implementation of the proposed model,
ClueReader, which is shown in
Figure 1.
3.1. Task Formalization
A given query is in a triple form, where s is the subject entity, r is the query relation (i.e., predication), and q can be converted into sequential form , where m is the number of tokens in the query q. Then a set of candidates and a series of supporting documents containing the candidates are also provided, where z is the number of the given candidates, n is the number of the given supporting documents, and the subscript q means the two sets are constrained by the query q. Moreover, is provided in a random order, and without , the answer to the query q could be multiple. Our goal is to identify the single correct answer by reading .
3.2. Encoding Layer
We utilize the pre-trained
GloVe [
32] model to initialize word embeddings, and then employ
Bidirectional Long Short-Term Memory (
Bi-LSTM) [
33,
34] to encode sequence representations as:
where the subscripts
t and
denote the indexes of the encoding time step;
and
are the hyperparameters of the input and the hidden layer;
i,
f,
o,
,
h, and
c respectively represent the input, forget, output, content, hidden, and cell states;
x represents the word embedding;
and
are sigmoid activation and hyperbolic tangent activation, respectively.
We use
and
to denote the forward-pass (i.e., the left to right) and the backward-pass (i.e., the right to left) sequence representations encoded by
Bi-LSTM, respectively. Then, the representation of the entire sequential context obtained from the encoding layer can be expressed as follows:
where the symbol
denotes the concatenation of
and
. To encode the sequence representations of support documents
S, candidates
C, and query
q, it is desirable to use three independent
Bi-LSTMs. Their outputs are
,
and
, respectively, where
i and
j are the indexes of the documents and the candidates,
l is the sequence length, and
d is the output dimension of the representations.
3.3. Heterogeneous Reasoning Graph
The concept of
grandmother cells reveals that the brains of monkeys, like those of humans, contain neurons that are so specialized they appear to be dedicated to a single person, image, or concept. This amazing selectivity is uncovered in a single neuron, while it must result from computation by a much larger network [
5]. We heuristically consider that this procedure in multi-hop reading could be summarized as three steps:
The query (or the question) locates the related neurons at a low level, which then stimulates higher-level neurons to trigger computation;
The higher-level neurons begin to respond to increasingly broader portions of other neurons for reasoning, and to avoid a broadcast storm, informative selectivity takes place in this step;
At the top-level, some independent neurons are responsible for the computations that occurred in step 2. We refer to these neurons as grandmother cells and expect them to provide the appropriate results that correspond to the query.
We attempt to imitate
grandmother cells in our reading procedure and present our reasoning graph as consistent as possible with the three steps mentioned above. The heterogeneous reasoning graph
, which is illustrated in
Figure 2, simulates a heuristic chain of comprehension that starts from the subject entity in query
q and goes through the reasoning entities in the supporting document set
, then through the mention entities in
that are consistent with the candidate answer, and finally touches at the candidates in set
(referred to as the
grandmother cell).
3.3.1. Nodes Definition
To construct the graph, we define five different types of nodes which are similar to neurons and ten kinds of edges among the nodes [
15,
24].
Subject Nodes—As the form of query
q, the subject entity
s is given in
. For example, the subject entity of the query sequence context
Where is the basketball team that Mike DiNunno plays for based? is certainly
Mike DiNuuno. We extract all the named entities that match with
s from documents, and regard them as the subject nodes to open up the reading clues triggering the further computations. The subject nodes are denoted as
and colored in gray in
Figure 2.
Reasoning Nodes—In light of the requirements of the multi-hop MRC, there are some gaps between the subject entities and candidates. To build bridges between the two and make the reasoning clues as complete as possible, we replenish those clues with the named recognition entities and nominal phrases from the documents containing the question subjects and answer candidates. The reasoning nodes are marked as
and colored in orange in
Figure 2.
Mention Nodes—A series of candidate entities are given in
, they may occur in multiple times within the document set
. As a result, we traverse the documents and extract the named entities corresponding to each candidate as mention nodes, serving as the soft endpoint of the reasoning chain. It should be noted that mention nodes will participate in the semi-supervised learning process and will be involved in the final answer prediction. The mention nodes are presented as
and colored in green in
Figure 2.
Support Nodes—As described by [
5], we consider that multi-type representations may contribute to the reading process, and thus the support documents containing the above nodes are introduced to
as support nodes, which are notated as
and colored in red in
Figure 2.
Candidate Nodes—To imitate
grandmother cells, we consider candidate nodes as hard endpoints of the reasoning chain to gather relevant information from the heterogeneous reasoning graph. For the mention nodes
of a candidate answer
, when
, candidate nodes are established as
grandmother cells to provide the final prediction. The candidate nodes are denoted as
and colored in blue in
Figure 2.
3.3.2. Edges Definition
To learn the entity relationships between different nodes, we define 10 kinds of edges between nodes in heterogeneous reasoning graphs inspired by the literature [
24,
26,
35], as shown in
Table 1.
3.3.3. Graph Construction
In the heterogeneous reasoning graph, the clue-reading chain can be represented by , whose edges are covered by , , , and . and give the model the ability to transfer information across documents and edges in , , and are responsible for supplementing the multi-angle textual information from the documents. Furthermore, the could gather all the information of the mentioned nodes corresponding to the candidates and then pass their representations to the output layer to realize the imitation of grandmother cells.
Specifically, this multi-hop MRC process of the clue-based reasoning starts with the subject node, connecting reasoning nodes from support documents, then connecting the mention nodes as soft endpoints of the clue chain, and finally connecting to the candidate nodes (grandmother cells) as hard endpoints of the clue chain. For example, for the question Which country is the location of the United Nations Headquarters? the answer candidate set includes China, France, UK, USA, and Russia. One correct and reasonable clue chain can be represented as Location of United Nations Headquarters (subject node)↔Manhattan↔New York City↔New York State↔USA (mention node)↔USA (candidate node). In practice, multiple clue chains are included within the heterogeneous reasoning graph, and under the constraints of the query, the selection of soft and hard endpoints is required to output the final prediction.
3.4. Heterogeneous Graph Attention Network for Multi-Hop Reading
3.4.1. Query-Aware Contextual Information
Following
HDE [
28], we use the
co-attention and
self-attention mechanisms [
36] to combine the query contextual information and documents. Moreover, it is applied to the other semantic representations that require reasoning consistent with the query. To represent the query-aware support documents, it can be calculated as follows:
where
is the similarity matrix for two sequences, between the
i-th support document
and query
, and
d is the dimension of the context. Then, the query-aware representation of support documents
is computed as follows:
To project the sequence into a fixed dimension and output the representation
of
for graph optimization, a
self-attention is utilized to summarize the contextual information:
In addition to the query-aware support documents, the co-attention and self-attention are used to generate query-aware node representations from other sequential representations.
3.4.2. Message Passing in the Heterogeneous Graph Attention Network
We present messaging passing in the heterogeneous graph attention network for reading within multiple relations in diverse nodes. The input of this module is a graph
and node representations
, where
r is the number of nodes. Initially, a shared weight matrix
is applied to
, then the attention coefficients and nodes attention coefficients are computed as
where
are the attention coefficients indicating the importance of the features of the node
to the node
, and
is normalized across all structure neighbors
of the node
. The attention mechanism is responsible for selectivity with node interdependence, which enables us to show how the nodes take effect during the reasoning.
Considering the 10 different types of edges defined in
Section 3.3.2, we model the relational edges basing on the vanilla GAT [
37]:
where
is the hidden state of the node
in the
l-th layer, all the GAT layers are parameter-shared,
is the
-th head following [
15,
37],
is the set of all types of edges in
, and
are normalized attention coefficients computed by the
k-th attention mechanism with relation
r, which is presented in [
37].
Message passing is a key component of our model. To echo the selectivity of grandmother cells, we use the attention mechanism to select (i.e., activate or deactivate) key node pairs in our reasoning graph, and we empirically regard this process as the reading reasoning in the graph.
3.4.3. Gating Mechanism
A previous study [
19] showed that GNNs are suffer from the smoothing problem when calculated by stacking many layers, and, thus, we overcome this issue by applying question-aware [
27] and general gating mechanisms [
38] to optimize the procedure.
where
is the query representation given by a dedicated
Bi-LSTM encoder to keep consistency with the dimension of node features
,
j indicates the order of query words,
m is the query length,
is a sigmoid function, and ⊙ indicates element-wise multiplication. Then the general gating mechanism is introduced as follows:
3.5. Output Layer
After updating the node representation, we use two multilayer perceptrons,
and
, to transform the node features to prediction scores. All the candidate nodes (
grandmother cells)
and mention nodes
from
are employed to output the prediction score distribution
a as
where
takes the maximum mention node score over
, then the two parts are summed with the effect of a harmonic
as the final prediction score distribution.
5. Conclusions
We present ClueReader, a heterogeneous graph attention network for multi-hop MRC, which is inspired by the concept of grandmother cells from cognitive neuroscience. The network contains several clue-reading paths from the subject of the question and ends with candidate entities. We use reasoning and mention nodes to complete the process and use support nodes to add supernumerary semantic information. We apply our methodology on QAngaroo, a multi-hop MRC dataset, and the official evaluation supports the effectiveness of our model in open-domain QA and the molecular biology domain. Several potential issues could be further addressed, such as introducing intermediate supervision signals during the semi-supervised graph learning, the enhancement of using external knowledge, and dedicated word-embedding methodology in the medical context, which are possible to improve the model performance in multi-hop MRC tasks.