Next Article in Journal
Tightness of Harary Graphs
Previous Article in Journal
Nonlinear Perception Characteristics Analysis of Ocean White Noise Based on Deep Learning Algorithms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Case-Based Deduction for Entailment Tree Generation

School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(18), 2893; https://doi.org/10.3390/math12182893
Submission received: 14 August 2024 / Revised: 12 September 2024 / Accepted: 15 September 2024 / Published: 17 September 2024
(This article belongs to the Special Issue Explainable and Trustworthy AI Models for Data Analytics)

Abstract

:
Maintaining logical consistency in structured explanations is critical for understanding and troubleshooting the reasoning behind a system’s decisions. However, existing methods for entailment tree generation often struggle with logical consistency, resulting in erroneous intermediate conclusions and reducing the overall accuracy of the explanations. To address this issue, we propose case-based deduction (CBD), a novel approach that retrieves cases with similar logical structures from a case base and uses them as demonstrations for logical deduction. This method guides the model toward logically sound conclusions without the need for manually constructing logical rule bases. By leveraging a prototypical network for case retrieval and reranking them using information entropy, CBD introduces diversity to improve in-context learning. Our experimental results on the EntailmentBank dataset show that CBD significantly improves entailment tree generation, achieving performance improvements of 1.7% in Task 1, 0.6% in Task 2, and 0.8% in Task 3 under the strictest Overall AllCorrect metric. These findings confirm that CBD enhances the logical consistency and overall accuracy of AI systems in structured explanation tasks.

1. Introduction

Neural networks have demonstrated exceptional performance across various tasks [1]. However, these models often remain opaque to human understanding, prompting researchers to explore methods for providing more interpretable explanations. Three main approaches have emerged: highlighting relevant features in the input data to explain decisions [2], generating natural language explanations alongside predicted answers [3], and employing structured reasoning processes [4].
Our work focuses on leveraging structured reasoning processes to explain decisions in a clear and understandable manner. Traditionally, structured explanations have been constructed using reasoning chains, as demonstrated in earlier studies [2,5]. More recent research has explored entailment trees as an alternative method for generating explanations. Entailment trees offer a more structured and manageable way of conducting reasoning, making them a promising tool for explanation generation [6,7,8]. Entailment trees, as proposed by [6], illustrate the logical connections between multiple premises, systematically organizing the reasoning chains that underpin question-answering tasks. This approach provides a transparent means of understanding how hypotheses are derived from a series of entailment steps based on multiple premises.
Previous methods for constructing entailment trees typically employed a single, sequential process that generated the entire tree at once [9]. For instance, the sequential form of the entailment tree in Figure 1b can be represented as “ s e n t 1 & s e n t 6 i n t 1 : Chocolate is a kind of solid substance ; i n t 1 & s e n t 2 i n t 2 : Chocolate in solid state has definite shape ; ; i n t 2 & i n t 4 Hypothesis ”. However, this approach often struggled to generate complete and accurate entailment trees, leading to unreliable or hallucinatory results. To mitigate these issues, recent studies [7,8,10,11] have adopted an iterative approach, where one-step entailments are generated first and then iteratively expanded to form the entire tree. Despite these improvements, a common challenge persists: ensuring logical consistency within individual steps. As shown in Figure 1c, given the sentences “Chocolate is usually a solid” and “Chocolate is a kind of substance”, the correct entailment is “Chocolate is a kind of solid substance”. However, a pretrained language model might incorrectly predict that “Substance is usually a solid”, highlighting the issue of logical inconsistency.
To address these challenges, researchers have explored symbolic methods [12,13,14] known for their logical reasoning capabilities. However, these methods face difficulties in enumerating all first-order logical rules due to the diverse nature of natural language expressions, and constructing rule bases through human engineering is time-consuming. A promising alternative is case-based learning [15,16,17,18,19], where systems operate on the principle that similar problems have similar solutions. Case-based reasoning is a method in artificial intelligence where solutions to new problems are derived by referring to similar, previously solved cases. By leveraging past cases to solve new problems, case-based reasoning allows for guiding logical generation using patterns from past gold cases without the need for explicit rule induction.
The case-based reasoning framework typically follows a retrieve-reuse-refine paradigm [20]. A key challenge in this approach is retrieving similar cases, as logical patterns can manifest in various natural language forms. To address this, we propose using prototypical networks [21] to represent logical patterns as prototype embeddings. Prototypical networks are a type of neural network that learns a metric space in which classification is performed by computing the distance to prototype representations of each class. This method captures the essence of logical patterns, enhancing the retrieval process within the case-based reasoning framework. Our research focuses on three prevalent logical patterns: conjunction, if–then, and substitution. The prototypical network generates three prototype embeddings, one for each pattern, by averaging the embeddings of cases in each category. During reasoning, these prototype embeddings help identify the logical pattern of the current step, narrowing the retrieval scope within our proposed framework, CBD.
Once similar cases are retrieved, they are used as demonstrations for in-context learning. In-context learning benefits from diverse demonstrations [22,23,24] and is sensitive to their order [25,26]. To enhance diversity, we incorporate information entropy [27], a measure from information theory used to quantify the diversity of information in a set, which is applied here to rerank the retrieved cases. We then select the top-n cases as demonstrations. Inspired by curriculum learning [28], we organize the demonstrations by difficulty, arranging them from easy (low information entropy) to difficult (high information entropy). The deduction process is carried out iteratively until the target hypothesis is proven or the maximum deduction depth is reached. To facilitate iterative generation, we utilize the controller from MetGen [8].
Our contributions are as follows:
  • Case-based learning for step generation: We implement case-based learning to eliminate the need for manual construction of logical rules. The retrieved cases are utilized as demonstrations for in-context learning, effectively guiding large language models in logical text generation.
  • Prototypical networks for inducing abstract logical structures: To improve the efficiency of identifying cases with similar logical structures, we employ a prototypical network to capture implicit logical patterns. This enables more targeted case retrieval and provides insights into logical pattern attribution.
  • Integration of information entropy to enhance diversity: We incorporate information entropy to increase the diversity of retrieved cases, enhancing the model’s ability to learn underlying logical patterns and improving in-context learning.
The rest of this paper is organized as follows: Section 2 defines the problem and provides background knowledge on case-based reasoning and prototypical networks. Section 3 describes the architecture and implementation of CBD, detailing the case-based deduction process for entailment tree generation. Section 4 outlines our experimental setup, including the benchmark dataset, baseline models, evaluation metrics, and implementation details. Section 5 presents our results, discusses their implications, and highlights the advantages of CBD over existing methods. Section 6 reviews related work in entailment tree generation and explanation for question-answering tasks, situating our contributions within the broader research context. Finally, Section 8 concludes the paper by summarizing our contributions and discussing potential directions for future work.

2. Background

2.1. Task Definition

Entailment tree generation, as illustrated in Figure 1, involves creating an entailment tree with a hypothesis ( h y p ), which is a declarative form of a question and answer, and a set of natural language sentences K = { s 1 , s 2 , , s n } , referred to as facts. Here, n represents the number of candidate facts. Entailment tree generation is a process that organizes reasoning into steps to demonstrate how a conclusion is derived from a set of premises. The objective is to generate a valid entailment tree T p r e d , where the hypothesis ( h y p ) serves as the root node, the leaves are facts selected from K , and the intermediate nodes consist of novel intermediate facts ( i n t m ) representing conclusions generated by the model. The tree T p r e d is considered valid if each nonleaf node is a valid entailment of its immediate children. The annotated gold tree is denoted as T g o l d , with its leaf facts forming S g o l d . Following [6], we consider three increasingly challenging tasks based on the scope of candidate facts K :
  • Task 1 (no-distractor): K contains only the gold standard facts ( K g o l d ).
  • Task 2 (distractor): K includes K g o l d along with 15–20 distractor facts, challenging the model’s ability to distinguish relevant information.
  • Task 3 (full-corpus): This scenario uses the entire corpus as K , presenting the most comprehensive and challenging setting for entailment tree generation. In our experiments, the entire corpus refers to the WorldTree corpus.

2.2. Case-Based Reasoning

Case-based reasoning (CBR) is a well-established problem-solving approach where new problems are addressed by adapting solutions from past cases that share similar characteristics [16,17,29]. It follows the retrieve-reuse-refine paradigm [30], which has been widely applied in fields such as legal question answering [31] and numerical reasoning [32], where reasoning from past experiences proves valuable for addressing new challenges [33]. CBR mimics human learning through analogies, functioning similarly to analogical reasoning [34]. Numerous studies [35,36,37,38] have demonstrated the effectiveness of CBR in various domains. In knowledge base question answering (KBQA), CBR has been applied to predict answers in a Knowledge Base (KB) given a query node and a relation [35,36,37]. Additionally, CBR has shown promise in semantic parsing tasks, where logical forms of similar cases are retrieved and passed through seq2seq models along with a target query to generate semantic outputs [39,40,41].
Building on prior CBR research [20,29,42], our work CBD extends the principles of CBR to the task of entailment tree generation. Rather than merely solving new problems, CBD retrieves cases with similar logical structures from a preannotated case base and adapts them as demonstrations for generating logical entailment steps. This approach ensures that the generated entailment trees maintain both structural and logical consistency, making it particularly suitable for complex, multistep reasoning tasks. CBD introduces unique contributions such as logical pattern retrieval and the integration of innovative techniques like prototypical networks and information entropy-based reranking, which we detail in the Methodology section. To our knowledge, this is the first application of case-based reasoning in entailment tree generation.

2.3. Prototypical Networks

Prototypical networks (ProtoNet) [21] pioneered the introduction of prototypes into deep learning. This approach suggests that each category can be succinctly represented by a single prototype within the feature space. ProtoNet computes prototype vectors by averaging instance vectors within each category, leveraging metric-based comparisons between prototypes and query instances to make predictions [43,44,45]. For instance, Yue et al. [46] explored cross-domain instance-to-prototype matching for Unsupervised Domain Adaptation (UDA), while Pan et al. [47] used prototype learning to bridge domain gaps and construct classifiers in the target domain for UDA. The effectiveness of prototype-based models highlights their ability to capture class-level semantic features through representative prototype embeddings of instances from the same classes.
Given the intrinsic similarity between prototypes and the underlying logical patterns of entailment steps, it is both natural and effective to introduce prototypes into case-based learning for modeling logical patterns. Researchers [43,48,49,50] have further explored the interpretability of prototypical networks. By applying this concept to text classification, they demonstrated how prototypical texts can aid in interpreting predictions. Building on this foundation, we employ a prototypical network to capture abstract logical patterns, which not only facilitates a more focused search for relevant cases but also provides valuable insights into the attribution of logical patterns, thereby contributing to clearer decision-making processes.

3. Case-Based Deduction

The objective of CBD is to construct an entailment tree based on a set of knowledge K = { s 1 , s 2 , , s n } and a declarative hypothesis h y p , which consists of a question and its correct answer. For example, given the hypothesis “the shape of the chocolate changes when the chocolate melts”, our task is to build an explanation that justifies h y p by selecting and combining pairs of sentences to generate new intermediate conclusions, thereby forming an entailment tree. This is achieved through a case-based reasoning approach, which involves three main phases:
  • Retrieval Phase: Relevant cases are retrieved to serve as demonstrations, ensuring the logical consistency of the generated results. A prototypical network predicts the logical pattern of promising steps, categorizing them into one of three types: if–then, substitution, or conjunction relations. After determining the logical pattern, cosine similarity is used to identify the most relevant cases for the current fact pairs.
  • Reuse Phase: To mitigate bias toward entities that might result from the cosine similarity method and to encourage diversity among the selected cases, we integrate information entropy into the ranking procedure. Retrieved cases are then ranked based on their information entropy scores.
  • Refine Phase: The ordered cases are concatenated and used as demonstrations to guide a pretrained language model for logical text generation.
Through iterative application of this process, the entire entailment tree is constructed. Figure 2 provides a comprehensive overview of the CBD framework. Further details are discussed in the following sections.

3.1. Retrieve Phase

Using the controller proposed by Hong et al. [8], we identify promising steps in the reasoning process. For instance, the sentences “Chocolate is usually a solid” and “Chocolate is a kind of substance” exemplify such a promising step. The case-based deduction module then generates intermediate conclusions for each identified step. This process begins by retrieving analogous cases that follow the same logical pattern, leveraging a prototypical network to capture these implicit patterns. The prototypical network helps recognize abstract logical patterns in the data, which ensures that retrieved cases align with the input logical structures. By doing so, the network facilitates the retrieval of cases that are more semantically relevant and helps improve the logical consistency of intermediate conclusions. We categorize the types of inferences into three prevalent categories, as shown in Table 1. The case base, denoted as C , consists of three logical categories: C c o n j , C i f , and C s u b .
To encode the semantics of promising steps into embeddings, we use the DeBERTa model [51], chosen for its effectiveness in capturing sentence semantics. Although other neural architectures, such as recurrent neural networks (RNNs), could also be utilized for this purpose, DeBERTa is our preferred choice. After computing the sentence pair embeddings within the case base C , we use a prototypical network to generate a prototype for each logical pattern based on these embeddings. By measuring the distance between the embedding of a promising step and each prototype, we can accurately determine the logical pattern corresponding to that step.
Given a promising step S p a i r = ( s 1 , s 2 ) , where s 1 , s 2 K and are represented as s 1 = { w 1 , , w k } and s 2 = { w 1 , , w k } , we apply the DeBERTa model to encode the sentence pair into a continuous low-dimensional embedding H p a i r , capturing the semantics of the sentence pair. For all input tokens of the sentence pairs, we concatenate the two sentences using the specific token [SEP], forming S p a i r = { [ CLS ] , w 1 , , w k , [ SEP ] , w 1 , , w k , [ SEP ] } . The embedding is computed as follows:
H p a i r = DeBERTa ϕ e m b ( S p a i r ) ,
where ϕ e m b refers to the embedding layer of DeBERTa, which maps discrete words into continuous embeddings. The output layer is represented as { h p a i r , h 1 , , h k , h [ SEP ] , h 1 , , h k } . The vector representation of the sentence pair, corresponding to the token [CLS], is denoted as h p a i r .
In our approach, the prototypical network represents each logical pattern with a single vector, known as a prototype. The logical patterns L = { c o n j , i f , s u b } correspond to conjunction, if–then, and substitution. We calculate prototypes by averaging all sentence pair embeddings within the annotated entailment step set for each category:
h ¯ c o n j = 1 N c o n j = 1 N c o n j h c o n j , h ¯ i f = 1 N i f = 1 N i f h i f , h ¯ s u b = 1 N s u b = 1 N s u b h s u b ,
where h ¯ c o n j is the prototype embedding for the conjunction pattern, and h c o n j represents the -th sentence pair embedding within the conjunction category. The same process is applied to calculate the prototypes for the other two logical categories: h ¯ i f for the if–then pattern and h ¯ s u b for the substitution pattern.
Next, we calculate the probabilities p ^ for predicting the logical pattern to the promising step x = ( s o , s t ) using the following equations:
h x = DeBERTa ϕ e m b ( x ) [ C L S ] , D e u c i = m = 1 d [ h x ] m [ h ¯ i ] m 2 , p ^ i = exp D e u c i j = 1 | L | exp D e u c j ,
where h x is the vector representation of the promising step, h ¯ i is the prototype embedding for the i-th logical category, and D e u c i is the Euclidean distance between two vectors. p ^ i represents the probability of the i-th logical category, with p ^ = max ( p ^ i ) as the predicted category. Euclidean distance is used here, following findings from [21], where it was shown to outperform other distance metrics.
The loss function is then defined as
L = i = 1 N t i log p ^ i ,
where t i is 1 if the i-th case is predicted correctly; otherwise, t i is 0, and N is the total number of samples. We minimize the cross-entropy loss between the prediction and the ground truth. Through the prototypical network, we narrow the retrieval scope to focus on cases with analogical logical patterns. For instance, as shown in Figure 2, the promising step ( s 1 , s 6 ) exemplifies a conjunction pattern. Once the logical pattern is determined, we proceed to gather similar cases for demonstration.
We hypothesize that employing analogous single-step deduction can effectively assist a large language model in performing logical deductions. Through these illustrative examples, the model can better comprehend underlying logical patterns, enabling it to extrapolate from these patterns and generate logical outcomes for promising steps. Specifically, we retrieve cases from the corresponding case base C c o n j based on the logical pattern identified in the previous phase. The top- α cases are selected based on their cosine similarity:
s i m ( h x , h c i ) = h x · h c i h x h c i ,
where h x represents the vector representation of the promising step, and h c i represents the vector presentations of the logical pattern category.

3.2. Reuse Phase

Cosine similarity tends to emphasize entities within the query, often resulting in retrieval outcomes dominated by cases with similar entities. This homogeneity can limit the model’s performance in in-context generation. Studies [25,26] have shown that model performance can vary significantly based on the selection of in-context examples, with diverse demonstrations leading to better outcomes. To address this limitation, we apply an information entropy-based reranking mechanism to prioritize and rank the retrieved cases, introducing greater diversity into the results. The information entropy-based reranking ensures that the model encounters a broader range of distinct cases within the same logical pattern. By diversifying the retrieved cases, this approach enhances in-context learning, enabling the model to better learn from these varied demonstrations. This ultimately improves the model’s performance in generating logically consistent entailment trees.
The information entropy of a sentence s = { w 1 , w 2 , , w k } is calculated as follows:
H ( s ) = i = 1 k p w i log 2 p w i ,
where p ( w i ) denotes the probability of each word w i , calculated based on its frequency in the case base. Higher entropy indicates a sentence with more information content. Previous studies [52,53] have demonstrated that diverse demonstrations significantly improve the performance of large language models.
After retrieving a set of cases { c 1 , c 2 , , c α } using cosine similarity, we rerank them based on information entropy to enhance diversity. The information entropy of each case c i is calculated as
I E c i = H ( s 1 ) + H ( s 2 ) + H ( i n t ) .
We then select the top- β cases with the highest information entropy scores. The information entropy score of a case provides insights into its complexity, indicating the amount of information it contains. Recognizing that in-context learning is sensitive to the order of demonstrations, we organize the filtered cases by difficulty, arranging them from easy (low entropy) to difficult (high entropy), following a curriculum learning approach [28].

3.3. Refine Phase

In the refine phase, the ordered cases are concatenated and used as demonstrations to guide a pretrained language model for logical text generation. We use the Llama 3 [54] model as the backbone. Each selected case (i.e., s 1 + s 2 i n t ) is translated into a natural language format as follows: s 1 [AND] s 2 [ENTAIL] i n t . Through in-context learning, the LLM generates logical text based on these demonstrations. Below is an example prompt for case-based deduction using five cases (see Figure 3):
Following this case-based deduction process, the controller facilitates reasoning. Since entailment trees are built iteratively, with each step expanding the reasoning space, we employ beam search for effective reasoning. The process continues until the target hypothesis is validated or the maximum reasoning depth is reached.

4. Experiment

4.1. Benchmark Dataset

We conducted experiments using the EntailmentBank dataset [6], which contains 1840 expert-annotated entailment trees. Each tree corresponds to a hypothesis derived from a question and its correct answer in the ARC dataset [55]. The leaf facts of these trees are selected from the WorldTreeV2 corpus [56]. The statistics for the EntailmentBank dataset are detailed in Table 2.
We chose EntailmentBank because it offers a structured entailment reasoning task, where multistep reasoning is crucial to constructing logically consistent entailment trees. This dataset provides a well-established benchmark for evaluating models that generate logical entailments and intermediate reasoning steps, aligning with the capabilities of our proposed method. EntailmentBank’s tasks involve reasoning from multiple premises to conclusions, which allows us to test CBD’s ability to retrieve, reuse, and refine logical patterns, making it an ideal dataset for assessing multistep reasoning models.
Following [6], three explanation tasks are defined based on a hypothesis (question + answer), each with increasing difficulty: (a) generating a valid entailment tree given all relevant sentences (the leaves of the gold entailment tree), (b) generating the tree given all relevant sentences along with some irrelevant ones, or (c) generating the tree given a full corpus. The objective is to produce explanations in the form of entailment trees that represent a sequence of multipremise entailment steps from known facts through intermediate conclusions to the final hypothesis (the question + answer).
The case base used in our study comprises the entailment steps from the training dataset. We utilized the manually annotated dataset from [8], which includes 400 annotated reasoning steps in the training split (Train-manual) and 275 steps in the development split (Dev-manual). Table 3 presents the statistics of the logical pattern annotations. These entailment steps are categorized into three types: conjunction, if–then, and substitution patterns. We used the training dataset to train the prototypical network and the development dataset to evaluate and optimize its performance.

4.2. Baseline Models

To evaluate the effectiveness of our proposed method, we conducted a comparative analysis between CBD and several existing methods:
  • EntailmentWriter [6]: This method generates the entire entailment tree “all at once”, directly producing linearized trees. Hong et al. [8] extended this approach by developing EntailmentWriter-Iter, which iteratively generates individual steps of the tree and then concatenates them to form the final tree.
  • IRGR [10]: This method designs an iterative retrieval-generation framework that enhances retrieval outcomes. It iteratively retrieves premises and constructs step-by-step explanations from textual evidence.
  • MetGen [8]: This method employs a reasoning controller and a T5 model to independently select premises and generate conclusions. For a fair comparison, we also experimented with replacing the T5 model with Llama 3 [54].
In our comparison, we integrated CBD with the controller from MetGen to effectively select steps, which then facilitated the generation of logical intermediate conclusions.

4.3. Evaluation Metrics

We evaluate entailment trees in a two-step process. First, nodes in the predicted tree T p r e d are aligned with nodes in the gold tree T g o l d using the s e n t labels and Jaccard similarity for intermediate nodes, aiming to capture semantic equivalences rather than exact matches. Next, we assess the generated tree by comparing it to the gold tree across three dimensions, following established methodologies from previous studies [6,7,8]:
  • Leaves: This dimension evaluates the use of correct leaf facts. We calculate the F 1 score by comparing the predicted leaf facts with those in the gold tree. Additionally, we report the AllCorrect score, which indicates exact matches. The AllCorrect score is 1 when the F 1 score is 1, and 0 otherwise.
  • Steps: This dimension assesses the structural correctness of entailment steps. We compare the steps in both trees using the F 1 and AllCorrect scores. A predicted step is considered correct if its children’s identifiers perfectly match those in the gold tree.
  • Intermediates: This dimension measures the accuracy of intermediate conclusions, also using the F 1 and AllCorrect scores. A predicted intermediate conclusion is considered correct if the BLEURT-Large-512 [57] score between the aligned predicted intermediate and the corresponding gold intermediate is greater than 0.28. Using BLEURT allows us to account for semantic equivalence between the generated and gold conclusions beyond surface-level matching, which is critical for assessing complex multistep reasoning tasks.
Finally, we employ a strict metric called Overall Allcorrect. This score is 1 only if all leaves, steps, and intermediates in the generated tree are correct, meaning the tree exactly matches T g o l d . Any error in the tree results in a score of 0 under this metric.
We chose these evaluation metrics because they directly measure both the structural and semantic accuracy of the generated entailment trees. By using a combination of F1 and AllCorrect scores, we ensure that the model is evaluated on both partial correctness and exact matches, providing a nuanced view of its performance. Additionally, the inclusion of BLEURT-Large-512 for intermediate conclusions allows us to capture semantic similarity, which is essential for assessing the quality of multistep reasoning. These metrics have been widely used in prior work, ensuring that our evaluation is comparable with other models in the field and that our method’s improvements are consistently measured across all reasoning dimensions.

4.4. Implementation Details

All models and baselines were implemented using the PyTorch framework [58] and Huggingface transformers [59]. During the logical generation in the refine phase, we used Llama 3 [54] as the pretrained language model. For Task 1, we iterated until all facts in K were used. For Task 2 and 3, we applied the same settings as MetGen to ensure a fair comparison, using a fact score threshold of 0.001 to filter out distractors and a maximum reasoning depth of 5. We selected the top 10% of steps for each state and set the beam size to 10. The prototypical network architecture is based on the DeBERTa model [51], with the final layer producing a 768-dimensional embedding for each case. The network was trained using the Adam optimizer with an initial learning rate of 2 × 10−5 and a batch size of 8, and over 30 epochs. Additionally, we optimized the loss function using the AdamW optimizer [60].

5. Results and Analysis

In this section, we evaluate the performance of CBD and provide a comprehensive analysis of the results, including the impact of varying the number of selected cases.

5.1. Main Results

The evaluation results, summarized in Table 4, show that multistep methods (IRGR, EntailmentWriter-Iter, and MetGen) outperform the single-step method (EntailmentWriter), highlighting the effectiveness of multistep approaches in selecting premises and generating intermediate conclusions that closely align with the ground truth. When comparing CBD with MetGen for intermediate conclusion generation, we observe a notable performance improvement, particularly in Task 1. Specifically, under the strict Overall AllCorrect metric, CBD shows performance gains of 1.7% in Task 1, 0.6% in Task 2 and 0.8% in Task 3. A key advantage of CBD is its use of a prototypical network to recognize abstract logical patterns in the promising step. Unlike traditional methods that often focus on surface-level similarities, the prototypical network allows CBD to capture the underlying logical patterns, which is particularly valuable in retrieving similar cases. By leveraging these similar cases, CBD ensures that the intermediate conclusions maintain logical consistency, contributing to higher performance in the Intermediates AllCorrect metric, where it achieves a top score of 42.1% on Task 1. In contrast, methods like EntailmentWriter and IRGR, which lack the capability to abstract these logical patterns, produce less accurate intermediate conclusions.
EntailmentWriter and EntailmentWriter-Iter, particularly the single-step EntailmentWriter, struggle to generate consistent intermediate conclusions, as they produce the entire entailment tree “all at once”. This makes it difficult to maintain logical coherence across multiple reasoning steps. In contrast, CBD’s multistep reasoning, supported by case-based learning, leads to a more accurate generation of intermediate conclusions. While MetGen benefits from its reasoning controller and performs well across most tasks, CBD outperforms it in both the Intermediates and Overall AllCorrect metrics. The integration of the prototypical network and information entropy-based reranking in case-based reasoning enables CBD to demonstrate strong performance, effectively handling a wide range of reasoning complexities and excelling across diverse tasks.

5.2. Ablation Study

In the ablation study, we evaluate the effectiveness of three key components of CBD: the prototypical network (ProtoNet), ranking by information entropy (IE), and curriculum reasoning (Curriculum). The results, based on experiments conducted on Task 1, are presented in Table 5. First, we assess the model without ProtoNet, which directly searches the case base without distinguishing among different logical patterns. This version shows a significant performance drop, underscoring the crucial role of ProtoNet in selecting relevant cases. Next, we examine the model without reranking by information entropy scores. While this model maintains the same architecture as the full CBD, it ranks analogical cases solely based on cosine similarity during the reuse phase. The observed performance degradation highlights the importance of information entropy in introducing diversity into the demonstrations. Despite this, the model still outperforms MetGen (36.8 vs. 36.5), indicating that case-based deduction contributes to generating better conclusions. Additionally, we observe that the order of selected cases affects overall performance, as captured by the curriculum reasoning component. Importantly, CBD (w/o ProtoNet) closely mirrors traditional case-based reasoning methods, which rely purely on similarity-based retrieval without the advanced features we propose. The performance drop shows that our enhancements in case retrieval and diversification significantly improve model performance over standard case-based reasoning techniques.

5.3. Prototypical Network Results

To improve the efficiency of identifying cases with similar logical patterns, we use a prototypical network to determine the logical pattern of each promising step. The DeBERTa model [51] serves as the encoder for the prototypical network. The statistics for logical pattern annotations are shown in Table 3. Our model achieves an accuracy of 81% on the development set. We also experimented with a direct classifier approach, which produced results comparable to the prototypical network. However, the prototypical network is preferred due to its more interpretable decision-making process.
While evaluating the performance of our method, it’s important to consider the distinction between our approach and clustering algorithms such as ECA* [61] and iECA* [62]. These clustering algorithms are designed to identify clusters without predefined centers, making them ideal for applications where categories are unknown in advance. In contrast, our approach leverages a prototypical network, where each data point’s category is known beforehand, enabling targeted logical pattern recognition.

5.4. Comparative Analysis of Retrieval and Reranking Method Combinations

This section presents a comparative analysis of different retrieval and reranking method combinations within the CBD framework. Our experiments focused on comparing the proposed method, which employs a prototypical network in the retrieval phase and information entropy for reranking in the reuse phase (ProtoNet + IE), against commonly used approaches, such as Dense Retrieval combined with a RoBERTa-based reranker (Dense + Reranker), on the EntailmentBank dataset for Task 1. Dense Retrieval operates by encoding both the query and candidate cases into dense vector representations and then calculating their cosine similarity within the same vector space. The RoBERTa-based Reranker refines the ranking of candidate cases by jointly encoding the query and a candidate case, scoring their relevance based on the [CLS] token’s embedding.
As summarized in Table 6, our ProtoNet + IE approach outperforms all other methods across all metrics, achieving the highest overall accuracy of 38.2%. These results underscore the effectiveness of our method in both the retrieval and reuse stages, delivering more accurate results than alternative approaches. Specifically, in the retrieval phase, the prototypical network effectively narrows the retrieval scope, reducing noise from irrelevant cases. Additionally, the information entropy-based reranker enhances diversity, which is crucial for in-context learning. This comparative analysis highlights the significant advantages of integrating ProtoNet with information entropy in improving the performance of entailment tree generation within our framework.

5.5. Effect of Number of Selected Cases

We next investigate the influence of the number of selected cases in the reused phase on Task 1. This experiment was conducted using the top 10 results based on cosine similarity. We focused on the top 10 because it likely includes a more diverse set of entities. To explore this, we tested CBD with varying numbers of selected cases. The results, presented in Figure 4 show a clear improvement in the Overall-Allcorrect accuracy of our model when three cases are selected. We hypothesize that selecting three cases strikes a balance between the diversification provided by information entropy ranking and the curriculum learning process from easy to hard. The improvement with one or two cases is less significant, possibly due to insufficient diversity, which may cause the model to confuse in-context cases with knowledge. It is important to note that accuracy declines when more than three cases are selected, suggesting that noise may be introduced into the demonstrations, leading the model to focus more on entities rather than the logical pattern. Nevertheless, even with just one selected case, the model’s performance is comparable to MetGen, indicating that CBD is beneficial for the entailment task.

5.6. Visualization of Logical Patterns

We visualized the prototype embeddings of different logical patterns (conjunction, if-then, substitution) using t-SNE [63], as shown in Figure 5. Each color represents a prototype logical pattern induced by the prototypical network, with the prototypical features clearly organized into distinct clusters. This clustering suggests that the model can effectively differentiate among the different logical patterns, enhancing its ability to accurately classify promising steps.

6. Related Work

6.1. Explanation for Question Answering

Extensive research has explored interpretability in question-answering (QA) systems, with a focus on understanding how models make predictions and provide explanations in various forms [64,65]. These efforts include techniques such as using attention maps over passages [66], extracting snippets of textual evidence [2], and selecting answer-supporting sentences from input paragraphs [56,67]. With the advent of large language models (LLMs) [52,68,69], recent research has shifted toward improving interpretability. One prominent approach is generating reasoning steps before producing answers, often referred to as chain-of-thought (CoT) reasoning [3,70,71,72,73]. However, several studies [74,75] have pointed out the limitations of this strategy. For example, Turpin et al. [76] highlight that CoT explanations can misrepresent the true reasoning behind a model’s prediction. Additionally, evaluations of CoT explanations [77] reveal several issues, such as contradictions and mathematical errors [78]. These findings underscore the risk that CoT-generated answers may not align with the reasoning process, which can erode user trust and complicate the identification of errors. Unlike free-form text explanations, which are prone to hallucinations [70,79,80,81] or rationales that only provide supporting evidence [2,82], structured explanations like entailment trees [6,83] are gaining popularity as a more reliable approach. These methods use a tree structure to outline how information is combined to derive the answer systematically. In our work, we adopt case-based deduction to iteratively construct entailment trees, ensuring that each step logically follows from the previous one, thus enhancing interpretability and minimizing errors.

6.2. Entailment Tree Generation

Entailment trees [6], which illustrate the logical relationships between multiple premises, have proven effective for structuring text-only question-answering tasks by systematically representing logical reasoning chains. They provide a transparent way to understand how hypotheses logically follow from a series of entailment steps based on multiple premises [84]. Recent advancements in the field have focused on the reconstruction task [6,10,85] and on leveraging entailment trees for neuro-symbolic reasoning [11,84]. For example, Weir et al. [84] introduced a QA system that employs backward chaining to search for entailment trees, allowing users to inspect and debug the system’s reasoning process. This approach holds promise for developing interactive and educational QA systems [86]. In our work, we focus on the reconstruction task, specifically on entailment tree generation.
EntailmentBank [6] proposes representing QA system reasoning steps as multistep textual entailments, offering detailed and informative explanations. Various methods have been developed to reconstruct tree-structured explanations for correct answers, such as those introduced by Hong et al. [8], Yang et al. [7], and Ribeiro et al. [10]. Recent studies [7,8,10,11] have adopted an iterative strategy for performing single-step neural inferences and combining these results to construct multistep reasoning chains. This iterative technique has been successful, leading to improved performance. Hong et al. [8] use concise prompts, such as “deductive conjunction”, tailored to reasoning patterns. Their method employs supervised learning to train one-step generation. In contrast, our method adopts an unsupervised in-context learning approach without parameter fine-tuning. By providing retrieved cases with promising steps during each inference, our method effectively guides the model toward the desired outputs. On the other hand, Ribeiro et al. [10] directly input fact pairs, prompting the language model to deduce based on these pairs. However, this approach lacks considering an underlying logical pattern, making it susceptible to inaccuracies or hallucinations.

7. Limitations

While the case-based deduction method demonstrates improvements in entailment tree generation, there are some considerations regarding its current design. One aspect to acknowledge is the method’s reliance on prototypical networks, which could potentially introduce biases based on the distribution of training data. If the training data emphasize certain logical patterns more than others, the model may show a preference for these patterns. However, with a sufficiently diverse and well-balanced dataset, this issue can be mitigated, allowing the model to generalize more effectively. Additionally, the method is particularly well suited for binary tree structures, which makes it robust in handling such reasoning tasks. While this focus contributes to the method’s strong performance in these contexts, it may be less adaptable when dealing with more complex or nonbinary tree structure reasoning. Nonetheless, for tasks that adhere to binary structures, the approach provides significant advantages in terms of logical consistency and accuracy.

8. Conclusions and Future Research

Existing methods for entailment tree generation often struggle to maintain logical consistency within individual steps. To address this challenge, we introduce a novel approach called case-based deduction (CBD), which retrieves cases with similar logical patterns to the input pairs, ensuring consistent logical generation. Within the retrieve-reuse-refine paradigm, we use a prototypical network to identify logical patterns, enabling the accurate selection of analogous cases. By ranking these cases based on information entropy scores, we diversify the demonstrations and effectively teach a large language model the foundational logical patterns for reasoning. Our experimental results show that this method significantly improves the quality of entailment tree generation compared with state-of-the-art approaches. Additionally, using a prototypical network in the retrieval phase provides interpretable logical pattern induction. While our current approach has demonstrated effectiveness in entailment tree generation tasks, its potential extends beyond this specific application. The principles of CBD, retrieving and reusing logical patterns from similar cases, can be applied to a broader range of reasoning domains. For example, CBD could be leveraged in commonsense reasoning to infer conclusions based on everyday scenarios by retrieving cases with similar logical structures. In legal reasoning, the method could assist in drawing parallels between legal precedents and current cases. Additionally, mathematical proof generation could benefit from CBD’s ability to retrieve similar logical steps from previously solved proofs, aiding in new problem-solving tasks.
In future work, we plan to address the limitations identified in our current approach and focus on more specific improvements to enhance the performance and flexibility of case-based deduction. One promising direction is the exploration of alternative retrieval mechanisms. While our current method leverages a prototypical network to retrieve cases based on logical patterns, we aim to investigate more adaptive retrieval strategies that adapt to diverse reasoning contexts. For instance, context-aware retrieval mechanisms or reinforcement learning techniques could allow the model to better select cases tailored to specific problem domains. Another area for improvement is integrating more complex and diverse logical patterns into the prototypical network. This enhancement would allow the model to capture a broader range of reasoning strategies, particularly for more challenging entailment tasks. By expanding the variety of logical patterns the network can recognize and retrieve, we expect the model to handle diverse and nonbinary reasoning structures more effectively. We also see value in incorporating human evaluation to complement the current automated metrics. Human evaluators can provide additional insights into aspects like interpretability, logical flow, and the practical utility of generated entailment trees, which may not be fully captured by automated evaluation methods.

Author Contributions

Conceptualization, J.S., X.D. and T.L.; methodology, J.S., X.D. and T.L.; software, J.S.; validation, X.D. and T.L.; formal analysis, J.S.; investigation, J.S.; resources, X.D. and T.L.; data curation, X.D. and T.L.; writing—original draft preparation, J.S.; writing—review and editing, X.D. and T.L.; visualization, J.S.; supervision, X.D. and T.L.; project administration, X.D. and T.L.; funding acquisition, X.D. and T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (U22B2059 and 62176079) and the Natural Science Foundation of Heilongjiang Province (YQ2022F005).

Data Availability Statement

The data presented in this study are openly available at https://allenai.org/data/entailmentbank, accessed on 7 November 2021.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  2. DeYoung, J.; Jain, S.; Rajani, N.F.; Lehman, E.; Xiong, C.; Socher, R.; Wallace, B.C. ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4443–4458. [Google Scholar]
  3. Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.V.; Chi, E.H.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  4. Rocktäschel, T.; Riedel, S. End-to-end differentiable proving. Adv. Neural Inf. Process. Syst. 2017, 30, 3788–3800. [Google Scholar]
  5. Tafjord, O.; Dalvi, B.; Clark, P. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 3621–3634. [Google Scholar]
  6. Dalvi, B.; Jansen, P.; Tafjord, O.; Xie, Z.; Smith, H.; Pipatanangkura, L.; Clark, P. Explaining Answers with Entailment Trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7358–7370. [Google Scholar]
  7. Yang, K.; Deng, J.; Chen, D. Generating Natural Language Proofs with Verifier-Guided Search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 89–105. [Google Scholar]
  8. Hong, R.; Zhang, H.; Yu, X.; Zhang, C. METGEN: A Module-Based Entailment Tree Generation Framework for Answer Explanation. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; pp. 1887–1905. [Google Scholar]
  9. Krishna, A.; Riedel, S.; Vlachos, A. Proofver: Natural logic theorem proving for fact verification. Trans. Assoc. Comput. Linguist. 2022, 10, 1013–1030. [Google Scholar]
  10. Ribeiro, D.N.; Wang, S.; Ma, X.; Dong, R.; Wei, X.; Zhu, H.; Chen, X.; Xu, P.; Huang, Z.; Arnold, A.; et al. Entailment Tree Explanations via Iterative Retrieval-Generation Reasoner. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; pp. 465–475. [Google Scholar]
  11. Tafjord, O.; Dalvi, B.; Clark, P. Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2078–2093. [Google Scholar]
  12. Liu, Z.; Wang, Z.; Lin, Y.; Li, H. A Neural-Symbolic Approach to Natural Language Understanding. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2159–2172. [Google Scholar]
  13. Zhang, H.; Huang, J.; Li, Z.; Naik, M.; Xing, E. Improved Logical Reasoning of Language Models via Differentiable Symbolic Programming. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 3062–3077. [Google Scholar]
  14. Nye, M.; Tessler, M.; Tenenbaum, J.; Lake, B.M. Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning. Adv. Neural Inf. Process. Syst. 2021, 34, 25192–25204. [Google Scholar]
  15. Slade, S. Case-based reasoning: A research paradigm. AI Mag. 1991, 12, 42. [Google Scholar]
  16. Kolodner, J.L. An introduction to case-based reasoning. Artif. Intell. Rev. 1992, 6, 3–34. [Google Scholar]
  17. Aamodt, A.; Plaza, E. Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Commun. 1994, 7, 39–59. [Google Scholar]
  18. Watson, I.; Marir, F. Case-based reasoning: A review. Knowl. Eng. Rev. 1994, 9, 327–354. [Google Scholar]
  19. Reza Montazemi, A.; Moy Gupta, K. A framework for retrieval in case-based reasoning systems. Ann. Oper. Res. 1997, 72, 51–73. [Google Scholar]
  20. Valentino, M.; Thayaparan, M.; Freitas, A. Case-Based Abductive Natural Language Inference. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 1556–1568. [Google Scholar]
  21. Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. 2017, 30, 4077–4087. [Google Scholar]
  22. Gao, T.; Fisch, A.; Chen, D. Making Pre-trained Language Models Better Few-shot Learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 3816–3830. [Google Scholar]
  23. Zhao, Z.; Wallace, E.; Feng, S.; Klein, D.; Singh, S. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; pp. 12697–12706. [Google Scholar]
  24. Levy, I.; Bogin, B.; Berant, J. Diverse Demonstrations Improve In-context Compositional Generalization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 1401–1422. [Google Scholar]
  25. Lu, Y.; Bartolo, M.; Moore, A.; Riedel, S.; Stenetorp, P. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 8086–8098. [Google Scholar]
  26. Liu, J.; Shen, D.; Zhang, Y.; Dolan, W.B.; Carin, L.; Chen, W. What Makes Good In-Context Examples for GPT-3? In Proceedings of the Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, Dublin, Ireland, 27 May 2022; pp. 100–114. [Google Scholar]
  27. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar]
  28. Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 41–48. [Google Scholar]
  29. Schank, R.C.; Kass, A.; Riesbeck, C.K. Inside Case-Based Explanation; Psychology Press: London, UK, 2014. [Google Scholar]
  30. Lopez, D.M.R.; Mcsherry, D.; Bridge, D.; Leake, D.; Smyth, B.; Craw, S.; Faltings, B.; Maher, M.L.; Cox, M.T.; Forbus, K.; et al. Retrieval, reuse, revision and retention in case-based reasoning. Knowl. Eng. Rev. 2005, 20, 215–240. [Google Scholar]
  31. Wiratunga, N.; Abeyratne, R.; Jayawardena, L.; Martin, K.; Massie, S.; Nkisi-Orji, I.; Weerasinghe, R.; Liret, A.; Fleisch, B. CBR-RAG: Case-based reasoning for retrieval augmented generation in LLMs for legal question answering. In Proceedings of the International Conference on Case-Based Reasoning, Merida, Mexico, 1–4 July 2024; pp. 445–460. [Google Scholar]
  32. Feng, B.; Gao, H.; Zhang, P.; Zhang, J. CBR-Ren: A Case-Based Reasoning Driven Retriever-Generator Model for Hybrid Long-Form Numerical Reasoning. In Proceedings of the International Conference on Case-Based Reasoning, Merida, Mexico, 1–4 July 2024; pp. 111–126. [Google Scholar]
  33. Watson, I. Applying Case-Based Reasoning: Techniques for Enterprise Systems; Morgan Kaufmann Publishers Inc.: Burlington, MA, USA, 1998. [Google Scholar]
  34. Kolodner, J.L. Educational implications of analogy: A view from case-based reasoning. Am. Psychol. 1997, 52, 57. [Google Scholar] [PubMed]
  35. Das, R.; Godbole, A.; Monath, N.; Zaheer, M.; Mccallum, A. Probabilistic Case-based Reasoning for Open-World Knowledge Graph Completion. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 4752–4765. [Google Scholar]
  36. Das, R.; Godbole, A.; Dhuliawala, S.; Zaheer, M.; McCallum, A. A Simple Approach to Case-Based Reasoning in Knowledge Bases. In Proceedings of the Automated Knowledge Base Construction, Online, 22–24 June 2020. [Google Scholar]
  37. Das, R.; Godbole, A.; Naik, A.; Tower, E.; Zaheer, M.; Hajishirzi, H.; Jia, R.; McCallum, A. Knowledge base question answering by case-based reasoning over subgraphs. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 4777–4793. [Google Scholar]
  38. Orozco-del Castillo, M.G.; Recio-Garcia, J.A.; Orozco-del Castillo, E.C. Item-Specific Similarity Assessments for Explainable Depression Screening. In Proceedings of the International Conference on Case-Based Reasoning, Merida, Mexico, 1–4 July 2024; pp. 430–444. [Google Scholar]
  39. Das, R.; Zaheer, M.; Thai, D.; Godbole, A.; Perez, E.; Lee, J.Y.; Tan, L.; Polymenakos, L.; Mccallum, A. Case-based Reasoning for Natural Language Queries over Knowledge Bases. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 9594–9611. [Google Scholar]
  40. Pasupat, P.; Zhang, Y.; Guu, K. Controllable Semantic Parsing via Retrieval Augmentation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7683–7698. [Google Scholar]
  41. Awasthi, A.; Chakrabarti, S.; Sarawagi, S. Structured case-based reasoning for inference-time adaptation of text-to-sql parsers. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 12536–12544. [Google Scholar]
  42. Schank, R. Explanation Patterns: Understanding Mechanically and Creatively; Psychology Press: London, UK, 2013. [Google Scholar]
  43. Das, A.; Gupta, C.; Kovatchev, V.; Lease, M.; Li, J.J. ProtoTEx: Explaining Model Decisions with Prototype Tensors. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022. [Google Scholar]
  44. Van Aken, B.; Papaioannou, J.M.; Naik, M.; Eleftheriadis, G.; Nejdl, W.; Gers, F.; Loeser, A. This Patient Looks Like That Patient: Prototypical Networks for Interpretable Diagnosis Prediction from Clinical Text. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 20–23 November 2022; pp. 172–184. [Google Scholar]
  45. Gao, T.; Han, X.; Liu, Z.; Sun, M. Hybrid attention-based prototypical networks for noisy few-shot relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6407–6414. [Google Scholar]
  46. Yue, X.; Zheng, Z.; Zhang, S.; Gao, Y.; Darrell, T.; Keutzer, K.; Vincentelli, A.S. Prototypical cross-domain self-supervised learning for few-shot unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 13834–13844. [Google Scholar]
  47. Pan, Y.; Yao, T.; Li, Y.; Wang, Y.; Ngo, C.W.; Mei, T. Transferrable prototypical networks for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2239–2247. [Google Scholar]
  48. Ming, Y.; Xu, P.; Qu, H.; Ren, L. Interpretable and steerable sequence learning via prototypes. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 903–913. [Google Scholar]
  49. Li, O.; Liu, H.; Chen, C.; Rudin, C. Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  50. Chen, C.; Li, O.; Tao, D.; Barnett, A.; Rudin, C.; Su, J.K. This looks like that: Deep learning for interpretable image recognition. Adv. Neural Inf. Process. Syst. 2019, 32, 8928–8939. [Google Scholar]
  51. He, P.; Gao, J.; Chen, W. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. In Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  52. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  53. Lester, B.; Al-Rfou, R.; Constant, N. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3045–3059. [Google Scholar]
  54. AI@Meta. Llama 3 Model Card. Available online: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct (accessed on 17 April 2024).
  55. Clark, P.; Cowhey, I.; Etzioni, O.; Khot, T.; Sabharwal, A.; Schoenick, C.; Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv 2018, arXiv:1803.05457. [Google Scholar]
  56. Xie, Z.; Thiem, S.; Martin, J.; Wainwright, E.; Marmorstein, S.; Jansen, P. Worldtree v2: A corpus of science-domain structured explanations and inference patterns supporting multi-hop inference. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 5456–5473. [Google Scholar]
  57. Sellam, T.; Das, D.; Parikh, A. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7881–7892. [Google Scholar]
  58. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8024–8035. [Google Scholar]
  59. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
  60. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  61. Hassan, B.A.; Rashid, T.A. A multidisciplinary ensemble algorithm for clustering heterogeneous datasets. Neural Comput. Appl. 2021, 33, 10987–11010. [Google Scholar]
  62. Hassan, B.A.; Rashid, T.A.; Hamarashid, H.K. A novel cluster detection of COVID-19 patients and medical disease conditions using improved evolutionary clustering algorithm star. Comput. Biol. Med. 2021, 138, 104866. [Google Scholar]
  63. van der Maaten, L.; Hinton, G. Visualizing High-Dimensional Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  64. Wiegreffe, S.; Marasovic, A. Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing. In Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), Online, 6–14 December 2021. [Google Scholar]
  65. Lamm, M.; Palomaki, J.; Alberti, C.; Andor, D.; Choi, E.; Soares, L.B.; Collins, M. Qed: A framework and dataset for explanations in question answering. Trans. Assoc. Comput. Linguist. 2021, 9, 790–806. [Google Scholar]
  66. Seo, M.; Kembhavi, A.; Farhadi, A.; Hajishirzi, H. Bidirectional Attention Flow for Machine Comprehension. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  67. Jansen, P.; Ustalov, D. TextGraphs 2019 Shared Task on Multi-Hop Inference for Explanation Regeneration. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), Hong Kong, China, 4 November 2019; pp. 63–77. [Google Scholar]
  68. Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar]
  69. Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
  70. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
  71. Creswell, A.; Shanahan, M.; Higgins, I. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  72. Zelikman, E.; Wu, Y.; Mu, J.; Goodman, N. Star: Bootstrapping reasoning with reasoning. Adv. Neural Inf. Process. Syst. 2022, 35, 15476–15488. [Google Scholar]
  73. Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Cui, C.; Bousquet, O.; Le, Q.V.; et al. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  74. Liang, X.; Song, S.; Zheng, Z.; Wang, H.; Yu, Q.; Li, X.; Li, R.H.; Xiong, F.; Li, Z. Internal consistency and self-feedback in large language models: A survey. arXiv 2024, arXiv:2407.14507. [Google Scholar]
  75. Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv 2023, arXiv:2311.05232. [Google Scholar]
  76. Turpin, M.; Michael, J.; Perez, E.; Bowman, S. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Adv. Neural Inf. Process. Syst. 2023, 36, 74952–74965. [Google Scholar]
  77. Lanham, T.; Chen, A.; Radhakrishnan, A.; Steiner, B.; Denison, C.; Hernandez, D.; Li, D.; Durmus, E.; Hubinger, E.; Kernion, J.; et al. Measuring faithfulness in chain-of-thought reasoning. arXiv 2023, arXiv:2307.13702. [Google Scholar]
  78. Golovneva, O.; Chen, M.; Poff, S.; Corredor, M.; Zettlemoyer, L.; Fazel-Zarandi, M.; Celikyilmaz, A. ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning. In Proceedings of the The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  79. Hron, J.; Culp, L.A.; Elsayed, G.F.; Liu, R.; Snoek, J.; Kornblith, S.; Rizkowsky, A.; Simpson, I.; Sohl-Dickstein, J.; Fiedel, N.; et al. Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
  80. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar]
  81. Li, K.; Patel, O.; Viégas, F.; Pfister, H.; Wattenberg, M. Inference-time intervention: Eliciting truthful answers from a language model. Proccedings of the Thirty-Seventh Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 41451–41530. [Google Scholar]
  82. Valentino, M.; Thayaparan, M.; Freitas, A. Unification-based Reconstruction of Multi-hop Explanations for Science Questions. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 200–211. [Google Scholar]
  83. Song, J.; Wu, X.; Cai, Y. Step Feasibility-Aware and Error-Correctable Entailment Tree Generation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Torino, Italy, 20–25 May 2024; pp. 15296–15308. [Google Scholar]
  84. Weir, N.; Van Durme, B. Dynamic generation of interpretable inference rules in a neuro-symbolic expert system. arXiv 2022, arXiv:2209.07662. [Google Scholar]
  85. Bostrom, K.; Sprague, Z.; Chaudhuri, S.; Durrett, G. Natural Language Deduction through Search over Statement Compositions. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 4871–4883. [Google Scholar]
  86. Dalvi, B.; Tafjord, O.; Clark, P. Towards teachable reasoning systems: Using a dynamic memory of user feedback for continual system improvement. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 9465–9480. [Google Scholar]
Figure 1. Illustration of entailment tree generation: (a) Given a hypothesis (grey) and relevant text (or a corpus), the goal is to generate an entailment tree. (b) The entailment tree includes basic facts ( s e n t ) and novel intermediate conclusions ( i n t ), all connected by entailment steps. (c) The predicted step produces an illogical result.
Figure 1. Illustration of entailment tree generation: (a) Given a hypothesis (grey) and relevant text (or a corpus), the goal is to generate an entailment tree. (b) The entailment tree includes basic facts ( s e n t ) and novel intermediate conclusions ( i n t ), all connected by entailment steps. (c) The predicted step produces an illogical result.
Mathematics 12 02893 g001
Figure 2. Overview of the proposed CBD framework for entailment tree generation. The left side of the figure illustrates the entire iterative generation process, where circles represent sentences, rounded rectangles represent intermediate conclusions, and dotted rounded rectangles indicate sets of sentences or intermediate conclusions being considered together. In the initial iteration, the controller identifies promising steps, such as ( s 1 , s 6 ). The right side of the figure provides a detailed illustration of the process using the sentence pair (“ s 1 : Chocolate is usually a solid” and “ s 6 : Chocolate is a kind of substance”). Through the retrieve–reuse–refine paradigm, CBD derives a logical intermediate fact (“ i n t 1 : Chocolate is a kind of solid substance”). Each promising step undergoes case-based deduction, after which the controller validates the results and determines the next reasoning step.
Figure 2. Overview of the proposed CBD framework for entailment tree generation. The left side of the figure illustrates the entire iterative generation process, where circles represent sentences, rounded rectangles represent intermediate conclusions, and dotted rounded rectangles indicate sets of sentences or intermediate conclusions being considered together. In the initial iteration, the controller identifies promising steps, such as ( s 1 , s 6 ). The right side of the figure provides a detailed illustration of the process using the sentence pair (“ s 1 : Chocolate is usually a solid” and “ s 6 : Chocolate is a kind of substance”). Through the retrieve–reuse–refine paradigm, CBD derives a logical intermediate fact (“ i n t 1 : Chocolate is a kind of solid substance”). Each promising step undergoes case-based deduction, after which the controller validates the results and determines the next reasoning step.
Mathematics 12 02893 g002
Figure 3. Example prompt for case-based deduction, where five cases guide the language model to generate logical text through in-context learning.
Figure 3. Example prompt for case-based deduction, where five cases guide the language model to generate logical text through in-context learning.
Mathematics 12 02893 g003
Figure 4. Performance of CBD under different numbers of cases on Task 1.
Figure 4. Performance of CBD under different numbers of cases on Task 1.
Mathematics 12 02893 g004
Figure 5. Visualization of prototype embeddings for different logical patterns (Conjunction, If–then, Substitution). Each dot represents an instance of a specific logical pattern, with different colors corresponding to different patterns. The red crosses indicate the central prototype embedding for each logical pattern.
Figure 5. Visualization of prototype embeddings for different logical patterns (Conjunction, If–then, Substitution). Each dot represents an instance of a specific logical pattern, with different colors corresponding to different patterns. The red crosses indicate the central prototype embedding for each logical pattern.
Mathematics 12 02893 g005
Table 1. The distribution of logical categories in individual entailment tree steps based on a sample of 100 random steps from the training corpus. Here, s 1 and s 2 represent input sentences, while i n t 1 represents the deduction outcome (intermediate nodes in the entailment tree).
Table 1. The distribution of logical categories in individual entailment tree steps based on a sample of 100 random steps from the training corpus. Here, s 1 and s 2 represent input sentences, while i n t 1 represents the deduction outcome (intermediate nodes in the entailment tree).
CategoriesLogical PatternProportionExample
Substitution s 1 :   x X P ( x ) 46% s 1 : coal is a kind of fossil fuel.
s 2 :   a X s 2 : fossil fuels are acquired by mining the lithosphere.
i n t 1 :   P ( a ) i n t 1 : coal is acquired by mining the lithosphere.
If-then s 1 :   P ( x ) Q ( x ) 33% s 1 : if a thing converts something into something else then that thing uses that something.
s 2 :   P ( x ) s 2 : bees convert nectar into honey for food.
i n t 1 :   Q ( x ) i n t 1 : bees use nectar for food.
Conjunction s 1 :   P ( x ) 21% s 1 : sunlight is a kind of electromagnetic radiation.
s 2 :   Q ( x ) s 2 : sunlight can shine through a window.
i n t 1 :   P Q ( x ) i n t 1 : sunlight is a kind of electromagnetic radiation that can shine through a window.
Table 2. Summary Statistics for the EntailmentBank dataset splits.
Table 2. Summary Statistics for the EntailmentBank dataset splits.
TrainDevTestAll
Questions13131873401840
Entailment Steps417559711095881
Table 3. Statistics of logical pattern annotations for steps.
Table 3. Statistics of logical pattern annotations for steps.
SplitSubstitutionConjunctionIf-thenAll
Train21110584400
Dev1537151275
Table 4. Automatic evaluation results on the EntailmentBank test split. ◊ indicates results reported in published papers. † denotes results obtained using the official public code with the Llama 3 language model. The best results are highlighted in bold, and the second-best results are underlined.
Table 4. Automatic evaluation results on the EntailmentBank test split. ◊ indicates results reported in published papers. † denotes results obtained using the official public code with the Llama 3 language model. The best results are highlighted in bold, and the second-best results are underlined.
MethodLeavesStepsIntermediatesOverall
F1 (%)AllCorrect (%)F1 (%)AllCorrect (%)F1 (%)AllCorrect (%)AllCorrect (%)
Task 1 (no-distractor)
EntailmentWriter [6] 98.484.150.038.567.035.934.4
EntailmentWriter-Iter [6] 99.897.651.638.568.336.535.0
IRGR [10] 97.689.450.236.862.131.832.4
MetGen (T5) [8] 100.0100.057.741.970.839.236.5
MetGen (Llama 3) [8] 99.899.458.642.471.640.336.5
CBD (Ours)99.899.458.542.472.242.138.2
Task 2 (distractor)
EntailmentWriter [6] 83.235.039.524.762.228.223.2
EntailmentWriter-Iter [6] 85.240.938.926.863.529.125.0
IRGR [10] 69.923.830.522.447.726.521.8
MetGen (T5) [8] 82.746.141.329.661.432.427.7
MetGen (Llama 3) [8] 80.943.241.229.460.332.927.9
CBD (Ours)81.344.142.530.360.933.528.5
Task 3 (full-corpus)
EntailmentWriter [6] 30.91.24.41.228.85.61.2
EntailmentWriter-Iter [6] 32.41.84.41.529.76.51.5
IRGR [10] 46.610.011.38.238.720.98.2
MetGen (T5) [8] 34.88.79.88.636.620.48.6
MetGen (Llama 3) [8] 36.29.110.68.236.118.28.2
CBD (Ours)35.79.711.59.436.720.99.4
Table 5. Ablation study on the EntailmentBank dataset for Task 1. “w/o” indicates the removal of the corresponding module from the model. Δ represents the percentage difference compared with CBD.
Table 5. Ablation study on the EntailmentBank dataset for Task 1. “w/o” indicates the removal of the corresponding module from the model. Δ represents the percentage difference compared with CBD.
MethodIntermediatesOverall Δ
F1 (%)AllCorrect (%)AllCorrect (%)
CBD (Ours)72.242.138.2-
CBD (w/o ProtoNet)67.837.134.7−3.5
CBD (w/o IE)71.540.936.8−1.4
CBD (w/o Curriculum)71.840.637.1−1.1
Note: Bold values indicate the best performance.
Table 6. Experimental results of different retrieval and reranking method combinations for Task 1.
Table 6. Experimental results of different retrieval and reranking method combinations for Task 1.
MethodIntermediatesOverall
F1 (%)AllCorrect (%)AllCorrect (%)
Dense + Reranker68.036.233.5
Dense + IE67.837.134.7
ProtoNet + Reranker70.639.435.9
ProtoNet + IE (Ours)72.242.138.2
Note: Bold values indicate the best performance.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shi, J.; Ding, X.; Liu, T. Case-Based Deduction for Entailment Tree Generation. Mathematics 2024, 12, 2893. https://doi.org/10.3390/math12182893

AMA Style

Shi J, Ding X, Liu T. Case-Based Deduction for Entailment Tree Generation. Mathematics. 2024; 12(18):2893. https://doi.org/10.3390/math12182893

Chicago/Turabian Style

Shi, Jihao, Xiao Ding, and Ting Liu. 2024. "Case-Based Deduction for Entailment Tree Generation" Mathematics 12, no. 18: 2893. https://doi.org/10.3390/math12182893

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop