1. Introduction
Neural networks have demonstrated exceptional performance across various tasks [
1]. However, these models often remain opaque to human understanding, prompting researchers to explore methods for providing more interpretable explanations. Three main approaches have emerged: highlighting relevant features in the input data to explain decisions [
2], generating natural language explanations alongside predicted answers [
3], and employing structured reasoning processes [
4].
Our work focuses on leveraging structured reasoning processes to explain decisions in a clear and understandable manner. Traditionally, structured explanations have been constructed using reasoning chains, as demonstrated in earlier studies [
2,
5]. More recent research has explored entailment trees as an alternative method for generating explanations. Entailment trees offer a more structured and manageable way of conducting reasoning, making them a promising tool for explanation generation [
6,
7,
8]. Entailment trees, as proposed by [
6], illustrate the logical connections between multiple premises, systematically organizing the reasoning chains that underpin question-answering tasks. This approach provides a transparent means of understanding how hypotheses are derived from a series of entailment steps based on multiple premises.
Previous methods for constructing entailment trees typically employed a single, sequential process that generated the entire tree at once [
9]. For instance, the sequential form of the entailment tree in
Figure 1b can be represented as “
Chocolate is a kind of solid substance
Chocolate in solid state has definite shape
”. However, this approach often struggled to generate complete and accurate entailment trees, leading to unreliable or hallucinatory results. To mitigate these issues, recent studies [
7,
8,
10,
11] have adopted an iterative approach, where one-step entailments are generated first and then iteratively expanded to form the entire tree. Despite these improvements, a common challenge persists: ensuring logical consistency within individual steps. As shown in
Figure 1c, given the sentences “Chocolate is usually a solid” and “Chocolate is a kind of substance”, the correct entailment is “Chocolate is a kind of solid substance”. However, a pretrained language model might incorrectly predict that “Substance is usually a solid”, highlighting the issue of logical inconsistency.
To address these challenges, researchers have explored symbolic methods [
12,
13,
14] known for their logical reasoning capabilities. However, these methods face difficulties in enumerating all first-order logical rules due to the diverse nature of natural language expressions, and constructing rule bases through human engineering is time-consuming. A promising alternative is case-based learning [
15,
16,
17,
18,
19], where systems operate on the principle that similar problems have similar solutions. Case-based reasoning is a method in artificial intelligence where solutions to new problems are derived by referring to similar, previously solved cases. By leveraging past cases to solve new problems, case-based reasoning allows for guiding logical generation using patterns from past gold cases without the need for explicit rule induction.
The case-based reasoning framework typically follows a retrieve-reuse-refine paradigm [
20]. A key challenge in this approach is retrieving similar cases, as logical patterns can manifest in various natural language forms. To address this, we propose using prototypical networks [
21] to represent logical patterns as prototype embeddings. Prototypical networks are a type of neural network that learns a metric space in which classification is performed by computing the distance to prototype representations of each class. This method captures the essence of logical patterns, enhancing the retrieval process within the case-based reasoning framework. Our research focuses on three prevalent logical patterns: conjunction, if–then, and substitution. The prototypical network generates three prototype embeddings, one for each pattern, by averaging the embeddings of cases in each category. During reasoning, these prototype embeddings help identify the logical pattern of the current step, narrowing the retrieval scope within our proposed framework,
CBD.
Once similar cases are retrieved, they are used as demonstrations for in-context learning. In-context learning benefits from diverse demonstrations [
22,
23,
24] and is sensitive to their order [
25,
26]. To enhance diversity, we incorporate information entropy [
27], a measure from information theory used to quantify the diversity of information in a set, which is applied here to rerank the retrieved cases. We then select the top-
n cases as demonstrations. Inspired by curriculum learning [
28], we organize the demonstrations by difficulty, arranging them from easy (low information entropy) to difficult (high information entropy). The deduction process is carried out iteratively until the target hypothesis is proven or the maximum deduction depth is reached. To facilitate iterative generation, we utilize the controller from
MetGen [
8].
Our contributions are as follows:
Case-based learning for step generation: We implement case-based learning to eliminate the need for manual construction of logical rules. The retrieved cases are utilized as demonstrations for in-context learning, effectively guiding large language models in logical text generation.
Prototypical networks for inducing abstract logical structures: To improve the efficiency of identifying cases with similar logical structures, we employ a prototypical network to capture implicit logical patterns. This enables more targeted case retrieval and provides insights into logical pattern attribution.
Integration of information entropy to enhance diversity: We incorporate information entropy to increase the diversity of retrieved cases, enhancing the model’s ability to learn underlying logical patterns and improving in-context learning.
The rest of this paper is organized as follows:
Section 2 defines the problem and provides background knowledge on case-based reasoning and prototypical networks.
Section 3 describes the architecture and implementation of
CBD, detailing the case-based deduction process for entailment tree generation.
Section 4 outlines our experimental setup, including the benchmark dataset, baseline models, evaluation metrics, and implementation details.
Section 5 presents our results, discusses their implications, and highlights the advantages of
CBD over existing methods.
Section 6 reviews related work in entailment tree generation and explanation for question-answering tasks, situating our contributions within the broader research context. Finally,
Section 8 concludes the paper by summarizing our contributions and discussing potential directions for future work.
3. Case-Based Deduction
The objective of CBD is to construct an entailment tree based on a set of knowledge and a declarative hypothesis , which consists of a question and its correct answer. For example, given the hypothesis “the shape of the chocolate changes when the chocolate melts”, our task is to build an explanation that justifies by selecting and combining pairs of sentences to generate new intermediate conclusions, thereby forming an entailment tree. This is achieved through a case-based reasoning approach, which involves three main phases:
Retrieval Phase: Relevant cases are retrieved to serve as demonstrations, ensuring the logical consistency of the generated results. A prototypical network predicts the logical pattern of promising steps, categorizing them into one of three types: if–then, substitution, or conjunction relations. After determining the logical pattern, cosine similarity is used to identify the most relevant cases for the current fact pairs.
Reuse Phase: To mitigate bias toward entities that might result from the cosine similarity method and to encourage diversity among the selected cases, we integrate information entropy into the ranking procedure. Retrieved cases are then ranked based on their information entropy scores.
Refine Phase: The ordered cases are concatenated and used as demonstrations to guide a pretrained language model for logical text generation.
Through iterative application of this process, the entire entailment tree is constructed.
Figure 2 provides a comprehensive overview of the
CBD framework. Further details are discussed in the following sections.
3.1. Retrieve Phase
Using the controller proposed by Hong et al. [
8], we identify promising steps in the reasoning process. For instance, the sentences “Chocolate is usually a solid” and “Chocolate is a kind of substance” exemplify such a promising step. The case-based deduction module then generates intermediate conclusions for each identified step. This process begins by retrieving analogous cases that follow the same logical pattern, leveraging a prototypical network to capture these implicit patterns. The prototypical network helps recognize abstract logical patterns in the data, which ensures that retrieved cases align with the input logical structures. By doing so, the network facilitates the retrieval of cases that are more semantically relevant and helps improve the logical consistency of intermediate conclusions. We categorize the types of inferences into three prevalent categories, as shown in
Table 1. The case base, denoted as
, consists of three logical categories:
,
, and
.
To encode the semantics of promising steps into embeddings, we use the DeBERTa model [
51], chosen for its effectiveness in capturing sentence semantics. Although other neural architectures, such as recurrent neural networks (RNNs), could also be utilized for this purpose, DeBERTa is our preferred choice. After computing the sentence pair embeddings within the case base
, we use a prototypical network to generate a prototype for each logical pattern based on these embeddings. By measuring the distance between the embedding of a promising step and each prototype, we can accurately determine the logical pattern corresponding to that step.
Given a promising step
, where
and are represented as
and
, we apply the DeBERTa model to encode the sentence pair into a continuous low-dimensional embedding
, capturing the semantics of the sentence pair. For all input tokens of the sentence pairs, we concatenate the two sentences using the specific token [SEP], forming
. The embedding is computed as follows:
where
refers to the embedding layer of DeBERTa, which maps discrete words into continuous embeddings. The output layer is represented as
. The vector representation of the sentence pair, corresponding to the token [CLS], is denoted as
.
In our approach, the prototypical network represents each logical pattern with a single vector, known as a prototype. The logical patterns
correspond to conjunction, if–then, and substitution. We calculate prototypes by averaging all sentence pair embeddings within the annotated entailment step set for each category:
where
is the prototype embedding for the conjunction pattern, and
represents the
ℓ-th sentence pair embedding within the conjunction category. The same process is applied to calculate the prototypes for the other two logical categories:
for the if–then pattern and
for the substitution pattern.
Next, we calculate the probabilities
for predicting the logical pattern to the promising step
using the following equations:
where
is the vector representation of the promising step,
is the prototype embedding for the
i-th logical category, and
is the Euclidean distance between two vectors.
represents the probability of the
i-th logical category, with
as the predicted category. Euclidean distance is used here, following findings from [
21], where it was shown to outperform other distance metrics.
The loss function is then defined as
where
is 1 if the
i-th case is predicted correctly; otherwise,
is 0, and
N is the total number of samples. We minimize the cross-entropy loss between the prediction and the ground truth. Through the prototypical network, we narrow the retrieval scope to focus on cases with analogical logical patterns. For instance, as shown in
Figure 2, the promising step (
) exemplifies a conjunction pattern. Once the logical pattern is determined, we proceed to gather similar cases for demonstration.
We hypothesize that employing analogous single-step deduction can effectively assist a large language model in performing logical deductions. Through these illustrative examples, the model can better comprehend underlying logical patterns, enabling it to extrapolate from these patterns and generate logical outcomes for promising steps. Specifically, we retrieve cases from the corresponding case base
based on the logical pattern identified in the previous phase. The top-
cases are selected based on their cosine similarity:
where
represents the vector representation of the promising step, and
represents the vector presentations of the logical pattern category.
3.2. Reuse Phase
Cosine similarity tends to emphasize entities within the query, often resulting in retrieval outcomes dominated by cases with similar entities. This homogeneity can limit the model’s performance in in-context generation. Studies [
25,
26] have shown that model performance can vary significantly based on the selection of in-context examples, with diverse demonstrations leading to better outcomes. To address this limitation, we apply an information entropy-based reranking mechanism to prioritize and rank the retrieved cases, introducing greater diversity into the results. The information entropy-based reranking ensures that the model encounters a broader range of distinct cases within the same logical pattern. By diversifying the retrieved cases, this approach enhances in-context learning, enabling the model to better learn from these varied demonstrations. This ultimately improves the model’s performance in generating logically consistent entailment trees.
The information entropy of a sentence
is calculated as follows:
where
denotes the probability of each word
, calculated based on its frequency in the case base. Higher entropy indicates a sentence with more information content. Previous studies [
52,
53] have demonstrated that diverse demonstrations significantly improve the performance of large language models.
After retrieving a set of cases
using cosine similarity, we rerank them based on information entropy to enhance diversity. The information entropy of each case
is calculated as
We then select the top-
cases with the highest information entropy scores. The information entropy score of a case provides insights into its complexity, indicating the amount of information it contains. Recognizing that in-context learning is sensitive to the order of demonstrations, we organize the filtered cases by difficulty, arranging them from easy (low entropy) to difficult (high entropy), following a curriculum learning approach [
28].
3.3. Refine Phase
In the refine phase, the ordered cases are concatenated and used as demonstrations to guide a pretrained language model for logical text generation. We use the Llama 3 [
54] model as the backbone. Each selected case (i.e.,
) is translated into a natural language format as follows:
[AND]
[ENTAIL]
. Through in-context learning, the LLM generates logical text based on these demonstrations. Below is an example prompt for case-based deduction using five cases (see
Figure 3):
Following this case-based deduction process, the controller facilitates reasoning. Since entailment trees are built iteratively, with each step expanding the reasoning space, we employ beam search for effective reasoning. The process continues until the target hypothesis is validated or the maximum reasoning depth is reached.
4. Experiment
4.1. Benchmark Dataset
We conducted experiments using the EntailmentBank dataset [
6], which contains 1840 expert-annotated entailment trees. Each tree corresponds to a hypothesis derived from a question and its correct answer in the ARC dataset [
55]. The leaf facts of these trees are selected from the WorldTreeV2 corpus [
56]. The statistics for the EntailmentBank dataset are detailed in
Table 2.
We chose EntailmentBank because it offers a structured entailment reasoning task, where multistep reasoning is crucial to constructing logically consistent entailment trees. This dataset provides a well-established benchmark for evaluating models that generate logical entailments and intermediate reasoning steps, aligning with the capabilities of our proposed method. EntailmentBank’s tasks involve reasoning from multiple premises to conclusions, which allows us to test CBD’s ability to retrieve, reuse, and refine logical patterns, making it an ideal dataset for assessing multistep reasoning models.
Following [
6], three explanation tasks are defined based on a hypothesis (question + answer), each with increasing difficulty: (a) generating a valid entailment tree given all relevant sentences (the leaves of the gold entailment tree), (b) generating the tree given all relevant sentences along with some irrelevant ones, or (c) generating the tree given a full corpus. The objective is to produce explanations in the form of entailment trees that represent a sequence of multipremise entailment steps from known facts through intermediate conclusions to the final hypothesis (the question + answer).
The case base used in our study comprises the entailment steps from the training dataset. We utilized the manually annotated dataset from [
8], which includes 400 annotated reasoning steps in the training split (Train-manual) and 275 steps in the development split (Dev-manual).
Table 3 presents the statistics of the logical pattern annotations. These entailment steps are categorized into three types: conjunction, if–then, and substitution patterns. We used the training dataset to train the prototypical network and the development dataset to evaluate and optimize its performance.
4.2. Baseline Models
To evaluate the effectiveness of our proposed method, we conducted a comparative analysis between CBD and several existing methods:
EntailmentWriter [
6]: This method generates the entire entailment tree “all at once”, directly producing linearized trees. Hong et al. [
8] extended this approach by developing EntailmentWriter-Iter, which iteratively generates individual steps of the tree and then concatenates them to form the final tree.
IRGR [
10]: This method designs an iterative retrieval-generation framework that enhances retrieval outcomes. It iteratively retrieves premises and constructs step-by-step explanations from textual evidence.
MetGen [
8]: This method employs a reasoning controller and a T5 model to independently select premises and generate conclusions. For a fair comparison, we also experimented with replacing the T5 model with Llama 3 [
54].
In our comparison, we integrated CBD with the controller from MetGen to effectively select steps, which then facilitated the generation of logical intermediate conclusions.
4.3. Evaluation Metrics
We evaluate entailment trees in a two-step process. First, nodes in the predicted tree
are aligned with nodes in the gold tree
using the
labels and Jaccard similarity for intermediate nodes, aiming to capture semantic equivalences rather than exact matches. Next, we assess the generated tree by comparing it to the gold tree across three dimensions, following established methodologies from previous studies [
6,
7,
8]:
Leaves: This dimension evaluates the use of correct leaf facts. We calculate the score by comparing the predicted leaf facts with those in the gold tree. Additionally, we report the AllCorrect score, which indicates exact matches. The AllCorrect score is 1 when the score is 1, and 0 otherwise.
Steps: This dimension assesses the structural correctness of entailment steps. We compare the steps in both trees using the and AllCorrect scores. A predicted step is considered correct if its children’s identifiers perfectly match those in the gold tree.
Intermediates: This dimension measures the accuracy of intermediate conclusions, also using the
and AllCorrect scores. A predicted intermediate conclusion is considered correct if the BLEURT-Large-512 [
57] score between the aligned predicted intermediate and the corresponding gold intermediate is greater than 0.28. Using BLEURT allows us to account for semantic equivalence between the generated and gold conclusions beyond surface-level matching, which is critical for assessing complex multistep reasoning tasks.
Finally, we employ a strict metric called Overall Allcorrect. This score is 1 only if all leaves, steps, and intermediates in the generated tree are correct, meaning the tree exactly matches . Any error in the tree results in a score of 0 under this metric.
We chose these evaluation metrics because they directly measure both the structural and semantic accuracy of the generated entailment trees. By using a combination of F1 and AllCorrect scores, we ensure that the model is evaluated on both partial correctness and exact matches, providing a nuanced view of its performance. Additionally, the inclusion of BLEURT-Large-512 for intermediate conclusions allows us to capture semantic similarity, which is essential for assessing the quality of multistep reasoning. These metrics have been widely used in prior work, ensuring that our evaluation is comparable with other models in the field and that our method’s improvements are consistently measured across all reasoning dimensions.
4.4. Implementation Details
All models and baselines were implemented using the PyTorch framework [
58] and Huggingface transformers [
59]. During the logical generation in the refine phase, we used Llama 3 [
54] as the pretrained language model. For Task 1, we iterated until all facts in
were used. For Task 2 and 3, we applied the same settings as
MetGen to ensure a fair comparison, using a fact score threshold of 0.001 to filter out distractors and a maximum reasoning depth of 5. We selected the top 10% of steps for each state and set the beam size to 10. The prototypical network architecture is based on the DeBERTa model [
51], with the final layer producing a 768-dimensional embedding for each case. The network was trained using the Adam optimizer with an initial learning rate of 2 × 10
−5 and a batch size of 8, and over 30 epochs. Additionally, we optimized the loss function using the AdamW optimizer [
60].
8. Conclusions and Future Research
Existing methods for entailment tree generation often struggle to maintain logical consistency within individual steps. To address this challenge, we introduce a novel approach called case-based deduction (CBD), which retrieves cases with similar logical patterns to the input pairs, ensuring consistent logical generation. Within the retrieve-reuse-refine paradigm, we use a prototypical network to identify logical patterns, enabling the accurate selection of analogous cases. By ranking these cases based on information entropy scores, we diversify the demonstrations and effectively teach a large language model the foundational logical patterns for reasoning. Our experimental results show that this method significantly improves the quality of entailment tree generation compared with state-of-the-art approaches. Additionally, using a prototypical network in the retrieval phase provides interpretable logical pattern induction. While our current approach has demonstrated effectiveness in entailment tree generation tasks, its potential extends beyond this specific application. The principles of CBD, retrieving and reusing logical patterns from similar cases, can be applied to a broader range of reasoning domains. For example, CBD could be leveraged in commonsense reasoning to infer conclusions based on everyday scenarios by retrieving cases with similar logical structures. In legal reasoning, the method could assist in drawing parallels between legal precedents and current cases. Additionally, mathematical proof generation could benefit from CBD’s ability to retrieve similar logical steps from previously solved proofs, aiding in new problem-solving tasks.
In future work, we plan to address the limitations identified in our current approach and focus on more specific improvements to enhance the performance and flexibility of case-based deduction. One promising direction is the exploration of alternative retrieval mechanisms. While our current method leverages a prototypical network to retrieve cases based on logical patterns, we aim to investigate more adaptive retrieval strategies that adapt to diverse reasoning contexts. For instance, context-aware retrieval mechanisms or reinforcement learning techniques could allow the model to better select cases tailored to specific problem domains. Another area for improvement is integrating more complex and diverse logical patterns into the prototypical network. This enhancement would allow the model to capture a broader range of reasoning strategies, particularly for more challenging entailment tasks. By expanding the variety of logical patterns the network can recognize and retrieve, we expect the model to handle diverse and nonbinary reasoning structures more effectively. We also see value in incorporating human evaluation to complement the current automated metrics. Human evaluators can provide additional insights into aspects like interpretability, logical flow, and the practical utility of generated entailment trees, which may not be fully captured by automated evaluation methods.