Investigating Self-Rationalizing Models for Commonsense Reasoning

Rancourt, Fanny; Vondrlik, Paula; Maupomé, Diego; Meurs, Marie-Jean

doi:10.3390/stats6030056

Open AccessArticle

Investigating Self-Rationalizing Models for Commonsense Reasoning

Department of Computer Science, Université du Québec à Montréal (UQAM), Montreal, QC H2L 2C4, Canada

^*

Authors to whom correspondence should be addressed.

Stats 2023, 6(3), 907-919; https://doi.org/10.3390/stats6030056

Submission received: 3 August 2023 / Revised: 19 August 2023 / Accepted: 23 August 2023 / Published: 29 August 2023

(This article belongs to the Special Issue Machine Learning and Natural Language Processing (ML & NLP))

Download

Browse Figures

Versions Notes

Abstract

:

The rise of explainable natural language processing spurred a bulk of work on datasets augmented with human explanations, as well as technical approaches to leverage them. Notably, generative large language models offer new possibilities, as they can output a prediction as well as an explanation in natural language. This work investigates the capabilities of fine-tuned text-to-text transfer Transformer (T5) models for commonsense reasoning and explanation generation. Our experiments suggest that while self-rationalizing models achieve interesting results, a significant gap remains: classifiers consistently outperformed self-rationalizing models, and a substantial fraction of model-generated explanations are not valid. Furthermore, training with expressive free-text explanations substantially altered the inner representation of the model, suggesting that they supplied additional information and may bridge the knowledge gap. Our code is publicly available, and the experiments were run on open-access datasets, hence allowing full reproducibility.

Keywords:

explainability; generative models; large language models; natural language processing; rationales; X-AI

1. Introduction

Over recent years, the increased predictive power of artificial intelligence (AI) methods has heralded the practical deployment of AI systems in various sectors of human activity. In turn, this has renewed research interest in the relationship between these systems and their human operators. A key component of this relationship is trust. Inadequate trust dynamics between automated systems and their operator can result in suboptimal performance. This is evidenced by the fact that overt human reliance on automated systems can result in poor performance. Trust in AI systems, however, is not merely a matter of reliability, i.e., decreased error. Instead, favorable trust dynamics are best served by frameworks allowing the appropriate calibration of trust between human and autonomous agents [1,2]. Improving the ability of humans to calibrate trust in automated tools is a matter of transparency: human operators should understand the inner workings of the tool in question either in general or for a given prediction [3,4].

Exercising transparency is a key concern in explainable ai (XAI) research, which seeks to make AI methods and their models more interpretable by human observers in order to maintain intellectual oversight on them [5,6,7]. Although XAI research concerns a variety of types of data and fields of application, the present work focuses on explainability in natural language processing (NLP) methods. Modern NLP methods leverage heavy representation learning in order to grapple with the irregularities of human language, making explainability in NLP a burgeoning field of research, with several avenues for development. One such avenue is the use of models explicitly trained to provide textual explanations for their predictions [8,9,10]. Indeed, models can be trained to replicate explanations provided by human annotators for individual data points, much in the same manner they are trained for the task of interest. Thus, models need not be structured in an inherently interpretable manner but may extract patterns of explanation from the data. There are several limitations to such a framework. A primary concern is whether model-generated explanations indeed reflect the decision-making process of the model—are faithful—or shallowly replicate explanation formulae. It has been argued that this can be achieved by imposing structural constraints on models [11]. For example, models can be structured to make predictions based on explanations, thus becoming faithful by construction [12]. Another concern is how these explanations might alter the learning of the primary task they intend to explain. Continuing the previous example, an inaccurate explanation may hinder the ability of a model to produce an accurate prediction. To the contrary, training with explanations has been suggested to help task acquisition in some instances by providing supplementary information. This could be the case for tasks such as commonsense reasoning and natural language inference, where the input is not sufficient to complete the task, and world knowledge is needed to bridge the gap [11].

The present work concerns itself with self-rationalizing models, which jointly produce predictions and their associated explanations, achieving promising results [13]. In doing so, self-rationalizing models do not constrain their prediction and explanation components to rely directly on the output of one another. However, it is unclear how considering explanations in the training process affects a model. For example, gaps in prediction quality between self-rationalizing and prediction-only models have been reported [11]. The present work seeks to further study this issue.

We begin by measuring the effects of explanations and their type on performance in commonsense reasoning and natural language inference tasks. In order to make an equitable comparison between approaches with different explanation configurations, we unite them under a flexible common base, the text-to-text transfer Transformer (T5) model [14]. This allows the treatment of the task target and accompanying explanation as a single target sequence. Our results indicate, in agreement with previous work, that explanations both in excerpts and free-text hurt prediction performance. This stands in contrast with human learning, where self-explanation helps learning [15].

Subsequently, we examine whether the presence of explanations causes models to rely on data artifacts, as has been observed in natural language inference (NLI) tasks [16]. Model-generated free-text explanations tend to be structured [8] and appear to rely on a handful of rare formulae from the training data. Lastly, we investigate whether the presence and type of explanations considered in training alter the influence that individual examples may have on the training process, specifically, which examples become more influential. We observe that some types of explanations have significant effect on the inner representation of the model.

This paper is structured as follows. Section 2 discusses human explanations and their usage in explainable NLP. Section 3 provides technical information on our experiments. Section 4 presents our results, and Section 5 concludes this paper.

2. Background

2.1. Explainable NLP

Many approaches and models in modern NLP are not built with interpretability in mind. Therefore, much work has been devoted to developing means of explaining model behavior a posteriori, by examining the models themselves or their predictions. For example, one can employ feature attribution [17,18] or instance attribution methods [19] to probe models for the significance of specific features or observations. Nonetheless, the faithfulness of these post hoc explanations can vary [20,21]. Similarly, a widespread approach in NLP is to interrogate the values produced by the attention mechanism [22]—a ubiquitous content-based weighting of concurrent parts, e.g., the words in a sentence. Simply put, the distribution of attention can be attributed an explanatory value: parts receiving larger attention values are deemed to play a larger role in a prediction. However, the soundness of this attribution is debated [23,24,25,26].

Rather than probing models a posteriori in order to elucidate their inner workings, one can constrain models to produce explanations for their predictions. That is, models are built to accompany their output with a human-interpretable justification. These explanations justify an individual occurrence rather than teach generalized theories, the latter being out of scope for most of the explainability literature. In NLP, these explanations—or rationales—are text sequences. We use the terms rationales and explanations interchangeably in this work, referring the reader to Jacovi and Goldberg [20] for a discussion on the nuances of both terms. There exist several competing notions regarding the use of rationales, namely surrounding the constraints on explanations and their relationship to the output. Indeed, the explanations to be produced can be constrained in their structure in different manners. Similarly, different dependencies can be established between the output and accompanying rationale. For example, a model can be trained to produce its output as a function of the rationale or vice versa. These choices in model and explanation constraints articulate different considerations regarding the role of explanations in prediction.

In any case, models are trained to produce these explanations based on supervision by human-generated explanations. This has the advantage of favoring model explanations that are adequate by construction. Of course, training models on human-generated explanations exposes the framework to the quality of explanation data, which may be difficult to verify. Further, it requires an understanding of the nature of human explanations in order to gain perspective on model behavior. Lastly, such a framework assumes that explanations are intended to justify a prediction to a human observer rather than to teach them what the model has inferred. In other words, the model is the explainee, rather than the user. The latter paradigm is out of the scope of the present work, which focuses on commonsense reasoning.

2.2. Human Explanations

Human explanations tend to be contrastive in order to fit the decision border [27]. Thus, it is natural for an explanation of a classification prediction to not be self-contained, referring instead to other classes. This is particularly true for commonsense reasoning tasks with distractor choices: explanations tend to highlight why these distractors are unreasonable alternatives [9]. Further, explanations are also a co-adaptive process in which the explainer and the explainee collaborate to obtain a satisfactory explanation [15]. Since humans have a limited capacity to process information, they tend to select simple explanations that are less specific and cite fewer causes over plausible ones [27]. As explanations are seldom self-contained, they are social: the explainer states what is necessary and the explainee leverages their knowledge of the situation to contextualize the explanation [27]. While earlier work in explainability focused on delivering one explanation, advances in reinforcement learning with human feedback is a promising avenue for large language models (LLMs) to align with user preferences [28], thus approaching a co-adaptive process.

2.3. Rationales

It is well established that humans can improve their learning and understanding of a given context through self-explanation [15]. However, it is unclear if this holds for machine learning models. Differences can arise from the format constraints of rationales. The two most prominent choices are highlights and free-text rationales. A highlight is a (noncontiguous) subsequence of the input text. These excerpts are the key portions of the input supporting the prediction. As such, they constitute extractive explanations. In contrast, abstractive explanations produce new text intended to explain a decision. Namely, a free-text explanation is one written in natural language without restricting its format. This terminology is borrowed from Wiegreffe and Marasović [8]. Previous results suggest that models considering human-generated highlights achieve higher accuracy [29], require fewer observations to achieve similar performance [30], and improve model explanations [31]. For free-text explanations, strong results can be achieved through a generative pre-trained Transformer (GPT) [32] model fine-tuned on human-annotated explanations [9], while a gap subsists for T5 [11].

Explanations also benefit data quality: the need to explain a decision tends to improve its accuracy [33]. Although the time required for annotation with both label and explanation is greater at first, it largely decreases over time to nearly equal the time required to annotate labels only [34]. Furthermore, when restricting explanation to highlights, strong interannotator agreements can be observed [30] which can in turn be easily monitored to insure explanation quality.

2.4. Self-Rationalizing Models

The production of datasets augmented with human-generated explanations gained prominence recently [33,34], enabling explainable NLP work. Indeed, two paradigms emerged: pipeline models (see [35])—one model generates the decision and another the explanation—and self-rationalizing models [9,13]—which jointly output both. This definition is adapted from the work of [11]. These approaches are illustrated in Figure 1. Multiple configurations can be used for pipeline models: input → output → explanation—any post hoc explainability method [17,18,19] can provide a plausible explanation, but its faithfulness is not guaranteed [20]—and input → explanation → output—deemed faithful by construction, as the first model acts as an “evidence extractor” and the prediction model can only rely on this evidence. The main limitation of this latter approach is that the quality of the first model may limit the accuracy of the whole [36].

In contrast, self-rationalizing models produce labels and explanations jointly. Because they generate the label and an explanation from the same representation, this explanation is deemed introspective [37] and can be faithful [11]. Early work revolved around attention supervision with human-generated highlights (see [23]). While sufficient highlights [35] are favored in annotation guidelines rather than comprehensive ones [8], it may be too sparse to understand model behavior [38]. With the rise of generative LLMs, self-rationalizing models now have the ability to output the label as well as an explanation in natural language [9,13]. Moreover, LLMs such as GPT [39] and T5 [14] benefit from their large pretraining corpora to distill world knowledge, which is helpful when explaining commonsense reasoning decisions.

3. Experimental Setup

3.1. Data

The CommonsenseQA [40] dataset supports a multiple-choice question-answering task which seeks to evince the capacity of language models to leverage real-world knowledge to make inferences. The first version of the dataset, v1.0, comprises 7.5 k examples with 3 options per question, only 1 of which is correct. The v1.11 incorporates 2 additional so-called distractor choices and provides additional observations, for a total of 11 k. The Commonsense Explanation (Cos-E) datasets [9] are augmented versions of these datasets, introducing human-generated rationales, highlights, and free-text alike. Although the authors had quality checks during their collection, free-text explanations of v1.11 tend to be of lesser quality [13]. Our experiments are carried out on both versions. All of our analyses were conducted on the validation set to access ground-truth annotation. Unanswered test sets are available on https://github.com/salesforce/cos-e (accessed on 1 August 2023) for v1.0 and https://www.tau-nlp.sites.tau.ac.il/commonsenseqa (accessed on 1 August 2023) for v1.11.

Similarly, the e-SNLI dataset [10] is an extension of the Stanford NLI dataset [41]. It comprises 570 k pairs of sentences, a premise and a hypothesis, together with a label categorizing their logical relationship. This label designates whether the hypothesis entails, contradicts, or is neutral with respect to the premise. Thus, the task consists in parsing the sentence pair and placing it in one of these three classes. Free-text rationales were collected through crowdsourcing [10]. Annotation guidelines encouraged annotators to provide self-contained rationales. Three separate rationales are provided for each observation in the validation and test sets, while training examples only have one. Furthermore, as a first step of the annotation process, before composing a free-text rationale, annotators were required to highlight the words of the input (both premise and hypothesis) that they deemed essential to categorizing their relationship. These selections are used as highlight rationales [35].

3.2. Model

The text-to-text transfer Transformer (T5) model is an approach to multitask learning based on reframing multiple natural language tasks as text-to-text tasks [14]. As a sequence-to-sequence model, it provides a unified framework for different rationale formats—label-only, label + highlights, and label + free-text—as the label and rationale can be presented as a single sequence, thus eliminating the need for adjustments in model architecture. This is illustrated in Figure 2. Such a framework can obtain strong performances [13]. For the present experiments, T5-base was fine-tuned on each dataset and with each rationale configuration, resulting in nine different models. An overview of the methodology is presented in Figure 3. Hyperparameters are constant across these models [11]: drop-out is set to 0.1, batch size to 64, and patience for early stopping to 10. Implementation, software and hardware details are provided in Appendix A. At test time, output sequences are decoded greedily, with a length cut-off set to the 95th percentile of the length of target sequences in the training set (20 for Cos-E; 33 for e-SNLI). The quality assessment of the models is two-fold. First, classification power is measured with accuracy, as answer choices are question-dependent. Second, output validity is assessed according to the following criteria: an answer is deemed valid if it is one of the available choices, a valid highlight is one that is a subsequence of the input, while free-text explanations do not follow any particular structure [8].

3.3. Instance Attribution

The importance of training examples is estimated with influence functions (IFs) [19,42]. Influence functions seek to measure the effect of particular training data on the resulting model. This is carried out by upweighting training examples and observing the change in loss value, similarly to jackknife resampling. A training example incurring a large loss variation will be deemed influential. Furthermore, the sign of said variation will indicate whether this example helps or hurts model predictions with regard to the loss. Given a training set

D = {x_{k} : 1 \leq k \leq n}

, a machine learning algorithm will seek to minimize the loss,

L

, a function of model parameters,

θ

:

arg {min}_{θ} \sum_{k} L (x_{k}, θ)

. Upweighting a given observation,

x_{i} \in D

, by

ϵ

, this objective becomes:

arg {min}_{θ} \sum_{k} L (x_{k}, θ) + ϵ L (x_{i}, θ)

. Let

\hat{θ}, {\hat{θ}}_{ϵ, i}

be the solutions to, respectively, the original and amended problems. One can compute the influence of

x_{i}

on the model prediction for some unseen test observation,

x_{t}

by

I (x_{t}, x_{i}) : = {\frac{d L (x_{t}, {\hat{θ}}_{ϵ, i})}{d ϵ}|}_{ϵ = 0} = - \nabla L {(x_{t}, \hat{θ})}^{⊤} H_{\hat{θ}}^{- 1} \nabla L (x_{i}, \hat{θ})

This equation allows the computation of influence without explicitly determining

{\hat{θ}}_{ϵ, i}

. Although, it only holds for loss functions twice differentiable and convex, which will seldom be the case in machine learning settings, it can serve as an adequate approximation for Transformer models [16,43].

However, a difficulty in its application is its computational complexity. We rely on FastIF [43] to speed up computation and consider the 1000 closest neighbors. We use the average of token embeddings as a sentence embedder [44]. Other hyperparameters follow recommendations [43]:

1 \times 10^{3}

iterations for the Hessian–vector product with a batch size of 1, the scale parameter is set to

10^{4}

, and the damping parameter, to

5 \times 10^{- 3}

. Given the large size of the e-SNLI test set (∼10 k data points), we perform instance attribution on a sample of 500 data points.

4. Results and Discussion

4.1. Task Performance

Task performance results for all datasets are presented in Table 1. For Cos-E, the accuracy of each model is far from human performance, while remaining substantially better than random (v1.0: 33%; v1.11: 20%). For both versions of the dataset, the models can be ranked as per their accuracy in the following manner:

label-only > label + free-text > label + highlight.

This order corresponds to previous results from [9] (though they did not consider a T5 model) and [11], maintaining a decrease in task performance for self-rationalizing models. These results also further suggest that highlights may be inappropriate explanations for commonsense reasoning tasks. For e-SNLI, however, these differences become very small, consistent with previous work [11]. Increased task performance on this dataset relative to Cos-E may be due to a variety of factors. Some simple, numerical considerations are the difference in dataset size (e-SNLI being ∼57 times larger) and labels being a fixed rather than variable set. A consideration more complex and difficult to quantify is the degree to which so-called world knowledge embedded in the task may be required at test time or may be delivered in training.

4.2. Output Validity

Because T5 models frame their tasks as text-to-text tasks, the validity of the output—answer, highlight, and free-text explanation—must be verified. In particular, the models may not produce a valid label. In practice, this is seldom the case: regardless of the generated explanation (or lack thereof), all models are appropriate classifiers that seldom predict invalid answers (<0.5%).

Similarly, highlight explanations produced by the models must also be verified. Highlight explanations are subsequences of their input sequence. In training, this format is only enforced through supervision, i.e., the target explanations constitute valid highlights. However, the model is not otherwise constrained to adhere to this format. In particular, at prediction time, there is no guarantee that the explanation sequence provided will be a valid highlight. Nonetheless, the proportion of invalid highlights observed is small across datasets: 1.9% for Cos-E v1.0, 1.4% for v1.11, and 0.1% for e-SNLI. It should be noted that validity is computed for our purposes by verifying independently whether each token is present in the input, ignoring character case. This does not invalidate overlapping highlights. Examining the relationship between prediction correctness and highlight validity reveals accuracy for examples resulting in invalid highlights to be fairly close to overall accuracy (v1.0: 61.1%, v1.11: 52.9%, and e-SNLI: 100%). Further, manual inspection of these examples reveals a majority of these discrepancies to be attributable to differences in word inflection, e.g., “spending all your time” rather than “spend all your time”. We contend the nature of highlights to be at odds with model pretraining and conjecture this conflict as the cause of the above: the models were pretrained to produce not sequences of words but grammatically sound ones (or at least linguistically plausible ones). This would drive models to “correct” would-be highlights into proper phrases.

As for valid highlights, their overlap with ground-truth highlights is fairly low, with a mean Jaccard index of 36.9% for Cos-E v1.0, 42.8% for v1.11, and 60.8% for e-SNLI. Interestingly, the overlap increases drastically from incorrect predictions to correct ones for e-SNLI (Jaccard index: 30.4% to 64.1%), which is not the case for Cos-E. This may be due to Cos-E ground-truth highlights covering a larger portion of the input (v1.0: 47.7%, v1.11: 56.7%, e-SNLI: 19.8%). To boot, Cos-E highlights are more concentrated. We compute the ratio of the distance in words between the last and first highlighted words to the total number of highlighted words. A ratio of 1 would thus indicate that highlighted words form a contiguous subsequence. Cos-E highlights span 1.07 times their own length on average, compared with 2.95 for e-SNLI. These two observations—the greater length and higher concentration of highlights in Cos-E—may indicate summarily selected excerpts, which we contend to be less informative than sparser, noncontiguous highlights. This could be addressed in annotation by way of a highlighting budget.

In turn, free-text explanations are not required to adhere to a specific format. However, generated explanations do appear to follow certain patterns. For Cos-E, the most prominent one is “<answer> is the only <something> that […]”, which is generated for 46.3% of examples in v1.0, although it appears in only 4.2% of the training data. The same pattern is noticeable for v1.11, with 10.8% generated explanations following the template even though only 0.7% of the training data follow it. This observation is surprising, as a very similar template “<answer> is the only option that is correct|obvious” was reannotated [9] and suggests that structured explanations may be more appropriate for this task than they are for NLI [10]. Nevertheless, this is consistent with observed annotation practices: annotators tended to justify an answer with a contrastive explanation [9]. Sadly, explanations following this template tend to be uninformative—even nonsensical—as shown in Table 2. In e-SNLI, prominent patterns are “Not all <something> are <something else>” and “A <something> is [not] <something else>”. Selected examples are presented in Table 3. These patterns, akin to predicate quantification, are similarly overrepresented in generated explanations. Respectively, they make up 8.6% of generated explanations against 2.8% of training examples and 31.2% against 11.7%.

4.3. Probing Model Behavior

Our experiments indicate that the embedding space is vastly different when the model is trained with certain types of explanations for different tasks. Indeed, analyzing the closest neighbors of test (or validation) examples reveals sharp drops in overlap for specific explanation types. In the case of Cos-E v1.0, the closest neighbors as per label-only and label + highlights models are fairly constant (average overlap of 86%). This proportion drops to roughly 20% when considering the model generating free-text explanations. In contrast, the average neighbor overlaps for v1.11 range from 63% to 77%. This suggests that free-text explanations add a lot of information to the inner representation of the model, while highlights do not. Additionally, though the label-only and label + highlights models have a similar inner representation, their respective influence functions do not correlate, which indicates the presence of the Rashomon effect [45]: the multiplicity of interpretations of the same facts. Finally, our results invalidate the hypothesis that training examples with a free-text explanation following the aforementioned templates are highly influential. Indeed, observed correlations (Pearson and Spearman) neared 0, indicating the need for a deeper analysis of previous checkpoints.

As for e-SNLI, the model trained on highlight explanations appears to differ the most in its inner representation from the other models. Indeed, the proportion of shared neighbors drops from 7.5% between label-only and label + free-text to 0.7% when matching against highlight explanations. Such small overlaps do not allow for significant estimation of influence correlations. Further influence analysis is required to study this issue.

5. Conclusions

This work has aimed to investigate the effect of training NLP models on rationales—textual explanations for individual predictions—in natural language inference and commonsense question-answering settings. This was carried out by training models to produce their prediction and the appropriate explanation as a single text sequence, using a common text-to-text pretrained model.

The tasks under consideration require built-in world knowledge to complete adequately. Indeed, knowledge of real-world concepts and their relationships is embedded in the association between a prompt and its target response. As such, evaluating models on these tasks provides an assessment of the presence of this knowledge. It has been argued [9,10] that training a model on data with human-generated explanations may help the model acquire world knowledge more so than without said explanations due to the supplemental information that they provide. This is referred to as “bridging the knowledge gap”. Our results indicate that the use of explanations hurts performance on the commonsense question-answering datasets, all the while being ineffectual in natural language inference.

It is difficult to evince how much of these performance discrepancies are attributable to the nature of the tasks as opposed to the nature of the explanations present in the data. That is, it remains unclear to what extent commonsense question-answering is resistant or at odds with self-rationalization and to what extent the particular sets of explanations may be detrimental. While the annotation guidelines differ, what matters are arguably the material explanations that are present in the data. Their differences are difficult to quantify in a meaningful manner. Highlight explanations are more readily analyzed given their expected origin in the input and lack of added linguistic structure. We noted that highlight explanations in Cos-E are longer and more concentrated compared with those in e-SNLI, which we contend to be less informative to the model. However, one limitation of our study is that this lack of linguistic structure in highlight explanations is contrary to model pretraining. In contrast, free-text explanations align with model pretraining, but their structure is more difficult to analyze. Further work in this direction is needed to characterize the difference in explanations between these datasets and to address the inherent skew in the comparison between highlight and free-text explanations.

Of course, the primary goal of self-rationalization is not to improve task acquisition but to increase explainability. This remains challenging, as free-text explanations appear to follow certain noninformative or nonsensical formulae. Although some of these patterns were present in the training data, they are overrepresented in predictions. Further, our analyses did not show examples of them to be influential in the training process.

To improve our investigation of self-rationalizing models, considering a more complete annotation that includes negative and positive properties [46] could further bridge the knowledge gap and improve the quality of free-text explanations. To the best of our knowledge, this annotation scheme is the most thorough and self-contained, which could in turn greatly improve the quality of model-generated explanations.

Author Contributions

Conceptualization, F.R. and D.M.; methodology, F.R., D.M. and M.-J.M.; software, F.R. and D.M.; validation, formal analysis, and investigation, F.R., P.V., D.M. and M.-J.M.; resources, M.-J.M.; writing—original draft preparation, F.R., P.V. and D.M.; writing—review and editing, F.R., D.M. and M.-J.M.; supervision, D.M. and M.-J.M.; funding acquisition, D.M. and M.-J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) [M.-J. Meurs, NSERC Grant number 06487-2017] and the Government of Canada’s New Frontiers in Research Fund (NFRF), [M.-J. Meurs, NFRFE-2018-00484].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Cos-E test sets are available on https://github.com/salesforce/cos-e (accessed on 1 August 2023) for v1.0 and https://www.tau-nlp.sites.tau.ac.il/commonsenseqa (accessed on 1 August 2023) for v1.11. The e-SNLI dataset is available at https://github.com/OanaMariaCamburu/e-SNLI (accessed on 1 August 2023). Our source code is available under a GPLv3.0 license at https://gitlab.labikb.ca/ikb-lab/nlp/self-rationalizing-commonsense-reasoning/self-rationalizing-models-for-commonsense-reasoning (accessed on 1 August 2023).

Acknowledgments

This research was enabled in part by support provided by Calcul Québec (https://www.calculquebec.ca (accessed on 1 August 2023)) and The Digital Research Alliance of Canada (https://alliancecan.ca (accessed on 1 August 2023)).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	artificial intelligence
Cos-E	Commonsense Explanation (dataset)
GPT	generative pre-trained Transformer
IF	influence function
LLM	large language model
NLP	natural language processing
NLI	natural language inference
T5	text-to-text transfer Transformer
XAI	explainable ai

Appendix A. Computation Details

Appendix A.1. Implementation

The software for the experiments was developed for Python 3.8 on Linux (Ubuntu 20, 22) systems. Models were trained and tested with HuggingFace libraries (Transformers v2.9.1, Datasets v2.5.1) on PyTorch (v1.12) [47,48,49]. Influence function computation was largely based on the FastIF software put forth by Guo et al. [43]. We refer the reader to the code repository for a full list of dependencies (https://gitlab.labikb.ca/ikb-lab/nlp/self-rationalizing-commonsense-reasoning/self-rationalizing-models-for-commonsense-reasoning (accessed on 1 August 2023)).

Appendix A.2. Hardware and Runtimes

Experiments were run on single nodes in the Narval high-performance cluster [50] equipped with AMD Milan 7413 processors with Solid-State Drives (3.85TB) and NVidia A100 (40GB) GPUs. Training on a single GPU took approximately 1.5 h per epoch for Cos-E and 2 h per epoch on e-SNLI, for each model. Early stopping occurred before epoch 15 for all models. Prediction ran on a single GPU for less than 5 min per model for Cos-E and 1 h for e-SNLI.

References

Lyons, J.B.; Clark, M.A.; Wagner, A.R.; Schuelke, M.J. Certifiable Trust in Autonomous Systems: Making the Intractable Tangible. AI Mag. 2017, 38, 37–49. [Google Scholar] [CrossRef]
Nor, A.K.M.; Pedapati, S.R.; Muhammad, M.; Leiva, V. Abnormality Detection and Failure Prediction Using Explainable Bayesian Deep Learning: Methodology and Case Study with Industrial Data. Mathematics 2022, 10, 554. [Google Scholar] [CrossRef]
Dzindolet, M.T.; Peterson, S.A.; Pomranky, R.A.; Pierce, L.G.; Beck, H.P. The role of trust in automation reliance. Int. J. Hum.-Comput. Stud. 2003, 58, 697–718. [Google Scholar] [CrossRef]
Mercado, J.E.; Rupp, M.A.; Chen, J.Y.; Barnes, M.J.; Barber, D.; Procci, K. Intelligent Agent Transparency in Human–Agent Teaming for Multi-UxV Management. Hum. Factors 2016, 58, 401–415. [Google Scholar] [CrossRef]
Héder, M. Explainable AI: A brief History of the Concept. ERCIM News 2023, 134, 9–10. [Google Scholar]
La Rocca, M.; Perna, C. Opening the Black Box: Bootstrapping Sensitivity Measures in Neural Networks for Interpretable Machine Learning. Stats 2022, 5, 440–457. [Google Scholar] [CrossRef]
Hulsen, T. Explainable Artificial Intelligence (XAI): Concepts and Challenges in Healthcare. AI 2023, 4, 652–666. [Google Scholar] [CrossRef]
Wiegreffe, S.; Marasović, A. Teach Me to Explain: A Review of Datasets for Explainable NLP. In Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems, NeurIPS, Datasets and Benchmarks Track, Virtual, 6–14 December 2021. [Google Scholar]
Rajani, N.F.; McCann, B.; Xiong, C.; Socher, R. Explain Yourself! Leveraging Language Models for Commonsense Reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4932–4942. [Google Scholar] [CrossRef]
Camburu, O.M.; Rocktäschel, T.; Lukasiewicz, T.; Blunsom, P. e-SNLI: Natural Language Inference with Natural Language Explanations. In Advances in Neural Information Processing Systems 31, NeurIPS; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 9539–9549. [Google Scholar]
Wiegreffe, S.; Marasović, A.; Smith, N.A. Measuring Association Between Labels and Free-Text Rationales. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 10266–10284. [Google Scholar] [CrossRef]
Jain, S.; Wiegreffe, S.; Pinter, Y.; Wallace, B.C. Learning to Faithfully Rationalize by Construction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL, Virtual, 6–8 July 2020; pp. 4459–4473. [Google Scholar] [CrossRef]
Narang, S.; hoffman, C.; Lee, K.; Roberts, A.; Fiedel, N.; Malkan, K. WT5?! Training Text-to-Text Models to Explain their Predictions. arXiv 2020, arXiv:2004.14546. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Hoffman, R.R.; Klein, G.; Mueller, S.T. Explaining Explanation for “Explainable AI”. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 2018, 62, 197–201. [Google Scholar] [CrossRef]
Han, X.; Wallace, B.C.; Tsvetkov, Y. Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL, Virtual, 6–8 July 2020; pp. 5553–5563. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You? ”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In Proceedings of the International Conference on Learning Representations, ICLR (Workshop Poster), Banff, Canada, 14–16 April 2014. [Google Scholar]
Koh, P.W.; Liang, P. Understanding Black-box Predictions via Influence Functions. In Proceedings of the 4th International Conference on Machine Learning, ICML, Sidney, Australia, 6–11 August 2017; Volume 70, pp. 1885–1894. [Google Scholar]
Jacovi, A.; Goldberg, Y. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL, Virtual, 6–8 July 2020; pp. 4198–4205. [Google Scholar] [CrossRef]
Pezeshkpour, P.; Jain, S.; Wallace, B.; Singh, S. An Empirical Comparison of Instance Attribution Methods for NLP. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Virtual, 6–11 June 2021; pp. 967–975. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2016, arXiv:1409.0473. [Google Scholar]
Bibal, A.; Cardon, R.; Alfter, D.; Wilkens, R.; Wang, X.; François, T.; Watrin, P. Is Attention Explanation? An Introduction to the Debate. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, ACL, Dublin, Irland, 22–27 May. 2022; Volume 1, pp. 3889–3900. [Google Scholar] [CrossRef]
Bastings, J.; Filippova, K. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Online, 20 November 2020; pp. 149–155. [Google Scholar] [CrossRef]
Wiegreffe, S.; Pinter, Y. Attention is not not Explanation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong-Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 11–20. [Google Scholar] [CrossRef]
Jain, S.; Wallace, B.C. Attention is not Explanation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3543–3556. [Google Scholar] [CrossRef]
Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar]
Mathew, B.; Saha, P.; Yimam, S.M.; Biemann, C.; Goyal, P.; Mukherjee, A. HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Special Track on AI for Social Impact, Virtual-only, 2–9 February; 2021; Volume 35, pp. 14867–14875. [Google Scholar] [CrossRef]
Zaidan, O.F.; Eisner, J.; Piatko, C.D. Using “Annotator Rationales” to Improve Machine Learning for Text Categorization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT, Rochester, NY, USA, 22–27 April 2007; pp. 260–267. [Google Scholar]
Strout, J.; Zhang, Y.; Mooney, R. Do Human Rationales Improve Machine Explanations? In Proceedings of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, 1 August 2019; pp. 56–62. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. [Google Scholar]
McDonnell, T.; Lease, M.; Kutlu, M.; Elsayed, T. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments. In Proceedings of the Conference on Human Computation and Crowdsourcing, AAAI-HCOMP, Austin, Texas, USA, 30 October–3 November 2016; Volume 4, pp. 139–148. [Google Scholar] [CrossRef]
Kutlu, M.; McDonnell, T.; Elsayed, T.; Lease, M. Annotator Rationales for Labeling Tasks in Crowdsourcing. J. Artif. Intell. Res. 2020, 69, 143–189. [Google Scholar] [CrossRef]
DeYoung, J.; Jain, S.; Rajani, N.F.; Lehman, E.; Xiong, C.; Socher, R.; Wallace, B.C. ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL, Virtual, 6–8 July 2020; pp. 4443–4458. [Google Scholar] [CrossRef]
Jacovi, A.; Goldberg, Y. Aligning Faithful Interpretations with their Social Attribution. Trans. Assoc. Comput. Linguist. 2021, 9, 294–310. [Google Scholar] [CrossRef]
Sheh, R.; Monteath, I. Defining Explainable AI for Requirements Analysis. KI Künstliche Intell. 2018, 32, 261–266. [Google Scholar] [CrossRef]
Meister, C.; Lazov, S.; Augenstein, I.; Cotterell, R. Is Sparse Attention more Interpretable? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-ICNLP, Virtual, 1–6 August 2021; Volume 2, pp. 122–129. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Talmor, A.; Herzig, J.; Lourie, N.; Berant, J. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4149–4158. [Google Scholar] [CrossRef]
Bowman, S.R.; Angeli, G.; Potts, C.; Manning, C.D. A Large Annotated Corpus for Learning Natural Language Inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP, Lisbon, Portugal, 17–21 September 2015. [Google Scholar] [CrossRef]
Rancourt, F.; Maupomé, D.; Meurs, M.J. On the Influence of Annotation Quality in Suicidal Risk Assessment from Text. In Proceedings of the Canadian Conference on Artificial Intelligence, CAI, Toronto, ON, Canada, 30 May–3 June 2022. [Google Scholar] [CrossRef]
Guo, H.; Rajani, N.; Hase, P.; Bansal, M.; Xiong, C. FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 10333–10350. [Google Scholar] [CrossRef]
Ni, J.; Hernandez Abrego, G.; Constant, N.; Ma, J.; Hall, K.; Cer, D.; Yang, Y. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Irland, 22–27 May 2022; pp. 1864–1874. [Google Scholar] [CrossRef]
Breiman, L. Statistical Modeling: The Two Cultures. Stat. Sci. 2001, 16, 199–231. [Google Scholar] [CrossRef]
Aggarwal, S.; Mandowara, D.; Agrawal, V.; Khandelwal, D.; Singla, P.; Garg, D. Explanations for CommonsenseQA: New Dataset and Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP, Virtual, 1–6 August 2021; Volume 1, pp. 3050–3065. [Google Scholar] [CrossRef]
Lhoest, Q.; Villanova del Moral, A.; Jernite, Y.; Thakur, A.; von Platen, P.; Patil, S.; Chaumond, J.; Drame, M.; Plu, J.; Tunstall, L.; et al. Datasets: A Community Library for Natural Language Processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 175–184. [Google Scholar] [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Virtual, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
Calcul Québec. Narval, a 5.9 Petaflops Supercomputer for Scientific Researchers in Canada. Available online: https://docs.alliancecan.ca/wiki/Narval/en (accessed on 1 August 2023).

Figure 1. Two alternative paradigms of explainable natural language processing (NLP) models in a classification example. Pipeline models produce explanations and outputs through separate, sequentially dependent components. Self-Rationalizing models produce outputs and explanations jointly.

Figure 2. Framing different types of rationale as sequence-to-sequence tasks. Highlighted explanations extract sequences from the input. Free-text explanations have free form. Example from Commonsense Explanation (Cos-E) v1.0, correct label is house.

Figure 3. Overview of the methodology. The data are adapted into different input–output pairs for each explanation type. A single sequence-to-sequence model, text-to-text transfer Transformer (T5), serves as a common base for models producing the different explanations. Test-predicted labels and explanations are analyzed and compared. Models are probed to assess differences in training data influence and inner representation.

Table 1. Accuracy (%) of T5-base models on the Cos-E validation sets and e-SNLI test set. ^§ indicates results from [11] and ^† from [13].

	Cos-E v1.0	Cos-E v1.11	e-SNLI
T5 label-only	69.4 (69.2 $^{§}$ )	60.9 (61.8 $^{§}$ )	91.1 (90.9 $^{§}$ )
T5 label + highlight	60.6	51.2	90.1
T5 label + free-text	65.1 (64.8 $^{§}$ )	56.8 (55.6 $^{§}$ , 59.4 $^{†}$ )	91.0 (90.8 $^{§}$ , 90.9 $^{†}$ )
human [40]	95.3	88.9	-

Table 2. Selected model-generated free-text explanations following the template “<answer> is the only <something> that […]” for Cos-E v1.0 and v1.11. Correct answer is in bold.

Question	Answer Choices	Model Explanation
	Cos-E v1.0
A great teacher can be what when you are attending school?	graduate, boredom, inspiration	inspiration is the only thing that can be done when you are attending school.
According to what book did an apple tree lead to the downfall of man?	woods, Spain, bible	bible is the only book that is a book
After he got hired he hoped for success at his what?	michigan, new job, vocation	new job is the only option that is not a vocation.
Aliens, humans and everything is a part of what?	outer space, universe, roswell	the universe is the only place that is a part of the universe.
	Cos-E v1.11
He has lactose intolerant, but was eating dinner made of cheese, what followed for him?	digestive, feel better, sleepiness, indigestion, illness	indigestion is the only answer that is not intolerant
If a person were going to bed, what would be their goal?	lazy, insomnia, rest, falling asleep, dreaming of	falling asleep is the only option that would help you fall asleep.
If you are hungry and going fishing, why would you be going fishing?	to see the fish, have fun, catching fish, wet clothes, killing	catching fish is the only option that would make you have fun.
If you’re remembering something, it’s because of your what of it to begin with?	knowledge, knowing, forgetful, pleasure, depression	knowing is the only thing that is remembering

Table 3. Selected model-generated free-text explanations following the templates “Not all <something> are <something else>” or “A <something> is [not] <something else>” for e-SNLI.

Premise	Hypothesis	Label/ Prediction	Model Explanation
A couple walk hand in hand down a street.	The couple is married.	neutral neutral	Not all couples are married.
Two women are observing something together.	Two women are looking at a flower together.	neutral neutral	Not all things are a flower.
Two people using a water buffalo to cultivate a watery field.	A couple is farming with their pet buffalo.	entailment neutral	Not all water buffalos are pet buffalos.
A man playing an electric guitar on stage.	A man playing banjo on the floor.	contradiction contradiction	A banjo is not an electric guitar.
A little girl with pointy pigtails is sitting in the sand on the beach.	The girl is sitting on the beach.	entailment entailment	A little girl is a girl.
A speaker is talking with a TV in the background.	There is a live bear in the background.	neutral contradiction	A speaker is not a bear.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rancourt, F.; Vondrlik, P.; Maupomé, D.; Meurs, M.-J. Investigating Self-Rationalizing Models for Commonsense Reasoning. Stats 2023, 6, 907-919. https://doi.org/10.3390/stats6030056

AMA Style

Rancourt F, Vondrlik P, Maupomé D, Meurs M-J. Investigating Self-Rationalizing Models for Commonsense Reasoning. Stats. 2023; 6(3):907-919. https://doi.org/10.3390/stats6030056

Chicago/Turabian Style

Rancourt, Fanny, Paula Vondrlik, Diego Maupomé, and Marie-Jean Meurs. 2023. "Investigating Self-Rationalizing Models for Commonsense Reasoning" Stats 6, no. 3: 907-919. https://doi.org/10.3390/stats6030056

APA Style

Rancourt, F., Vondrlik, P., Maupomé, D., & Meurs, M.-J. (2023). Investigating Self-Rationalizing Models for Commonsense Reasoning. Stats, 6(3), 907-919. https://doi.org/10.3390/stats6030056

Article Menu

Investigating Self-Rationalizing Models for Commonsense Reasoning

Abstract

1. Introduction

2. Background

2.1. Explainable NLP

2.2. Human Explanations

2.3. Rationales

2.4. Self-Rationalizing Models

3. Experimental Setup

3.1. Data

3.2. Model

3.3. Instance Attribution

4. Results and Discussion

4.1. Task Performance

4.2. Output Validity

4.3. Probing Model Behavior

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Computation Details

Appendix A.1. Implementation

Appendix A.2. Hardware and Runtimes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI