CLICK: Integrating Causal Inference and Commonsense Knowledge Incorporation for Counterfactual Story Generation

Li, Dandan; Guo, Ziyu; Liu, Qing; Jin, Li; Zhang, Zequn; Wei, Kaiwen; Li, Feng

doi:10.3390/electronics12194173

Open AccessArticle

CLICK: Integrating Causal Inference and Commonsense Knowledge Incorporation for Counterfactual Story Generation

¹

Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

²

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China

³

Aerospace Information Research Institute of QiLu, Jinan 250132, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(19), 4173; https://doi.org/10.3390/electronics12194173

Submission received: 7 September 2023 / Revised: 24 September 2023 / Accepted: 4 October 2023 / Published: 8 October 2023

(This article belongs to the Special Issue Emerging Theory and Applications in Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Counterfactual reasoning explores what could have happened if the circumstances were different from what actually occurred. As a crucial subtask, counterfactual story generation integrates counterfactual reasoning into the generative narrative chain, which requires the model to preserve minimal edits and ensure narrative consistency. Previous work prioritizes conflict detection as a first step, and then replaces conflicting content with appropriate words. However, these methods mainly face two challenging issues: (a) the causal relationship between story event sequences is not fully utilized in the conflict detection stage, leading to inaccurate conflict detection, and (b) the absence of proper planning in the content rewriting stage results in a lack of narrative consistency in the generated story ending. In this paper, we propose a novel counterfactual generation framework called CLICK based on causal inference in event sequences and commonsense knowledge incorporation. To address the first issue, we utilize the correlation between adjacent events in the story ending to iteratively calculate the contents from the original ending affected by the condition. The content with the original condition is then effectively prevented from carrying over into the new story ending, thereby avoiding causal conflict with the counterfactual conditions. Considering the second issue, we incorporate structural commonsense knowledge about counterfactual conditions, equipping the framework with comprehensive background information on the potential occurrence of counterfactual conditional events. Through leveraging a rich hierarchical data structure, CLICK gains the ability to establish a more coherent and plausible narrative trajectory for subsequent storytelling. Experimental results show that our model outperforms previous unsupervised state-of-the-art methods and achieves gains of 2.65 in BLEU, 4.42 in ENTScore, and 3.84 in HMean on the TIMETRAVEL dataset.

Keywords:

counterfactual story generation; causal inference; structural commonsense knowledge; generative narrative chain

1. Introduction

Counterfactual reasoning has attracted significant attention in the field of natural language processing due to its wide range of applications in improving model robustness [1,2], interpretability [3,4,5], and data augmentation [6,7,8]. One significant aspect where counterfactual reasoning holds pivotal importance is in text generation, leading to advancements in various applications such as dialogue systems [9], answer feedback generation [10], and creative content generation [11]. Building upon the foundation of counterfactual reasoning in text generation, researchers have recently explored a novel task known as counterfactual story generation [12]. By leveraging counterfactual thinking, the objective is to generate alternative narratives that explore different outcomes or events based on hypothetical changes to the initial context or plot. This area of research aligns with the current trend in large-scale language models [13,14,15], which is dedicated to pushing the boundaries of creative text generation and enabling more human-like interactions with AI systems. The significance of this task extends far beyond the theoretical realm, offering promising applications in diverse real-world scenarios. In education, it offers the potential to revolutionize learning materials. By rewriting stories under different counterfactual conditions, students can gain a deeper understanding of historical events, scientific phenomena, and complex cause-and-effect relationships. In the legal and ethical realms, the task can aid lawyers, judges, and policymakers. It allows for the exploration of alternative scenarios, thereby facilitating informed decisions and assessments of potential outcomes in legal cases and ethical dilemmas. From enhancing educational materials to guiding legal decisions, simulating business strategies, and aiding medical treatment planning, counterfactual story generation presents an invaluable tool for understanding causality and exploring multifaceted narratives in a multitude of domains.

As shown in Figure 1, in this generation task, given an entire original story (consisting of a one-sentence premise, a one sentence condition, and a three-sentence ending) and an intervening counterfactual condition, some phrases in the original ending may conflict with the new condition. For example, an old man built a zoo where many of the animals were set free due to the occurrence of a hurricane. When the story condition is changed to a hotel, the subsequent story is adapted to portray the destruction of the rooms, ensuring narrative consistency with the counterfactual condition. Furthermore, the hurricane event in scenario

s_{4}

remains unaffected by two conditions, making it optimal to retain the event entirely to ensure minimal editing. Traditional models [16,17,18] are based on understanding the given context to generate fluent and logically sound text. Therefore, leveraging a pre-trained language model enables the generation of fluent endings under counterfactual conditions. However, challenges arise when attempting to achieve accurate reasoning while making only minimal modifications to the ending while ensuring it remains natural.

Recent relevant research [19,20] adopts a two-stage framework with the objective of attaining accurate reasoning in order to achieve minimal editing. In the first stage, each token in the story context is examined individually to determine if it requires modification. This approach enables accurate detection of each word, ensuring minimal editing. In the second stage, the words identified in the previous stage are modified to align with the story logic under the counterfactual condition, thereby ensuring narrative consistency. For instance, Hao et al. [19] trained a binary classifier using supervised learning to classify whether each token in the story ending represents causal content, while Chen et al. [20] conducted causal risk ratio calculations to detect causal conflicts. However, despite achieving promising improvements, there are still two primary challenges associated with these approaches, elaborated as follows:

Conflict Detection: In previous methods, the causal invariance of a word in the story ending is determined through an assessment of its relevance to both the original condition and the counterfactual condition. However, it is challenging and confusing to compare the correlation of the phrase round them all up in

s_{5}

with the original condition and the correlation of the phrase with the counterfactual condition. Because them refers to the content in

s_{4}

, its relevance to the condition gradually weakens as the event progresses. In such scenario, relying solely on conditions is insufficient to precisely determine the location of causal invariance content. The narrative of events in a story evolves gradually, leading to a diminishing impact on preceding information, whereas the latest plot developments bear a more immediate correlation with future events. Consequently, during the process of story rewriting, it is crucial to consider not only the influence of initial conditions but also to incorporate a more comprehensive account of the entire story.

Causal Continuity: Previous approaches rely exclusively on the modeling capabilities of the language model to predict the output by leveraging the provided contextual information. However, although current generative models are capable of producing coherent text, they are also prone to expose defects such as self-contradiction and topic drifting [21,22,23]. In the context of a story condition, subsequent events can unfold in numerous directions. Allowing the language model to select words for the next position solely based on statistical probabilities and the aforementioned information is not an effective mode of control to ensure the generated content maintains narrative consistency with counterfactual conditions.

In terms of conflict detection issues, we leverage the causal relationship within the event sequences instead of relying solely on the story conditions. For example, the story in Figure 1 exhibits the following causal relationships: zoo in

s_{2}

is related to zoo, animal in

s_{3}

; zoo, animal in

s_{3}

is related to animals, set free in

s_{4}

; and animals, set free in

s_{4}

is related to round them all up in

s_{5}

. Therefore, from an intuitive standpoint, a more effective approach for assessing the causal correlation between outcomes and conditions is to leverage the correlation of cause words among consecutive events. To explicitly explain the causal relationship, we formulate the story ending generation process as a causal graph. Considering causal continuity issue, we integrate commonsense event knowledge into the rewriting process. Specifically, we introduce COMET [24], a powerful tool capable of generating diverse and structured commonsense knowledge specifically tailored for counterfactual conditions. We fine-tune the GPT style [16] model on a large corpus of story paired with corresponding commonsense knowledge pertaining to counterfactual conditions. Leveraging the vast world of knowledge encoded within the pre-trained language model and integrating structured commonsense knowledge allows for deducing plausible event sequences that have not been previously observed and seamlessly incorporating novel words and knowledge into the generated content.

In this paper, we propose CLICK, a counterfactual generation framework based on CausaL Inference in event sequences and Commonsense Knowledge incorporation. In the first stage, we propose a skeleton extractor, which leverages the causal relationship among event sequences to detect the contents in the story ending that are affected by the original story condition. Subsequently, the elements that results in a basic skeleton are removed, mitigating any interference with the new counterfactual outcome. Furthermore, commonsense generators are conducted to formulate structural knowledge associated with the counterfactual condition, which enhances the causal coherence between the story ending and the counterfactual condition. In the second stage, the skeleton and the knowledge are provided as contexts to a generator to produce proper words to fill in the skeleton in a sequence-to-sequence way. We conduct experiments on the TIMETRAVEL dataset, and the experimental results illustrate that our model achieves state-of-the-art performance compared to other strong baselines. Additionally, our model exhibits superior capabilities in terms of minimal editing to the original ending and ensuring causal coherence between the counterfactual ending and the corresponding counterfactual condition. The contributions to this work can be summarized as follows:

Inspired from causal graph modeling on how conditions formulate story endings, we propose the counterfactual generation framework CLICK based on CausaL Inference in event sequences and Commonsense Knowledge incorporation, improving the interpretability of generative reasoning.
We investigate the causal invariance by analyzing the causal relationship among event sequences to pinpoint the necessary modification locations. Meanwhile, CLICK enhances the causal continuity between the ending tokens and the counterfactual condition with commonsense knowledge.
We conduct experiments on the TIMETRAVEL dataset. The experimental results demonstrate that the CLICK framework outperforms previous state-of-the-art models under unsupervised settings. Ablation experiments further validate the effectiveness of considering causal relationships among event sequences and incorporating structural knowledge.

2. Related Work

2.1. Knowledge-Enhanced Text Generation

Text generation is a task which takes text as input, processes the input text into semantic representations, and generates desired output text. However, the inherent limitation of input text in providing sufficient knowledge poses challenges for neural generation models to achieve the desired output quality [25,26,27]. Many research efforts have been made to enhance the control of generation with various desired properties, such as topic [28], emotion [29], keywords [30], dialogue intent [31], etc. In particular, narrative generation requires models to produce fluent and logically coherent stories based on predefined conditions [32,33,34]. Nevertheless, current generative models have not yet attained the same level of storytelling proficiency exhibited by human narrators. To bridge this gap, many studies seek to inject structured knowledge into the generation process. According to the method of integration, these works are divided into two categories: knowledge enhanced by encoding or text.

Knowledge Enhanced by Encoding. One line of researchers [35,36,37,38,39,40] encodes structured knowledge into low-dimensional vectors and then uses them to influence word probability distribution during the generation process. Wang et al. [35] encoded entities retrieved from ConceptNet and then fed them into decoders to generate story. Chen et al. [36] leveraged implicit relationships among keywords in stories by calculating the cosine similarity of word embeddings. However, this type of process happens as a black box process, presenting challenges in terms of interpretability. To tackle this challenge, Liu et al. [37] propose calculating a knowledge gain to define a reward at each step during the decoding process. Taking inspiration from such methods, we use vectors trained based on knowledge graphs in the detection of causal invariance and calculate the similarity scores between them to assess the correlation. The resulting scores directly indicate the basis of our correlation detection and contribute to the interpretability of our method.

Knowledge Enhanced by Text. Another line of researchers views structured knowledge as material that can be learned in the same way as stories. They feed knowledge contexts into generative models to capture explicit information. Guan et al. [41] integrated commonsense knowledge graphs into GPT-2 [16] by post-training the model on the knowledge examples. Xu et al. [42] transformed triples in ConceptNet [43] into natural language sentences and utilized them to generate story. These studies lead generative models to being inherently knowledge-enhanced. Differently, [44,45] proposed a two-step generation pipeline with an independent knowledge reasoner instead of finetuning PLM to directly generating discourse-level stories. They first generated successive events, and subsequently expanded these events into coherent discourse sentences. In our method, we incorporate both fine-tuning and multi-step generation techniques. Different from these works, we employ commonsense knowledge as a guiding mechanism to assist the model in Fill-in-the-Blanks tasks, rather than fine-tuning a model to generate sequences in a direct left-to-right way.

2.2. Causal Inference and NLP

Causal inference [46] aims to explore the cause-and-effect relationships between different variables. With the emergence of an interdisciplinary research field at the intersection of causal inference and NLP, there is a growing interest among researchers in exploring methods for estimating causal effects from textual data and leveraging causal mechanisms to enhance the current understanding and generation of natural language. Within NLP, distinguishing between causation and correlation remains a considerable challenge, leading to potential misconceptions in the results. Moreover, the prediction process is often treated as a black box, lacking interpretability and transparency in its outputs. To tackle these challenges, incorporating a causal mechanism can be employed to model the data generation process and enhance the comprehension of the causal relationships between events and the underlying constructs within the predictor [47,48]. To enhance causal reasoning in narratives, we primarily employ two methods: causal graph analysis and counterfactual reasoning. These approaches offer valuable insights and tools for effectively capturing causal relationships within the context.

Causal Graph Analysis. One line of research focuses on leveraging causal graph analysis in the data generation process, enabling the derivation of valid causal conclusions and ultimately enhancing the performance of NLP systems. The authors of [49,50] employ causal graph analysis to qualitatively analyze the impact of item popularity as a confounder, effectively boosting recommendation system performance. Tian et al. [2] employ a structural causal model to formulate biases in natural language understanding tasks, effectively alleviating the annotation biases of the datasets. Moreover, causal graph analysis is also widely used in various fields, including text classification [51], named entity recognition [52], pretrained language models [53], fake news detection [54], and even performance bottleneck detection in programming languages [55,56,57,58]. In this work, we utilize causal graph modeling to analyze the generation process of story event sequences, enabling the identification of elements in subsequent events that are impacted by changing conditions.

Counterfactual Reasoning. Another line of research focuses on enhancing current text generation mechanisms by incorporating counterfactual reasoning capabilities. Counterfactual reasoning refers to reasoning about what could have happened if the past had been different or if certain conditions or events had been altered. It deals with hypothetical or counterfactual scenarios and explores the causal relationships between variables. These efforts generate counterfactual samples that are used to improve model robustness [59], interpretability [60], and data augmentation [7]. However, the primary idea behind these works is to utilize language models and diverse sampling strategies to generate counterfactuals, without involving more complex narrative counterfactual reasoning. In 2019, Qin et al. [12] introduced the task of counterfactual story generation. They employed a seq2seq model to reconstruct stories; however, the resulting story ending diverged significantly from the original ending. To address this issue, Hao et al. [19] and Chen et al. [20] proposed a two-step approach. Firstly, they determined the editing position, followed by modifying the content. Compared with the original method, they made improvements in terms of minimal editing and maintaining consistency with counterfactual conditions. However, the content generated by these methods still exhibits flaws in terms of logical rationality and consistency with counterfactual conditions. In our work, we address these limitations by incorporating causal relationships between story event sequences to more accurately assess the causal invariance of content. Additionally, we introduce structural commonsense knowledge to offer diverse and previously unseen planning guidance to the model, aiming to improve the overall quality and coherence of the generated output.

3. Preliminaries

3.1. Causal Graph

A causal graph is a probabilistic graphical model used to describe how variables interact with each other, expressed by a directed acyclic graph (DAG)

G = {V, E}

, where V denotes the set of variables and E represents the causal correlations among those variables. A DAG is a collection of nodes (variables) and edges (associations) that define the assumed causal relationships in a data-generating process. In Figure 2, we show an example of causal graph with three variables: Treatment, Outcome, and Confounder. In the context of causal models, the Treatment plays a direct causal role in determining the value of the Outcome, as is indicated by the directed edge that links the Treatment to the Outcome. The Confounder influences both the Treatment and the Outcome, creating an association between them. However, it is important to note that the association between the Treatment and Outcome resulting from their shared cause is not part of the specific causal association being analyzed. In other words, a portion of the association between the Treatment and Outcome can be attributed to the biasing path that runs from the Confounder through the Treatment to the Outcome. To accurately compute the causal effect, this biasing path must be blocked by adjusting for the influence of the Confounder. Appendix A supplements additional information regarding practical applications of causal graph.

3.2. Causal Intervention

Causal intervention is employed to determine the true causal effect of one variable on another in the presence of confounders. In a causal graph, performing an intervening operation on a variable eliminates all edges directed towards it, thereby breaking causal relationships from its parent nodes. The backdoor adjustment [46] using do-calculus provides a method for computing the intervened distribution when there are no additional confounders. For the example in Figure 2, the adjustment formula can be derived according to Bayes’ theorem as follows, where Z denotes the value of Confounder Z:

P (Y | d o (X)) = \sum_{z} P (Y | X, z) P (z)

(1)

4. Methodology

4.1. Task Formulation

The input of the counterfactual story rewriting task is a five-sentence story

S =

{s_{1}, s_{2}, s_{3}, \dots, s_{m}}

, where m is the number of sentences and

s_{i} = {w_{1}^{i}, w_{2}^{i}, \dots, w_{n}^{i}}

contains n words in the i-th sentence, and a counterfactual condition

s_{2}^{'}

, which is counterfactual to the initial condition

s_{2}

. In this representation,

s_{1}

is equivalently denoted as the premise p,

s_{2}

is denoted as the original condition c,

s_{2}^{'}

is denoted as

c^{'}

, and

s_{3 : m}

is denoted as the ending e. The goal of this task is to revise the ending e into an edited ending

e^{'}

which minimally modifies the original one and regains narrative coherency to the counterfactual condition.

4.2. Causal Graph and Causal Path Analysis

To reveal the causal relationship between the event sequence in the story ending, we construct a causal graph that represents the generation process of each individual event sequence in Figure 3. From the perspective of outcome event generation, an investigation into the sources and affected factors allows us to locate the specific modification points when past events undergo changes. This analysis enables us to pinpoint the corresponding positions in need of adjustments in order to maintain coherence and consistency within the overall narrative. Therefore, the causal invariance of words in the outcome can be detected based on the causal graph modeling.

Each causal graph for an event sequence comprises three variables: Treatment, Outcome, and Confounder. In the context of this task, the event sequence in the story ending corresponds to the Outcome variable, while the preceding event adjacent to it represents the Treatment variable. The Confounder variables consist of the earlier contextual events that occurred before the Outcome event. The interventionist account characterizes a causal relationship between two variables C and E in the following way: C is a cause of E if there is at least one ideal intervention on C that changes the value of E. Referring to this definition, we provide the definition of causal invariance in the context of the story rewriting task: under counterfactual interventions, any changes observed in subsequent events indicate a causal relationship with the conditions. The immune portions represent the causal invariance. Based on the causal graph presented in Figure 3, we improve the calculation scope of causal invariance to consider only the relationship between outcome and treatment, that is, the causal effects between adjacent events.

For the calculation of causal invariance in

s_{3}

: In Figure 3a, event

s_{2}

influences event

s_{3}

through the core path

s_{2} \to s_{3}

. Specifically, the goal of the third sentence’s rewriting is to determine the specific components of

s_{3}

that are impacted by the intervention content in

s_{2}

. This entails evaluating the causal invariance between the cause word in

s_{2}

and

s_{3}

.

For the calculation of causal invariance in

s_{4}

: In Figure 3b, event

s_{3}

influences event

s_{4}

through the core path

s_{3} \to s_{4}

. The story condition now plays the role of Confounder in the causal graph, creating a spurious correlation by influencing both

s_{3}

and

s_{4}

events. To mitigate the influence of Confounder, only the effect of Treatment on Outcome is considered. The influence on

s_{3}

in the previous step serves as the cause of what affects the subsequent event in the current step. Consequently, the calculation of the affected content in

s_{4}

is performed based on the content in

s_{3}

. This entails evaluating the causal invariance between the cause word in

s_{3}

and

s_{4}

. Likewise, the calculation of causal invariance in

s_{5}

is converted to the computation between the causal content in

s_{4}

and

s_{5}

, as illustrated in Figure 3c.

A formal description of a causal graph is shown in Figure 3d. In this graph, each sentence

s_{i}

in the story is modeled as the outcome in the causal relationship. The preceding event

s_{i - 1}

, which is adjacent to the sentence

s_{i}

, serves as the treatment that directly influences it. Additionally, there are other events that transpired prior to sentence

s_{i - 1}

acting as Confounders that influence both the Treatment and Outcome. To effectively rewrite the sentence

s_{i}

, it is crucial to identify the underlying cause that impacts its generation process. As indicated by the causal graph, the sentence

s_{i}

is primarily influenced by the preceding event through the path

s_{i - 1} \to s_{i}

. Hence, during the causal invariance detection stage, we calculate the impact on event

s_{i}

by using the causal result in the previous event

s_{i - 1}

.

4.3. Model Overview

The framework of CLICK is shown in Figure 4. It consists of three components: (1) a skeleton extractor with narrative chain guidance, which removes words that are causally associated with the original condition, which leads to the formation of a skeleton that exclusively consists of words unrelated to the original condition; (2) a knowledge-alignment commonsense generator, which employs COMET, a transformer-based tool to generate structured commonsense knowledge about counterfactual conditions, and whose outcome can provide extensive and diverse information and serve as a valuable resource for subsequent rewriting tasks; and (3) a commonsense-constrained generative model, leveraging the previously acquired skeleton and commonsense knowledge as prompts to rewrite the story ending.

4.4. Skeleton Extractor with Narrative Chain Guidance

The counterfactual story generation task investigates how subsequent events are altered when conditions are modified. Based on the causal graph modeling and causal path analysis of the event sequence in the ending depicted in Figure 3, we can formally summarize the causal influence between events as follows: factor X within an event leads to factor Y in the subsequent event, and factor Y further causes factor Z in the subsequent event. Factor X in the first process is referred to as the cause word, while factor Y factor is termed the effect word. In the second process, factor Y becomes the new cause word, and factor Z is the effect word affected by it. In the counterfactual story generation task, the counterfactual condition can be viewed as a causal intervention in the story event chain. For example, in Figure 1, the zoo scene in the original condition is intervened, so a natural idea is to remove the events or content associated with the zoo scene from the original ending. In view of the causal relationship among sequences of events, we employ a progressive approach to identify the influence of adjacent events. This involves determining the effect word in the current sentence based on the cause word from the preceding sentence. Subsequently, the resultant effect word is employed as the cause word to compute the influenced effect word in the subsequent sentence.

The main content of this module is to find the words in story ending that are highly related to the original condition, eliminate them from the ending, and obtain a skeleton consisting solely of words that are irrelevant to the intervention factor in the condition. Summarized below, the main module is divided into the following three steps: (1) condition-guided intervention selection; (2) sequence-aware correlation calculation; (3) skeleton acquisition.

4.4.1. Condition-Guided Intervention Selection

A counterfactual condition involves partial modifications to the original condition. Specifically, when investigating the influence of a specific element on subsequent events, we can modify that element and observe the corresponding changes in the subsequent events. By comparing the disparities between the original and counterfactual conditions, we can determine the intervened variables within the original conditions. This approach enables us to explore the causal effects of the modified element and gain insights into its impact on the subsequent events.

Given the original story condition denoted as

c = {w_{1}, \dots, w_{j}, \dots, w_{n}}

and the counterfactual condition denoted as

c^{'} = {w_{1}^{'}, \dots, w_{j}^{'}, \dots, w_{m}^{'}}

, the transformer from original condition to counterfactual condition can be divided into two situations:

(1): Word substitution: By selectively modifying only a subset of words in the original conditions, new counterfactual conditions can be obtained. Thus, by comparing c and $c^{'}$ , the modified content in c is the intervention.
(2): Word deletion and addition: By selectively deleting or adding words to the original conditions, the new counterfactual conditions can be obtained. In this case, all words in c are considered the intervention.

4.4.2. Sequence-Aware Correlation Calculation

Building upon previous causal pathway analysis, we utilize the relationships between adjacent sequences of events to calculate the elements in the story ending that are affected by interventions. To assess the correlation between tokens, we employ numberbatch word embedding [43] to calculate the similarity between them. Tokens with high similarity indicate a significant impact, whereas tokens with lower similarity suggest a lesser influence from the intervened variables. The numberbatch word embedding is trained on diverse datasets, including ConceptNet [43], Word2Vec [61], GloVe [62], and OpenSubtitles [63]. By leveraging both the textual information and the structured knowledge graph of ConceptNet, these vectors capture semantic representations that surpass what can be directly learned from general language corpora. The numberbatch achieves good performance on tasks related to commonsense knowledge [64]. For instance, when calculating the cosine similarity between the word zoo and the sentence The zoo had unusual animals that nobody had ever seen before in numberbatch, it can be observed that the tokens zoo and animals in this sentence exhibit the highest similarity in terms of their cosine similarity scores. This observation aligns with human intuition based on common knowledge.

In the previous step, we identify the location of the perturbation applied to the original conditions, which we refer to as the intervention. Then, we record the original story ending, comprised of the third to fifth sentences, as

e = {s_{3}, s_{4}, s_{5}}

. To calculate the words in the story ending that are influenced by the intervention, we utilize cosine similarity in the numberbatch word embedding. Algorithm 1 outlines the procedure employed for this purpose. Firstly, we take the intervention as the initial cause word and calculate its correlation with all the words in

s_{3}

. We identify the words with cosine similarity surpassing a predefined threshold and refer to them as the cause words in

s_{3}

. Subsequently, by utilizing the cause words in

s_{3}

, we can compute its correlation with all the words in

s_{4}

. This process is repeated iteratively to obtain the complete set of relevant words across the three sentences. Regarding the parameter threshold setting, we determined it through an extensive series of experiments and comparisons, as discussed in detail in the subsequent ablation experiments section.

Algorithm 1 Sequence-aware correlation computation

Input:

i n t e r v e n t i o n

: Initial intervention words;

e = {s_{n}, s_{n + 1}, \dots, s_{m}}

: Three sentences in ending;

α

: Threshold used to determine correlation

Output:

o u t p u t = {c a u s a l_{n}, c a u s a l_{n + 1}, \dots, c a u s a l_{m}}

: Causal word set in the i-th sentence

1:: $c a u s a l_w o r d s \leftarrow i n t e r v e n t i o n$ ▹ Initialize the set of causal words used to calculate the first sentence
2:: for $j \leftarrow n$ to m and $s_{j} \in e$ do
3:: $c a u s a l_{j} \leftarrow {}$ ▹ Initialize empty set for causal words in the j-th sentence
4:: for $word \in s_{j}$ do
5:: $similarity \leftarrow avg cosine_similarity (word, c a u s a l_w o r d s)$ ▹ Compute average cosine similarity between word and all words in set causal_words.
6:: if $similarity > α$ then
7:: $c a u s a l_{j} \leftarrow c a u s a l_{j} \cup {w o r d}$ ▹ Add word to causal word set in j-th sentence
8:: end if
9:: end for
10:: $c a u s a l_w o r d s \leftarrow c a u s a l_{j}$ ▹ Update the set of causal words used to calculate the next sentence
11:: end for
12:: return ${c a u s a l_{n}, c a u s a l_{n + 1}, \dots, c a u s a l_{m}}$ ▹ Return the detected sets of causal words

4.4.3. Skeleton Acquisition

Through the preceding steps, we have identified the causal words in the story ending that are influenced by the initial condition. To create a counterfactual scenario where the ending is unaffected by the original condition, we replace these causal words with blank spaces and subsequently merge any consecutive spaces. This process yields a fundamental skeleton of the ending that remains independent of the original story condition, ensuring that the ending under the counterfactual condition remains unaltered by the initial condition.

4.5. Knowledge-Alignment Commonsense Generator

The primary objective of this module is to generate relevant commonsense knowledge based on counterfactual conditions and incorporate this knowledge into the model through prompts. This guidance encourages the model to take into account the impact of commonsense knowledge when generating story endings. When humans create a story, they often employ commonsense reasoning based on the preceding text to develop a comprehension of the narrative being presented [65]. However, machines, constrained by their trained data, lack a universal understanding of commonsense knowledge. To enhance their ability in this regard, it is necessary to incorporate relevant and accurate commonsense knowledge. For this purpose, we utilize COMET [24], a generative knowledge transformer that facilitates commonsense reasoning. By providing a counterfactual condition as input, the model can generate natural language descriptions encompassing nine dimensions of commonsense knowledge, and some examples are illustrated in Table 1. In the classification of ATOMIC [66], these nine structured commonsense descriptions can be divided into three categories:

If-Event-Then-Mental-State: Defines three relations relating to the mental pre- and post-conditions of an event, including XIntent (why does X cause the event), XReact (how does X feel after the event), and OReact (how do others feel after the event). Our focus lies on the knowledge of events related to explicitly mentioned participants, specifically within the categories of XIntent and XReact.
If-Event-Then-Event: Defines five relations relating to events that constitute probable pre- and post-conditions of a given event, including XNeed (what does X need to do before the event), XEffect (what effects does the event have on X), XWant (what would X likely want to do after the event), OWant (what would others likely want to do after the event), and OEffect (what effects does the event have on others). Our focus is on the knowledge of events that are related to explicitly mentioned participants, encompassing the categories of XNeed, XEffect, and XWant.
If-Event-Then-Persona: Defines a stative relation that describes how the subject of an event is described or perceived, including XAttr (how would X be described).

4.6. Commonsense-Constrained Generative Model

Given the original story and a counterfactual condition, we first obtain a basic skeleton using the skeleton extractor module. Next, we utilize the commonsense generator module to extend the counterfactual condition with relevant structured commonsense knowledge. With these two components, we train the model to fill in the skeleton with the guidance of commonsense knowledge. This training enables the model to generate a story ending that aligns with the specified counterfactual condition. In our approach, we utilize GPT-2 [16] as the underlying language model. The input sequence for the model consists of four main components: the premise (p), the counterfactual condition (

c^{'}

), the basic skeleton (s), and the commonsense knowledge (k). These components are combined and represented as {[PRE] p [CON] c [SKE] s [KNOW] k [END] }, where [PRE], [CON], [SKE], [KNOW], and [END] denote special tokens. The primary objective is to generate the counterfactual ending (

e^{'}

) of the story. This input format allows the model to incorporate relevant information and guide the generation process based on both the given context and commonsense knowledge.

During the training phase, we utilize unsupervised training data that include only the original story and counterfactual conditions, without the corresponding rewriting counterfactual endings. To construct training instances, we assemble the premise (p), condition (c), and basic skeleton (s) extracted from the original story ending and the extended commonsense knowledge specific to the original condition (k). These components are concatenated into the following sequence, which serves as the input for the GPT-2 model: {[PRE] p [CON] c [SKE] s [KNOW] k [END]}. The original ending is used as the target output. In this approach, the GPT-2 model learns to preserve certain words from the skeleton while generating the final ending. It employs the provided commonsense knowledge to guide the generation process and fill in the blanks. The [END] token serves as the starting symbol for the decoding process, and GPT-2 generates the ending word by word. The probability distribution of the output words is as follows:

p (y_{t} ∣ x, y_{< t}) = L M (x, y_{< t})

(2)

where

y_{t}

(

e_{t}

in the training phase and

e_{t}^{'}

in the inference phase) is the t-th token after the [END] token,

y_{< t}

represents the words between the token [END] and the t-th token, x represents the words preceding the token [END], and

G P T 2 (z)

is the function of getting the current step output distribution of the GPT-2 fed z as its input.

In the generation phase, we train the model using the following loss:

L_{g e n} = - \sum_{t = 1}^{m} log [p (e_{t} ∣ p, c, s, k, e_{< t})]

(3)

where

e_{t}

is the t-th word in the original ending,

e_{< t}

represents the words before the t-th word, and m is the length of the original ending.

In the inference phase, given the original story and counterfactual condition, the skeleton extractor module is employed to obtain the basic skeleton (s) representing the structure of the original ending. Next, the commonsense generator module is utilized to generate extended commonsense knowledge (

k^{'}

) specific to the counterfactual condition. The ending generator leverages the information provided by the basic skeleton (s) and the extended commonsense knowledge (

k^{'}

), along with the premise (p) and the counterfactual condition (

c^{'}

), to generate the counterfactual ending. The input sequence for the model is constructed as follows: {[PRE] p [CON]

c^{'}

[SKE] s [KNOW]

k^{'}

[END]}. The role of the ending generator is to retain the essential words from the given skeleton and generate new words to fill in the blank spaces based on the available input information. By incorporating the provided context, counterfactual condition, and commonsense knowledge, the generator produces a coherent and contextually appropriate counterfactual ending for the story. Specifically, the GPT-2 editor predicts the counterfactual ending token by token:

\hat{e_{t}^{'}} = s a m p l e_{e_{t}^{'} \in V} p (e_{t}^{'} ∣ p, c^{'}, s, k^{'}, e_{< t}^{'})

(4)

where sample represents the top-k [67] sampling method and V is the vocabulary. When a sentence terminator is predicted, the decoding process is carried out, producing a generated counterfactual ending

{\hat{e_{1}^{'}}, \hat{e_{2}^{'}}, \dots, \hat{e_{n}^{'}}}

of length n.

5. Experiments

5.1. Dataset

We run experiments with CLICK on a standard counterfactual story rewriting dataset TIMETRAVEL [12], which is built on the ROCStories [68] corpus. In TIMETRAVEL, the initial condition was rewritten by humans into a counterfactual condition, followed with edited endings. And only part of the training set is annotated with the edited endings. We train CLICK in an unsupervised manner, i.e., without access to manually edited endings. The unsupervised dataset contains 96,867 training original and counterfactual five-sentence story pairs. The development and test sets both have 1871 original stories, and each of the original stories have one counterfactual condition and three rewritten counterfactual endings.

5.2. Evaluation Metrics

In prior research conducted by Qin et al. [12], the model performance is evaluated using metrics such as BLEU [69] and BERTScore [70]. The BLEU metric calculates the number of overlapping n-grams between the generated and reference endings, and the BERTScore computes their cosine similarity using BERT encodings. However, it was found that while BLEU effectively measures the minimal edits property, its correlation with human judgments is relatively weak. The BERTScore metric faces the same problem. In more recent investigations, Chen et al. [20] introduced two novel metrics, ENTScore and HMean, which demonstrate greater alignment with human evaluation judgments. These metrics provide improved consistency when evaluating the performance of models. Specifically, the ENTScore evaluates the probability of whether an ending is entailed by the counterfactual context, and HMean calculates the harmonic mean of the ENTScore and BLEU, providing a balanced assessment of coherence and minimal edits. Therefore, we focus on the performance of the model on the HMean metric in the experiment.

5.3. Implementation Details

All experiments are implemented on an NVIDIA Tesla V100 GPU with a Pytorch https://pytorch.org, (accessed on 5 October 2023) framework. In the skeleton extractor module, we use ConceptNet Numberbatch word embedding https://github.com/commonsense/conceptnet-numberbatch, (accessed on 5 October 2023) to calculate the similarity between tokens. In the commonsense generator module, we use the COMET model https://github.com/atcbosselut/comet-commonsense, (accessed on 5 October 2023) to expand the commonsense knowledge for counterfactual conditions. In the ending generator module, we use the medium version of GPT-2 from HuggingFace’s Transformers library https://huggingface.co/gpt2-medium, (accessed on 5 October 2023) as the base decoder. We use Adam optimization for both models with the initial learning rates set as 5 ×

10^{- 5}

and 1.5 ×

10^{- 4}

separately. The warm-up strategy is applied with the number of warm-up steps set to 2000. The batch size for the training phrase is set to 8. We train the CLICK model for 15 epochs and select the best models on the validation set. During the inference stage, we use top-k sampling with the temperature set to 0.7 and k set to 40.

5.4. Compared Approaches

We compare CLICK with the following baselines:

GPT2-M: [12] utilizes a pre-trained model GPT-2 for story ending rewriting. The method receives the story premise and counterfactual condition as input, without undergoing any training on the dataset.
GPT2-M + FT: [12] fine-tunes the pre-trained model GPT-2 to maximize the log-likelihood of the stories in the ROCStories corpus. The premise and the counterfactual condition are provided as input.
DELOREAN: [71] is an unsupervised decoding algorithm that can flexibly incorporate both the past and future contexts using only off-the-shelf, left-to-right language models and no supervision. The method receives the story premise and counterfactual condition as input, without undergoing any training on the dataset.
EDUCAT: [20] is an editing-based unsupervised approach for counterfactual story rewriting, which includes a target position detection strategy and a modification action.
Human: One of the three ground-truth counterfactual endings edited by humans. The results are from [20].
CLICK- $α$ -w/o-kno: A version of CLICK that does not use the commonsense knowledge in the experiment. The variant method receives the story premise, counterfactual condition, and skeleton as input. The correlation threshold $α$ in the skeleton extractor module is set to 0.2.
CLICK-w/o-ske: A version of CLICK that does not use the skeleton in the experiment. The variant method receives the story premise, counterfactual condition, and commonsense knowledge as input.
CLICK: The full version of our method.

5.5. Main Results

To verify the effectiveness and superiority of our proposed method, we conducted a comprehensive comparison with the state-of-the-art baseline models, and the experimental results are shown in Table 2.

(1) Compared to GPT-2, GPT2+FT, and DELOREAN, which are methods with zero shot or which are fine-tuned on ROCstories dataset, our CLICK method demonstrates better performance on the comprehensive evaluation metric HMean. While these methods exhibit high scores on ENTScore, this can be attributed to the fluency and consistency observed in unconstrained free generation, where the original story ending exerts minimal control over the generated content. Furthermore, all of them exhibited lower BLEU scores. This suggests that pre-trained generation models cannot naturally adapt to counterfactual generation tasks, and fine-tuning on similar story datasets fails to instruct the model on counterfactual rewriting and minimum editing constraints. In contrast, our approach achieves a better balance between minimal editing and consistency, making it more appropriate for counterfactual generation tasks.

(2) In comparison to the editing-based approach EDUCAT, our method demonstrates improvements of 2.65 and 4.42 in terms of BLEU and ENTScore, respectively, as well as a 3.84 increase in the comprehensive metric HMean. The improvement in the BLEU metric signifies that our method achieves a higher degree of preservation for unnecessarily modified words in the original ending, thus demonstrating enhanced accuracy in detecting causal conflicts between the original ending and counterfactual condition. The improvement in the ENTScore demonstrates that the endings generated by our method adhere more effectively to the guidance provided by the counterfactual condition. This demonstrates the ability of CLICK to enhance the alignment between the counterfactual condition and the generated ending.

(3) To comprehensively assess the model’s capabilities in terms of minimal editing and counterfactual consistency, we concentrate on evaluating its performance using the comprehensive evaluation metric HMean. While previous methods excel in individual aspects of the metric, this does not guarantee their ability to effectively fulfill both requirements of the task. To illustrate this, we introduce two variations of the CLICK method. CLICK-

α

-w/o-kno, which maximally preserves words from the original ending to the counterfactual ending, achieves the highest scores in BLEU and BERT metrics but falls short of human-level performance in ENTScore. Conversely, CLICK-w/o-ske solely relies on commonsense knowledge to guide the generation of counterfactual endings, resulting in the highest score in ENTScore metric. However, the endings generated by CLICK-w/o-ske fail to achieve minimal editing as they do not make adequate use of the original ending as a constraint and exhibit comparable performance to the GPT2 and GPT2-FT methods in the BLEU index. While a variant of our method may achieve superior performance in a single metric, it is insufficient to evaluate the model’s overall performance on the task, necessitating a comparison using the comprehensive evaluation metric.

5.6. Analysis and Discussion

5.6.1. Ablation Study

We conduct ablation studies to assess the effectiveness of each component, and introduce following variant models for comparison:

w/o skeleton means removing the skeleton extractor module;
w/o knowledge means removing the commonsense generator module;
w/o ske w/o kno means removing both the skeleton extractor module and the commonsense generator module.

The experimental results of the ablation studies are presented in Table 3. It is evident that the model exhibits inferior performance in the comprehensive evaluation metric upon removing each component. From these findings, we can draw the following conclusions:

(1) After removing the skeleton extractor module, the model experiences a significant 36.3% drop in HMean metric and 44.2% drop in BLEU metric, indicating that the skeleton module has a positive impact on the minimal editing of the original ending and the preservation of words in the generated ending that are consistent with the counterfactual condition. Moreover, this observation further demonstrates the validity of utilizing causal relationships between event sequences to test causal invariance.

(2) After removing the commonsense generator module, the performance of the model exhibits a decrease of 1.03% in HMean and 1.24% in ENTScore, suggesting that this module is beneficial for promoting the consistency between the generated ending and the counterfactual condition. Moreover, this observation further demonstrates the effectiveness of incorporating commonsense knowledge to enhance the guidance of counterfactual conditionals.

(3) The removal of both the skeleton and knowledge modules leads to a 38.4% decrease in HMean metric, highlighting the insufficiency of relying solely on a generic generative model for the task. This can be primarily attributed to the model’s inadequate comprehension of counterfactual invariance within the causal narrative chain and its constrained ability to perform precise minimal editing based on the original ending.

5.6.2. Effect of Skeleton

In this section, we perform experiments to investigate the effect of similarity threshold and word embedding selection on the process of skeleton extraction. The experimental results are summarized in Figure 5 and Figure 6 and focus solely on the skeleton extractor module without considering the incorporation of commonsense knowledge into the counterfactual conditions. Specifically, we utilize the skeleton extractor module to obtain the fundamental skeleton and construct the input sequence {[PRE]p[CON]c[SKE]s[END]} for both training and inference stages.

In the correlation detection stage, we employ a similarity threshold

α

to determine the relationship between the similarity score and correlation. When the cosine similarity between two word vectors surpasses the threshold, it indicates a correlation between the corresponding words. For this experimental section, we utilized the NumberBatch word vector and tested various thresholds, with the results presented in Figure 5. The observations reveal that higher thresholds lead to higher BLEU and BERT scores but lower ENTscore. This outcome can be attributed to the fact that a higher threshold excludes highly relevant words from the generated endings. Consequently, a larger number of words are retained in the skeleton and subsequently in the generated counterfactual ending, which leads to a higher BLEU score. However, this also results in the retention of more erroneous words, potentially interfering with the model as part of the input, leading to a lower ENTscore. As a result, we select the NumberBatch vector for our method and consider the HMean composite index. After careful analysis, we determine that the threshold of 0.1 serves as the optimal parameter.

The skeleton extractor module evaluates the correlation between two tokens by computing the cosine similarity between their word embeddings. We compare two types of word embeddings for this task: BERT [72], which is based on a pretrained language model, and NumberBatch, derived from a knowledge graph. The results in Figure 6 indicate that the method performs better with NumberBatch word vectors compared to BERT, achieving optimal performance with varying cosine similarity parameters. The model obtains a relatively high BLEU score but a low ENTScore. This outcome suggests that the method prioritizes preserving words from the original ending to the counterfactual ending, leading to a higher BLEU score. However, this approach inadvertently retains certain words that contradict the counterfactual condition, resulting in a lower ENTScore. These observations highlight the limitations of using BERT vectors for effectively detecting causal invariance. Word vectors from pre-trained language models like BERT offer extensive semantic representation capabilities. However, word vectors trained on corpora like knowledge graphs leverage structured knowledge, making them more effective at capturing entity relationships and aligning better with our objectives. Therefore, we ultimately choose the NumberBatch vector for our method.

5.6.3. Effect of Commonsense Knowledge

In this section, we conduct experiments to investigate the influence of different types of commonsense knowledge on the performance of the model. The experimental results are summarized in Table 4. In order to ensure a fair comparison, we augment the counterfactual conditions with various types of commonsense knowledge as guidance for the model, while using the same basic skeleton (NumberBatch word embedding, threshold = 0.1). Sap et al. (2019) [66] propose nine if–then relation types to differentiate causes from effects, agents from themes, voluntary events from involuntary events, and actions from mental states. Given the counterfactual conditions, we utilize the COMET model to generate nine natural language expressions, and examine the influence of various combinations of relationship types on the performance of the model. It should be emphasized that we focus on the mental states, events, and character attributes related to explicitly mentioned participants. Specifically, we concentrate on the following categories: (XEffect + XWant + XNeed) in the If-Event-Then-Event category, (XIntent + XReact) in the If-Event-Then-Mental-State category, and XAttr in the If-Event-Then-Persona category.

Firstly, as shown in Table 4, it is evident that commonsense knowledge can serve as a valuable aid to the model. Introducing counterfactual conditions such as XIntent, XNeed, XAttr, and OEffect commonsense knowledge leads to improvements in the model compared to solely relying on the skeleton for the HMean metric. This improvement can be attributed to the provision of more detailed counterfactual conditions, such as XIntent indicating the intention of subject X to perform a specific event. Such reasoning information assists the model in predicting potential subsequent events. It is worth noting that certain types of knowledge may not have a positive impact on the model, but instead introduce interference that leads to a slight decrease in performance. For instance, XReact reflects the emotional response of subject X after the event. Given the numerous possibilities generated by COMET, we only select one possibility to provide as input to the model for testing. Consequently, the endings generated by the model may exhibit emotional states that deviate from the narrative direction observed in the artificially rewritten endings of the test set. This discrepancy highlights the inherent conflict between fostering varied text generation and maintaining consistency with the original data distribution.

Secondly, the integration of multiple types of knowledge has a greater impact on enhancing the model than relying solely on a single type of information. In terms of the overall HMean metric, the combinations of knowledge in the If-Event-Then-Event category and the eIf-Event-Then-Mental-State category outperform their corresponding single knowledge types. The evidence demonstrates that comprehensive knowledge has the ability to amplify the enhancement provided by single knowledge. By increasing the richness of input knowledge, the model can develop a deeper understanding of past events and improve its ability to plan for the future.

Furthermore, we also conduct experiments to examine the impact of incorporating commonsense knowledge about other participants involved in the event on the model. Specifically, we explore the influence of If-Event-Then-Others in the table. Surprisingly, this particular knowledge type also yields a slight improvement in story rewriting. The improvement in overall metric primarily manifests in an improved BLEU metric, indicating that the relevant knowledge about other participants assists the model in generating counterfactual endings that closely resemble the original ones. We believe that this is possible due to the following reason: The presence of auxiliary characters and their interactions with the protagonist can indeed drive the story’s development. While changes in the story conditions often result in alterations in the protagonist’s behavior, auxiliary characters tend to maintain their original personality and behavior throughout both the original and rewritten narratives. Thus, incorporating commonsense information about these auxiliary characters helps the model approach the original ending more closely when rewriting the counterfactual ending.

5.7. Case Study

Table 5 shows two examples of generating counterfactual endings using different methods. The Sketch&Customize method is a supervised approach proposed by Hao et al. [19]. In the first example, the EDUCAT method yields logically implausible content and exhibits incoherent story progression. Conversely, the Sketch&Customize method generates logically coherent content but suffers from excessive editing. In contrast, the CLICK method shows the advantages of minimal editing and counterfactual consistency. It seamlessly integrates the rainy scene, incorporating new elements like a rainbow and had lunch by the window, while maximizing the retention of the original ending’s wording.

In the second example, the EDUCAT method fails to implement essential targeted alterations to the original ending in conjunction with counterfactual condition. Water-related events from the original scenario persist in the new ending, and no specific adjustments are made to the ingredients mentioned in the counterfactual condition. The content generated by the Sketch&Customize method suffers from logical incoherence and semantic expression issues. While the method attempts to preserve the original ending content as much as possible, the replacement and modification of words are often misplaced, leading to semantic confusion in corresponding sections. The content generated by our CLICK method successfully maintains a significant portion of the original ending while incorporating new vocabularies including mold and pasta in relation to the mentioned ingredients.

6. Conclusions

In this paper, we propose a counterfactual generation framework based on causal inference in event sequences and commonsense knowledge incorporation. The primary objective is to maintain the minimal editing constraint while also being able to incorporate new vocabularies and generate plausible story endings. To eliminate content that may conflict with the counterfactual condition in the story ending, we employ causal graph analysis and utilize the correlation between adjacent events in the story ending to iteratively calculate the contents from the original ending affected by the condition. To enhance the causal consistency between the story ending and the counterfactual condition, we integrate diverse and structural commonsense knowledge, facilitating the construction of coherent causal relationships and the modification of conflicting words to ensure a cohesive narrative. In the future, we intend to explore the domain of counterfactual rewriting in longer texts. This expanded exploration will provide a more rigorous evaluation of the model’s capability to maintain a balance between counterfactual consistency and minimal editing of the original text. Furthermore, analyzing causal narrative chains within longer texts presents a heightened challenge, and our forthcoming work aims to tackle this complexity.

7. Limitations

Our approach primarily relies on causal relationships between adjacent events for counterfactual invariance detection. We have experimentally validated the effectiveness of this approach on short-text datasets commonly used in the current research domain. However, when dealing with the complexities and diversities of real-world texts, particularly in the context of longer documents, we acknowledge the presence of more intricate narrative structures, intricate causal relationships, and longer temporal dependencies. These intricacies necessitate methodological adjustments and adaptations to ensure applicability in longer and more complex text settings. Therefore, in the future, we intend to extend our approach to tackle the challenges posed by complex long-text scenarios.

Author Contributions

Conceptualization, D.L. and Z.G.; methodology, D.L. and L.J.; validation, D.L. and Q.L.; investigation, Z.Z.; resources, Z.Z.; data curation, K.W.; writing—original draft preparation, D.L. and Z.G.; writing—review and editing, D.L. and F.L.; visualization, K.W.; supervision, F.L.; project administration, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No.62206267), and supported by Research Funding of Satellite Information Intelligent Processing and Application Research Laboratory (2022-ZZKY-ZD-05-01).

Data Availability Statement

The data that support the findings of this study are openly available on Github at https://github.com/qkaren/Counterfactual-StoryRW, (accessed on 5 October 2023) reference number [12].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Causal Graph

Causal graphs emerged as a crucial tool for studying causal relationships, especially in situations where the randomization of units into treatment and control groups is unfeasible or unethical. The primary function of causal graphs is to aid researchers in identifying and understanding causal relationships in the data-generation process. They serve to elucidate which variables are causally related, which are confounding variables, and can guide the construction of causal models. Causal graphs facilitate a better understanding of research questions and assist in devising appropriate statistical strategies for estimating valid causal effects.

A causal graph is a graphical representation of a causal model, typically in the form of a directed acyclic graph (DAG). In a causal graph, arrows denote the direction of causality, nodes represent variables, and the absence of cycles indicates acyclic causality. Causal graphs provide a visual framework to comprehend and depict causal relationships among variables, including treatment variables, outcome variables, and potential covariates.

To illustrate the challenges inherent in estimating causal effects from observational data, let us consider a simplified data-generating process depicted in Figure A1. This process involves three key variables: Treatment, Outcome, and a Confounder covariate. In this graphical representation, Treatment exerts a direct causal influence on Outcome, as indicated by the directed edge connecting Treatment to Outcome. Additionally, Confounder serves as a common cause for both Treatment and Outcome within this framework. For instance, we can envision a scenario in which Treatment represents a binary variable signifying whether an individual is administered a specific medication, while Outcome represents a binary variable indicating the occurrence of a particular side effect. If, in this context, men are more likely than women to receive the medication and also exhibit a higher propensity to experience the side effect, gender acts as the common cause (Confounder) that links Treatment and Outcome in the depicted Figure A1.

Figure A1. An example of a causal graph.

After constructing a causal graph for the three variables, the subsequent task is to quantitatively assess the causal impact of the treatment on the outcome. Conceptually, confounding introduces an association between treatment and outcome because it acts as a causal factor for both. However, the observed partial association between Treatment and Outcome can be attributed to the biased pathway originating from Treatment, passing through Confounder, and eventually reaching Outcome. To precisely compute the causal effect, it is imperative to alleviate this biasing pathway through adjustments involving the Confounder.

References

Cornacchia, G.; Anelli, V.W.; Biancofiore, G.M.; Narducci, F.; Pomo, C.; Ragone, A.; Sciascio, E.D. Auditing fairness under unawareness through counterfactual reasoning. Inf. Process. Manag. 2023, 60, 103224. [Google Scholar] [CrossRef]
Tian, B.; Cao, Y.; Zhang, Y.; Xing, C. Debiasing NLU Models via Causal Intervention and Counterfactual Reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; Volume 36, pp. 11376–11384. [Google Scholar]
Jaimini, U.; Sheth, A.P. CausalKG: Causal Knowledge Graph Explainability Using Interventional and Counterfactual Reasoning. IEEE Internet Comput. 2022, 26, 43–50. [Google Scholar] [CrossRef]
Huang, Z.; Kosan, M.; Medya, S.; Ranu, S.; Singh, A.K. Global Counterfactual Explainer for Graph Neural Networks. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February 2023–3 March 2023; pp. 141–149. [Google Scholar]
Stepin, I.; Alonso, J.M.; Catalá, A.; Pereira-Fariña, M. A Survey of Contrastive and Counterfactual Explanation Generation Methods for Explainable Artificial Intelligence. IEEE Access 2021, 9, 11974–12001. [Google Scholar] [CrossRef]
Temraz, M.; Keane, M.T. Solving the class imbalance problem using a counterfactual method for data augmentation. Mach. Learn. Appl. 2022, 9, 100375. [Google Scholar] [CrossRef]
Calderon, N.; Ben-David, E.; Feder, A.; Reichart, R. DoCoGen: Domain Counterfactual Generation for Low Resource Domain Adaptation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022. [Google Scholar]
Howard, P.; Singer, G.; Lal, V.; Choi, Y.; Swayamdipta, S. NeuroCounterfactuals: Beyond Minimal-Edit Counterfactuals for Richer Data Augmentation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022. [Google Scholar]
Wang, X.; Zhou, K.; Tang, X.; Zhao, W.X.; Pan, F.; Cao, Z.; Wen, J. Improving Conversational Recommendation Systems via Counterfactual Data Simulation. arXiv 2023, arXiv:2306.02842. [Google Scholar]
Filighera, A.; Tschesche, J.; Steuer, T.; Tregel, T.; Wernet, L. Towards Generating Counterfactual Examples as Automatic Short Answer Feedback. In Proceedings of the Artificial Intelligence in Education—23rd International Conference, Durham, UK, 27–31 July 2022; Volume 13355, pp. 206–217. [Google Scholar]
Liu, X.; Feng, Y.; Tang, J.; Hu, C.; Zhao, D. Counterfactual Recipe Generation: Exploring Compositional Generalization in a Realistic Scenario. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7354–7370. [Google Scholar]
Qin, L.; Bosselut, A.; Holtzman, A.; Bhagavatula, C.; Clark, E.; Choi, Y. Counterfactual Story Reasoning and Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI blog 2019, 1, 9. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual Denoising Pre-training for Neural Machine Translation. Trans. Assoc. Comput. Linguistics 2020, 8, 726–742. [Google Scholar] [CrossRef]
Hao, C.; Pang, L.; Lan, Y.; Wang, Y.; Guo, J.; Cheng, X. Sketch and Customize: A Counterfactual Story Generator. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 12955–12962. [Google Scholar]
Chen, J.; Gan, C.; Cheng, S.; Zhou, H.; Xiao, Y.; Li, L. Unsupervised Editing for Counterfactual Stories. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; Volume 36, pp. 10473–10481. [Google Scholar]
Bisk, Y.; Holtzman, A.; Thomason, J.; Andreas, J.; Bengio, Y.; Chai, J.; Lapata, M.; Lazaridou, A.; May, J.; Nisnevich, A.; et al. Experience Grounds Language. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Virtually, 16–20 November 2020; pp. 8718–8735. [Google Scholar]
Tan, B.; Yang, Z.; Al-Shedivat, M.; Xing, E.P.; Hu, Z. Progressive Generation of Long Text with Pretrained Language Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 4313–4324. [Google Scholar]
Dziri, N.; Madotto, A.; Zaïane, O.; Bose, A.J. Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 2197–2214. [Google Scholar]
Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; Choi, Y. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. In Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Zhang, H.; Song, H.; Li, S.; Zhou, M.; Song, D. A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models. J. ACM 2023, 37, 111:1–111:37. [Google Scholar] [CrossRef]
Ling, Y.; Liang, Z.; Wang, T.; Cai, F.; Chen, H. Sequential or jumping: Context-adaptive response generation for open-domain dialogue systems. Appl. Intell. 2023, 53, 11251–11266. [Google Scholar] [CrossRef]
Chen, Z.; Liu, Z. Fixed global memory for controllable long text generation. Appl. Intell. 2023, 53, 13993–14007. [Google Scholar] [CrossRef]
Yang, L.; Shen, Z.; Zhou, F.; Lin, H.; Li, J. TPoet: Topic-Enhanced Chinese Poetry Generation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–15. [Google Scholar] [CrossRef]
Mao, Y.; Cai, F.; Guo, Y.; Chen, H. Incorporating emotion for response generation in multi-turn dialogues. Appl. Intell. 2022, 52, 7218–7229. [Google Scholar] [CrossRef]
He, X. Parallel Refinements for Lexically Constrained Text Generation with BART. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
Xu, F.; Xu, G.; Wang, Y.; Wang, R.; Ding, Q.; Liu, P.; Zhu, Z. Diverse dialogue generation by fusing mutual persona-aware and self-transferrer. Appl. Intell. 2022, 52, 4744–4757. [Google Scholar] [CrossRef]
Mo, L.; Wei, J.; Huang, Q.; Cai, Y.; Liu, Q.; Zhang, X.; Li, Q. Incorporating sentimental trend into gated mechanism based transformer network for story ending generation. Neurocomputing 2021, 453, 453–464. [Google Scholar] [CrossRef]
Spangher, A.; Hua, X.; Ming, Y.; Peng, N. Sequentially Controlled Text Generation. arXiv 2023, arXiv:2301.02299. [Google Scholar]
Chung, J.J.Y.; Kim, W.; Yoo, K.M.; Lee, H.; Adar, E.; Chang, M. TaleBrush: Sketching Stories with Generative Pretrained Language Models. In Proceedings of the CHI ’22: CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April 2022–5 May 2022. [Google Scholar]
Wang, J.; Zou, B.; Li, Z.; Qu, J.; Zhao, P.; Liu, A.; Zhao, L. Incorporating Commonsense Knowledge into Story Ending Generation via Heterogeneous Graph Networks. In Proceedings of the Database Systems for Advanced Applications—27th International Conference, Virtual, 11–14 April 2022; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13247, pp. 85–100. [Google Scholar]
Chen, J.; Chen, J.; Yu, Z. Incorporating Structured Commonsense Knowledge in Story Completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
Liu, R.; Zheng, G.; Gupta, S.; Gaonkar, R.; Gao, C.; Vosoughi, S.; Shokouhi, M.; Awadallah, A.H. Knowledge Infused Decoding. In Proceedings of the Tenth International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Wei, K.; Sun, X.; Zhang, Z.; Zhang, J.; Zhi, G. Trigger is not sufficient: Exploiting frame-aware knowledge for implicit event argument extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Virtual, 1–6 August 2021. [Google Scholar]
Wei, K.; Sun, X.; Zhang, Z.; Jin, L.; Zhang, J.; Lv, J.; Zhi, G. Implicit Event Argument Extraction With Argument-Argument Relational Knowledge. IEEE Trans. Knowl. Data Eng. 2023, 35, 8865–8879. [Google Scholar] [CrossRef]
Wei, K.; Yang, Y.; Jin, L.; Sun, X.; Zhang, Z.; Zhang, J.; Zhi, G. Guide the Many-to-One Assignment: Open Information Extraction via IoU-aware Optimal Transport. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
Guan, J.; Huang, F.; Huang, M.; Zhao, Z.; Zhu, X. A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation. Trans. Assoc. Comput. Linguistics 2020, 8, 93–108. [Google Scholar] [CrossRef]
Xu, P.; Patwary, M.; Shoeybi, M.; Puri, R.; Fung, P.; Anandkumar, A.; Catanzaro, B. MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; pp. 2831–2845. [Google Scholar]
Speer, R.; Chin, J.; Havasi, C. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4444–4451. [Google Scholar]
Martin, L.J.; Ammanabrolu, P.; Wang, X.; Hancock, W.; Singh, S.; Harrison, B.; Riedl, M.O. Event Representations for Automated Story Generation with Deep Neural Nets. In Proceedings of the AAAI Conference on Artificial Intelligence, Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 868–875. [Google Scholar]
Martin, L.J.; Sood, S.; Riedl, M.O. Dungeons and DQNs: Toward Reinforcement Learning Agents that Play Tabletop Roleplaying Games. In Proceedings of the Joint Workshop on Intelligent Narrative Technologies and Workshop on Intelligent Cinematography and Editing Co-Located with 14th AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, INT/WICED@AIIDE 2018, Edmonton, AB, Canada, 13–14 November 2018. [Google Scholar]
Glymour, M.; Pearl, J.; Jewell, N.P. Causal Inference in Statistics: A Primer; John Wiley & Sons: Chichester, UK; Hoboken, NJ, USA, 2016. [Google Scholar]
Feder, A.; Keith, K.A.; Manzoor, E.; Pryzant, R.; Sridhar, D.; Wood-Doughty, Z.; Eisenstein, J.; Grimmer, J.; Reichart, R.; Roberts, M.E.; et al. Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond. Trans. Assoc. Comput. Linguistics 2022, 10, 1138–1158. [Google Scholar] [CrossRef]
Yao, L.; Chu, Z.; Li, S.; Li, Y.; Gao, J.; Zhang, A. A Survey on Causal Inference. ACM Trans. Knowl. Discov. Data 2021, 15, 74:1–74:46. [Google Scholar] [CrossRef]
Zhang, Y.; Feng, F.; He, X.; Wei, T.; Song, C.; Ling, G.; Zhang, Y. Causal Intervention for Leveraging Popularity Bias in Recommendation. In Proceedings of the SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 11–15 July 2021; pp. 11–20. [Google Scholar]
Wang, W.; Feng, F.; He, X.; Zhang, H.; Chua, T. Clicks can be Cheating: Counterfactual Recommendation for Mitigating Clickbait Issue. In Proceedings of the SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 1288–1297. [Google Scholar]
Qian, C.; Feng, F.; Wen, L.; Ma, C.; Xie, P. Counterfactual Inference for Text Classification Debiasing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Virtually, 1–6 August 2021. [Google Scholar]
Zhang, W.; Lin, H.; Han, X.; Sun, L. De-biasing Distantly Supervised Named Entity Recognition via Causal Intervention. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Virtually, 1–6 August 2021. [Google Scholar]
Li, S.; Li, X.; Shang, L.; Dong, Z.; Sun, C.; Liu, B.; Ji, Z.; Jiang, X.; Liu, Q. How Pre-trained Language Models Capture Factual Knowledge? A Causal-Inspired Analysis. In Proceedings of the Findings of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; pp. 1720–1732. [Google Scholar]
Zhu, Y.; Sheng, Q.; Cao, J.; Li, S.; Wang, D.; Zhuang, F. Generalizing to the Future: Mitigating Entity Bias in Fake News Detection. In Proceedings of the SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022. [Google Scholar]
Li, B.; Su, P.; Chabbi, M.; Jiao, S.; Liu, X. DJXPerf: Identifying Memory Inefficiencies via Object-Centric Profiling for Java. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization, Montreal, QC, Canada, 25 February–1 March 2023. [Google Scholar]
Li, B.; Xu, H.; Zhao, Q.; Su, P.; Chabbi, M.; Jiao, S.; Liu, X. OJXPERF: Featherlight Object Replica Detection for Java Programs. In Proceedings of the 44th IEEE/ACM 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 25–27 May 2022. [Google Scholar]
Xu, G. Finding reusable data structures. In Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, Tucson, AZ, USA, 21–25 October 2012. [Google Scholar]
Li, B.; Zhao, Q.; Jiao, S.; Liu, X. DroidPerf: Profiling Memory Objects on Android Devices. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking, ACM MobiCom 2023, Madrid, Spain, 2–6 October 2023; pp. 1–15. [Google Scholar]
Liu, C.; Gan, L.; Kuang, K.; Wu, F. Investigating the Robustness of Natural Language Generation from Logical Forms via Counterfactual Samples. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022. [Google Scholar]
You, D.; Niu, S.; Dong, S.; Yan, H.; Chen, Z.; Wu, D.; Shen, L.; Wu, X. Counterfactual explanation generation with minimal feature boundary. Inf. Sci. 2023, 625, 342–366. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; A Meeting of SIGDAT, a Special Interest Group of the ACL. ACL: Doha, Qatar, 2014; pp. 1532–1543. [Google Scholar]
Lison, P.; Tiedemann, J. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, 23–28 May 2016. [Google Scholar]
Speer, R.; Lowry-Duda, J. ConceptNet at SemEval-2017 Task 2: Extending Word Embeddings with Multilingual Relational Knowledge. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 85–89. [Google Scholar]
Yu, W.; Zhu, C.; Li, Z.; Hu, Z.; Wang, Q.; Ji, H.; Jiang, M. A Survey of Knowledge-enhanced Text Generation. ACM Comput. Surv. 2022, 54, 1–38. [Google Scholar] [CrossRef]
Sap, M.; Bras, R.L.; Allaway, E.; Bhagavatula, C.; Lourie, N.; Rashkin, H.; Roof, B.; Smith, N.A.; Choi, Y. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. In Proceedings of the AAAI conference on artificial intelligence, Honolulu, HI, USA, 27 January 27–1 February 2019. [Google Scholar]
Fan, A.; Lewis, M.; Dauphin, Y.N. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018. [Google Scholar]
Mostafazadeh, N.; Chambers, N.; He, X.; Parikh, D.; Batra, D.; Vanderwende, L.; Kohli, P.; Allen, J.F. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories. arXiv 2016, arXiv:1604.01696. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Qin, L.; Shwartz, V.; West, P.; Bhagavatula, C.; Hwang, J.D.; Le Bras, R.; Bosselut, A.; Choi, Y. Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 794–805. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]

Figure 1. An example of an original story and counterfactual story pair. Given an original story and a counterfactual condition, the task requires changing the original story ending to a counterfactual ending. The words highlighted in red in the original condition represent the elements that are intervened with by the counterfactual condition. The words highlighted in red from the original ending indicate the content requiring modification due to conflicts with the counterfactual condition. The blue words in the counterfactual story represent the modified content. The dotted lines connecting the red contents indicate a causal connection.

Figure 2. An example of a causal graph. X: Treatment plays a direct causal role in determining the value of Outcome. Y: Outcome is the effect of the causal path. Z: Confounder is a common cause of both Treatment and Outcome.

Figure 3. Causal graphs to describe the generation process of each individual event sequence.

s_{i}

denotes the i-th sentence in the story, and

s_{3 : 5}

denotes the original ending, which is the target for rewriting in this task.

Figure 3. Causal graphs to describe the generation process of each individual event sequence.

s_{i}

denotes the i-th sentence in the story, and

s_{3 : 5}

denotes the original ending, which is the target for rewriting in this task.

Figure 4. Overview of our proposed framework, which consists of three main components: skeleton extractor, commonsense generator, and generative model. The skeleton extractor takes the entire input context as input and aims to eliminate words that have strong associations with the original condition by detecting causal invariance among the words in the original ending. The internal structure of the skeleton extractor is depicted on the left side of the diagram, and it allows for a step-by-step acquisition of words highly correlated with the original conditions through the calculation of word vector similarity. These correlated words are then removed to obtain the fundamental skeleton. The commonsense generator takes the counterfactual condition as its input and primarily utilizes COMET to provide extensive and diverse structured commonsense information for subsequent rewriting. The outputs of the skeleton extractor and the commonsense generator, along with the premise and counterfactual condition, are combined and fed into the generative model. This generative model works collectively to produce the counterfactual story ending.

Figure 5. Analysis experiment of similarity threshold selection in skeleton extractor module.

Figure 6. Analysis experiment of word embedding selection in skeleton extractor module.

Table 1. Examples of If-Event-Then-X commonsense knowledge generated by COMET. For inference dimensions, “x” and “o” pertain to PersonX and others, respectively.

Event	Type of Relations	Inference Examples	Inference dim.
She and her brother watched the whole film.	If-Event Then-Mental-State	to be entertained	XIntent
	If-Event Then-Mental-State	happy	XReact
	If-Event Then-Event	to go to the theater	XNeed
	If-Event Then-Event	learns something new	OEffect
	If-Event Then-Persona	interested	XAttr
Suddenly, she woke up in pain.	If-Event Then-Mental-State	to feel better	XIntent
		hurt	XReact
		worried	OReact
	If-Event Then-Event	to have had a nightmare	XNeed
		cries	XEffect
		to go to the bathroom	XWant
		to make sure she is ok	OWant
	If-Event Then-Persona	hurt	XAttr

Table 2. Automatic evaluation results in the test set of TIMETRAVEL. Bold numbers denote the best results.

Method	Single Metric			Overall Metric
Method	BLEU	BERT	ENTScore	HMEAN
Human	64.76	78.82	80.56	71.8
GPT2	1.39	47.13	54.21	2.7
GPT2+FT	3.9	53.00	52.77	7.26
DELOREAN	23.89	59.88	51.4	32.62
EDUCAT	44.05	74.06	32.28	37.26
CLICK(−0.2−w/o−kno)	62	78.2	29.35	39.84
CLICK(−w/o−ske)	2.5	51.3	55.1	4.8
CLICK(Final)	46.7	73.2	36.7	41.1

Table 3. Experimental results of ablation study on CLICK. Bold numbers denote the best results. The numerical values following the downward arrows indicate the extent to which the model’s performance declines after the removal of that module.

Ablation	Single Metric			Overall Metric
Ablation	BLEU	BERT	ENTScore	HMean
Full CLICK	46.7	73.2	36.7	41.1
w/o skeleton	2.5 (↓ 44.2)	51.3	55.1	4.8
w/o knowledge	46.1	72.8	35.46 (↓ 1.24)	40.07
w/o ske w/o kno	1.39	47.13	54.21	2.7 (↓ 38.4)

Table 4. Effect of various types of commonsense knowledge. Bold numbers denote the best results.

Knowledge Type		Single Metric			Overall Metric
Knowledge Type		BLEU	BERT	ENTScore	HMean
w/o knowledge		46.1	72.8	35.46	40.07
If-Event-Then Mental-State	xIntent	46.49	73.20	35.72	40.40
	xReact	45.70	73.00	35.10	39.70
	xIntent+xReact	45.93	73.00	37.10	41.00
If-Event-Then Event	xEffect	47.21	73.53	34.37	39.81
	xWant	44.10	72.20	35.84	39.51
	xNeed	45.62	72.96	36.72	40.69
	xEffect+xWant+ xNeed	46.00	73.17	36.93	40.96
If-Event-Then Others	oReact	46.70	73.00	34.90	40.00
	oEffect	46.70	73.40	35.90	40.60
	oWant	47.00	73.30	35.50	40.30
	oReact+oEffect+ oWant	46.10	73.20	34.73	39.60
If-Event-Then Persona	xAttr	46.70	73.20	36.70	41.10
Nine kinds of knowledge		46.67	72.90	36.15	40.74

Table 5. Two cases of counterfactual endings generated by CLICK and baselines. Red words represent logical incoherence and low counterfactual consistency. Green words represent the preservation of the original ending. The gray boxes represent problems with the generated ending.

Case 1
Premise	The day was sunny and warm, a perfect day for a picnic.
Orig Condition	Mom, James, and Renee went to the park.
CF condition	Rain started to fall.
Orig Ending	First they went for a walk. Then they had a picnic by the river. They all had a good time.
CF Ending	They found a covered seating area. Then they had a picnic there. They all had a good time.
Methods	Generated Counterfactual Ending
EDUCAT	So I went for a walk. Yes, I had a picnic by the river. I had a great time.
	Logical Incoherence
Sketch&Customize	Rain was then followed by a thunderstorm. All the picnic food was soaked. Then it was a cold day.
	Overediting
CLICK	First they found a rainbow. Then they had a great lunch by the window. They all had a good time.
Case 2
Premise	Tom was making some pasta.
Orig Condition	He boiled some water.
CF condition	He took all of the ingredients out of the pantry.
Orig Ending	He left the kitchen to answer an important phone call. When he came back there was water all over the ground. He turned off the stove and cleaned up the kitchen.
CF Ending	He left the kitchen to answer an important phone call. When he came back the dog had knocked everything over. He picked up the food and cleaned up the kitchen.
Methods	Generated Counterfactual Ending
EDUCAT	He left the kitchen to take an urgent phone call. When he got home, there was water all over the ground. He turned off the water and left the kitchen.
	Low Counterfactual Consistency
Sketch&Customize	He sat down to answer an important phone call. When he came back he was the ground. Tom turned off the oven to start it up.
	Logical Incoherence
CLICK	He left the kitchen to answer an important phone call. When he came back he found there was mold all over the pasta. He cleaned off the mess and cleaned up the kitchen.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, D.; Guo, Z.; Liu, Q.; Jin, L.; Zhang, Z.; Wei, K.; Li, F. CLICK: Integrating Causal Inference and Commonsense Knowledge Incorporation for Counterfactual Story Generation. Electronics 2023, 12, 4173. https://doi.org/10.3390/electronics12194173

AMA Style

Li D, Guo Z, Liu Q, Jin L, Zhang Z, Wei K, Li F. CLICK: Integrating Causal Inference and Commonsense Knowledge Incorporation for Counterfactual Story Generation. Electronics. 2023; 12(19):4173. https://doi.org/10.3390/electronics12194173

Chicago/Turabian Style

Li, Dandan, Ziyu Guo, Qing Liu, Li Jin, Zequn Zhang, Kaiwen Wei, and Feng Li. 2023. "CLICK: Integrating Causal Inference and Commonsense Knowledge Incorporation for Counterfactual Story Generation" Electronics 12, no. 19: 4173. https://doi.org/10.3390/electronics12194173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CLICK: Integrating Causal Inference and Commonsense Knowledge Incorporation for Counterfactual Story Generation

Abstract

1. Introduction

2. Related Work

2.1. Knowledge-Enhanced Text Generation

2.2. Causal Inference and NLP

3. Preliminaries

3.1. Causal Graph

3.2. Causal Intervention

4. Methodology

4.1. Task Formulation

4.2. Causal Graph and Causal Path Analysis

4.3. Model Overview

4.4. Skeleton Extractor with Narrative Chain Guidance

4.4.1. Condition-Guided Intervention Selection

4.4.2. Sequence-Aware Correlation Calculation

4.4.3. Skeleton Acquisition

4.5. Knowledge-Alignment Commonsense Generator

4.6. Commonsense-Constrained Generative Model

5. Experiments

5.1. Dataset

5.2. Evaluation Metrics

5.3. Implementation Details

5.4. Compared Approaches

5.5. Main Results

5.6. Analysis and Discussion

5.6.1. Ablation Study

5.6.2. Effect of Skeleton

5.6.3. Effect of Commonsense Knowledge

5.7. Case Study

6. Conclusions

7. Limitations

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Causal Graph

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI