1. Introduction
The increasing integration of machine learning (ML) into high-stakes domains such as healthcare, finance, and autonomous systems has amplified the demand for models that are not only accurate but also transparent, reliable, and trustworthy [
1,
2]. A significant limitation of many contemporary ML models is their reliance on statistical correlations, which can be brittle and misleading. These models often fail to distinguish between mere association and true causation, making them vulnerable to spurious conclusions and poor generalization when deployed in new environments [
3]. Therefore, moving beyond correlation-based learning toward causality-informed decision-making is a cornerstone for the next generation of Explainable Artificial Intelligence (XAI) [
4].
The central research problem lies in effectively inferring causal relationships from observational data. While randomized controlled trials (RCTs) are the gold standard for establishing causality, they are often impractical, unethical, or prohibitively expensive to conduct in many real-world scenarios, such as studying the link between lifestyle factors and chronic diseases [
5]. Consequently, there is a critical need for robust computational methods that can uncover causal structures directly from passively collected observational datasets.
Existing methods for causal discovery from observational data offer various approaches, but each comes with its own set of limitations. Classical techniques like Granger Causality (GC) and Transfer Entropy (TE) are primarily designed for time-series data and rely on Norbert Wiener’s principle of predictability, where a cause must improve the prediction of its effect [
6,
7,
8]. However, these methods can be constrained by assumptions of linearity (in the case of GC) or require substantial data to accurately estimate probability distributions (for TE) [
9]. A more recent and powerful paradigm for causal inference is rooted in algorithmic information theory, which posits that the simplest explanation for a phenomenon is likely the correct one. This principle, formalized by Kolmogorov complexity, suggests that if
X causes
Y, the algorithmic description of
Y given
X should be simpler than the description of
X given
Y. Methods like ORIGO, based on the Minimum Description Length (MDL) principle, and ERGO have operationalized this idea using lossless compression and complexity estimates as practical proxies for the uncomputable Kolmogorov complexity [
10,
11].
Lempel–Ziv (LZ) complexity, a computationally feasible upper bound on Kolmogorov complexity, has emerged as a valuable tool in this domain due to its parameter-free nature and its proven utility in diverse fields from bioinformatics to anomaly detection [
12,
13]. However, recent attempts to formulate an LZ-based causality measure, such as the one proposed by Pranay and Nagaraj (2021), suffer from two fundamental drawbacks [
14]. Firstly, their formulation implicitly assumes a complete temporal precedence of one variable over another, making it unsuitable for systems where variables evolve and interact concurrently. Secondly, their metric can yield a non-zero “self-causation” penalty (i.e., the penalty for a variable causing itself is greater than zero), which is counter-intuitive and problematic for applications like feature selection. These unresolved issues highlight the need for a more robust and theoretically sound causality measure.
Our research is motivated by the goal of bridging this gap. We aim to develop a novel causality measure based on LZ complexity, which not only addresses the limitations of prior work but can also be seamlessly integrated into standard machine learning frameworks to build inherently interpretable, causally informed models. The primary objective is to create a decision tree classifier that builds its structure not on correlation-based metrics like Gini impurity, but on the causal influence of features on the target variable.
The primary novelty of this paper lies in the formulation of the Lempel–Ziv Penalty (LZP) causality measure, which uniquely employs a real-time, parallel parsing of sequences. This approach overcomes the key limitations of previous compression-based methods by naturally handling concurrent data and guaranteeing a zero self-causation penalty. Furthermore, the direct application of this measure as a splitting criterion in a decision tree is a novel approach to imbue a classical ML model with causal reasoning capabilities.
The main contributions of this work are explicitly listed below:
- 1.
We introduce the Lempel–Ziv Penalty (LZP), a novel causality measure based on LZ complexity that is robust for both temporal and non-temporal observational data.
- 2.
We propose a new distance metric derived from the symmetric difference of LZ-generated grammars.
- 3.
We present the design and implementation of two new classifiers: an LZ-Causality-based decision tree that selects features based on their causal influence and an LZ-Distance-based decision tree.
- 4.
We introduce a causal feature importance score derived from our causal decision tree, offering a more interpretable alternative to correlation-based importance metrics.
- 5.
Through extensive experiments, we demonstrate that while our models perform comparably on standard benchmarks, the causal decision tree significantly outperforms traditional methods on datasets with known underlying causal structures, validating its ability to leverage causal information for improved accuracy.
2. Related Works
Compression-based causal inference methods leverage algorithmic information theory to infer causal directions by measuring how well one variable compresses another.
ORIGO applies the Minimum Description Length (MDL) principle to binary and categorical data, using decision trees as compressors to encode one variable given another [
10]. The causal direction is inferred by comparing the total description lengths: if encoding
Y given
X is shorter than the inverse,
X is considered the cause of
Y. While conceptually powerful, its performance can be sensitive to the parameterization of the chosen compressor, and its scope is primarily limited to discrete data.
To address the challenge of more complex, real-world datasets, ERGO extends this compression-based framework to handle multivariate and mixed-type (continuous and categorical) data [
11]. Instead of evaluating variable pairs in isolation, ERGO seeks the causal ordering of an entire set of variables that yields the most compact overall description of the data. The framework was, however, primarily designed for real-valued data, limiting its application to datasets dominated by categorical or temporal components.
Compression Complexity Causality (CCC) offers a robust method specifically for time-series analysis [
15]. It estimates causality by measuring changes in the dynamical compression complexity of time series and is notably resilient to common issues like irregular sampling and noise. However, its primary weakness lies in its strict adherence to the Wiener-Granger causality framework, which assumes that causes must temporally precede their effects. This makes it highly effective for detecting lagged dependencies but potentially less sensitive to contemporaneous causal effects where an event and its cause occur in the same time step. The versatility of compression-based metrics is further demonstrated by recent efforts to integrate them with established causal frameworks, such as using compressibility to measure dependence specifically within additive noise models [
16].
The work most closely related to our own is that in [
14], which also leverages Lempel–Ziv complexity for causal discovery. However, as detailed in
Section 3.1.3, their formulation has critical limitations, including an implicit assumption of temporal precedence and a non-zero self-causation penalty. These unresolved issues directly motivate our research and the development of a novel measure designed to overcome these specific drawbacks, thereby enabling a more robust and intuitive integration into machine learning models like decision trees. Nevertheless, its application is hindered by several critical flaws, which are elaborated upon in
Section 3.1.3. These unresolved issues directly motivate our research and the development of a novel measure designed to overcome these specific drawbacks.
3. Materials and Methods
This section describes the novel causality and distance measures derived from Lempel–Ziv complexity, their integration into decision trees, and the experimental setups used for validation.
3.1. Proposed Causality and Distance Measures
3.1.1. Foundational Idea: Lempel–Ziv Complexity for Causal Inference
The foundational idea behind using Lempel–Ziv complexity for causal inference lies in the connection between predictability and compressibility [
17]. Consider two variables,
X and
Y. If
X causes
Y, then the patterns in
X should make
Y more predictable—and hence more compressible—when compared to the reverse scenario.
Formally, if
then
denotes the shortest program (or description) that maps
X to
Y, and its length,
, reflects the Kolmogorov complexity of this description. Similarly, we can consider
where
measures the complexity of describing
X given
Y.
If we infer that X causes Y, because Y can be more concisely described using X than vice versa.
However, since Kolmogorov complexity of an arbitrary string
X is undecidable [
18], we approximate it using practical measures such as Lempel–Ziv complexity [
19], which serves as a surrogate by quantifying how well a sequence can be compressed. The direction that leads to a lower Lempel–Ziv complexity is considered the likely causal direction.
3.1.2. Lempel–Ziv 1976 Complexity
Lempel–Ziv Complexity (LZC) is a widely adopted measure of complexity that is rooted in dictionary encoding. LZC compresses data by identifying recurring substrings and replacing repeated instances with references, thus efficiently capturing the structural information of the sequence [
12]. This mechanism sees widespread use in areas such as file compression schemes, signal analysis, and study of various system dynamics [
20,
21,
22].
The advantages of LZC include its non-parametric nature and versatility across data types without assuming specific probabilistic models, making it useful in fields such as neuroscience, genomics, and causal inference. However, recent work highlights certain limitations such as sensitivity to signal encoding, low efficiency due to sequential processing, and challenges when extending to multivariate or continuous data, prompting ongoing research to improve its robustness and application scope [
23].
Let
be a binary string. The Lempel–Ziv complexity (LZ76) [
12], denoted by
, is defined as the number of distinct phrases obtained by parsing
X from left to right such that each new phrase is the shortest substring that has not appeared previously as a phrase.
We decompose the string as
where
,
each is the shortest substring such that ,
and .
Example 1 (Parsing of X = 110011)
.Parse the string from left to right:Thus, the parsing is and hence, The set of distinct phrases (also known as the LZ grammar of X) is . 3.1.3. Research Gap and Comparison to Prior Work
The work in [
14] utilizes Lempel–Ziv complexity for causal inference. However, our proposed formulation and underlying assumptions differ fundamentally, leading to distinct advantages, particularly in scenarios involving the simultaneous evolution of variables.
The causal metric in [
14] is defined as follows:
This formulation inherently assumes a full temporal precedence of one variable over the other, which is a critical limitation for systems where variables evolve and interact concurrently. Applying a method predicated on sequential occurrence to such data can yield spurious causal inferences. Our method directly tackles this by employing a real-time grammar construction that processes both sequences simultaneously. By incrementally building grammars and tracking the overlap, we avoid the problematic assumption of strict temporal separation.
A second, more fundamental issue with their formulation is that the penalty for a variable “causing” itself is not guaranteed to be zero. For instance, for a sequence , the self-causation penalty is non-zero: . This non-zero baseline means it is possible for a completely different sequence Z to have a lower causal penalty with respect to X than X has with itself. This is particularly problematic for applications like feature selection in a decision tree, where a perfectly predictive feature (i.e., one that is identical to the target) could be assigned a higher penalty and thus be deemed less important than a noisy or irrelevant feature. In contrast, our method is designed to guarantee that for any sequence X, ensuring a stable and logical baseline by quantifying only the additional complexity required to explain Y using the real-time grammar of X.
3.1.4. Model Assumptions and Properties
The proposed causality measure has the following assumptions:
- 1.
Assumption 1: A cause must precede or occur simultaneously with its effect. Retrocausal scenarios are not considered.
- 2.
Assumption 2: There is no confounding variable that causes
X and
Y, where
X and
Y are temporal/non-temporal data. We modify the measure to deal with possible confounders separately in
Section 3.4.
Consider two univariate datasets and , which may or may not have temporal structure. We want to develop a measure in order to comment on the direction causality. i.e., or . We want the measure to have the following properties:
If , then the grammar of X constructed over real time has patterns that better explain Y. The extra penalty incurred by explaining Y using the generated grammar of X, denoted as , is less than the penalty incurred by explaining X using the generated grammar of Y, denoted as . For this case, . The inequality will be reversed if .
Explaining X using the real time generated grammar of X denoted as should give zero penalty. This follows from assumption 1.
3.1.5. Definition of the LZ Penalty (LZP) Measure
Given two symbolic sequences, and , where each symbol belongs to an alphabet and each symbol belongs to an alphabet . We define as the penalty incurred by explaining y using the real-time grammar of x. The algorithm for calculating the penalty is described in Algorithm 1.
The real-time grammar of a sequence is constructed incrementally as we process the sequence from left to right. At each step, we identify the shortest substring starting from the current position that has not been encountered previously in the sequence. This new substring is then added to the grammar set of that sequence. For a set G, denotes its cardinality, i.e., the number of elements in the set. Also note that for a symbolic sequence x, denotes the inclusive range of symbols from index i to j.
Example 2 (Calculation of ). Consider the two strings and . Then is calculated according to the following steps:
Step 0: We consider the two strings x and y, and construct their respective grammar sets and respectively using a process analogous to the Lempel–Ziv algorithm. We monitor the overlap at each stage, which can be thought to represent the extent to which the current and previous values of x influence y.
Step 1: Since we are inferring the strength of causation from string x to string y, we start with the selection of the smallest substring of x, starting from the first index, that is not already present in its grammar set. Since the grammar set is empty, the substring ‘1’ is selected and added to .
Step 2: Similarly, for string y, the smallest substring starting from the first index that is not in is ‘1’. Since this substring ‘1’ is already present in , theoverlapis incremented. We can interpret this as the ‘1’ present in x inducing its occurrence in y. The substring ‘1’ is then added to .
Step 3: The process is now repeated for x. The starting index of the subsequent substring must immediately follow the terminal index of the preceding substring to ensure no information is lost. Hence, starting from the second position, the smallest substring not in is ‘0’, which is selected and added to the set .
Step 4: As in Step 3, we consider y. Starting from the second position, the smallest substring not in is ‘1’. Since it is already present in , we extend the substring to ‘10’. The substring ‘10’ is not present in and is hence added to it. Since ‘10’ is also not present in , the overlap is not incremented.
Step 5: For x, starting from the third position, the smallest substring is ‘1’, which is already present in . We extend it to ‘11’, which is not in . Thus, ‘11’ is selected and added to .
Step 6: For y, starting from the fourth position, the smallest substring is ‘1’, which is in . We extend it to ‘11’, which is not in . Thus, ‘11’ is selected and added to . Since ‘11’ is already present in , the overlap is incremented.
Step 7: For x, starting from the fifth position, the smallest substring not in is ‘10’, which is added to .
Since there are no more substrings left in either string that can be added to the respective grammar sets, the process stops here, and is found to be (—overlap), which is . In the same manner (although not explicitly shown here), is also found to be 1. If , we can say that x causes y. Similarly, if , we conclude that y causes x. In this case, since the penalties are equal (), it implies x and y can be independent or bi-directional. In our experiments, we assumed that y and x are independent if .
Algorithm 1 details the process of calculating the causal penalty,
. It takes two symbolic sequences,
x and
y, as input. The core of the algorithm involves building the Lempel–Ziv grammars for both sequences,
and
, in a parallel, step-by-step manner. At each iteration, it finds the next unique phrase in
x and adds it to
. It then finds the next unique phrase in
y. A crucial step is checking if this new phrase from
y already exists in the grammar of
x (
). If it does, an ‘overlap‘ counter is incremented. This ‘overlap‘ quantifies how many of the structural patterns in
y had already been discovered while parsing
x. The final penalty is the total number of phrases in
y’s grammar minus this overlap, representing the “new” information in
y that could not be explained by the information in
x.
| Algorithm 1 Calculation of Penalty of a string y of length n given another string x of length m |
- 1:
Input: String y of length n, String x of length m. - 2:
Output: Penalty of using x to compress y - 3:
Initialize set Empty grammar set - 4:
Initialize set Empty grammar set - 5:
Initialize - 6:
Initialize - 7:
Initialize - 8:
whiledo - 9:
if then - 10:
Set - 11:
while do - 12:
Increase by 1 - 13:
end while - 14:
Add substring to - 15:
end if - 16:
Set - 17:
while do - 18:
Increase by 1 - 19:
end while - 20:
if then - 21:
Increase by 1 - 22:
end if - 23:
Add to - 24:
Set - 25:
Set - 26:
end while - 27:
return
|
3.1.6. A Distance Metric Based on Lempel–Ziv Complexity
In this section, we define a distance metric derived from the grammar of two symbolic sequences. The grammar construction is based on the Lempel–Ziv algorithm [
24]. Given two symbolic sequences,
x and
y, their grammars
and
respectively can be encoded using the Lempel–Ziv Algorithm, as shown in Algorithm 2. The Lempel–Ziv distance between two sets
and
is
where
represents the cardinality of a set
G. The proof of the given measure being a distance metric is given in
Appendix A.
| Algorithm 2 Calculation of Grammar of a String Using Lempel–Ziv Complexity |
- 1:
Input: String x of length n - 2:
Output: Grammar of the string, - 3:
Initialize set {Empty grammar set} - 4:
Initialize - 5:
while do - 6:
Set - 7:
while do - 8:
Increase j by 1 - 9:
end while - 10:
Add substring to - 11:
Set - 12:
end while - 13:
return
|
Algorithm 2 describes the standard Lempel–Ziv 1976 (LZ76) parsing procedure to generate the grammar of a single string
x. It iterates through the string from left to right, identifying the shortest substring that has not appeared as a phrase before. Each new, unique phrase is added to the grammar set,
. This process continues until the entire string is parsed. The resulting set,
is used as the basis for the Lempel–Ziv distance metric defined in Equation (
4).
3.2. Applicability in Decision Trees
The decision tree was chosen as the primary framework for this research due to its interpretability. Its hierarchical structure serves as the most direct and transparent vehicle for demonstrating and validating the behavior of a novel splitting criterion.
We utilize the proposed causality and distance measures as splitting criteria in decision trees, yielding (a) an LZ Causal Measure-based decision tree; (b) an LZ distance metric-based decision tree.
3.2.1. Utilization as a Splitting Criterion
To split a binary tree at a given node, a feature and threshold are selected based on minimizing the chosen LZ-based measure. Each feature and target value is transformed into symbolic sequences: elements in the feature column are represented as 0, if below the threshold, and 1, if above. In the target symbolic sequence t, a given label l is represented as 1 and all other labels as 0.
For the causal decision tree, the feature and threshold selection are performed as per the following equation:
where
is the best feature,
is the best threshold,
f is a candidate feature,
t is the target variable converted to a symbolic sequence based on label
l, and
is the causal penalty.
For the distance-based decision tree, the selection is based on the following:
where
is the Lempel–Ziv-based distance between the grammar of the feature symbolic sequence
f and the target symbolic sequence
t.
3.2.2. Causal Strength of a Feature on a Target
We propose a method of ranking the strength of causation of different attributes on a particular label based on the causal decision tree. Consider a decision tree with
m nodes, and
denotes the depth of the
ith node. The causal strength
for a feature
j to the target is given by the following formulation:
where
.
These scores can then be normalized to provide a relative ranking of causal influence.
3.3. Experimental Setup
3.3.1. Simple Causation in AR Processes of Various Orders
Data was generated from a coupled auto regressive process of order
. The governing equations for timeseries
and
are
The parameters were set to be , , and the coupling coefficient () was varied from 0 to 1 with a step size of . Noise terms and where and are sampled from the standard normal distribution and noise intensity . The process was simulated for time steps, with the first 500 transients discarded. This procedure was repeated for trials for each value of .
3.3.2. Coupled Logistic Map
The master-slave system of 1D Logistic maps is governed by
The coupling coefficient
is varied from 0 to
.
and
, where
and
. For each
, 1000 data instances of length 2000 were generated, after removing the initial 500 transients.
3.3.3. Deciphering Causation in the Three Variable Case
We apply a modified
LZP algorithm to three-variable causal structures. To find causation from
X to
Y conditioned on
Z, denoted
, we first build the grammar of the conditioning variable
Z. Then, we proceed with the normal LZP algorithm for
X and
Y, with the caveat that a new phrase from
X is only added to its grammar if it is not already present in the grammar of
Z. The system is modeled by
By varying parameters
, we simulate different causal structures. Strings of length 2500 were used (first 1000 transients removed), and results were averaged over 200 trials.
3.3.4. Datasets for Classification
We used several datasets from the UCI Machine Learning Repository [
25] and scikit-learn [
26] for evaluating the decision tree classifiers: Iris [
27], Breast Cancer Wisconsin [
28], Congressional Voting Records [
29], Car Evaluation [
30], KRKPA7 [
31], Mushroom [
32], Heart Disease [
33], and Thyroid [
34]. Some of the aforementioned datasets, including Breast Cancer Wisconsin, Congressional Voting Records, Car Evaluation, and KRKPA7, were previously used to evaluate the effectiveness of the Causal Decision Tree proposed by [
35]. Motivated by this, we employ these and other related datasets in our study.
Additionally, a synthetic AR Dataset was generated with a known causal structure to specifically test the causal decision tree. The selection of the other public datasets was intended to provide a diverse set of benchmark challenges that vary in sample size, number of features, number of classes, data types (numeric vs. categorical), and class balance. This allows for a robust evaluation of the general-purpose classification performance of the proposed models against established methods. Details on the datasets, including train–test splits, are provided in
Table 1.
3.4. Deciphering Causation in the Three Variable Case
We apply the proposed LZP algorithm to three different variable causal structures. We find causation from A to B, with C as the conditioned variable; B to C, with A as the conditioned variable; C to A, with B as the conditioned variable. We also find causation in the opposite direction in each case with the same conditioned variable and find the difference in penalties. The sign indicates the direction, and the magnitude indicates the strength of causation. We decide if causation detected is true or not based on its relative strength.
3.4.1. Algorithm for
We use the algorithm with a minor modification. Suppose we are finding causation from X to Y by conditioning it against Z. Let , , and be their respective grammar sets. Before we find the shortest possible substring in X not in , we first find the shortest substring in Z not present in and add it to its grammar . Now, we follow the same algorithm as before, with the caveat that a substring from X is added to only if it is not present in , otherwise it is discarded.
3.4.2. Procedure
The system is modeled by the following set of equations:
By changing the values for the parameters , we obtain different causal structures. The errors are randomly sampled from a normal distribution and = 0.01. Each string is taken to be of length 2500, with the first 1000 transients filtered out. The experiment is performed for 200 trials, and the average values of causal strength are found each time. This procedure is repeated for 20 trials to guarantee consistency.
4. Results
This section presents the results from validating the causality measure and the performance of the LZ-based decision trees.
4.1. Causality Measure Validation
The proposed LZP measure was tested on synthetic data with known causal links.
4.1.1. Coupled AR Processes
Figure 1 shows the average
and
for coupled AR processes of different orders. Across all orders (
), as the coupling coefficient
increases, the penalty for the true causal direction (
) becomes significantly smaller than for the anti-causal direction (
). This demonstrates that the proposed measure correctly and robustly identifies the direction of causality.
4.1.2. Coupled Logistic Map
For the coupled logistic map, the measure correctly identifies the causal direction (
) for coupling strengths up to
(
Figure 2a). Beyond this point, the two time series become synchronized (
Figure 2b), and the causal penalties converge, which is expected as the distinction between cause and effect blurs.
4.1.3. Three-Variable Causal Structures
Table 2 presents the average causal strengths and variances estimated by the conditional LZP algorithm across several canonical three-variable causal structures. The results validate the algorithm’s ability to distinguish between direct, indirect, and spurious influences.
In simple directed cases like , the algorithm assigns strong positive strength to the correct direction (51.05), in comparison to the other cases. In confounder scenarios (, ), both and show high strength (9), while the spurious link has a near-zero value (0.40), indicating correct rejection of non-causal association.
For chain structures (), the algorithm captures both direct (: 17.25, : 49.48) and strong indirect influence (: 66.02), aligning with expectations. In the collider case (, ), it reports strong negative values for X’s links to Y and Z, correctly suggesting the reverse direction of influence.
When variables are independent, all strengths remain near zero with higher variance, showing that no causality is detected. Finally, in complex cyclic or converging scenarios (e.g., , , ), the algorithm assigns appropriately strong strengths (e.g., –76.99 for ), capturing the dominant influences. In the cyclic case, the algorithm does not detect any causal structure.
While these results show that conditional LZP can reliably rank the strength and direction of causal links, any categorical decision-making or structure learning would require setting appropriate thresholds on the magnitude of estimated strengths to decide whether a link is present or absent. These thresholds could depend on the application, noise level, or confidence desired, and choosing them remains an important step in translating continuous scores into discrete causal graphs.
4.2. Decision Tree Performance
The LZP Causal and LZ-Distance decision trees were compared against Gini and Entropy-based decision trees, as well as a tree that uses LZ-P ([
14]) as the splitting criterion. The comparative performance is summarized in
Figure 3.
The most significant finding is the strong performance of the LZP-Causal Tree on the synthetic AR Dataset, where it achieves a macro F1-score of 0.716. Notably, this performance is matched by the prior LZ-P (Pranay & Nagaraj) DT (0.716), while both methods drastically outperform traditional approaches like the Gini DT (0.446). This result strongly validates the general principle that Lempel–Ziv complexity-based measures are uniquely capable of leveraging underlying causal structures for superior prediction, a task where purely associative metrics fail. The novelty of our LZP, therefore, is affirmed not by a performance gain on this specific dataset, but by its theoretical robustness, such as the guaranteed zero self-causation penalty. On standard, well-balanced datasets such as Mushroom (F1 score of 1.000) and Breast Cancer (0.934), the LZ-Causal tree performs on par with the Gini benchmark, demonstrating its viability as a general-purpose classifier. The results show that the LZ-Decision Tree is also effective in capturing structural similarities between feature and target sequences.
Table 3 shows the best and second-best performing algorithms based on Accuracy across all datasets.
However, the results also transparently highlight a key limitation: on highly imbalanced datasets like Thyroid (F1 score of 0.607 vs. Gini’s 0.951), the Gini-based tree maintains superior performance, underscoring the sensitivity of the current LZP formulation to skewed class distributions and reinforcing this as a critical area for future work.
4.3. Feature Importance and Interpretability
The causal strength metric provides a way to rank features based on their causal influence on the target.
Table 4 shows the feature ranking for the Heart Disease dataset for predicting the presence of disease. A similar ranking for the Mushroom dataset is provided in
Appendix F.
The causal decision tree for the Heart Disease dataset (
Figure 4) provides a clear and interpretable model. The root node splits the data based on serum cholesterol. This feature was chosen because it resulted in the lowest LZP causal penalty, indicating it is the strongest initial causal predictor of heart disease among all features. The features chosen for the top splits (‘chol‘, ‘age‘, and ‘sex‘) directly correspond to the features with the highest causal strength scores in
Table 4.
5. Discussion
Our preliminary validation on coupled AR and logistic map systems served to establish the measure’s fundamental soundness. The results also demonstrate that both the proposed LZP Causal and LZ-Distance measures can be effectively integrated into decision tree classifiers. The key finding of this research is the pronounced advantage of the LZP Causal decision tree on datasets with a known underlying causal structure. On the synthetic AR dataset, our causal tree achieved a macro F1-score of 0.716, representing a 60.5% performance improvement over the traditional Gini-based decision tree. This result provides strong evidence for our central hypothesis: by using a splitting criterion based on the LZP causal measure, the model successfully leverages causal information that is inaccessible to purely correlational metrics like Gini impurity. While performance on standard benchmarks was comparable, this specific success demonstrates that our approach is favorable in contexts where causal relationships drive the outcomes.
Extending this, the robust performance of our conditional LZP algorithm on three-variable structures (
Table 2) shows its capability to perform in various scenarios. For instance, in the confounder scenario, the algorithm correctly identified the strong causal links from the confounder while assigning a near-zero strength to the spurious correlation. The low scores for independent variables and the inability to find a clear structure in the cyclic case further demonstrate that the measure does not hallucinate causality where none exists. This validation of the underlying measure reinforces the credibility of its application within the decision tree framework.
The feature importance ranking derived from the causal tree provides a transparent, causally grounded explanation for the model’s predictions. For the Heart Disease dataset, the ranking aligns with known medical risk factors, lending credibility to the approach. This moves beyond traditional feature importance measures (like Gini importance or permutation importance), which are purely correlational and can be misleading in the presence of confounders.
Limitations and Future Work
A key limitation observed is the performance degradation on highly imbalanced datasets. Both LZ-based algorithms work by minimizing a score between the feature’s and target’s symbolic sequences. In imbalanced scenarios, this minimization can be trivially achieved by collapsing the feature representation into a homogeneous sequence (e.g., all zeros), failing to find a meaningful split. This is because the target sequence for the majority class is long and complex, while the one for the minority class is short and simple. The algorithm may favor splits that do a poor job on the minority class if it leads to a large reduction in penalty/distance for the majority class. Future work should address this by incorporating class weights or using sampling techniques within the splitting criterion.
Another area for future investigation is the sensitivity of the initial data discretization (binning). The conversion of continuous data to symbolic sequences is a critical step, and the choice of binning strategy could influence the results. While simple thresholding was used here, more sophisticated, data-aware binning methods could potentially improve performance.
A fundamental limitation inherited from the standard decision tree architecture is that features are assessed individually at each split. Our LZP metric evaluates the bivariate causal influence of a single feature on the target. This approach cannot capture complex multivariate or interactive causal relationships, where a combination of features jointly causes the outcome, but neither does so individually. Future research could investigate extensions to capture such effects.
6. Conclusions
In this research, we introduced a novel causality measure and a distance metric based on Lempel–Ziv complexity. We successfully demonstrated that our LZP causality measure can identify the correct direction of causation in synthetic datasets, including coupled AR processes and logistic maps, and can handle more complex three-variable structures.
We integrated these measures into a decision tree framework, creating an LZ-Causal and an LZ-Distance classifier. The primary validation of this approach was demonstrated by integrating these measures into a decision tree framework. The key finding of this work is the performance of the LZ-Causal decision tree on the synthetic AR dataset, which was designed with a known causal structure. On this dataset, our model achieved a 60.5% improvement over a traditional Gini-based tree. The LZ-Causal decision tree significantly outperforms traditional methods on data with an inherent causal structure, confirming its ability to exploit more than just statistical correlations. Furthermore, we proposed a method for deriving causal feature importance from the tree structure, offering a more robust form of model interpretability.
While the proposed methods show great promise for building causally aware machine learning models, future work is needed to improve their handling of imbalanced data. We also plan to explore the applicability of the proposed distance metric in bioinformatics, where the order and complexity of sequences like DNA or proteins are of fundamental importance.