Next Article in Journal
Image Restoration Quality Assessment Based on Regional Differential Information Entropy
Next Article in Special Issue
From p-Values to Posterior Probabilities of Null Hypotheses
Previous Article in Journal
Reconstructing Sparse Multiplex Networks with Application to Covert Networks
Previous Article in Special Issue
TFD-IIS-CRMCB: Telecom Fraud Detection for Incomplete Information Systems Based on Correlated Relation and Maximal Consistent Block
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Causal Confirmation Measures: From Simpson’s Paradox to COVID-19

1
Intelligence Engineering and Mathematics Institute, Liaoning Technical University, Fuxin 123000, China
2
School of Computer Engineering and Applied Mathematics, Changsha University, Changsha 410022, China
Entropy 2023, 25(1), 143; https://doi.org/10.3390/e25010143
Submission received: 8 November 2022 / Revised: 4 January 2023 / Accepted: 7 January 2023 / Published: 10 January 2023
(This article belongs to the Special Issue Data Science: Measuring Uncertainties II)

Abstract

:
When we compare the influences of two causes on an outcome, if the conclusion from every group is against that from the conflation, we think there is Simpson’s Paradox. The Existing Causal Inference Theory (ECIT) can make the overall conclusion consistent with the grouping conclusion by removing the confounder’s influence to eliminate the paradox. The ECIT uses relative risk difference Pd = max(0, (R − 1)/R) (R denotes the risk ratio) as the probability of causation. In contrast, Philosopher Fitelson uses confirmation measure D (posterior probability minus prior probability) to measure the strength of causation. Fitelson concludes that from the perspective of Bayesian confirmation, we should directly accept the overall conclusion without considering the paradox. The author proposed a Bayesian confirmation measure b* similar to Pd before. To overcome the contradiction between the ECIT and Bayesian confirmation, the author uses the semantic information method with the minimum cross-entropy criterion to deduce causal confirmation measure Cc = (R − 1)/max(R, 1). Cc is like Pd but has normalizing property (between −1 and 1) and cause symmetry. It especially fits cases where a cause restrains an outcome, such as the COVID-19 vaccine controlling the infection. Some examples (about kidney stone treatments and COVID-19) reveal that Pd and Cc are more reasonable than D; Cc is more useful than Pd.

1. Introduction

Causal confirmation is the expansion of Bayesian confirmation. It is also a task of causal inference. The Existing Causal Inference Theory (ECIT), including Rubin’s (or Neyman-Rubin) potential outcomes model [1,2] and Pearl’s causal graph [3,4], has achieved great success. However, causal confirmation is rarely mentioned.
Bayesian confirmation theories are also called confirmation theories, which can be divided into incremental and inductive schools. The incremental school affirms that the confirmation measures the supporting strength of evidence e to hypothesis h, as explained by Fitelson [5]. Following Carnap [6], the incremental school’s researchers often use the increment of a hypothesis’ probability or logical probability, P(h|e) − P(h), as a confirmation measure. Fitelson [5] discussed causal confirmation with this measure and obtained some conclusions incompatible with the ECIT. On the other hand, the inductive school [7,8] considers confirmation as induction’s modern form, whose task is to measure a major premise’s creditability supported by a sample or sampling distribution.
We use eh to denote a major premise. Variable e takes one of two possible values e1 and its negation e0. Variable h takes one of two possible values h1 and its negation h0. Then a sample includes four examples (e1, h1), (e1, h0), (e0, h1), and (e0, h0) with different proportions. The inductive school’s researchers often use positive examples and counterexamples’ proportions (P(e1|h1) and P(e1|h0)) or likelihood ratio (P(e1|h1)/P(e1|h0)) to express confirmation measures.
A confirmation measure is often denoted by C(e, h) or C(h, e). The author (of this paper) agrees with the inductive school and suggests using C(eh) to express a confirmation measure so that the task is evident [8]. In this paper, we use “x=>y” to denote “Cause x leads to outcome y”.
Although the two schools understand confirmation differently, both use sampling distribution P(e, h) to construct confirmation measures. There have been many confirmation measures [8,9]. Most researchers agree that an ideal confirmation measure should have the following two desired properties:
  • normalizing property [9,10], which means C(e, h) should change between −1 and 1 so that the difference between a rule eh and the best or the worst rule is clear;
  • hypothesis symmetry [11] or consequent symmetry [8], which means C(e1h1) = −C(e1h0). For example, C(raven→black) = −C(raven→non-black).
The author in [8] distinguished channels’ confirmation and predictions’ confirmation and provided channels’ confirmation measure b*(e→h) and predictions’ confirmation measure c*(eh). Both have the above two desired properties and can be used for the probability predictions of h according to e.
Bayesian confirmation confirms associated relationships, which are different from causal relationships. Association includes causality, but many associated relationships are not causal relationships. One reason is that the existence of association is symmetrical (if P(h|e) ≠ 0, then P(e|h) ≠ 0), whereas the existence of causality is asymmetrical. For example, in medical tests, P(positive|infected) reflects both association and causality. However, inversely, P(infected|positive) only indicates association. Another reason is that two associated events, A and B, such as electric fans’ easy selling and air conditioners’ easy selling, are the outcomes caused by the third event (hot weather). Neither P(A|B) nor P(B|A) indicates causality.
Causal inference only deals with uncertain causal relationships in nature and human society without considering those in mathematics, such as (x + 1)(x − 1) < x2 because (x + 1)(x − 1) = x2 − 1. We know that Kant distinguishes analytic judgments and synthetic judgments. Although causal inference is a mathematical method, it is used for synthetic judgments to obtain uncertain rules in biology, psychology, economics, etc. In addition, causal confirmation only deals with binary causality.
Although causal confirmation was rarely mentioned in the ECIT, the researchers of causal inference and epidemiology have provided many measures (without using the term “confirmation measure”) to indicate the strength of causation. These measures include risk difference [12]:
R D = P ( y 1 | x 1 ) P ( y 1 | x 0 ) ,
relative risk difference or the risk ratio (like the likelihood ratio for medical tests):
R R = P ( y 1 | x 1 ) / P ( y 1 | x 0 ) ,
and the probability of causation Pd (used by Rubin and Greenland [13]) or the probability of necessity PN (used by Pearl [3]). There is:
P d = P N = max ( 0 , P ( y 1 | x 1 ) P ( y 1 | x 0 ) P ( y 1 | x 1 ) ) .
Pd is also called Relative Risk Reduction (RRR) [12]. In the above formula, max(0, ∙) means its minimum is 0. This function is to make Pd more like a probability. Measure b* proposed by the author in [8] is like Pd, but b* changes between −1 and 1. The above risk measures can measure not only risk or relative risk but also success or relative success raised by the cause.
The risk measures in Equations (1)–(3) are significant; however, they do not possess the two desired properties and hence are improper as causal confirmation measures.
We will encounter Simpson’s Paradox if we only use sampling distributions for the above measures. Simpson’s Paradox has been accompanying the study of causal inference, as the Raven Paradox has been going with the study of Bayesian confirmation. Simpson proposed the paradox [14] using the following example.
Example 1 
[15]. The admission data of the graduate school of the University of California, Berkeley (UCB), for the fall of 1973 showed that 44% of male applicants were accepted, whereas only 35% of female applicants were accepted. There was probably gender bias present. However, in most departments, female applicants’ acceptance rates were higher than male applicants.
Was there a gender bias? Should we accept the overall conclusion or the grouping conclusion (i.e., that from every department)? If we take the overall conclusion, we can think that the admission had a bias against the female. On the other hand, if we accept the grouping conclusion, we can say that the female applicants were priorly accepted. Therefore, we say there exists a paradox.
Example 1 is a little complicated and easy to raise arguments against. To simplify the problem, we use Example 2, which the researchers of causal inference often mentioned, to explain Simpson’s Paradox quantitatively.
We use x1 to denote a new cause (or treatment) and x0 to denote a default cause or no cause. If we need to compare two causes, we may use x1 and x2, or xi and xj, to represent them. In these cases, we may assume that one is default like x0.
Example 2 
[16,17]. Suppose there are two treatments, x1 and x2, for patients with kidney stones. Patients are divided into two groups according to their size of stones. Group g1 includes patients with small stones, and group g2 has large ones. Outcome y1 represents the treatment’s success. Success rates shown in Figure 1 are possible. In each group, the success rate of x2 is higher than that of x1; however, the overall conclusion is the opposite.
According to Rubin’s potential outcomes model [1], we should accept the grouping conclusion: x2 is better than x1. The reason is that the stones’ size is a confounder, and the overall conclusion is affected by the confounder. We should eliminate this influence. The method is to imagine the patients’ numbers in each group are unchanged whether we use x1 or x2. Then we replace weighting coefficients P(gi|x1) and P(gi|x2) with P(gi) (i = 1, 2) to obtain two new overall success rates. Rubin [1] expresses them as P(y1x1) and P(y1x2); whereas Pearl [3] expresses them as P(y1|do(x1)) and P(y1|do(x2)). Then, the overall conclusion is consistent with the grouping conclusion.
Should we always accept the grouping conclusion when the two conclusions are inconsistent? It is not sure! Example 3 is a counterexample.
Example 3 
(from [18]). Treatment x1 denotes taking a kind of antihypertensive drug, and treatment x0 means taking nothing. Outcome y1 denotes recovering health, and y0 means not. Patients are divided into group g1 (with high blood pressure) and group g0 (with low blood pressure). It is very possible that in each group g, P(y1|g, x1) < P(y1|g, x0) (which means x0 is better than x1); whereas overall result is P(y1|x1) > P(y1|x0) (which means x1 is better than x0).
The ECIT tells us that we should accept the overall conclusion that x1 is better than x0 because blood pressure is a mediator, which is also affected by x1. We expect that x1 can move a patient from g1 to g0; hence we need not change the weighting coefficients from P(g|x) to P(g). The grouping conclusion, P(y1|g, x1) < P(y1|g, x0), exists because the drug has a side effect.
There are also some examples where the grouping conclusion is acceptable from one perspective, and the overall conclusion is acceptable from another.
Example 4 
[19]. The United States statistical data about COVID-19 in June 2020 show that COVID-19 led to a higher Case Fatality Rate (CFR) of non-Hispanic whites than others (overall conclusion). We can find that only 35.3% of the infected people were non-Hispanic whites, whereas 49.5% of the infected people who died from COVID-19 were non-Hispanic whites. It seems that COVID-19 is more dangerous to non-Hispanic whites. However, Dana Mackenzie pointed out [19] that we will obtain the opposite conclusion from every age group because the CFR of non-Hispanic whites is lower than that of other people in every age group. So, there exists Simpson’s Paradox. The reason is that non-Hispanic whites have longer lifespans and a relatively large proportion of the elderly, while COVID-19 is more dangerous to the elderly.
Kügelgen et al. [20] also pointed out the existence of Simpson’s Paradox after they compared the CFRs of COVID-19 (reported in 2020) in China and Italy. Although the overall conclusion was that the CFR in Italy was higher than in China, the CFR of every age group in China was higher than in Italy. The reason is that the proportion of the elderly in Italy is larger than in China.
According to Rubin’s potential outcomes model or Pearl’s causal graph, if we think that the reason for non-Hispanic whites’ longevity is good medical conditions or other elements instead of their race, then the lifespan is a confounder. Therefore, we should accept the grouping conclusion. On the other hand, if we believe that non-Hispanic whites are longevous because they are whites, then the lifespan is a mediator, so we should accept the overall conclusion.
Example 1 is similar to Example 4, but the former is not easy to understand. The data show that the female applicants tended to choose majors with low admission rates (perhaps because lower thresholds resulted in more intense competition). This tendency is like the lifespan of the white. If we regard lifespan as a confounder, Berkeley University had no gender bias against the female. On the other hand, if we believe the female is the cause of this tendency, the overall conclusion is acceptable, and gender bias should have existed. Deciding which of the two judgments is right depends on one’s perspective.
Pearl’s causal graph [3] makes it clear that for the same data, if supposed causal relationships are different, conclusions are also different. So, it is not enough to have data only. We also need the structural causal model.
However, the incremental school’s philosopher Fitelson argues that from the perspective of Bayesian confirmation, we should accept the overall conclusion according to the data without considering causation; Simpson’s Paradox does not exist according to his rational explanation. His reason is that we can use the measure [5]:
measure i = P(y1|x1) − P(y1)
to measure causality. Fitelson proves (see Fact 3 of Appendix in [5]) that if there is:
P(y1|x1, gi) > P(y1|x2, gi), i=1, 2,
then there must be P(y1|x1) > P(y1). The result is the same when “>“ is replaced with “<“. Therefore, Fitelson affirms that, unlike RD and Pd, measure i does not result in the paradox.
However, Equation (5) expresses a rigorous condition, which excludes all examples with joint distributions P(y, x, g) that cause the paradox, including Fitelson’s simplified example about the admissions of the UCB.
One cannot help asking:
  • For Example 2 about kidney stones, is it reasonable to accept the overall conclusion without considering the difficulties of treatments?
  • Is it necessary to extend or apply a Bayesian confirmation measure incompatible with the ECIT and medical practices to causal confirmation?
  • Except for the incompatible confirmation measures, are there no compatible confirmation measures?
In addition to the incremental school’s confirmation measures, there are also the inductive school’s confirmation measures, such as F proposed by Kemeny and Oppenheim in 1952 and b* provided by the author in 2020.
This paper mainly aims at:
  • combining the ECIT to deduce causal confirmation measure Cc(x1 => y1) (“C” stands for confirmation and “c” for the cause), which is similar to Pd but can measure negative causal relationships, such as “vaccine => infection”;
  • explaining that measures Cc and Pd are more suitable for causal confirmation than measure i by using some examples with Simpson’s Paradox;
  • supporting the inductive school of Bayesian confirmation in turn.
When the author proposed measure b*, he also provided measure c* for eliminating the Raven Paradox [8]. For extending c* to causal confirmation, this paper presents measure Ce(x1 => y1), which indicates the outcome’s inevitability or the cause’s sufficiency.

2. Background

2.1. Bayesian Confirmation: Incremental School and Inductive School

A universal judgment is equivalent to a hypothetical judgment or a rule, such as “All ravens are black” is equivalent to “For every x, if x is a raven, then x is black”. Both can be used as a major premise for a syllogism. Due to the criticism of Hume and Popper, most philosophers no longer expect to obtain absolutely correct universal judgments or major premises by induction but hope to obtain their degrees of belief. A degree of belief supported by a sample or sampling distribution is the degree of confirmation.
It is worth noting that a proposition does not need confirmation. Its truth value comes from its usage or definition [8]. For example, “People over 18 are adults” does not need confirmation; whether it is correct depends on the government’s definition. Only major premises (such as “All ravens are black” and “If a person’s Nucleic Acid Test is positive, he is likely to be infected with COVID-19”) need confirmation.
A natural idea is to use conditional probability P(h|e) to confirm a major premise or rule denoted with eh. This measure is also recommended by Fitelson [5], and called confirm f. There is [5,6]:
confirm f = f(e,h) = P(h|e). (Carnap, 1962, Fitelson, 2017)
However, P(h|e) depends very much on the prior probability P(h) of h. For example, where COVID-19 is prevalent, P(h) is large, and P(h|e) is also large. Therefore, P(h|e) cannot reflect the necessity of e. An extreme example is that h and e are independent of each other, but if P(h) is large, P(h|e) = P(h, e)/P(e) = P(h) is also large. At this time, P(h|e) does not reflect the creditability of the causal relationship. For example, h = “There will be no earthquake tomorrow”, P(h) = 0.999, and e = “Grapes are ripe”. Although e and h are irrelative, P(h|e) = P(h) = 0.999 is very large. However, we cannot say that the ripe grape supports no earthquake happening.
For this reason, the incremental school’s researchers use posterior (or conditional) probability minus prior probability to express the degree of confirmation. These confirmation measures include [6,10,21,22]:
D(e1, h1) = P(h1|e1) − P(h1)     (Carnap, 1962),
M(e1, h1) = P(e1|h1) − P(e1)     (Mortimer, 1988),
R(e1, h1) = log[P(h1|e1)/P(h1)]     (Horwich, 1982),
C(e1, h1) = P(h1, e1) − P(e1) P(h1)     (Carnap, 1962),
Z ( h 1 , e 1 ) = { [ P ( h 1 | e 1 ) P ( h 1 ) ] / P ( h 0 ,   as   P ( h 1 | e 1 ) P ( h 1 ) , [ P ( h 1 | e 1 ) P ( h 1 ) ] / P ( h 1 ) ,   otherwise ,   ( Crupi   et   al . ,   2007 ) .
In the above measures, D(e1, h1) is measure i recommended by Fitelson in [5]. R(e1, h1) is an information measure. It can be written as logP(h1|e1) − logP(h1). Since logP(h1|e1) − logP(h1) = logP(e1|h1) − logP(e1) = logP(h1,e1) − log[P(h1)P(e1)], D, M, and C increase with R and hence can be replaced with each other. Z is the normalization of D for having the two desired properties [10]. Therefore, we can also call the incremental school the information school.
On the other hand, the inductive school’s researchers use the difference (or likelihood ratio) between two conditional probabilities representing the proportions of positive and negative examples to express confirmation measures. These measures include [7,8,23,24,25]:
S(e1, h1) = P(h1|e1) − P(h1|e0)     (Christensen, 1999),
N(e1, h1) = P(e1|h1) − P(e1|h0)     (Nozick, 1981),
L(e1, h1) = log[ P(e1|h1)/P(e1|h0)]     (Good, 1984),
F(e1, h1) = [P(e1|h1) − P(e1|h0)]/[ P(e1|h1)+ P(e1|h0)] (Kemeny and Oppenheim, 1952),
b*(e1, h1) = [P(e1|h1) − P(e1|h0)]/max(P(e1|h1), P(e1|h0)) (Lu, 2020).
They are all positively related to the Likelihood Ratio (LR+ = P(e1|h1)/P(e1|h0)). For example, L = log LR+ and F = (LR+ − 1)/(LR+ + 1) [7]. Therefore, these measures are compatible with risk (or reliability) measures, such as Pd, used in medical tests and disease control. Although the author has studied semantic information theory for a long time [26,27,28] and believe both schools have made important contributions to Bayesian confirmation, he is on the side of the inductive school. The reason is that information evaluation occurs before classification, whereas confirmation is needed after classification [8].
Although the researchers understand confirmation differently, they all agree to use a sample including four types of examples (e1, h1), (e0, h1), (e1, h0), and (e0, h0) with different proportions as the evidence to construct confirmation measures [8,10]. The main problem with the incremental school is that they do not distinguish the evidence of a major premise and that of the consequent of the major premise well. When they use the four examples’ proportions to construct confirmation measures, e is regarded as the major premise’s antecedent, whose negation e0 is meaningful. However, when they say “to evaluate the supporting strength of e to h”, e is understood as a sample, whose negation e0 is meaningless. It is more meaningless to put a sample e or e0 in an example (e1, h1) or (e0, h1).
We compare D (i.e., measure i) and S to show the main difference between the two schools’ measures. Since:
D(e1, h1) = P(h1|e1) − P(h1) = P(h1|e1) − [P(e1)P(h1|e1) + P(e0)P(h1|e0)]
= [1 − P(e1)]P(h1|e1) − P(e0)P(h1|e0) = P(e0)S(e1,h1),
we can find that D changes with P(e0) or P(e1), but S does not. P(e) means the source and P(h|e) means the channel. D is related to the source and the channel, but S is only related to the channel. Measures F and b* are also only related to channel P(e|h). Therefore, the author calls b* the channels’ confirmation measure.

2.2. The P-T Probability Framework and the Methods of Semantic Information and Cross-Entropy for Channels’ Confirmation Measure b*(e→h)

In the P-T probability framework [28] proposed by the author, there are both statistical probability P and logical probability (or truth value) T; the truth function of a predicate is also a membership function of a fuzzy set [29]. Therefore, the truth function also changes between 0 and 1. The purpose of proposing this probability framework is to set up the bridge between statistics and fuzzy logic.
Let X be a random variable representing an instance, taking a value xA = {x0,x1,…}, and Y be a random variable representing a label or hypothesis, taking a value yB = { y0,y1,…}. The Shannon channel is a conditional probability matrix P(yj|xi) (i = 1,2,...; j = 1,2,…) or a set of transition probability functions P(yj|x) (j = 1,2,…). The semantic channel is a truth value matrix T(yj|xi) (i = 1,2,…; j = 1,2,…) or a set of truth functions T(yj|x) (j = 0,1,…). Let the elements in A that make yj true form a fuzzy subset θj. The membership function T(θj|x) of θj is also the truth function T(yj|x) of yj, i.e., T(θj|x) = T(yj|x).
The logical probability of yj is:
T ( y j ) = T ( θ j ) = i P ( x i ) T ( θ j | x i ) .
Zadeh calls it the fuzzy event’s probability [30]. When yj is true, the conditional probability of x is:
P ( x | θ j ) = P ( x ) T ( θ j | x ) / T ( θ j ) .
Fuzzy set θj can also be understood as a model parameter; hence P(x|θj) is a likelihood function.
The differences between logical probability and statistical probability are:
  • The statistical probability is normalized (the sum is 1), whereas the logical probability is not. Generally, we have T(θ0) + T(θ1) + … > 1.
  • The maximum value of T(θj|x) is 1 for different x, whereas P(y0|x) + P(y1|x) + … = 1 for a given x.
We can use the sample distribution to optimize the model parameters. For example, we use x to represent the age, use a logistic function as the truth function of the elderly: T(“elderly”|x) = 1/[1 + exp (− bx + a)], and use a sampling distribution to optimize a and b.
The (amount of) semantic information about xi conveyed by yj is:
I ( x i ; θ j ) = log P ( x i | θ j ) P ( x i ) = log T ( θ j | x i ) T ( θ j ) .
For different x, the average semantic information conveyed by yj is:
I ( X ; θ j ) = i P ( x i | y j ) log T ( θ j | x i ) T ( θ j ) = i P ( x i | y j ) log P ( x i | θ j ) P ( x i ) = i P ( x i | y j ) log P ( x i ) H ( X | θ j ) .
In the above formula, H(X|θj) is a cross-entropy:
H ( X | θ j ) = i P ( x i | y j ) log P ( x i | θ j ) .  
The cross-entropy has an important property: when we change P(x|θj) so that P(x|θj) = P(x|yj), H(X|θj) reaches its minimum. It is easy to find from Equation (10) that I(X; θj) reaches its maximum as H(X|θj) reaches its minimum. The author has proved that if P(x|θj) = P(x|yj), then T(θj|x)∝P(yj|x) [27]. If for all j, T(θj|x)∝P(yj|x), we say that the semantic channel matches the Shannon channel.
We use the medical test as an example to deduce the channels’ conformation measure b*. We define h∈{h0, h1} = {infected, uninfected} and e∈{e0, e1} = {positive, negative}. The Shannon channel is P(e|h), and the semantic channel is T(e|h). The major premise to be confirmed is e1h1, which means “If one’s test is positive, then he is infected”.
We regard a fuzzy predicate e1(h) as the linear combination of a clear predicate (whose truth value is 0 or 1) and a tautology (whose truth value is always 1). Let the tautology’s proportion be b1′ and the clear predicate’s proportion be 1 − b1′. Then we have:
T(e1|h0) = b1′; T(e1|h1) = b1′+ b1 = b1′ + (1 − b1′) = 1.
The b1′ is also called the degree of disbelief of rule e1h1. The degree of disbelief optimized by a sample, denoted by b1′*, is the degree of disconfirmation. Let b1* denote the degree of confirmation; we have b1′* = 1 − |b1*|. By maximizing average semantic information I(H; θ1) or minimizing cross-entropy H(H|θj), we can deduce (see Section 3.2 in [8]):
b 1 * = b * ( e 1 h 1 ) = P ( e 1 | h 1 ) P ( e 1 | h 0 ) max ( P ( e 1 | h 1 ) , P ( e 1 | h 0 ) ) = L R + 1 max ( L R + , 1 ) .
Suppose that likelihood function P(h|e1) is decomposed into an equiprobable part and a part with 0 and 1. Then, we can deduce the predictions’ confirmation measure c*:
c 1 * = c * ( e 1 h 1 ) = P ( h 1 | e 1 ) P ( h 0 | e 1 ) max ( P ( h 1 | e 1 ) , P ( h 0 | e 1 ) ) = 2 P ( h 1 | e 1 ) 1 max ( P ( h 1 | e 1 ) , 1 P ( h 1 | e 1 ) ) .
Measure b* is compatible with the likelihood ratio and suitable for evaluating medical tests. In contrast, measure c* is appropriate to assess the consequent inevitability of a rule and can be used to clarify the Raven Paradox [8]. Moreover, both measures have the normalizing property and symmetry mentioned above.

2.3. Causal Inference: Talking from Simpson’s Paradox

According to the ECIT, the grouping conclusion is acceptable for Example 2 (about kidney stones), whereas the overall conclusion is acceptable for Example 3 (about blood pressure). The reason is that P(y1|x1) and P(y1|x0) may not reflect causality well; in addition to the observed data or joint probability distribution P(y, x, g), we also need to suppose the causal structure behind the data [3].
Suppose there is the third variable, u. Figure 2 shows the causal relationships in Examples 2, 3, and 4. Figure 2a shows the causal structure of Example 2, where u (kidney stones’ size) is a confounder that affects both x and y. Figure 2b describes the causal structure of Example 3, where u (blood pressure) is a mediator that affects y but is affected by x. In Figure 2c, u can be interpreted as either a confounder or a mediator. The causality will differ from different perspectives, and P(y1|do(x)) will also differ. In all cases, we should replace P(y|x) with P(y|do(x)) (if they are different) to get RD, RR, and Pd.
We should accept the overall conclusion for the example where u is a mediator. However, for the example where u is a confounder, how do we obtain a suitable P(y|do(x))? According to Rubin’s potential outcomes model, we use Figure 3 to explain the difference between P(y|do(x)) and P(y|x).
To find the difference in the outcomes caused by x1 and x2, we should compare the two outcomes in the same background. However, there is often no situation where other conditions remain unchanged except for the cause. For this reason, we need to replace x1 with x2 in our imagination and see the shift in y1 or its probability. If u is a confounder and not affected by x, the number of members in g1 and g2 should be unchanged with x, as shown in Figure 3. The solution is to use P(g) instead of P(g|x) for the weighting operation so that the overall conclusion is consistent with the grouping conclusion. Hence, the paradox no longer exists.
Although P(x0) + P(x1) = 1 is tenable, P(do (x1)) + P(do (x0)) = 1 is meaningless. That is why Rubin emphasizes that P(yx), i.e., P(y|do(x)), is still a marginal probability instead of a conditional probability, in essence.
Rubin’s reason [2] for replacing P(g|x) with P(g) is that for each group, such as g1, the two subgroups’ members (patients) treated by x1 and x2 are interchangeable (i.e., Pearl’s causal independence assumption mentioned in [5]). If a member is divided into the subgroup with x1, its success rate should be P(y1|g, x1); if it is divided into the subgroup with x2, the success rate should be P(y1|g, x2). P(g|x1) and P(g|x2) are different only because half of the data are missing. However, we can fill in the missing data using our imagination.
If u is a mediator, as shown in Figure 2b, a member in g1 may enter g2 because of x, and vice versa. P(g|x0) and P(g|x1) are hence different without needing to be replaced with P(g). We can let P(y1|do (x)) = P(y1|x) directly and accept the overall conclusion.

2.4. Probability Measures for Causation

In Rubin and Greenland’s article [13]:
P(t) = [R(t) − 1]/R(t)
is explained as the probability of causation, where t is one’s age of exposure to some harmful environment. R(t) is the age-specific infection rate (infected population divided by uninfected population). Let y1 stand for the infection, x1 for the exposure, and x0 for no exposure. Then there is R(t) = P(y1|do(x1), t)/P(y1|do(x0), t). Its lower limit is 0 because the probability cannot be negative. When the change of t is neglected, considering the lower limit, we can write the probability of causation as:
P d = max ( 0 , R 1 R ) = max ( 0 , P ( y 1 | do ( x 1 ) ) P ( y 1 | do ( x 0 ) ) P ( y 1 | do ( x 1 ) ) ) .
Pearl uses PN to represent Pd and explains PN as the probability of necessity [3]. Pd is very similar to confirmation measure b* [8]. The main difference is that b* changes between −1 and 1.
Robert van Rooij and Katrin Schulz [31] argue that conditionals of the form “If x, then y” are assertable only if:
Δ * P x y = P ( y 1 | x 1 ) P ( y 1 | x 0 ) 1 P ( y 1 | x 0 )
is high. This measure is similar to confirmation measure Z. The difference between Pd and Δ*Pxy is that Pd, like b*, is sensitive to counterexamples’ proportion P(y1|x0), whereas Δ*Pxy is not. Table 1 shows their differences.
David E. Over et al. [32] support the Ramsey test hypothesis, implying that the subjective probability of a natural language conditional, P(if p then q), is the conditional subjective probability, P(q|p). This measure is confirm f in [5].
The author [8] suggests that we should distinguish two types of confirmation measures for x=>y or eh. One is to stand for the necessity of x compared with x0; the other is for the inevitability of y. P(y|x) may be good for the latter but not for the former. The former should be independent of P(x) and P(y). Pd is such a one.
However, there is a problem with Pd. If Pd is 0 when y is uncorrelated to x, then Pd should be negative instead of 0 when x inversely affects y (e.g., vaccine affects infection). Therefore, we need a confirmation measure between −1 and 1 instead of a probability measure between 0 and 1.

3. Methods

3.1. Defining Causal Posterior Probability

To avoid treating association as causality, we first explain what kind of posterior probabilities indicate causality. Posterior probability and conditional probability are often regarded as the same. However, Rubin emphasizes that probability P(yx) is not conditional; it is still marginal. To distinguish P(yx) and marginal probability P(y), we call P(yx), i.e., P(y|do(x)), the Causal Posterior Probability (CPP). What posterior probability is the CPP? We use the following example to explain.
About the population age distribution, let z be age and the population age distribution be p(z). We may define that a person with z ≥ 60 is called an elderly; that is, P(y1|z) = 1 for zz0. The label of an elderly is y1, and the label of a non-elderly is y0. The probability of the elderly is:
P ( y 1 ) = all   z p ( z ) P ( y 1 | z ) = z 60 p ( z )
Let x1 denote the improved medical condition. After a period, p(z) becomes p(zx1) = p(z|do(x1)) and P(y1) becomes:
P ( y 1 x 1 ) = P ( y 1 | do ( x 1 ) ) = z 60 p ( z | do ( x 1 ) ) ,
Let x0 be the medical condition existing already. We have:
P ( y 1 | do ( x 0 ) ) = z 60 p ( z | do ( x 0 ) ) .
There are similar examples:
  • About whether a drug (x1) can lower blood pressure, blood sugar, blood lipid, or uric acid (z) or not, if z drops to a certain level z0, we say that the drug is effective (y1).
  • About whether a fertilizer (x1) can increase grain yield (z), if z increases to a certain extent z0, the grain yield is regarded as a bumper harvest (y1).
  • Can a process x1 reduce the deviation z of a product’s size? If the deviation is smaller than the tolerance (z0), we consider the product qualified (y1).
From the above examples, we can find that the action x can be the cause of a causal relationship because it can cause the change of probability distribution p(z) of objective result z, rather than the change of probability distribution P(y|∙) of outcome y. The reason is that P(y|∙) also changes with the dividing boundary z0. For example, if the dividing boundary of the elderly changes from z0 = 60 to z0′ = 65, the posterior probability P(y1|z0′) of y1 will become smaller. This change seemingly also reflects causality. However, the author thinks this change is due to a mathematical cause, which does not reflect the causal relationship we want to study. Therefore, we need to define the CPP more specifically.
Definition 1. 
Random variable Z takes a value z∈ {z1, z2, …} and p(z) is the probability distribution of the objective result. Random variable Y takes a value y∈{y0, y1} and represents the outcome, i.e., the classification label of z. The cause or treatment is x ∈ {x0, x1} or {x1, x2}. If replacing x0 with x1 (or x1 with x2) can cause the change of probability distribution p(z), we call x the cause, p(z|x) or p(zx) the CPP distribution, and P(yx) = P(y|do(x)) the CPP.
According to the above definition, given y1, the conditional probability distribution p(z|y1) is not the CPP distribution because the probability distribution of z does not change with y.
Suppose that x1 is the vaccine for COVID-19, y1 is the infection, and e1 is the test-positive. Then P(y1|x1) or P(y1|do(x1)) is the CPP, whereas P(y1|e1) is not. We may regard y1 as the conclusion obtained by the best test, e1 is from a common test, and P(y1|e1) is the probability prediction of y1. P(y1|e1) is not a CPP because e1 does not change p(z) and the conclusion from the best test.

3.2. Using x2/x1 => y1 to Compare the Influences of Two Causes on an Outcome

In associated relationships, x0 is the negation of x1; they are complementary. However, in causal relationships, x1 is the substitute for x0. For example, consider taking medicines to cure the disease. Let x0 denote taking nothing, and x1 and x2 represent taking two different medicines. Each of x1 and x2 is a possible alternative to x0 instead of the negation of x0. Furthermore, in some cases, x1 may include x0 (see Section 4.3).
When we compare the effects of x2 and x1, it is unclear to use “x2 => y1” to indicate the causal relationship. Therefore, the author suggests that we had better replace “x2 => y1” with “x2/x1 => y1”, which means “replacing x1 with x2 will arise or increase y1”.
There are two reasons for using “x2/x1”:
  • One is to express symmetry (Cc(x2/x1 => y1) = − Cc(x1/x2 => y1)) conveniently.
  • Another is to emphasize that x1 and x2 are not complementary but alternatives for eliminating Simpson’s Paradox easily.
To compare x1 with x0, we may selectively use “x1/x0 => y1” or “x1 => y1”.
For Example 2 with a confounder, if we consider the treatment as replacing x2 with x1 in our imagination, we can easily understand why the number of patients in each group should be unchanged, that is, P(g|x1) = P(g|x2) = P(g). The reason is that the replacement will not change everyone’s kidney stone size.
In Example 3, u is a mediator, and the number of people in each group (with high or low blood pressure) is also affected by taking an antihypertensive drug x1. When we replace x0 with x1, P(g|x1) ≠ P(g|x0) ≠ P(g) is reasonable, and hence the weighting coefficients need not be adjusted. In this case, we can directly let P(y1|do(x)) = P(y1|x).

3.3. Deducing Causal Confirmation Measure Cc by the Methods of Semantic Information and Cross-Entropy

We use x1 => y1 as an example to deduce the causal confirmation measure Cc. If we need to compare any two causes, xi and xk, we may assume that one is default as x0.
Let s1 = “x1 => y1” and s0 = “x0 => y0”. We suppose that s1 includes a believable part with proportion b1 and a disbelievable part with proportion b1′. Their relation is b1′ + |b1| = 1. First, we assume b1 > 0; hence b1 = 1 − b1′. The two truth values of s1 are T(s1|x1) and T(s1|x0), as shown in the last row of Table 2.
Figure 4 shows how truth function T(s1|x) is related to b1 and b1′ for b1 > 0. T(s1|x1) = 1 means that example (x1, y1) makes s1 fully true; T(s1|x0) = b1′ is the truth value and the degree of disbelief of s1 for given counterexample (x0, y1).
The degree of belief optimized by a sampling distribution with the maximum semantic information or minimum cross-entropy criterion is the degree of causal confirmation, denoted by Cc1 = Cc(x1/x0 = >y1) = b1*.
The logical probability of s1 is (see Equation (7)):
T(s1) = P(x1) + P(x0) b1′,
The predicted probability of x1 by y1 and s1 is:
P ( x 1 | θ 1 ) = T ( s 1 | x 1 ) P ( x 1 ) T ( s 1 ) = P ( x 1 ) P ( x 1 ) + P ( x 0 ) b 1 ,
where θj can be regarded as the parameter of truth function T(sj|x).
The average semantic information conveyed by y1 and s1 about x is:
I ( X ; θ 1 ) = i P ( x i | y 1 ) log P ( x i | θ 1 ) P ( x i ) = i P ( x i | y 1 ) log P ( x i ) H ( X | θ 1 ) ,
where H(X|θ1) is a cross-entropy. We suppose that sampling distribution P(x, y) has be modified so that P(y|x) = P(y|do(x)). According to the property of cross-entropy, H(X|θ1) reaches its minimum so that I(X; θj) reaches its maximum as P(x|θ1) = P(x|y1), i.e.,
P ( x 0 | θ 1 ) = P ( x 0 ) b 1 P ( x 1 ) + P ( x 0 ) b 1 = P ( x 0 | y 1 ) ,   P ( x 1 | θ 1 ) = P ( x 1 ) P ( x 1 ) + P ( x 0 ) b 1 = P ( x 1 | y 1 ) .
From the above two equations, we obtain:
P ( x 0 ) P ( x 1 ) b 1 = P ( x 0 | y 1 ) P ( x 1 | y 1 ) .
Order:
m ( x i , y j ) = P ( y j | x i ) P ( y j ) = P ( x i , y j ) P ( x i ) P ( y j ) , i = 0 , 1 ;   j = 0 , 1 ,
which represents the degree of correlation between xi and yj and may be independent of P(x) and P(y), unlike P(xi, yj). From Equations (25) and (26), we obtain the optimized degree of disbelief, i.e., the degree of disconfirmation:
b1′* = m(x0,y1)/m(x1,y1).
Then we have the degree of confirmation of s1:
b 1 * = 1 b 1 * = 1 m ( x 0 , y 1 ) m ( x 1 , y 1 ) = m ( x 1 , y 1 ) m ( x 0 , y 1 ) m ( x 1 , y 1 ) .
In the above formulas, we assume b1*> 0 and hence m(x1, y1) ≥ m(x0,y1). If m(x1, y1) < m(x0, y1), b1* should be negative, and b1′* should be m(x1, y1) / m(x0, y0). Then we have:
b 1 * = ( 1 b 1 * ) = ( 1 m ( x 1 , y 1 ) m ( x 0 , y 1 ) ) = m ( x 1 , y 1 ) m ( x 0 , y 1 ) m ( x 0 , y 1 ) .
Combining the above two equations, we derive the confirmation measure:
C c ( x 1 = > y 1 ) = b 1 * = m ( x 1 , y 1 ) m ( x 0 , y 1 ) max ( m ( x 1 , y 1 ) , m ( x 0 , y 1 ) ) .
Since P ( y j | x i ) = m ( x i , y j ) P ( y j ) , we also have:
b1′* = P(x0|y1)/P(x1|y1),
C c ( x 1 = > y 1 ) = b 1 * = P ( y 1 | x 1 ) P ( y 1 | x 0 ) max ( P ( y 1 | x 1 ) , P ( y 1 | x 0 ) ) = R 1 max ( R , 1 ) ,
where R = P(y1|x1) / P(y1|x0) is the relative risk or the likelihood ratio used for Pd.
Measure Cc has the normalizing property since its maximum is 1 as m(x0, y1) = 0 and the minimum is −1 as m(x1, y1) = 0. It has cause symmetry since:
C c ( x 0 / x 1 = > y 1 ) = m ( x 0 , y 1 ) m ( x 1 , y 1 ) max ( m ( x 0 , y 1 ) , m ( x 1 , y 1 ) ) = m ( x 1 , y 1 ) m ( x 0 , y 1 ) max ( m ( x 1 , y 1 ) , m ( x 0 , | y 1 ) ) = C c ( x 1 / x 0 = > y 1 ) .
Similarly, letting probability distribution P(y|x1) be the linear combination of a uniform probability distribution and a 0–1 distribution, we can obtain another causal confirmation measure:
C e ( x 1 = > y 1 ) = P ( y 1 | x 1 ) P ( y 0 | x 1 ) max ( P ( y 1 | x 1 ) , P ( y 0 | x 1 ) ) = 2 P ( y 1 | x 1 ) 1 max ( P ( y 1 | x 1 ) , 1 P ( y 1 | x 1 ) ) .
This measure can be regarded as the direct extension of Bayesian confirmation measure c*(e1h1) [8]. It increases monotonically with the Bayesian confirmation measure f(h1, e1) = P(h1|e1), which is used by Fitelson et al. [5,32]. However, Ce has the normalizing property and the outcome symmetry:
Ce(x1 => y1) = − Ce(x1 => y0).

3.4. Causal Confirmation Measures Cc and Ce for Probability Predictions

From y1, b1*, and P(x), we can make the probability prediction about x:
P ( x 1 | θ 1 ) = P ( x i ) P ( x 1 ) + b 1 * P ( x 0 ) ,   P ( x 0 | θ 1 ) = P ( x 0 ) b 1 * P ( x 1 ) + b 1 * P ( x 0 ) ,
where b1* > 0, θ1 represents y1 with b1′*, and θ0 means y0 with b0′*. If b1*< 0, we let T(s1|x1) = b1′ and T(s1|x0) = 1, and then use the above formula.
Following the probability prediction with Bayesian confirmation measure c* [8], we can also make probability prediction for given x1 and Ce1. For example, when Ce1 is greater than 0, there is:
P ( y 1 | θ x 1 ) = 1 / ( 2 C e 1 ) ,
where θx1 denotes x1 and Ce1.
Given the semantic channel ascertained by b1 > 0 and b0 > 0, as shown in Table 2, we can obtain the corresponding Shannon channel P(y|x). According to Equation (32), we can deduce:
P ( y 1 | x 1 ) = 1 b 0 1 b 1 b 0 ,   P ( y 0 | x 0 ) = 1 b 1 1 b 1 b 0 , P ( y 0 | x 1 ) = 1 P ( y 1 | x 1 ) ,   P ( y 1 | x 0 ) = 1 -   P ( y 0 | x 0 ) .

4. Results

4.1. A Real Example of Kidney Stone Treatments

Table 3 shows Example 2 with detailed data about kidney stone treatments [15]. The data were initially provided in [16]. In Table 3, *% means a success rate, and the number behind it is the number of patients. The stone size is a confounder. The conclusion from every group (with small or large stones) is that treatment x2 (i.e., treatment A in [15]) is better than treatment x1 (i.e., treatment B in [15]); whereas the conclusion according to average success rates, P(y1|x2) = 0.78 and P(y1|x1) = 0.83, treatment x1 is better than treatment x2. There seems to be a paradox.
We used P(g) instead of P(g|x1) or P(g|x2) as the weighting coefficient for P(y1|do(x1)) and P(y1|do(x2)). After replacing P(y1|x1) with P(y1|do(x1)) and P(y1|x2) with P(y1|do(x2)), we derived Cc1 = Cc(x2/x1 => y1) = 0.06 (see Table 3), which means that the overall conclusion is that treatment x2 is better than treatment x1.
For Cc1 in Table 3, we used treatment x1 as the default; the degree of causal confirmation Cc1 = Cc(x2/x1 => y1) is 0.06. If we used x2 as the default, Cc1 = Cc(x1/x2 => y1) = −0.06. Using measure Cc, we need not worry about which of P(y1|do(x1)) and P(y1|do(x2)) is larger, whereas, using Pd, we have to consider that before calculating Pd.
We used the incremental school’s confirmation measure D(x1, y1) to compare x1 and x2. We obtained:
P(y1) = P(x1)P(y1|x1) + P(x2)P(y1|x2) = 0.805,
P(y1|x2, g1) − P(y1) = 0.93 − 0.805 > 0,
P(y1|x2, g2) − P(y1) = 0.73 − 0.805 < 0, and
D(x1, y1) = P(y1|x1) − P(y1) = 0.83 − 0.805 > 0
D(x2, y2) = P(y1|x2) − P(y1) = 0.78 − 0.805 < 0.
The results mean that x1 is better than x2. There seems to be no paradox only because the paradox is avoided rather than eliminated when we use D(x1, y1).
We tested Equation (38) by the aforementioned example. The Shannon channel P(y|x) derived from the two degrees of disconfirmation b1′* and b0′* is the same as P(y|do(x)) shown in the last two rows of Table 3.

4.2. An Example of Eliminating Simpson’s Paradox with COVID-19

Table 4 shows Example 2 with detailed data about the CFRs of COVID-19. The original data were obtained from the website of the Centers for Disease Control and Prevention (CDC) in the United States up until 2 July 2022 [33]. The data only include reported cases; otherwise, the CFRs should be lower. The x1 represents the non-Hispanic white and x2 means the other races. P(y1|x1, g) and P(y1|x2, g) are the CFRs of x1 and x2 in an age group g. See Appendix A for the original data and median results.
Table 5 shows that the overall (average) CFRs vary before and after we change the weighting coefficient from P(g|x) to P(g).
From Table 4, we can find that for different age groups, the CFR of the non-Hispanic whites is lower than or close to that of the other races. However, for all age groups (see Table 5), the overall (average) CFR (1.04) of the non-Hispanic whites is higher than the CFR (0.73) of the other races. After replacing P(g|x) with P(g), the overall CFR (0.80) of the non-Hispanic whites is also lower than that (1.05) of the other races.
We followed Fitelson to use D(x1, y1) to assess the risk. The average CFR is 0.97 (found on the same website [33]). We obtained:
D(non-Hispanic whites, death) = P(y1|x1) − P(y1) = 1.04 − 0.97 = 0.07,
D(other people, death) = P(y1|x2) − P(y1) = 0.73 − 0.97 = −0.14,
which means that non-Hispanic whites are at higher risk.

4.3. COVID-19: Vaccine’s Negative Influences on the CFR and Mortality

Using causal probability measure Pd is not convenient to measure the “probability” of “vaccine => infection” or “vaccine => death”, since Pd is regarded as the probability, whose minimum value is 0, while the vaccine’s influence is negative. However, there is no problem using Cc because Cc can be negative.
Table 6 shows data obtained from the website of the US CDC [34] and the two degrees of causal confirmation. The numbers of cases and deaths are among 100,000 people (ages over 5) in a week (from June 20 to 26, 2022).
The negative degree of causal confirmation −0.63 means that the vaccine reduced the infection rate by 63%. The −0.79 means that the vaccine reduced the CFR by 79%.
To know the impact of COVID-19 on population mortality, we need to compare the regular mortality rate due to common reasons (x0) with the new mortality rate due to common reasons plus COVID-19 (x1) during the same period (such as one year). Since the average lifespan of people in the United States is 79 years old, the annual mortality rate is about 1/79 = 0.013. From Table 6, we can derive that the yearly mortality rate caused by COVID-19 is 0.001 (for unvaccinated people) or 0.00018 (for vaccinated people).
People who died due to COVID-19 may also die in the same year for common reasons. Therefore, the new mortality rate should be less than the sum of the two mortality rates. Assume that x0 and x1 are independent of each other. Then new mortality rate P(y1|x1) should be 0.013 + 0.001 − 0.013 × 0.001 ≈ 0.014 (for unvaccinated people) or 0.013 + 0.00018 − 0.013 × 0.00018 ≈ 0.01318 (for vaccinated people). Table 7 shows the degree of causal confirmation of COVID-19 leading to mortality, for which we assume P(y1|x) = P(y1|do(x)).
In the last line, Cc1 = 0.07 means that among unvaccinated people who die, 7% are due to COVID-19. Moreover, Cc = 0.014 means that among the vaccinated people who die, 1.4% are due to COVID-19.
If we used x1 = COVID-19 instead of x1 = x0 + COVID-19, we would get a strange conclusion that COVID-19 could reduce deaths.
We obtained the above results without considering the vaccine’s side effects, possibly resulting in chronic deaths.

5. Discussion

5.1. Why Can Pd and Cc Better Indicate the Strength of Causation Than D in Theory?

We call m(xi, yj) (i = 0,1; j = 0,1) the probability correlation matrix, which is not symmetrical. Although there exists P(x, y) first and then m(x, y) from the perspective of calculation, there exists m(x, y) first and then P(x, y) from the perspective of existence. That is, given P(x), m(x, y) only allows specific P(y) to happen.
We can also make probability predictions with m(x, y) (like using Bayes’ formula):
P ( y | x 1 ) = P ( y ) m ( x 1 , y ) / m ( x 1 ) , m ( x 1 ) = y P ( y ) m ( x 1 , y ) , P ( x | y 1 ) = P ( x ) m ( x , y 1 ) / m ( y 1 ) , m ( y 1 ) = x P ( x ) m ( x , y 1 ) .
From Equations (27)–(30), we can find that Pd and Cc only depend on m(x, y) and are independent of P(x) and P(y). The two degrees of disconfirmation, b1′* and b0′*, ascertain a semantic channel and a Shannon channel. Therefore, the two degrees of causal confirmation, Cc1 = b1* and Cc0 = b0*, indicate the strength of the constraint relationship (causality) from x to y. Like Cc, measure Pd is also only related to m(x, y). D and Δ*Pxy are different; they are related to P(x), so they do not indicate the strength of causation well.
For example, considering the vaccine’s effect on the CFR of COVID-19 (see Table 7), Pd or Cc are irrelated to vaccination coverage rate P(x1), whereas measure Δ*Pxy is related to P(x1). Measure D is associated with P(y) and is also related to P(x1). Pd and Cc1 obtained from one region also fits other areas for the same variant of COVID-19. In contrast, Δ*Pxy and D are not universal because the vaccination coverage rate P(x1) differs in different areas.
According to the incremental school’s view of Bayesian confirmation, P(y1) is a prior probability, and P(y1|x) − P(y1) is its increment. However, when measure D is used for causal confirmation, P(y1) is obtained from P(x) and P(y1|x) after the treatment, so P(y) is no longer a priori probability, which is also a fatal problem with the incremental school.
In addition, as the result of induction, Cc and Pd can indicate the degree of belief of a fuzzy major premise and can be used for probability predictions, whereas D and Δ*Pxy cannot.

5.2. Why Are Pd and Cc Better than D in Practice?

Two calculation examples in Section 4.1 and Section 4.2 support the conclusion that measures Pd and Cc are better than D in practice. The reasons are as follows.

5.2.1. Pd and Cc Have Precise Meanings in Comparison with D

Cc1 = Cc (x1/x0 => y1) indicates what percentage of the result y is due to x1 instead of x0. For example, Table 6 shows that according to the virulence of the virus, COVID-19 will increase the mortality rate of vaccinated people from 1.3% to 1.318%. Therefore, the degree of causal confirmation is Cc1 = Pd = 0.014, which means that 1.4% of the deaths will be due to COVID-19. However, the meanings of D and Δ*Pxy are not precise.
Different from measure RD (see Equation (1)), Pd and Cc indicate relative risk or the relative change of the outcome. Many people think COVID-19 is very dangerous because it can kill millions in a country. However, the mortality rate it brings is much lower than that caused by common reasons. Pd and Cc can reveal the relative change in the mortality rate (see Table 7). Although it is essential to reduce or delay deaths, it is also vital to decrease the economic loss due to the fierce fight against the pandemic. Therefore, Pd and Cc can help decision-makers balance between reducing or delaying deaths and reducing financial losses.

5.2.2. The Confounder’s Effect Is Removed from Pd and Cc

When there is a confounder, as shown in Section 4.1, using Pd or Cc, we can eliminate Simpson’s Paradox and make the overall conclusion consistent with the grouping conclusion: treatment x2 is better than treatment x1. For example, if we use D to compare the success rates of two treatments, although we can avoid Simpson’s Paradox, the conclusion is unreasonable. The reason is that we neglect the difficulties of treatments for different sizes of kidney stones. If a hospital only accepts patients who are easy to treat, its overall success rate must be high; however, such a hospital may not be a good one.

5.2.3. Pd and Cc Allow Us to View the Third Factor, u, from Different Perspectives

For the example in Section 4.2, if we think that one’s longevity is related to one’s race, we can take the lifespan as a mediator and then directly accept the overall conclusion (non-Hispanic whites have a higher CFR than other people). On the other hand, if we believe that one’s longevity is not due to one’s race, then the lifespan is a confounder. Therefore, we can make the overall conclusion consistent with the grouping conclusion, and then use Pd and Cc.
It is worth noting that it is concluded that the CFR of non-Hispanic whites is lower than that of other people, probably because medical conditions affect the CFRs. However, existing data do not contain information about the medical conditions of different races. Otherwise, the CFRs of different races might be similar if we use the medical condition as a confounder. This issue is worth researching further.

5.3. Why Is It Better to Replace Pd with Cc?

Section 4.3 provides the calculation of the two negative degrees of causal confirmation that reflect the impacts of the vaccine on infection and mortality. The negative degrees of confirmation mean that the vaccine can reduce illnesses and deaths. However, if we use Pd as the probability of causation, Pd can only take its lower limit 0. Although we can replace Pd(vaccinated => death) with Pd(unvaccinated => death) to ensure Pd > 0, it does not conform to our thinking habits to take being vaccinated as the default cause. In addition, Cc has cause symmetry, whereas Pd does not.
When we used Pd to compare two causes x1 and x2, such as two treatments for kidney stones (see Section 4.1), we had to consider which of P(y1|x2) and P(y1|x1) was larger. However, using Cc, we needed to not consider that because it is unnecessary to worry about if (R − 1)/R < 0.
The correlation coefficient in mathematics is between 1 and −1. Cc can be understood as the probability correlation coefficient. The difference is that the former has only one coefficient between x and y, whereas the latter has two coefficients: Cc1 = Cc(x1 => y1) and Cc0 = Cc(x0 => y0).

5.4. Necessity and Sufficiency in Causality

Measures Pd and Cc only indicate the necessity of cause x to outcome y; they do not reflect the sufficiency of x or the inevitability of y. On the other hand, measure f = P(y|x) and Ce can indicate the outcome’s inevitability.
The medical industry uses the odds ratio to indicate both the necessity and sufficiency of the cause to the outcome. The odds rate [2] is:
O R = P ( y 1 | x 1 ) P ( y 1 | x 0 ) × P ( y 0 | x 0 ) P ( y 0 | x 1 ) .
It is the product of two likelihood ratios. We can use:
O R N = O R 1 max ( O R , 1 )
as the confirmation measure of both x0 => y0 and x1 => y1 for the same purpose. However, ORN has the normalizing property and symmetry.

5.5. The Relationship between Bayesian Confirmation Measures b* and c*, and Causal Confirmation Measures Cc and Ce

Suppose that P(y1|x) has been modified for P(y1|x) = P(y1|do(x)). Causal confirmation measure Cc is equal to channels’ confirmation measure b* [8] in value, i.e.,
Cc(x1 => y1) = [P(y1|x1) − P(y1|x0)]/max(P(y1|x1), P(y1|x0)) = b*(y1x1).
However, their antecedents and consequents are inverted, which means that if x1 is the cause of y1, then y1 is the evidence of x1. For example, if COVID-19 infection is the cause of the test-positive, then the test-positive is the evidence of the infection.
Causal confirmation measure Ce indicating the inevitability of the outcome is equal to prediction confirmation measure c*(x1y1) in value, i.e.,
Ce(x1 => y1) = [P(y1|x1) − P(y0|x1)]/max(P(y1|x1), P(y0|x1)) = c*(x1y1).
Their antecedents and consequents are the same.
However, from the right sides’ values of the above two equations, we may not be able to obtain the left sides’ values because an associated relationship may not be a causal relationship.

6. Conclusions

Fitelson, a representative of the incremental school of Bayesian confirmation, used D(x1, y1) = P(y1|x1) − P(y1) to denote the supporting strength of the evidence to the consequence and extended this measure for causal confirmation without considering the confounder. This paper has shown that measure D is incompatible with the ECIT and popular risk measures, such as Pd = max(0, (R − 1)/R). Using D, one can only avoid Simpson’s Paradox but he cannot eliminate it or provide a reasonable explanation as the ECIT does.
On the other hand, Rubin et al. used Pd as the probability of causation. Pd is better than D, but it is improper to call Pd a probability measure and use the probability measure to measure causation. If we use Pd as a causal confirmation measure, it lacks the normalizing property and symmetry that an ideal confirmation measure should have.
This paper has deduced causal confirmation measure Cc(x1 => y1) = (R – 1) / max(R, 1) by the semantic information method with the minimum cross-entropy criterion. Cc is similar to the inductive school’s confirmation measure b* proposed by the author earlier. However, the positive examples’ proportion P(y1|x1) and the counterexamples’ proportion P(y1|x0) are replaced with P(y1|do(x1)) and P(y1|do(x0)) so that Cc is an improved Pd. Compared with Pd, Cc has the normalizing property (it changes between –1 and 1) and the cause symmetry (Cc(x0/x1 => y1) = −Cc (x1/x0 => y1)). Since Cc may be negative, it is also suitable for evaluating the inhibition relationship between cause and outcome, such as between vaccine and infection.
This paper has provided some examples with Simpson’s Paradox for calculating the degrees of causal confirmation. The calculation results show that Pd and Cc are more reasonable and meaningful than D, and Cc is better than Pd mainly because Cc may be less than zero. In addition, this paper has also provided a causal confirmation measure Ce(x1 => y1) that indicates the inevitability of the outcome y1.
Since measure Cc and the ECIT support each other, the inductive school of Bayesian confirmation are also supported by the ECIT and the epidemical risk theory.
However, like all Bayesian confirmation measures, causal confirmation measure Cc and Ce also use size-limited samples, hence, the degrees of causal confirmation are not strictly reliable. Therefore, replacing a degree of causal confirmation with a degree interval is necessary to retain the inevitable uncertainty. This work needs further studies by combining existing theories.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The author thanks Zhilin Zhang of Fudan University and Jianyong Zhou of Changsha University because this study benefited from communication with them. The author appreciates two reviewers’ comments, which have greatly helped the author improve this paper, and also thanks them for their input.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Data and Calculations for Comparing the CFRs of Non-Hispanic Whites and Other People in the USA

The original data were obtained from the USA CDC (Centers for Disease Control and Prevention) website [33]. The Excel file with the original data and computing results can be downloaded from http://survivor99.com/lcg/cm/CFR.zip (accessed on 8 December 2022).

References

  1. Rubin, D. Causal inference using potential outcomes. J. Amer. Statist. Assoc. 2005, 100, 322–331. [Google Scholar] [CrossRef]
  2. Hernán, M.A.; Robins, J.M. Causal Inference: What If; Chapman & Hall/CRC: Boca Raton, FL, USA, 2020. [Google Scholar]
  3. Pearl, J. Causal inference in statistics: An overview. Stat. Surv. 2009, 3, 96–146. [Google Scholar] [CrossRef]
  4. Geffner, H.; Rina Dechter, R.; Halpern, J.Y. (Eds.) Probabilistic and Causal Inference: The Works of Judea Pearl, Association for Computing Machinery; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar]
  5. Fitelson, B. Confirmation, Causation, and Simpson’s Paradox. Episteme 2017, 14, 297–309. [Google Scholar] [CrossRef] [Green Version]
  6. Carnap, R. Logical Foundations of Probability, 2nd ed.; University of Chicago Press: Chicago, IL, USA, 1962. [Google Scholar]
  7. Kemeny, J.; Oppenheim, P. Degrees of factual support. Philos. Sci. 1952, 19, 307–324. [Google Scholar] [CrossRef] [Green Version]
  8. Lu, C. Channels’ Confirmation and Predictions’ Confirmation: From the Medical Test to the Raven Paradox. Entropy 2020, 22, 384. [Google Scholar] [CrossRef] [Green Version]
  9. Greco, S.; Slowiński, R.; Szczęch, I. Properties of rule interestingness measures and alternative approaches to normalization of measures. Inf. Sci. 2012, 216, 1–16. [Google Scholar] [CrossRef]
  10. Crupi, V.; Tentori, K.; Gonzalez, M. On Bayesian measures of evidential support: Theoretical and empirical issues. Philos. Sci. 2007, 74, 229–252. [Google Scholar] [CrossRef] [Green Version]
  11. Eells, E.; Fitelson, B. Symmetries and asymmetries in evidential support. Philos. Stud. 2002, 107, 129–142. [Google Scholar] [CrossRef]
  12. Relative Risk, Wikipedia the Free Encyclopedia. Available online: https://en.wikipedia.org/wiki/Relative_risk (accessed on 15 August 2022).
  13. Robins, J.; Greenland, S. The probability of causation under a stochastic model for individual risk. Biometrics 1989, 45, 1125–1138. [Google Scholar] [CrossRef]
  14. Simpson, E.H. The interpretation of interaction in contingency tables. J. R. Stat. Soc. Ser. B 1951, 13, 238–241. [Google Scholar] [CrossRef]
  15. Simpson’s Paradox. Wikipedia the Free Encyclopedia. Available online: https://en.wikipedia.org/wiki/Simpson%27s_paradox (accessed on 20 August 2022).
  16. Charig, C.R.; Webb, D.R.; Payne, S.R.; Wickham, J.E. Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy. Br. Med. J. (Clin. Res. Ed.) 1986, 292, 879–882. [Google Scholar] [CrossRef] [Green Version]
  17. Julious, S.A.; Mullee, M.A. Confounding and Simpson’s paradox. BMJ 1994, 309, 1480–1481. [Google Scholar] [CrossRef] [Green Version]
  18. Pedagogy, W. Simpson’s Paradox. Available online: https://weapedagogy.wordpress.com/2020/01/15/5-simpsons-paradox/ (accessed on 21 August 2022).
  19. Mackenzie, D. Race, COVID Mortality, and Simpson’s Paradox. Available online: http://causality.cs.ucla.edu/blog/index.php/category/simpsons-paradox/ (accessed on 22 August 2022).
  20. Kügelgen, J.V.; Gresele, L.; Schölkopf, B. Simpson’s Paradox in COVID-19 case fatality rates: A mediation analysis of age-related causal effects. IEEE Trans. Artif. Intell. 2021, 2, 18–27. [Google Scholar] [CrossRef]
  21. Mortimer, H. The Logic of Induction; Prentice Hall: Paramus, NJ, USA, 1988. [Google Scholar]
  22. Horwich, P. Probability and Evidence; Cambridge University Press: Cambridge, UK, 1982. [Google Scholar]
  23. Christensen, D. Measuring confirmation. J. Philos. 1999, 96, 437–461. [Google Scholar] [CrossRef]
  24. Nozick, R. Philosophical Explanations; Clarendon: Oxford, UK, 1981. [Google Scholar]
  25. Good, I.J. The best explicatum for weight of evidence. J. Stat. Comput. Simul. 1984, 19, 294–299. [Google Scholar] [CrossRef]
  26. Lu, C. A generalization of Shannon’s information theory. Int. J. Gen. Syst. 1999, 28, 453–490. [Google Scholar] [CrossRef]
  27. Lu, C. Semantic Information G Theory and Logical Bayesian Inference for Machine Learning. Information 2019, 10, 261. [Google Scholar] [CrossRef] [Green Version]
  28. Lu, C. The P–T Probability Framework for Semantic Communication, Falsification, Confirmation, and Bayesian Reasoning. Philosophies 2020, 5, 25. [Google Scholar] [CrossRef]
  29. Zadeh, L.A. Fuzzy Sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef] [Green Version]
  30. Zadeh, L.A. Probability measures of fuzzy events. J. Math. Anal. Appl. 1986, 23, 421–427. [Google Scholar] [CrossRef]
  31. Rooij, R.V.; Schulz, K. Conditionals, causality and conditional probability. J. Log. Lang. Inf. 2019, 28, 55–71. [Google Scholar] [CrossRef] [Green Version]
  32. Over, D.E.; Hadjichristidis, C.; Jonathan St, B.T.; Evans, J.S.B.T.; Handley, D.J.; Sloman, S.A. The probability of causal conditionals. Cogn. Psychol. 2007, 54, 62–97. [Google Scholar] [CrossRef] [PubMed]
  33. Demographic Trends of COVID-19 Cases and Deaths in the US Reported to CDC. The Website of the US CDC. Available online: https://covid.cdc.gov/covid-data-tracker/#demographics (accessed on 10 September 2022).
  34. Rates of COVID-19 Cases and Deaths by Vaccination Status The Website of US CDC. Available online: https://covid.cdc.gov/covid-data-tracker/#rates-by-vaccine-status (accessed on 8 September 2022).
Figure 1. Illustrating Simpson’s Paradox. In each group, the success rate of x2, P(y1|x2, g), is higher than that of x1, P(y1|x1, g); however, using the method of finding the center of gravity, we can see that the overall success rate of x2, P(y1|x2) = 0.65, is lower than that of x1, P(y1|x1) = 0.7.
Figure 1. Illustrating Simpson’s Paradox. In each group, the success rate of x2, P(y1|x2, g), is higher than that of x1, P(y1|x1, g); however, using the method of finding the center of gravity, we can see that the overall success rate of x2, P(y1|x2) = 0.65, is lower than that of x1, P(y1|x1) = 0.7.
Entropy 25 00143 g001
Figure 2. Three causal graphs: (a) for Example 2; (b) for Example 3; (c) for Examples 4 and 1.
Figure 2. Three causal graphs: (a) for Example 2; (b) for Example 3; (c) for Examples 4 and 1.
Entropy 25 00143 g002
Figure 3. Eliminating Simpson’s Paradox as the confounder exists by modifying the weighting coefficients. After replacing P(gk|xi) with P(gk) (k = 1,2; i = 1,2), the overall conclusion is consistent with the grouping conclusion; the average success rate of x2, P(y1|do(x2)) = 0.7, is higher than that of x1, P(y1|do(x1)) = 0.65.
Figure 3. Eliminating Simpson’s Paradox as the confounder exists by modifying the weighting coefficients. After replacing P(gk|xi) with P(gk) (k = 1,2; i = 1,2), the overall conclusion is consistent with the grouping conclusion; the average success rate of x2, P(y1|do(x2)) = 0.7, is higher than that of x1, P(y1|do(x1)) = 0.65.
Entropy 25 00143 g003
Figure 4. The truth function of s1 includes a believable part with proportion b1 and a disbelievable part with proportion b1′.
Figure 4. The truth function of s1 includes a believable part with proportion b1 and a disbelievable part with proportion b1′.
Entropy 25 00143 g004
Table 1. Comparing Pd and Δ*Pxy.
Table 1. Comparing Pd and Δ*Pxy.
P(y1|x1)P(y1|x0)Pd Δ*PxyComparison
No big difference0.90.80.110.5Pd << Δ*Pxy
No counterexample0.2010.2Pd >> Δ*Pxy
Table 2. The truth values of s0 = “x0 => y0” and s1 = “x1 => y1”.
Table 2. The truth values of s0 = “x0 => y0” and s1 = “x1 => y1”.
T(s|x0)T(s|x1)
s0 = “x0 => y01b0
s1 = “x1 => y1b11
Table 3. Comparing two treatments’ success rates (y1 means the success).
Table 3. Comparing two treatments’ success rates (y1 means the success).
Treat. x1Treat. x2NumberP(g) or Cc
Small stones (g1)87%/27093%/87 *3570.51
Large stones (g2)69%/8073%/2633430.49
Overall83%/35078%/350700
P(y1|x)0.830.78 [(P(y1|x2) − P(y1|x1)]/P(y1|x2) = −0.064
P(y1|do(x))0.780.83 Cc1 = Cc(x2/x1 => y1) = 0.06
P(y0|do(x))0.220.17 Cc0 = Cc(x1/x2 => y0) = 0.23
* “87%/270” means that the success rate is 87%, and the number in this subgroup is 270.
Table 4. The CFRs of COVID-19 of non-Hispanic white (x1) and other people (x2) from different age groups.
Table 4. The CFRs of COVID-19 of non-Hispanic white (x1) and other people (x2) from different age groups.
Age Group (g)P(x1|g)P(g)P(y1|x1, g)P(g|x1)P(y1|x2, g)P(g|x2)
0–4 Years44.2000.0410.00020.03490.00020.0480
5–11 Years44.2000.0780.00010.06590.00010.0907
12–15 Years46.3000.0520.00010.04580.00010.0578
16–17 Years48.7000.0290.00010.02680.00020.0307
18–29 Years48.7000.2230.00040.20810.00060.2388
30–39 Years49.3000.1780.00110.16810.00190.1883
40–49 Years51.0000.1460.00300.14270.00480.1493
50–64 Years59.1000.1630.01020.18430.01440.1389
65–74 Years67.3000.0550.03330.07040.04570.0373
75–84 Years72.9000.0250.07620.03560.09380.0144
85+ Years76.3000.0120.16060.01730.17510.0059
sum 1 1 1
Table 5. Comparing the CFRs of non-Hispanic whites and the other people.
Table 5. Comparing the CFRs of non-Hispanic whites and the other people.
The CFR of
Non-Hispanic Whites (x1)
The CFR of
of Other People (x2)
Risk Measure *
P(y1|x)1.040.73Pd = (R − 1)/R = 0.30
P(y1|do(x))0.801.05Pd = 0; Cc(x1/x2=>y1) = −0.28
* R = P(y1|x1)/P(y1|x2).
Table 6. The negative degrees of causal confirmation for accessing that the vaccine affects infections and deaths.
Table 6. The negative degrees of causal confirmation for accessing that the vaccine affects infections and deaths.
Unvaccinated (x0)Vaccinated (x1)Cc
Cases512.6189.5Cc(x1/x0 => y1) = −0.63
Deaths1.890.34Cc(x1/x0) => y1) = −0.79
Mortality rate0.0010.00018
Table 7. Using Cc to measure the impact of COVID-19 on the mortality rates.
Table 7. Using Cc to measure the impact of COVID-19 on the mortality rates.
Mortality Rate P(y1|x)UnvaccinatedVaccinated
x0: common reasonsP(y1|x0)0.0130.013
x1: x0 plus COVID-19P(y1|x1)0.0140.01318
Cc1 = Cc(x1/x0 => y1) 0.070.014
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, C. Causal Confirmation Measures: From Simpson’s Paradox to COVID-19. Entropy 2023, 25, 143. https://doi.org/10.3390/e25010143

AMA Style

Lu C. Causal Confirmation Measures: From Simpson’s Paradox to COVID-19. Entropy. 2023; 25(1):143. https://doi.org/10.3390/e25010143

Chicago/Turabian Style

Lu, Chenguang. 2023. "Causal Confirmation Measures: From Simpson’s Paradox to COVID-19" Entropy 25, no. 1: 143. https://doi.org/10.3390/e25010143

APA Style

Lu, C. (2023). Causal Confirmation Measures: From Simpson’s Paradox to COVID-19. Entropy, 25(1), 143. https://doi.org/10.3390/e25010143

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop