Bayesian Test of Significance for Conditional Independence: The Multinomial Model

De Morais Andrade, Pablo; Stern, Julio Michael; De Bragança Pereira, Carlos Alberto

doi:10.3390/e16031376

Open AccessArticle

Bayesian Test of Significance for Conditional Independence: The Multinomial Model

by

Pablo De Morais Andrade

^*,

Julio Michael Stern

and

Carlos Alberto De Bragança Pereira

Instituto de Matemática e Estatística, Universidade de São Paulo (IME-USP) Rua do Matão, 1010, Cidade Universitária, São Paulo, SP/Brasil

^*

Author to whom correspondence should be addressed.

Entropy 2014, 16(3), 1376-1395; https://doi.org/10.3390/e16031376

Submission received: 3 December 2013 / Revised: 21 February 2014 / Accepted: 5 March 2014 / Published: 7 March 2014

Download

Browse Figures

Versions Notes

Abstract

: Conditional independence tests have received special attention lately in machine learning and computational intelligence related literature as an important indicator of the relationship among the variables used by their models. In the field of probabilistic graphical models, which includes Bayesian network models, conditional independence tests are especially important for the task of learning the probabilistic graphical model structure from data. In this paper, we propose the full Bayesian significance test for tests of conditional independence for discrete datasets. The full Bayesian significance test is a powerful Bayesian test for precise hypothesis, as an alternative to the frequentist’s significance tests (characterized by the calculation of the p-value).

Keywords:

hypothesis testing; probabilistic graphical models

Graphical Abstract

1. Introduction

Barlow and Pereira [1] discussed a graphical approach to conditional independence. A probabilistic influence diagram is a directed acyclic graph (DAG) that helps model statistical problems. The graph is composed of a set of nodes or vertices, which represent the variables, and a set of arcs joining the nodes, which represent the dependence relationships shared by these variables.

The construction of this model helps us understand the problem and gives a good representation of the interdependence of the implicated variables. The joint probability of these variables can be written as a product of their conditional distributions, based on their independence and conditional independence.

The interdependence of the variables [2] is sometimes unknown. In this case, the model structure must be learned from data. Algorithms, such as the IC-algorithm (inferred causation) described in Pearl and Verma [3], have been designed to uncover these structures from the data. This algorithm uses a series of conditional independence tests (CI tests) to remove and direct the arcs, connecting the variables in the model and returning a DAG that minimally (with the minimum number of parameters and without loss of information) represents the variables in the problem.

The problem of constructing the DAG structures based on the data motivates the proposal of new powerful statistical tests for the hypothesis of conditional independence, because the accuracy of the structures learned is directly affected by the errors committed by these tests. Recently proposed structure learning algorithms [4–6] indicate that the results of CI tests are the main source of errors.

In this paper, we propose the full Bayesian significance test (FBST) as a test of conditional independence for discrete datasets. FBST is a powerful Bayesian test for a precise hypothesis and can be used to learn the DAG structures based on the data as an alternative to the CI tests currently in use, such as Pearson’s chi-squared test.

This paper is organized as follows. In Section 2, we review the FBST. In Section 3, we review the FBST for the composite hypothesis. Section 4 gives an example of testing for conditional independence that can be used to construct a simple model with three variables.

2. The Full Bayesian Significance Test

The full Bayesian significance test was presented by Pereira and Stern [7] as a coherent Bayesian significance test for sharp hypotheses. In the FBST, the evidence for a precise hypothesis is computed.

This evidence is given by the complement of the probability of a credible set, called the tangent set, which is a subset of the parameter space in which the posterior density of each of the elements is greater than the maximum of the posterior density over the null hypothesis. This evidence is called the e-value, ev(H), and has many desirable properties as a statistical support. For example, Borges and Stern [8] described the following properties:

(1): provides a measure of significance for the hypothesis as a probability defined directly in the original parameter space.
(2): provides a smooth measure of the significance, both continuous and differentiable, of the hypothesis parameters.
(3): has an invariant geometric definition, independent of the particular parameterization of the null hypothesis being tested or the particular coordinate system chosen for the parameter space.
(4): obeys the likelihood principle.
(5): requires no ad hoc artifice, such as an arbitrary initial belief ratio between hypotheses.
(6): is a possibilistic support function, where the support of a logical disjunction is the maximum support among the support of the disjuncts.
(7): provides a consistent test for a given sharp hypothesis.
(8): provides compositionality operations in complex models.
(9): is an exact procedure, making no use of asymptotic approximations when computing the e-value.
(10): allows the incorporation of previous experience or expert opinions via prior distributions.

Furthermore, FBST is an exact test, whereas tests, such as the one presented in Geenens and Simar [9], are asymptotically correct. Therefore, the authors consider that a direct comparison between FBST and such test is not relevant in the context of this paper; considering, as future research, the comparison using small samples, in which case, FBST is still valid.

A more formal definition is given below.

Consider a model in a statistical space described by the triple, (Ξ, Δ, Θ), where Ξ is the sample space, Δ, the family of measurable subsets of Ξ and Θ the parameter space (Θ is a subset of ℜⁿ).

Define a subset of the parameter space, T_φ (tangent set), where the posterior density (denoted by f_x) of each element of this set is greater than φ.

T_{ϕ} = {θ \in Θ ∣ f_{x} (θ) > ϕ} .

(1)

The credibility of T_φ is given by its posterior probability,

(2)

where Entropy 16 01376f8 (θ) is the indicator function.

Defining the maximum of the posterior density over the null hypothesis as $f_{x}^{*}$ , with the maximum point at $θ_{0}^{*}$ ,

θ_{0}^{*} \in \underset{θ \in Θ_{0}}{argmax} f_{x} (θ), and f_{x}^{*} = f_{x} (θ^{*}),

(3)

and defining $T^{*} = T_{f_{x}^{*}}$ as the tangent set to the null hypothesis, H₀, the credibility of T* is κ*.

The measure of the evidence for the null hypothesis (called the e-value), which is the complement of the probability of the set T*, is defined as follows:

(4)

If the probability of the set, T*, is large, the null set falls within a region of low probability, and the evidence is against the null hypothesis, H₀. However, if the probability of T* is small, then the null set is in a region of high probability, and the evidence supports the null hypothesis.

2.1. FBST: Example of Tangent Set

Figure 1 shows the tangent set for a null hypothesis H₀ : μ = 1, for the posterior distribution, f_x, given bellow, where μ is the mean of a normal distribution and τ is the precision (the inverse of the variance $τ = \frac{1}{σ^{2}}$ ):

f_{x} (μ, τ) \propto τ^{1.5} e^{- τ {(μ)}^{2} - 1.5 τ} .

(5)

3. FBST: Compositionality

The relationship between the credibility of a complex hypothesis, H, and its elementary constituent, H_j, j = 1, . . . , k, under the full Bayesian significance test, was analyzed by Borges and Stern [8].

For a given set of independent parameters, (θ₁, . . . , θ_k) ∈ (Θ₁ × . . . × Θ_k), a complex hypothesis, H, can be given as follows:

H : θ_{1} \in Θ_{1}^{H} \land θ_{2} \in Θ_{2}^{H} \land \dots \land θ_{k}^{H} \in Θ_{k}^{H},

(6)

where $Θ_{j}^{H}$ is a subset of the parameter space, Θ_j, for j = 1, . . . , k and is constrained to the hypothesis, H, which can be decomposed into its elementary components (hypotheses):

\begin{matrix} H_{1} : θ_{1} \in Θ_{1}^{H} \\ H_{2} : θ_{2} \in Θ_{2}^{H} \\ \dots \\ H_{k} : θ_{k} \in Θ_{k}^{H} \end{matrix}

The credibility of H can be evaluated based on the credibility of these components. The evidence in favor of the complex hypothesis, H (measured by its e-value), cannot be obtained directly from the evidence in favor of the elementary components; instead, it must be based on their truth function, W^j (or cumulative surprise distribution), as defined below. For a given elementary component (H_j) of the complex hypothesis, H, $θ_{j}^{*}$ is the point of maximum density of the posterior distribution (f_x) that is constrained to the subset of the parameter space defined by hypothesis H_j:

θ_{j}^{*} \in \underset{θ_{j} \in Θ_{j}^{H}}{argmax} f_{x} (θ_{j}) and f_{j}^{*} = f_{x} (θ_{j}^{*}) .

(7)

The truth function, W_j, is the probability of the parameter subspace (region R_j(v) of the parameter space defined below), where the posterior density is lower than or equal to the value, v:

R_{j} (v) = {θ_{j} \in Θ_{j} ∣ f_{x} (θ_{j}) \leq v}, W_{j} (v) = \int_{R_{j} (v)} f_{x} (θ_{j}) d θ_{j} .

(8)

The evidence supporting the hypothesis, H_j, is given as follows:

E v (H_{j}) = W_{j} (f_{j}^{*}) .

(9)

The evidence supporting the complex hypothesis can be then described in terms of the truth function of its components as follows.

Given two independent variables, X and Y, if Z = XY, with cumulative distribution functions F_Z(z), F_X(x) and F_Y(y), then:

F_{Z} (z) = Pr [Z \leq z] = Pr [X \leq z / Y] = \int_{0}^{\infty} Pr [X \leq z / y] f_{Y} (y) d y = \int_{0}^{\infty} F_{X} (z / y) f_{Y} (y) d y = \int_{0}^{\infty} F_{X} (z / y) F_{Y} (d y) .

(10)

Accordingly, we define a functional product for cumulative distribution functions, namely,

F_{Z} = F_{X} \otimes F_{Y} (z) = \int F_{X} (z / y) F_{Y} (d y) .

(11)

The same result concerning the product of non-negative random variables can be expressed by the Mellin convolution of the probability density functions, as demonstrated by Kaplan and Lin [10], Springer [11] and Williamson [12].

f_{Z} (z) = (f_{X} ★ f_{Y}) (z) = \int_{0}^{\infty} (1 / y) f_{X} (z / y) f_{Y} (y) d y .

(12)

The evidence supporting the complex hypothesis can be then described as the Mellin convolution of the truth function of its components:

E v (H) = W_{1} \otimes W_{2} \otimes W_{3} \otimes \dots \otimes W_{k} (f_{1}^{*} \cdot f_{2}^{*} \cdot f_{3}^{*} \cdot \dots \cdot f_{k}^{*}) .

(13)

The Mellin convolution of two truth functions, W₁ ⊗ W₂, is the distribution function; see Borges and Stern [8]:

W_{1} \otimes W_{2} (f_{1}^{*} \cdot f_{2}^{*}) = \int_{0}^{\infty} W_{1} (\frac{f_{1}^{*} \cdot f_{2}^{*}}{f}) W_{2} (d f) .

(14)

The Mellin convolution W₁ ⊗ W₂ gives the distribution function of the product of two independent random variables, with distribution functions W₁ and W₂; see Kaplan and Lin [13] and Williamson [12]. Furthermore, the commutative and associative properties follow immediately for the Mellin convolution,

(W_{1} \otimes W_{2}) \otimes W_{3} = W_{1} \otimes (W_{2} \otimes W_{3}) = (W_{1} \otimes W_{3}) \otimes W_{2} = W_{1} \otimes (W_{3} \otimes W_{2}) .

(15)

3.1. Mellin Convolution: Example

An example of a Mellin convolution to find the product of two random variables, Y₁ and Y₂, both of which have a Log-normal distribution, is given below.

Assume Y₁ and Y₂ to be continuous random variables, such that:

Y_{1} ~ ln N (μ_{1}, σ_{1}^{2}), Y_{2} ~ ln N (μ_{2}, σ_{2}^{2}) .

(16)

We denote the cumulative distributions of Y₁ and Y₂ by W₁ and W₂, respectively, i.e.,

W_{1} (y_{1}) = \int_{- \infty}^{y_{1}} f_{Y_{1}} (t) d t, W_{2} (y_{2}) = \int_{- \infty}^{y_{2}} f_{Y_{2}} (t) d t,

(17)

where f_Y_₁ and f_Y_₂ are the density functions of Y₁ and Y₂, respectively. These distributions can be written as a function of two normally distributed random variables, X₁ and X₂:

\begin{array}{l} ln (Y_{1}) = X_{1} ~ N (μ_{1}, σ_{1}^{2}), \\ ln (Y_{2}) = X_{2} ~ N (μ_{2}, σ_{2}^{2}) . \end{array}

(18)

We can confirm that the distribution of the product of these random variables (Y₁ · Y₂) is also Log-normal, using simple arithmetic operations:

\begin{array}{r} Y_{1} = e^{X_{1}} and Y_{2} = e^{X_{2}}, \\ Y_{1} \cdot Y_{2} = e^{X_{1} + X_{2}}, \\ ln (Y_{1} \cdot Y_{2}) = X_{1} + X_{2} ~ N (μ_{1} + μ_{2}, σ_{1}^{2} + σ_{2}^{2}), \\ ∴ Y_{1} \cdot Y_{2} ~ ln N (μ_{1} + μ_{2}, σ_{1}^{2} + σ_{2}^{2}) . \end{array}

(19)

The cumulative density function of Y₁ · Y₂ (W₁₂(y₁₂)) is defined as follows:

W_{12} (y_{12}) = \int_{- \infty}^{y_{12}} f_{Y_{1} \cdot Y_{2}} (t) d t,

(20)

where f_Y_₁·_Y_₂ is the density function of Y₁ · Y₂.

In the next section, we show different numerical methods for use in the convolution and condensation procedures, and we apply the results of these procedures to the example given here.

3.2. Numerical Methods for Convolution and Condensation

Williamson and Downs [14] developed the idea of probabilistic arithmetics. They investigated numerical procedures that allow for the computation of a distribution using arithmetic operations on random variables by replacing basic arithmetic operations on numbers with arithmetic operations on random variables. They demonstrated numerical methods for calculating the convolution of probability distributions for a set of random variables.

The convolution for the multiplication of two random variables, X₁ and X₂ (Z = X₁ · X₂), can be written using their respective cumulative distribution functions, F_X_₁ and F_Y_₂:

F_{Z} (z) = \int_{0}^{z} F_{X_{1}} (\frac{z}{t}) d F_{X_{2}} (t) .

(21)

The algorithm for the numerical calculation of the distribution of the product of two independent random variables (Y₁ and Y₂), using their discretized marginal probability distributions (f_Y_₁ and f_Y_₂) is shown in Algorithm 1 (an algorithm for a discretization procedure is given by Williamson and Downs [14]). The description of Algorithm 1 is given below.

(1): The algorithm has as inputs two discrete variables, Y₁ and Y₂, as well as their respective probabilistic density functions (pdf): f_Y_₁ and f_Y_₂.
(2): The algorithm finds the products (Y₁ · Y₂ and f_Y_₁ · f_Y_₂), resulting in N² bins, if f_Y_₁ and f_Y_₂ each have N bins.
(3): The values of Y₁ · Y₂ are sorted in increasing order.
(4): The values of f_Y_₁ · f_Y_₂ are sorted according to the order of Y₁ · Y₂.
(5): The cumulative density function (cdf) of the product Y₁ · Y₂ is found (it has N² bins).

The numerical convolution of the two distributions with N bins, as described above, returns a distribution with N² bins. For a sequence of operations, such a large number of bins would be a problem, because the result of each operation would be larger than the input for the operations. Therefore, the authors have proposed a simple method for reducing the size of the output to N bins without introducing further error into the result. This operation is called condensation and returns the upper and lower bounds of each of the N bins for the distribution resulting from the convolution. The algorithm for the condensation process is shown in Algorithm 2. The description of Algorithm 2 is given below.

(1): The algorithm has as input a cdf with N² bins.
(2): For each group of N bins (there are N groups of N bins), the value of the cdf at the first bin is taken as the lower bound, and the value of the cdf at the last bin is taken as the upper bound.
(3): The algorithm returns a cdf with N bins, where each bin has a lower and an upper bound.

Algorithm 1. Find the distribution of the product of two random variables.

**Algorithm 1.** Find the distribution of the product of two random variables.
1:	procedure Convolution(Y₁, Y₂, f_Y_₁, f_Y_₂)	▹ Discrete pdf of Y₁ and Y₂
2:	f ← array(0, size ← n²)	▹ f and W has n² bins
3:	W ← array(0, size ← n²)
4:	y1y2 ← array(0, size ← n²)	▹ keep Y₁ * Y₂
5:	for i ← 1, n do	▹ f₁ and f₂ have n bins
6:	for j ← 1, n do
7:	f [(i − 1) · n + j] ← f_Y_₁ [i] · f_Y_₂ [j]
8:	y1y2[(i − 1) · n + j] ← Y₁[i] * Y₂[j]
9:	end for
10:	end for
11:	sortedIdx ← order(y1y2)	▹ find order of Y₁ * Y₂
12:	f ← f [sortedIdx]	▹ sort f according to Y₁ * Y₂
13:	W [1] ← f [1]
14:	for i ← k, n² do	▹ find cdf of Y₁ · Y₂
15:	W [k] ← f [k]
16:	W [k] ← W [k] +W [k − 1]
17:	end for
18:	return W	▹ Discrete cdf of Y₁ · Y₂
19:	end procedure

Algorithm 2. Find the upper lower bound for a cdf for condensation.

**Algorithm 2.** Find the upper lower bound for a cdf for condensation.
1:	procedure HorizontalCondensation(W)	▹ Histogram of a cdf with n² bins
2:	W^l ← array(0, size ← n)
3:	W^u ← array(0, size ← n)
4:	for i ← 1, n do
5:	W^l [i] ← W [(i − 1) · n + 1]	▹ lower bound after condensation
6:	W^u [i] ← W [i · n]	▹ upper bound after condensation
7:	end for
8:	return [W^l,W^u]	▹ Histograms with upper/lower bounds
9:	end procedure

Algorithm 3. Condensation with the bins vertically uniformly distributed.

**Algorithm 3.** Condensation with the bins vertically uniformly distributed.
1:	procedure VerticalCondensation(W,f,x)	▹ Histograms of a cdf and pdf, and breaks in the x-axis.
2:	breaks ← [1/n, 2/n, ..., 1]	▹ uniform breaks in y-axis
3:	W_n ← array (0, size ← n]
4:	x_n ← array (0, size ← n]
5:	lastbreak ← 1
6:	i ← 1
7:	for all b ∈ breaks do
8:	w ← first(W ≥ b)	▹ find break to create current bin
9:	if W [w] ≠ b then	▹ if the break is within a current bin
10:	ratio ← (b − W [w − 1])/(W [w] − W [w − 1])
11:	$x_{n} [i] \leftarrow \frac{1}{1 / n} (s u m (f [w - 1] \cdot x [w - 1]) + r a t i o \cdot f [w] \cdot x [w])$
12:	W [i − 1] ← b
13:	W_n [i] ← b
14:	f [i − 1] ← f [w − 1] + ratio · f [w]
15:	f [i] ← (1 − ratio) · f [w]
16:	else
17:	x_n [i] ← x [w]
18:	W_n [i] ← W [w]
19:	end if
20:	lastbreak ← b
21:	i ← i + 1
22:	end for
23:	return [W_n, x_n]	▹ Histograms with upper/lower bounds
24:	end procedure

3.2.1. Vertical Condensation

Kaplan and Lin [13] proposed a vertical condensation procedure for discrete probability calculations, where the condensation is done using the vertical axis, instead of the horizontal axis, as used by Williamson and Downs [14].

The advantage of this approach is that it provides greater control over the representation of the distribution; instead of selecting an interval of the domain of the cumulative distribution function (values assumed by the random variable) as a bin, we select the interval from the range of the cumulative distribution in [0, 1], which should be represented by each bin.

In this case, it is also possible to focus on a specific region of the distribution. For example, if there is a greater interest in the behavior of the tail of the distribution, the size of the bins can be reduced in this region, consequently increasing the number of bins necessary to represent the tail of the distribution.

An example of such a convolution that is followed by a condensation procedure using both approaches is given in Section 3.1. For this example, we used discretization and condensation procedures, with the bins uniformly distributed over both axes. At the end of the condensation procedure, using the first approach, the bins are uniformly distributed horizontally (over the sample space of the variable). For the second approach, the bins of the cumulative probability distribution are uniformly distributed over the vertical axis on the interval [0, 1]. Algorithm 3 shows the condensation with the bins uniformly distributed over the vertical axis.

Figure 2 shows the cumulative distribution functions of Y₁ and Y₂ (Section 3.1) after they have been discretized with bins uniformly distributed over both the x- and y-axes (horizontal and vertical discretizations). Figure 3 shows an example of convolution followed by condensation (based on the example in Section 3.1), using both the horizontal and vertical condensation procedures and the true distribution of the product of two variables with Log-normal distributions.

4. Test of Conditional Independence in Contingency Table Using FBST

We now apply the methods shown in the previous sections to find evidence of a complex null hypothesis of conditional independence for discrete variables.

Given the discrete random variables, X, Y and Z, with X taking values on {1, . . . , k} and Y and Z serving as categorical variables, the test for conditional independence Y ⫫ Z|X can be written as the complex null hypothesis, H:

H : [Y ⫫ Z ∣ X = 1] \land [Y ⫫ Z ∣ X = 2] \land \dots \land [Y ⫫ Z ∣ X = k] .

(22)

The hypothesis, H, can be decomposed into its elementary components:

\begin{matrix} H_{1} : Y ⫫ Z ∣ X = 1 \\ H_{2} : Y ⫫ Z ∣ X = 2 \\ \dots \\ H_{k} : Y ⫫ Z ∣ X = k \end{matrix}

Note that the hypotheses, H₁, . . . , H_k, are independent. For each value, x, taken by X, the values taken by variables Y and Z are assumed to be random observations drawn from some distribution p(Y,Z|X = x). Each of the elementary components is a hypothesis of independence in a contingency table. Table 1 shows the contingency table for Y and Z, which take values on {1, . . . , r} and {1, . . . , c}, respectively.

The test of the hypothesis, H_x, can be set up using the multinomial distribution for the cell counts of the contingency table and its natural conjugate prior, i.e., the Dirichlet distribution for the vector of the parameters θ_x = [θ₁₁_x, θ₁₂_x, . . . , θ_rcx].

For a given array of hyperparameters α_x = [α₁₁_x, . . . , α_rcx], the Dirichlet distribution is defined as:

f (θ_{x} ∣ α_{x}) = Γ (\sum_{y, z}^{r, c} α_{y z x}) \prod_{y, z}^{r, c} \frac{θ_{y z x}^{α_{y z x} - 1}}{Γ (α_{y z x})} .

(23)

The multinomial likelihood for the given contingency table, assuming the array of observations n_x = [n₁₁_x, . . . , n_rcx] and the sum of the observations $n_{.. x} = \sum_{y, z}^{r, c} n_{y z x}$ , is:

f (n_{x} ∣ θ_{x}) = n_{.. x}! \prod_{y, z}^{r, c} \frac{θ_{y z x}^{n_{y z x}}}{n_{y z x}!} .

(24)

The posterior distribution is thus a Dirichlet distribution, f_n(θ_x):

f_{n} (θ_{x}) \propto \prod_{y, z}^{r, c} θ_{y z x}^{α_{y z x} + n_{y z x} - 1} .

(25)

Under hypothesis H_x, we have Y ⫫ Z|X = x. In this case, the joint distribution is equal to the product of the marginals: p (Y = y, Z = z|X = x) = p (Y = y|X = x) p (Z = z|X = x). We can define this condition using the array of parameters, θ_x. In this case, we have:

H_{x} : θ_{y z x} = θ_{. z x} \cdot θ_{y . x}, \forall y, z

(26)

where $θ_{. z x} = \sum_{y}^{r} n_{y z x}$ and $θ_{y . x} = \sum_{z}^{c} θ_{y z x}$ .

The elementary components of hypothesis H are as follows:

\begin{array}{r} H_{1} : θ_{y z 1} = θ_{. z 1} \cdot θ_{y .1}, \forall y, z \\ H_{2} : θ_{y z 2} = θ_{. z 2} \cdot θ_{y .2}, \forall y, z \\ \dots \\ H_{k} : θ_{y z k} = θ_{. z k} \cdot θ_{y . k}, \forall y, z \end{array}

(27)

The point of maximum density of the posterior distribution that is constrained to the subset of the parameter space defined by hypothesis H_x can be estimated using the maximum a posteriori (MAP) estimator under hypothesis H_x (the mode of parameters, θ_x). The maximum density ( $f_{x}^{*}$ ) is the posterior density evaluated at this point.

θ_{y z x}^{*} = \frac{n_{y z x}^{H_{x}} + α_{y z x} - 1}{n_{.. x}^{H_{x}} + α_{.. x} - r \cdot c} and f_{x}^{*} = f_{n} (θ_{x}^{*}),

(28)

where $θ_{x}^{*} = [θ_{11 x}^{*}, \dots, θ_{r c x}^{*}]$ .

The evidence supporting H_x can be written in terms of the truth function, W_x, as defined in Section 3:

\begin{array}{r} R_{x} (f) = {θ_{x} \in Θ_{x} ∣ f_{x} (θ_{x}) \leq f}, \\ W_{x} (f) = \int_{R_{x} (f)} f_{n} (θ_{x}) d θ_{x} \propto \int_{R_{x} (f)} \prod_{y, z}^{r, c} θ_{y z x}^{α_{y z x} + n_{y z x} - 1} d θ_{x} . \end{array}

(29)

The evidence supporting H_x is:

E v (H_{x}) = W_{x} (f_{x}^{*}) .

(30)

Finally, the evidence supporting the hypothesis of conditional independence (H) is given by the convolution of the truth functions that are evaluated at the product of the points of maximum posterior density, for each component of hypothesis H:

E v (H) = W_{1} \otimes W_{2} \otimes \dots \otimes W_{k} (f_{1}^{*} \cdot f_{2}^{*} \cdot \dots \cdot f_{k}^{*}) .

(31)

The e-value for hypothesis H can be found using modern mathematical integration methods. An example is given in the next section, using the numerical convolution, followed by the condensation procedures described in Section 3.2. Applying the horizontal condensation method results in an interval for the e-value (found using the lower and upper bounds resulting from the condensation process) and in a single value for the vertical procedure.

4.1. Example of CI Test Using FBST

In this section, we describe an example of the CI test using the full Bayesian significance test for conditional independence using samples from two different models. For both models, we test whether the variable, Y, is conditionally independent of Z given X.

Two probabilistic graphical models (M₁ and M₂) are shown in Figure 4, where the three variables, X, Y and Z, assume values in {1, 2, 3}. In the first model (Figure 4a), the hypothesis of independence H : Y ⫫ Z|X is true, but in the second model (Figure 4b), the same hypothesis is false. The synthetic conditional probability distribution tables (CPTs) used to generate the samples are given in Appendix.

We calculate the intervals for the e-values and compare them, for hypothesis H of conditional independence, for both models: Ev_M_₁ (H) and Ev_M_₂ (H). The complexity hypothesis, H, can be decomposed into its elementary components:

\begin{array}{l} H_{1} : Y ⫫ Z ∣ X = 1 \\ H_{2} : Y ⫫ Z ∣ X = 2 \\ H_{3} : Y ⫫ Z ∣ X = 3 \end{array}

For each model, 5000 observations were generated; the contingency table of Y and Z for each value of X is shown in Table 2. The hyperparameters of the prior distribution were all set to one, because, in this case, the prior is equivalent to a uniform distribution (from Equation (23)):

\begin{array}{r} α_{1} = α_{2} = α_{3} = [1, 1, 1], \\ f (θ_{1} ∣ α_{1}) = f (θ_{3} ∣ α_{3}) = f (θ_{3} ∣ α_{3}) = 1. \end{array}

(32)

The posterior distribution, found using Equations (24) and (25), is then given as follows:

f_{n} (θ_{1}) \propto \prod_{y = 1, z = 1}^{3, 3} θ_{y z 1}^{n_{y z 1}}, f_{n} (θ_{2}) \propto \prod_{y = 1, z = 1}^{3, 3} θ_{y z 2}^{n_{y z 2}}, f_{n} (θ_{3}) \propto \prod_{y = 1, z = 1}^{3, 3} θ_{y z 3}^{n_{y z 3}} .

(33)

For example, for the given contingency table for Model M₁, when X = 2 (Table 2c), the posterior distribution is the following:

f_{n} (θ_{2}) \propto θ_{112}^{42} \cdot θ_{122}^{41} \cdot θ_{132}^{323} \cdot θ_{212}^{39} \cdot θ_{222}^{41} \cdot θ_{232}^{341} \cdot θ_{312}^{15} \cdot θ_{322}^{21} \cdot θ_{332}^{171} .

(34)

The point of highest density, in this example, following the hypothesis of independence (Equations (26) and (28)), was found to be the following:

θ_{2}^{*} \approx [0.036, 0.039, 0.317, 0.038, 0.041, 0.329, 0.019, 0.020, 0.162] .

(35)

The truth function and the evidence supporting the hypothesis of independence given X = 2 (hypothesis H₂) for Model M₁, as given in Equations (29) and (30), are as follows:

\begin{array}{r} R_{2} (f) = {θ_{2} \in Θ_{2} ∣ f_{n} (θ_{2}) \leq f}, \\ W_{2} (f) = \int_{R_{2} (f)} f_{n} (θ_{2}) d θ_{2} \\ E v_{M_{1}} (H_{2}) = W_{2} (f_{n} (θ_{2}^{*})) . \end{array}

(36)

We used the methods of numerical integration to find the e-value of the elementary components of hypothesis H (H₁,H₂ and H₃), and the results for each model are given below.

E-values found using horizontal discretization:

\begin{array}{l} E v_{M_{1}} (H_{1}) = 0.9878, E v_{M_{1}} (H_{2}) = 0.9806 and E v_{M_{1}} (H_{3}) = 0.1066; \\ E v_{M_{2}} (H_{1}) = 0.0004, E v_{M_{2}} (H_{2}) = 0.0006 and E v_{M_{2}} (H_{3}) = 0.0004, \end{array}

E-values found using vertical discretization:

\begin{array}{l} E v_{M_{1}} (H_{1}) = 0.99, E v_{M_{1}} (H_{2}) = 0.98 and E v_{M_{1}} (H_{3}) = 0.11; \\ E v_{M_{2}} (H_{1}) = 0.01, E v_{M_{2}} (H_{2}) = 0.01 and E v_{M_{2}} (H_{3}) = 0.01. \end{array}

Figure 5 shows the histogram of the truth functions, W₁, W₂ and W₃, for the model, M₁ (Y and Z are conditionally independent, given X). In Figure 5a,c,e, 100 bins are uniformly distributed over the x-axis (using the empirical values of min f_n(θ_x) and max f_n(θ_x)). In Figure 5b,d,f, 100 bins are uniformly distributed over the y-axis (each bin represents an increase in 1% in density from the previous bin). The function, W_x, evaluated at the maximum posterior density over the respective hypothesis, $f_{n} (θ_{x}^{*})$ , in red, corresponds to the e-values found (e.g., $W_{3} (f (θ_{3}^{*})) \approx 0.1066$ , for the horizontal discretization in Figure 5e).

The evidence supporting the hypothesis of the conditional independence H, as in Equation (31), for each model is as follows:

E v (H) = W_{1} \otimes W_{2} \otimes W_{3} (f_{n} (θ_{1}^{*}) \cdot f_{n} (θ_{2}^{*}) \cdot f_{n} (θ_{3}^{*})) .

(37)

The convolution follows the commutative property, and the order of the convolutions is therefore irrelevant.

W_{1} \otimes W_{2} \otimes W_{3} (f) = W_{3} \otimes W_{2} \otimes W_{1} (f) .

(38)

Using the algorithm for numerical convolution described in Algorithm 1, we found the convolution of the truth functions, W₁ and W₂, resulting in a cumulative function (W₁₂) with 10, 000 bins (100² bins). We then performed the condensation procedures described in Algorithms 2 and 3 and reduced the cumulative distribution to 100 bins, with lower and upper bounds ( $W_{12}^{l}$ and $W_{12}^{u}$ ) for the horizontal condensation. The results are shown in Figure 6a,b for Model M₁ (horizontal and vertical condensations, respectively) and in Figure 7a,b for Model M₂.

The convolution of W₁₂ and W₃ was followed by their condensation. The results are shown in Figure 6c,d (Model M₁) and Figure 7c,d (Model M₂).

The e-values supporting the hypothesis of conditional independence for both models are given below.

The intervals for the e-values were found using horizontal discretization and condensation, as follows:

\begin{array}{l} E v_{M_{1}} (H) = [0.587427, 0.718561], \\ E v_{M_{2}} (H) = [8 \cdot 10^{- 12}, 6.416 \cdot 10^{- 9}] . \end{array}

The e-values found using vertical discretization and condensation were as follows:

\begin{array}{l} E v_{M_{1}} (H) = 0.95, \\ E v_{M_{2}} (H) = 0.01. \end{array}

These results show strong evidence supporting the hypothesis of conditional independence between Y and Z, given X, for Model M₁ (using both discretization/condensation procedures). No evidence supporting the same hypothesis for the second model was found. This result is very relevant and promising as a motivation for further studies on the use of FBST as a CI test for the structural learning of graphical models.

5. Conclusions and Future Work

This paper provides a framework for performing tests of conditional independence for discrete datasets using the Full Bayesian Significance Test. A simple application of this test includes examining the structure of a directed acyclic graph given two different models. The result found in this paper suggests that FBST should be considered a good alternative to performing CI tests to uncover the structures of probabilistic graphical models from data.

Future research should include the use of FBST in algorithms to learn the structures of graphs with larger numbers of variables; to increase the capacity for performing these mathematical methods to calculate e-values (because learning DAG structures from data requires an exponential number of CI tests to be performed, each CI test needs to be performed faster); and to empirically evaluate the threshold for e-values to define conditional independence versus dependence. The last of these areas of future exploration should be achieved by minimizing the linear combination of type I and II errors (incorrect rejection of a true hypothesis of conditional independence and failure to reject a false hypothesis of conditional independence).

Acknowledgment

The authors are grateful for the support of IME-USP, to the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Fundação de Apoio à Pesquisa do Estado de São Paulo (FAPESP).

Appendix

Table A1. Conditional probability distribution tables. (a) The distribution of X, (b) the conditional distribution of Y, given X, and (c) the conditional distribution of Z, given X.

**Table A1.** Conditional probability distribution tables. (a) The distribution of X, (b) the conditional distribution of Y, given X, and (c) the conditional distribution of Z, given X.
(a) CPT of X

X	p(X)
1	0.3
2	0.2
3	0.5

(b) CPT of Y given X

Y	p(Y\|X=1)	p(Y\|X=2)	p(Y\|X=3)
1	0.3	0.4	0.2
2	0.2	0.4	0.1
3	0.5	0.2	0.7

(c) CPT of Z given X

Z	p(Z\|X=1)	p(Z\|X=2)	p(Z\|X=3)
1	0.5	0.1	0.6
2	0.4	0.1	0.1
3	0.1	0.8	0.3

Table A2. Conditional probability distribution table of Z, given X & Y.

**Table A2.** Conditional probability distribution table of Z, given X & Y.
Z	p(Z\|X=1,Y =1)	p(Z\|X=1,Y =2)	p(Z\|X=1,Y =3)
1	0.5	0.1	0.6
2	0.4	0.1	0.1
3	0.1	0.8	0.3

Z	p(Z\|X=2,Y =1)	p(Z\|X=2,Y =2)	p(Z\|X=2,Y =3)
1	0.2	0.4	0.8
2	0.2	0.3	0.1
3	0.6	0.3	0.1

Z	p(Z\|X=3,Y =1)	p(Z\|X=3,Y =2)	p(Z\|X=3,Y =3)
1	0.1	0.5	0.2
2	0.2	0.4	0.6
3	0.7	0.1	0.2

Conflicts of Interest

We certify that there is no conflict of interest regarding the material discussed in the manuscript.

Author ContributionAll authors made substantial contributions to conception and design, acquisition of data and analysis and interpretation of data; all authors participate in drafting the article or revising it critically for important intellectual content; all authors gave final approval of the version to be submitted and any revised version.

References

Barlow, R.E.; Pereira, C.A.B. Conditional independence and probabilistic influence diagrams. In Topics in Statistical Dependence; Block, H.W., Sampson, A.R., Savits, T.H., Eds.; Lecture Notes-Monograph Series; Institute of Mathematical Statistics: Beachwood, OH, USA, 1990; pp. 19–33. [Google Scholar]
Basu, D.; Pereira, C. A. Conditional independence in statistics. Sankhy: The Indian Journal of Statistics 1983, Series A. 371–384. [Google Scholar]
Pearl, J.; Verma, T.S. A Theory of Inferred Causation; Studies in Logic and the Foundations of Mathematics; Morgan Kaufmann: San Mateo, CA, USA, 1995; pp. 789–811. [Google Scholar]
Cheng, J.; Bell, D.A.; Liu, W. Learning belief networks from data: An information theory based approach. Proceedings of the sixth International Conference on Information and Knowledge Management, Las Vegas, NV, USA, 10–14 November 1997; pp. 325–331.
Tsamardinos, I.; Brown, L.E.; Aliferis, C.F. The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn 2006, 65, 31–78. [Google Scholar]
Yehezkel, R.; Lerner, B. Bayesian network structure learning by recursive autonomy identification. J. Mach. Learn. Res 2009, 10, 1527–1570. [Google Scholar]
Pereira, C.A.B.; Stern, J.M. Evidence and credibility: Full Bayesian significance test for precise hypotheses. Entropy 1999, 1, 99–110. [Google Scholar]
Borges, W.; Stern, J.M. The rules of logic composition for the Bayesian epistemic e-values. Log. J. IGPL 2007, 15, 401–420. [Google Scholar]
Geenens, G.; Simar, L. Nonparametric tests for conditional independence in two-way contingency tables. J. Multivar. Anal 2010, 101, 765–788. [Google Scholar]
Kilicman, A.; Arin, M.R.K. A note on the convolution in the Mellin sense with generalized functions. Bull. Malays. Math. Sci. Soc 2002, 25, 93–100. [Google Scholar]
Springer, M.D. The Algebra of Random Variables; Wiley: New York, NY, USA, 1979. [Google Scholar]
Williamson, R.C. Probabilistic Arithmetic. Ph.D Thesis, University of Queensland, Austrilia, 1989. [Google Scholar]
Kaplan, S.; Lin, J.C. An improved condensation procedure in discrete probability distribution calculations. Risk Anal 1987, 7, 15–19. [Google Scholar]
Williamson, R.C.; Downs, T. Probabilistic arithmetic. I. Numerical methods for calculating convolutions and dependency bounds. Int. J. Approx. Reason 1990, 4, 89–158. [Google Scholar]

Figure 1. Example of a tangent set for the null hypothesis, H₀ : μ = 1.0. In (a) and (b), the posterior distribution, f_x, is shown, with the red line representing the points in the null hypothesis (μ = 1). In (c), the contours of f_x show that the points of maximum density in the null hypothesis,

θ_{0}^{*}

, have a density of

0.1037 (f^{*} = f (θ_{0}^{*}) = 0.1037)

. The tangent set, T*, of the null hypothesis, H₀, is the set of points inside the green contour line (points with a density greater than f*), and the e-value of H₀ is the complement of the integral of f_x, as bounded by the green contour line.

Figure 1. Example of a tangent set for the null hypothesis, H₀ : μ = 1.0. In (a) and (b), the posterior distribution, f_x, is shown, with the red line representing the points in the null hypothesis (μ = 1). In (c), the contours of f_x show that the points of maximum density in the null hypothesis,

θ_{0}^{*}

, have a density of

0.1037 (f^{*} = f (θ_{0}^{*}) = 0.1037)

. The tangent set, T*, of the null hypothesis, H₀, is the set of points inside the green contour line (points with a density greater than f*), and the e-value of H₀ is the complement of the integral of f_x, as bounded by the green contour line.

Figure 2. Example of different discretization methods for the representation of the cdf of two random variables (Y₁ and Y₂) with Log-normal distributions. In (a) and (c), respectively, the cdf of Y₁ and Y₂ are shown, with the bins uniformly distributed over the x-axis. In (b) and (d), respectively, the cdf of Y₁ and Y₂ are shown, with the bins uniformly distributed over the y-axis.

Figure 3. Example of the convolution of two random variables (Y₁ and Y₂) with Log-normal distributions. The result of the convolution Y₁ ⊗ Y₂, followed by horizontal condensation (bins uniformly distributed over the x-axis), is shown in (a), and the result of vertical condensation (bins uniformly distributed over the y-axis) is shown in (b). The true distribution of the product Y₁ · Y₂ is shown in (c) and (d), respectively, for the horizontal and vertical discretization procedures.

Figure 4. Simple probabilistic graphical models. (a) Model M₁, where Y is conditionally independent of Z given X; (b) Model M₂, where Y is not conditionally independent of Z given X.

Figure 5. Histogram with 100 bins for the truth functions of the model, M₁ (Figure 4a for each value of X. (a) W₁ for Model M₁,

f_{n} (θ_{1}^{*})

in red; (b) W₁ for Model M₁,

f_{n} (θ_{1}^{*})

in red; (c) W₂, for Model M₁,

f_{n} (θ_{2}^{*})

in red; (d) W₂, for Model M₁,

f_{n} (θ_{2}^{*})

in red; (e) W₃, for Model M₁,

f_{n} (θ_{3}^{*})

in red; (f) W₃, for Model M₁,

f_{n} (θ_{3}^{*})

in red. In red is the maximum posterior density under the respective elementary component (H₁, H₂ and H₃) of the hypothesis of conditional independence H for both horizontal and vertical discretization procedures.

Figure 5. Histogram with 100 bins for the truth functions of the model, M₁ (Figure 4a for each value of X. (a) W₁ for Model M₁,

f_{n} (θ_{1}^{*})

in red; (b) W₁ for Model M₁,

f_{n} (θ_{1}^{*})

in red; (c) W₂, for Model M₁,

f_{n} (θ_{2}^{*})

in red; (d) W₂, for Model M₁,

f_{n} (θ_{2}^{*})

in red; (e) W₃, for Model M₁,

f_{n} (θ_{3}^{*})

in red; (f) W₃, for Model M₁,

f_{n} (θ_{3}^{*})

in red. In red is the maximum posterior density under the respective elementary component (H₁, H₂ and H₃) of the hypothesis of conditional independence H for both horizontal and vertical discretization procedures.

Figure 6. Histogram with 100 bins resulting from the convolutions for Model M₁: (a) W₁ ⊗ W₂ with horizontal discretization; (b) W₁ ⊗ W₂ with vertical discretization; (c) W₁ ⊗W₂⊗W₃ with horizontal discretization; (d) W₁⊗W₂⊗W₃ with vertical discretization. In red in (c) and (d) is the bin representing the product of the maximum posterior density under the elementary components (H₁, H₂ and H₃) of the hypothesis of the conditional independence H for model M₁.

Figure 7. Histogram with 100 bins resulting from the convolutions for model M₂: (a) W₁ ⊗ W₂ with horizontal discretization; (b) W₁⊗W₂ with vertical discretization; (c) W₁⊗W₂⊗W₃ with horizontal discretization; (d) W₁ ⊗ W₂ ⊗ W₃ with vertical discretization. In red in (c) and (d) is the bin representing the product of the maximum posterior density under the elementary components (H₁, H₂ and H₃) of the hypothesis of conditional independence, H, for model M₂.

Table 1. Contingency table of Y and Z for X = x (hypothesis H_x); n_yzx is the count of [Y,Z] = [y, z] when X = x.

**Table 1.** Contingency table of Y and Z for X = x (hypothesis H_x); n_yzx is the count of [Y,Z] = [y, z] when X = x.
	Z = 1	Z = 2	···	Z = c
Y = 1	n₁₁_x	n₁₂_x	···	n₁_cx
Y = 2	n₂₁_x	n₂₂_x	···	n₂_cx
···	···	···	···	···
Y =r	n_r₁_x	n_r₂_x	···	n_rcx

Table 2. Contingency tables of Y and Z for a given value of X for 5000 random samples. (a,c,e): samples from Model M₁ (Figure 4a) for X = 1, 2, and 3, respectively; (b,d,f): samples from Model M₂ (Figure 4b) for X = 1, 2, and 3, respectively.

**Table 2.** Contingency tables of Y and Z for a given value of X for 5000 random samples. (a,c,e): samples from Model M₁ (Figure 4a) for X = 1, 2, and 3, respectively; (b,d,f): samples from Model M₂ (Figure 4b) for X = 1, 2, and 3, respectively.
(a) Model M₁ (for X = 1)

	Z = 1	Z = 2	Z = 3
Y = 1	241	187	44	472
Y = 2	139	130	30	299
Y = 3	364	302	70	736

	744	619	144	1,507

(b) Model M₂ (for X = 1)

	Z = 1	Z = 2	Z = 3
Y = 1	228	179	39	446
Y = 2	25	33	211	269
Y = 3	482	75	208	765

	735	287	458	1,048

(c) Model M₁ (for X = 2)

	Z = 1	Z = 2	Z = 3
Y = 1	42	41	323	406
Y = 2	39	41	341	421
Y = 3	15	21	171	207

	96	103	835	1,034

(d) Model M₂ (for X = 2)

	Z = 1	Z = 2	Z = 3
Y = 1	77	85	248	410
Y = 2	165	135	120	420
Y = 3	188	21	24	233

	430	241	392	1,036

(e) Model M₁ (for X = 3)

	Z = 1	Z = 2	Z = 3
Y = 1	282	35	151	468
Y = 2	131	37	79	247
Y = 3	1,055	143	546	1,744

	1,468	215	776	2,459

(f) Model M₂ (for X = 3)

	Z = 1	Z = 2	Z = 3
Y = 1	40	87	354	481
Y = 2	119	104	27	250
Y = 3	305	1,049	372	1,726

	464	1,240	753	2,457

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

De Morais Andrade, P.; Stern, J.M.; De Bragança Pereira, C.A. Bayesian Test of Significance for Conditional Independence: The Multinomial Model. Entropy 2014, 16, 1376-1395. https://doi.org/10.3390/e16031376

AMA Style

De Morais Andrade P, Stern JM, De Bragança Pereira CA. Bayesian Test of Significance for Conditional Independence: The Multinomial Model. Entropy. 2014; 16(3):1376-1395. https://doi.org/10.3390/e16031376

Chicago/Turabian Style

De Morais Andrade, Pablo, Julio Michael Stern, and Carlos Alberto De Bragança Pereira. 2014. "Bayesian Test of Significance for Conditional Independence: The Multinomial Model" Entropy 16, no. 3: 1376-1395. https://doi.org/10.3390/e16031376

Article Menu

Bayesian Test of Significance for Conditional Independence: The Multinomial Model

Abstract

1. Introduction

2. The Full Bayesian Significance Test

2.1. FBST: Example of Tangent Set

3. FBST: Compositionality

3.1. Mellin Convolution: Example

3.2. Numerical Methods for Convolution and Condensation

3.2.1. Vertical Condensation

4. Test of Conditional Independence in Contingency Table Using FBST

4.1. Example of CI Test Using FBST

5. Conclusions and Future Work

Acknowledgment

Appendix

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI