Next Article in Journal
Active Control of a Chaotic Fractional Order Economic System
Next Article in Special Issue
Computing and Learning Year-Round Daily Patterns of Hourly Wind Speed and Direction and Their Global Associations with Meteorological Factors
Previous Article in Journal
Deformed Algebras and Generalizations of Independence on Deformed Exponential Families
Previous Article in Special Issue
Gaussian Network’s Dynamics Reflected into Geometric Entropy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach

Department of Mathematics, Graduate School of Science, Osaka University, Toyonaka-shi 560-0043,Japan
Entropy 2015, 17(8), 5752-5770; https://doi.org/10.3390/e17085752
Submission received: 30 April 2015 / Revised: 30 April 2015 / Accepted: 5 August 2015 / Published: 10 August 2015
(This article belongs to the Special Issue Dynamical Equations and Causal Structures from Observations)

Abstract

:
We consider the problem of learning a Bayesian network structure given n examples and the prior probability based on maximizing the posterior probability. We propose an algorithm that runs in O ( n log n ) time and that addresses continuous variables and discrete variables without assuming any class of distribution. We prove that the decision is strongly consistent, i.e., correct with probability one as n . To date, consistency has only been obtained for discrete variables for this class of problem, and many authors have attempted to prove consistency when continuous variables are present. Furthermore, we prove that the “ log n ” term that appears in the penalty term of the description length can be replaced by 2 ( 1 + ϵ ) log log n to obtain strong consistency, where ϵ > 0 is arbitrary, which implies that the Hannan–Quinn proposition holds.

Graphical Abstract

1. Introduction

In this paper, we address the problem of learning a Bayesian network structure from examples.
For sets A , B , C of random variables, we say that A and B are conditionally independent given C if the conditional probability of A and B given C is the product of the conditional probabilities of A given C and B given C. A Bayesian network (BN) is a graphical model that expresses conditional independence (CI) relations among the prepared variables using a directed acyclic graph (DAG). We define a BN by the DAG with vertexes V = { 1 , , N } and directed edges E = { ( j , i ) | i V , j π ( i ) } , where edge ( j , k ) V 2 directs from j to k, via minimal parent sets π ( i ) V , i V , such that the distribution is factorized by:
P ( X ( 1 ) , , X ( N ) ) = i = 1 N P ( X ( i ) | { X ( j ) } j π ( i ) ) .
First, suppose that we wish to know whether two random binary variables X and Y are independent (hereafter, we write X Y ). If we have n pairs of actually emitted examples ( X = x 1 , Y = y 1 ) , , ( X = x n , Y = y n ) and know the prior probability p of X Y , then it would be reasonable to maximize the posterior probability of X Y given x n = ( x 1 , , x n ) and y n = ( y 1 , , y n ) . If we assume that the probabilities P ( X = x ) , P ( Y = y ) and P ( X = x , Y = y ) are parameterized by p ( x | θ X ) , p ( y | θ Y ) , and p ( x , y | θ X Y ) and that the prior probabilities W X , W Y , and W X Y over the probabilities θ X , θ Y , and θ X Y of X { 0 , 1 } , Y { 0 , 1 } and ( X , Y ) { 0 , 1 } 2 are available, respectively, then we can construct the quantities:
Q X n ( x n ) : = i = 1 n p ( x i | θ X ) W X ( d θ X ) , Q Y n ( y n ) : = i = 1 n p ( y i | θ Y ) W Y ( d θ Y ) , Q X Y n ( x n , y n ) : = i = 1 n p ( x i , y i | θ X Y ) W X Y ( d θ X Y ) .
In this setting, maximizing the posterior probability of X Y given examples x n , y n w.r.t. the prior probability p is equivalent to deciding X Y if and only if:
p Q X n ( x n ) Q Y n ( y n ) ( 1 - p ) Q X Y n ( x n , y n ) .
The decision based on (1) is strongly consistent, i.e., it is correct with probability one as n [1] (see Section 3.1 for the proof). We say that a model selection procedure satisfies weak consistency if the probability of choosing the correct model goes to unity as n grows (probability convergence) and that it satisfies strong consistency if the probability one is assigned to the set of infinite example sequences that choose the correct model, except for at most finite times (almost sure convergence). In general, strong consistency implies weak consistency, but the converse is not true [2]. In any model selection, in particular for large n, the correct answer is required. If continuous variables are present, the BN structure learning is not easy, and strong consistency is hard to obtain.
The same scenario is applied to the case in which X and Y take values from finite sets A and B rather than { 0 , 1 } .
Next, suppose that we wish to know the factorization of three random binary variables X , Y , Z : P ( X ) P ( Y ) P ( Z ) , P ( X ) P ( Y , Z ) , P ( Y ) P ( Z , X ) , P ( Z ) P ( X , Y ) , P ( X , Y ) P ( X , Z ) P ( X ) , P ( X , Y ) P ( Y , Z ) P ( Y ) , P ( X , Z ) P ( Y , Z ) P ( Z ) , P ( Y ) P ( Z ) P ( X , Y , Z ) P ( Y , Z ) , P ( Z ) P ( X ) P ( X , Y , Z ) P ( Z , X ) , P ( X ) P ( Y ) P ( X , Y , Z ) P ( X , Y ) and P ( X , Y , Z ) . If we have n triples of actually emitted examples ( X = x 1 , Y = y 1 , Z = z 1 ) , , ( X = x n , Y = y n , Z = z n ) and know the prior probabilities p 1 , , p 11 over the eleven factorizations, then it would be reasonable to choose the one that maximizes:
p 1 Q X n ( x n ) Q Y n ( y n ) Q Z ( z n ) , p 2 Q X n ( x n ) Q Y Z n ( y n , z n ) , p 3 Q Y n ( y n ) Q X Z n ( x n , z n ) , p 4 Q Z n ( z n ) Q X Y n ( x n , y n ) , p 5 Q X Y n ( x n , y n ) Q X Z n ( x n , z n ) Q X n ( x n ) , p 6 Q X Y n ( x n , y n ) Q Y Z n ( y n , z n ) Q Y n ( y n ) , p 7 Q X Z n ( x n , z n ) Q Y Z n ( y n , z n ) Q Z n ( z n ) , p 8 Q Y n ( y n ) Q Z n ( z n ) Q X Y Z n ( x n , y n , z n ) Q Y Z n ( y n , z n ) , p 9 Q Z n ( z n ) Q X n ( x n ) Q X Y Z n ( x n , y n , z n ) Q X Z n ( x n , z n ) , p 10 Q X n ( x n ) Q Y n ( y n ) Q X Y Z n ( x n , y n , z n ) Q X Y n ( x n , y n ) , p 11 Q X Y Z n ( x n , y n , z n ) ,
to maximize the posterior probability of the factorization given x n = ( x 1 , , x n ) , y n = ( y 1 , , y n ) and z n = ( z 1 , , z n ) . For example, between the last two distributions, we choose the last if and only if:
p 10 Q X n ( x n ) Q Y n ( y n ) p 11 Q X Y n ( x n , y n ) .
In fact, for example, we can check that the factorizations:
P ( Y ) P ( X | Y ) P ( Z | X ) , P ( X ) P ( Y | X ) P ( Z | X ) , P ( Z ) P ( X | Z ) P ( Y | Z )
in Figure 1a–c share the same form P ( X Y ) P ( X Z ) P ( X ) , and we say that they share the same Markov-equivalent class. On the other hand, the factorization
P ( Y ) P ( Z ) P ( X | Y Z ) = P ( X Y Z ) P ( Y ) P ( Z ) P ( Y Z )
in Figure 1d has nothing to share with the same Markov equivalent class, except itself. In the case of three variables, there are 25 DAGs, but they reduce to the eleven Markov equivalent classes.
Figure 1. Markov-equivalent classes (ad).
Figure 1. Markov-equivalent classes (ad).
Entropy 17 05752 g001
The method that maximizes the posterior probability is strongly consistent [1] (see Section 3.1 for the proof), and a scenario with two and three variables as above can be extended to cases with N variables in a straightforward manner, if the variables are discrete.
In this paper, we consider the case when continuous variables are present. The idea is to construct measures g X n ( x n ) , g Y n ( y n ) and g X Y n ( x n , y n ) over X n , Y n and X n × Y n for continuous ranges X and Y to make the decision whether X Y based on:
p g X n ( x n ) g Y n ( y n ) ( 1 - p ) g X Y n ( x n , y n ) .
The main problem is whether the decision is strongly consistent. Many authors have attempted to address continuous variables. For example, Nir Friedman [3] experimentally demonstrated the construction of a genetic network based on expression data using the E-Malgorithm. However, the variables were assumed to be linearly related and included Gaussian noise, and the dataset was not sufficiently fit to the model. Imoto et al. [4] improved the model such that the relation is expressed by B-spline curves rather than lines. However, all of the authors, including Friedman and Imoto, failed to maximize the posterior probability, and thus, the decision is not consistent. This paper proves that the decision based on (2) and its extension for general N 2 is strongly consistent.
In any Bayesian approach of BN structure learning, whether continuous variables are present or not, the procedure consists of two stages:
(1)
Compute the local scores for the nonempty subsets of { X ( 1 ) , , X ( N ) } ; for example, if N = 3 , the seven quantities Q X n ( x n ) , , Q X Y Z n ( x n , y n , z n ) are obtained; and
(2)
Find a BN structure that maximizes the global scores among the M ( N ) ( 3 N ) candidate BN structures; there are at most 3 N DAGs in the case of N variables; for example, if N = 3 , the eleven quantities are computed and a structure with the largest is chosen.
Note that the second stage does not care about whether each variable is continuous or not. In this paper, we mainly discuss about the performance of the first stage. The number of local scores to be computed can be saved, although it is generally exponential with N. We consider the problem inSection 3.3.
On the other hand, Zhang, Peters, Janzing and Scholkopf [5] proposed a BN structure learning method using conditional independence (CI) tests based on kernel statistics. However, for the CI test that is close to the Hilbert–Schmidt information criterion (HSIC), it is very hard to simulate the null distribution. They only proposed to approximate it by a Gamma distribution, but no consistency, is obtained because the threshold of the statistical test is not correct in practice. Furthermore, for the independence test approach, it often results in conflicting assertions of independence for finite samples. In particular, for small samples, the obtained DAG sometimes contain a directed loop. The Bayesian approach we consider in this paper does not suffer from the inconvenience, because we seek a structure that maximizes the global score [6].
Another contribution of this paper is identifying the border between consistency and non-consistency in learning Bayesian networks. For discrete X, maximizing Q X n ( x n ) is equivalent to minimizing the description length [1]:
- log Q X n ( x n ) H n ( x n ) + α - 1 2 log n ,
where H n ( x n ) is the empirical entropy of x n X n (we write A B when | A - B | is bounded by a constant) and α is the cardinality of set X. The problem at hand is whether the log n term is the minimum function of n for ensuring strong consistency. If log n is replaced by two (AIC), we cannot obtain consistency. We prove that 2 ( 1 + ϵ ) log log n with ϵ > 0 is the minimum for strong consistency based on the law of iterated logarithms. The same property is known as the Hannan–Quinn principle [7], and similar results have been obtained for autoregression, linear regression [8] and classification [9], among others. The derivation in this paper does not depend on these previous results. The Hannan–Quinn principle will also be applied to continuous variables.
This paper is organized as follows. Section 2.1 introduces the general concept of learning Bayesian network structures based on maximizing the posterior probability, and Section 2.2 discusses the concept of density functions developed by Boris Ryabko [10] and extended by Suzuki [11]. Section 3 presents our contributions: Section 3.1 proves the Hannan–Quinn property in the current problem, andSection 3.2 proves consistency when continuous variables are present. Section 4 concludes the paper by summarizing the results and states the paper's significance in the field of model selection.

2. Preliminaries

2.1. Learning the Bayesian Structure for Discrete Variables and Its Consistency

We choose w X , such that w X ( θ ) d θ = 1 and 0 θ ( x ) 1 by w X ( θ ) x X θ ( x ) - 1 / 2 , where X is the set from which X takes its values. Let α = | X | , and let c i ( x ) be the frequency of x X in x i = ( x 1 , , x i ) X i , i = 1 , , n . It is known that the following quantities satisfies (3) [12]:
Q X n ( x n ) : = i = 1 n c i - 1 ( x i ) + 1 / 2 i - 1 + | X | / 2 = Γ ( α / 2 ) x X Γ ( c n ( x ) + 1 / 2 ) Γ ( 1 / 2 ) α Γ ( n + α / 2 ) ,
where Γ is the Gamma function, and Stirling's formula Γ ( z ) = 2 π z ( z e ) z { 1 + O ( z - 1 / 3 ) } has been applied. Thus, for x X , from the law of large numbers, c n ( x ) / n converges to P ( X = x ) with probability one as n , such that:
- 1 n log Q n ( x n ) H ( X ) : = x X - P ( X = x ) log P ( X = x )
with probability one as n .
Moreover, from the law of large numbers, with probability one as n ,
- 1 n log P ( X n = x n ) = 1 n i = 1 n { - log P ( X = x i ) } E [ - log P ( X ) ] = H ( X )
(Shannon–McMillan–Breiman [13]). This proves that there exists a Q X n (universal measure), such that for any probability P over the finite set X,
1 n log P n ( x n ) Q n ( x n ) 0
with probability one as n , where we write P n ( x n ) : = P ( X n = x n ) . The same property holds for:
- log Q Y n ( y n ) H n ( y n ) + β - 1 2 log n ,
and:
- log Q X Y n ( x n , y n ) H n ( x n , y n ) + α β - 1 2 log n ,
where β = | Y | , H n ( y n ) = y Y - c n ( y ) log c n ( y ) n and H n ( x n , y n ) = x X y Y - c n ( x , y ) log c n ( x , y ) n are the empirical entropies of y n Y n and ( x n , y n ) X n × Y n , and c n ( y ) and c n ( x , y ) are the numbers of occurrences of y Y and ( x , y ) X × Y in y n = ( y 1 , , y n ) Y n and ( x n , y n ) X n × Y n , respectively.
Thus, we have:
J n ( x n , y n ) : = 1 n log Q X Y ( x n , y x ) Q X ( x n ) Q Y ( y n ) I ( X , Y ) : = E { P ( X , Y ) P ( X ) P ( Y ) } .
with probability one as n . However, X Y if and only if I ( X , Y ) = 0 . Hence, if X ¬ Y , the value of J n ( x n , y n ) is positive with probability one as n . However, how can we detect X Y when X Y ? J n ( x n , y n ) cannot be exactly zero with probability one as n .
However, when X and Y are discrete, the estimation based on J n ( x n , y n ) is consistent: if X Y , the value of J n ( x n , y n ) is not greater than zero with probability one as n . For example, the decision based on (1) is strongly consistent because the values of 1 n log p and 1 n log ( 1 - p ) are negligible for large n, and asymptotically, (1) is equivalent to J n ( x n , y n , z n ) 0 .
In Section 3.1, we provide a stronger result of consistency and a more intuitive and elegant proof.
In general, if N variables exist ( N 2 ), we must consider two cases: D ( P * | | P ) > 0 and D ( P * | | P ) = 0 , where P * and P are the probabilities based on the correct and estimated factorizations and D ( P * | | P ) denotes the Kullback–Leibler divergence between P * and P. If N = 2 , then:
D ( P * | | P ) : = x y P * ( x , y ) log P * ( x , y ) P ( x , y ) > 0
if and only if X ¬ Y in P * and X Y in P.
The same property holds for three variables X , Y , Z ( N = 3 ):
J n ( x n , y n , z n ) : = 1 n log Q X Y Z ( x n , y n , z n ) Q Z n ( z n ) Q X Z n ( x n , y n ) Q Y Z n ( y n , z n ) I ( X , Y , Z ) : = E { P ( X Y Z ) P ( Z ) P ( X Z ) P ( Y Z ) }
with probability one as n , and X Y | Z if and only if I ( X , Y , Z ) = 0 . Then, we can show J n ( x n , y n , z n ) 0 if and only if I ( X , Y , Z ) = 0 , with probability one as n (see Section 3.1). For example, between the seventh and eleventh factorizations, if J n ( x n , y n , z n ) 0 and J n ( x n , y n , z n ) > 0 , then we choose the seventh and eleventh, respectively. In fact,
p 7 Q X Z n ( x n , z n ) Q Y Z n ( y n , z n ) Q Z n ( z n ) p 11 Q X Y Z n ( x n , y n , z n ) J n ( x n , y n , z n ) 0
for large n, because 1 n log p 7 p 11 diminishes.
Then, the decision is correct with probability one as n . Similarly, we calculate:
- log Q Z ( z n ) H n ( z n ) + γ - 1 2 log n ,
- log Q Y Z ( y n , z n ) H n ( y n , z n ) + β γ - 1 2 log n ,
- log Q Z X ( z n , x n ) H n ( z n , x n ) + γ α - 1 2 log n ,
and:
- log Q X Y Z ( x n , y n , z n ) H n ( x n , y n , z n ) + α β γ - 1 2 log n ,
where γ = | Z | . In general, for N variables, given P and P * , we have all of the CI statements for each of them, and D ( P * | | P ) = 0 if and only if the CI statements in P imply those in P * ; in other words, P induces an I-map, which is not necessarily minimal.
Note that for any subsets a , b , c of { 1 , , N } , we can construct the estimation J n ( x n , y n , z n ) , with X = { X ( i ) } i a , Y = { Y ( j ) } j b , Z = { X ( k ) } k c , and obtain consistency, i.e., we will have the correct CI statements, where c may be empty.
Table 1 depicts whether D ( P * | | P ) > 0 or D ( P * | | P ) = 0 for each P * and P. For example, if the factorizations of P * and P are the fourth and sixth, then D ( P * | | P ) = 0 from the table. In general, D ( P * | | P ) = 0 if and only if P * is realized using the factorization and an appropriate parameterset for P.
Table 1. Three-variable case: D ( P * | | P ) > 0 or D ( P * | | P ) = 0 : “+” and “0” denote D ( P * | | P ) > 0 and D ( P * | | P ) = 0 , respectively.
Table 1. Three-variable case: D ( P * | | P ) > 0 or D ( P * | | P ) = 0 : “+” and “0” denote D ( P * | | P ) > 0 and D ( P * | | P ) = 0 , respectively.
Estimated P
1234567891011
True P * 1*0000000000
2+*+++00+++0
3++*+0+0+++0
4+++*00++++0
5++++*+++++0
6+++++*++++0
7++++++*+++0
8+++++++*++0
9++++++++*+0
10+++++++++*0
11++++++++++*

2.2. Universal Measures for Continuous Variables

In this section, we primarily address continuous variables.
Let { A j } be such that A 0 = { X } , and let A j + 1 be a refinement of A j . For example, suppose that the random variable X takes values in X = [ 0 , 1 ] , and we generate a sequence as follows:
A 1 = { [ 0 , 1 2 ) , [ 1 2 , 1 ) } A 2 = { [ 0 , 1 4 ) , [ 1 4 , 1 2 ) , [ 1 2 , 3 4 , [ 3 4 , 1 ) } A j = { [ 0 , 2 ( j 1 ) ) , [ 2 ( j 1 ) , 2 · 2 ( j 1 ) ) , , 2 j 1 1 ) 2 ( j 1 ) , 1 ) }
For each j, we quantize each x [ 0 , 1 ] into the a A j , such that x a . For example, for j = 2 , x = 0 . 4 is quantized into a = [ 1 4 , 1 2 ) A 2 . Let λ be the Lebesgue measure (width of the interval). For example, λ ( [ 1 4 , 1 2 ) ) = 1 4 and λ ( { 1 2 } ) = 0 .
Note that each A j is a finite set. Therefore, we can construct a universal measure Q j n w.r.t. a finite set A j for each j. Given x n = ( x 1 , , x n ) [ 0 , 1 ] n , we obtain a quantized sequence ( a 1 ( j ) , , a n ( j ) ) A j n for each j and use it to compute the quantity:
g j n ( x n ) : = Q j n ( a 1 ( j ) , , a n ( j ) ) λ ( a 1 ( j ) ) λ ( a n ( j ) )
for each j. If we prepare a sequence of positive reals w 1 , w 2 , , such that j w j = 1 and w j > 0 , we can compute the quantity:
g X n ( x n ) : = j = 1 w j g j n ( x n ) .
Moreover, let f X be the true density function and f j ( x ) : = P ( X a ) / λ ( a ) for a A j and j = 1 , 2 , if x a . We may consider f j to be an approximated density function assuming the quantization sequence { A j } (Figure 2). For the given x n , we define f X n ( x n ) = f X ( x 1 ) f X ( x n ) and f j n ( x n ) : = f j ( x 1 ) f j ( x n ) .
Figure 2. Quantization at level k: x n = ( x 1 , , x n ) ( a 1 ( j ) , , a n ( j ) )
Figure 2. Quantization at level k: x n = ( x 1 , , x n ) ( a 1 ( j ) , , a n ( j ) )
Entropy 17 05752 g002
Thus, we have the following proposition, which is a continuous version of the universality (4) that was proven in Section 2.1.
Proposition 1 ([10]). For any density function f, such that D ( f X | | f j ) 0 as j ,
1 n log f X n ( x n ) g X n ( x n ) 0
as n with probability one, where D ( f X | | f j ) is the Kullback–Leibler divergence between f X and f j .
The same concept is applied to the case where no density function exists [11] in the usual sense (w.r.t. the Lebesgue measure λ). For example, suppose that we wish to estimate a distribution over the positive integers N. Apparently, N is not a finite set and has no density function. We consider the quantization sequence { B k } : B 0 = { N } , B 1 : = { { 1 } , { 2 , 3 , } } , B 2 : = { { 1 } , { 2 } , { 3 , 4 , } } , ..., B k : = { { 1 } , { 2 } , , { k } , { k + 1 , k + 2 , } } , ....
For each k, we quantize each y N into a b B k , such that y b . For example, for k = 2 , y = 4 is quantized into b = { 3 , 4 , } B 2 . Let η be a measure, such that:
η ( { k } ) = 1 k - 1 k + 1 , k N .
The measure η ( a ) for closed interval a gives:
η ( a ) = k a η ( { k } ) = k a ( 1 k - 1 k + 1 ) = 1 k m i n - 1 k m a x
if k m i n and k m a x are the minimum and maximum integers in a, and evaluates each bin width in a nonstandard way. For example, η ( { 2 } ) = 1 6 and η ( { 3 , 4 } ) = 2 15 . For multiple variables, we compute the measure by:
η ( { j } , { k } ) = ( 1 j - 1 j + 1 ) ( 1 k - 1 k + 1 ) .
Note that each B k is a finite set, and we construct a universal measure Q k n w.r.t. a finite set B k for each k. Given y n = ( y 1 , , y n ) N n , we obtain a quantized sequence ( b 1 ( k ) , , b n ( k ) ) B k n for each k, such that we can compute the quantity:
g k n ( y n ) : = Q k n ( b 1 ( k ) , , b n ( k ) ) η ( b 1 ( k ) ) η ( b n ( k ) )
for each k. If we prepare a sequence of positive reals w 1 , w 2 , , such that k w k = 1 and w k > 0 , we can compute the quantity g Y n ( y n ) : = k = 1 w k g k n ( y n ) . In this case, f Y ( y ) = P ( Y = y ) η ( { y } ) for y N ( f ( y ) with y N may take any arbitrary value) is considered to be a generalized density function (w.r.t. the measure η).
In general, if η ( D ) = 0 implies P ( Y D ) = 0 for the Borel sets (the Borel sets w.r.t. R being the set consisting of the sets generated via a countable number of union, intersection and set difference from the closed intervals of R [2]), we state that P is absolutely continuous w.r.t. η and that there exists a density function w.r.t. η (Radon–Nikodym [2]).
The following proposition addresses generalized densities and eliminates the condition D ( f Y | | f j ) 0 as j in Proposition 1.
Proposition 2 ([11]). For any generalized density function f Y ,
1 n log f Y n ( y n ) g Y n ( y n ) 0
as n with probability one.
Proposition 1 assumes a specific quantization sequence, such as { A n } . The universality holds for the densities that satisfy D ( f X | | f k ) as k [10]. However, in the proof of Proposition 2, a universal quantization, such that D ( f X | | f k ) 0 as k for any density f X , was constructed [11].

3. Contributions

3.1. The Hannan and Quinn Principle

We know that H n ( x n ) + H n ( y n ) - H n ( x n , y n ) is at most ( α - 1 ) ( β - 1 ) 2 log n with probability one as n when X Y because the decision based on (1) is strongly consistent.
In this section, we prove a stronger result: let:
I n ( x n , y n , z n ) : = H n ( x n , z n ) + H n ( y n , z n ) - H n ( x n , y n , z n ) - H n ( z n ) .
We show that the quantity I n ( x n , y n , z n ) is at most ( α - 1 ) ( β - 1 ) γ log log n rather than 1 2 ( α - 1 ) ( β - 1 ) γ log n , when X Y | Z :
Theorem 1. If X Y | Z :
I n ( x n , y n , z n ) ( 1 + ϵ ) ( α - 1 ) ( β - 1 ) γ log log n
with probability one as n for any ϵ > 0 .
In order to show the claim, we approximate I n ( x n , y n , z n ) by z Z I ( z ) with I ( z ) = 1 2 i = 1 α - 1 j = 1 β - 1 r i , j 2 , where r i , j , i = 1 , , α - 1 , j = 1 , , β - 1 , are mutually independent random variables with mean zero and variance σ i , j 2 , such that:
i = 1 α - 1 j = 1 β - 1 σ i , j 2 = ( α - 1 ) ( β - 1 ) .
Then, from the law of iterated logarithms below (Lemma 1) [2], it will be proven that r i , j 2 is almost surely upper-bounded by 2 ( 1 + ϵ ) σ i , j 2 log log n for any ϵ > 0 and each z Z , which implies Theorem 1 because:
I n ( x n , y n , z n ) z I ( z ) = γ · 1 2 i j r i , j 2 γ · 1 2 i j 2 ( 1 + ϵ ) σ i , j 2 log log n = ( 1 + ϵ ) ( α - 1 ) ( β - 1 ) γ log log n
(see the Appendix for the details of the derivation).
Lemma 1 ([2]). Let { U k } k = 1 n be random variables that obey an identical distribution with zero mean and unit variance, and S n : = k = 1 n U k . Then, with probability one,
lim sup n S n 2 n log n log n = 1 .
Theorem 1 implies the strong consistency of the decision based on (1). However, a stronger statement can be obtained:
Theorem 2. We define R Z n ( z n ) , R X Z n ( x n , z n ) , R Y Z n ( y n , z n ) and R X Y Z n ( x n , y n , z n ) by:
- log R Z n ( z n ) = H n ( z n ) + ( 1 + ϵ ) ( γ - 1 ) log log n ,
- log R X Z n ( x n , z n ) = H n ( x n , z n ) + ( 1 + ϵ ) ( β γ - 1 ) log log n ,
- log R Y Z n ( y n , z n ) = H n ( y n , z n ) + ( 1 + ϵ ) ( β γ - 1 ) log log n ,
and:
- log R X Y Z n ( x n , y n , z n ) = H n ( x n , y n , z n ) + ( 1 + ϵ ) ( α β γ - 1 ) log log n .
Then, the decision based on:
R X Z n ( x n , z n ) R Y Z n ( y n , z n ) R X Y Z n ( x n , y n , z n ) R Z n ( z n ) X Y | Z
is strongly consistent.
Proof. We note two properties:
  • R X Z n ( x n , z n ) R Y Z n ( y n , z n ) R X Y Z n ( x n , y n , z n ) R Z n ( z n ) is equivalent to (7); and
  • lim n 1 n log R X Y Z n ( x n , y n , z n ) R Z n ( z n ) R X Z n ( x n , z n ) R Y Z n ( x n , z n ) = lim n 1 n log Q X Y Z n ( x n , y n , z n ) Q Z n ( z n ) Q X Z n ( x n , z n ) Q Y Z n ( x n , z n ) I ( X , Y , Z )
If X Y | Z , then from Theorem 1 and the first property, we have R X Z n ( x n , z n ) R Y Z n ( y n , z n ) R X Y Z n ( x n , y n , z n ) R Z n ( z n ) almost surely. If R X Z n ( x n , z n ) R Y Z n ( y n , z n ) R X Y Z n ( x n , y n , z n ) R Z n ( z n ) almost surely holds, then the value in the second property should be no greater than zero, which means that X Y | Z . This completes the proof. ☐
Theorem 2 is related to the Hannan and Quinn theorem [7] for model selection. To obtain strong consistency, they proved that log log n rather than 1 2 log n is sufficient for the penalty terms of autoregressive model selection. Recently, several authors have proven this in other settings, such as classification [9] and linear regression [8].

3.2. Consistency for Continuous Variables

Suppose that we wish to estimate the distribution over [ 0 , 1 ] × N in Section 2.2. The set [ 0 , 1 ] × N is not a finite set and has no density function.
Because A j × B k is a finite set, we can construct a universal measure Q j , k n for A j × B k :
g j k n ( x n , y n ) : = Q j , k n ( a 1 ( j ) , , a n ( j ) , b 1 ( k ) , , b n ( k ) ) λ ( a 1 ( j ) ) λ ( a n ( j ) ) η ( b 1 ( k ) ) η ( b n ( k ) ) .
If we prepare the sequence such that j k ω j k = 1 , ω j k > 0 , we obtain the quantity:
g X Y n ( x n , y n ) : = j = 1 k = 1 w j , k g j k n ( x n , y n ) .
In this case, the (generalized) density function is obtained via:
f X Y ( x , y ) = F X ( x | y ) d x · P ( Y = y ) η ( { y } )
where y N ( f X Y takes arbitrary values for x [ 0 , 1 ] and y N ), where F X ( · | y ) is the conditional distribution function of X given Y = y .
In general, we have the following result:
Proposition 3 For any generalized density function f:
1 n log f X Y n ( x n , y n ) g X Y n ( x n , y n ) 0
as n with probability one.
The measures g X n ( x n ) and g X Y n ( x n , y n ) are computed using (A) and (B) of Algorithm 1, where the value of K is the number of quantizations, and g ^ X n ( x n ) and g ^ X Y n ( x n , y n ) denote the approximated scores using finite quantization of level K.
Algorithm 1 Calculating gn.
(A) Input xnAn, Output g ^ X n ( x n )
1.For each k = 1 , , K , g k n ( x n ) : = 0
2.For each k = 1 , , K and each a A k , c k ( a ) : = 0
3.For each i = 1 , , n ,
 (a) A 0 = X , a i ( 0 ) = x i
 (b) for each k = 1 , , K
   i. Find a i ( k ) A k from a i ( k - 1 ) A k - 1
   ii. log g k n ( x n ) : = log g k n ( x n ) + log c i , k ( a i ( k ) ) + 1 / 2 i - 1 + | A k | / 2 - log ( η X ( a i ( k ) ) )
   iii. c i , k ( a i ( k ) ) : = c i , k ( a i ( k ) ) + 1
4. g ^ X n ( x n ) : = k = 1 K 1 K g k n ( x n )
(B) Input x n A n and y n B n , Output g ^ X Y n ( x n , y n )
1.For each j , k = 1 , , K , g j , k n ( x n , y n ) : = 0
2.For each j , k = 1 , , K and each a A j and b B k , c j , k ( a , b ) : = 0
3.For each i = 1 , , n
 (a) A 0 = X , B 0 = Y , a i ( 0 ) = x i , b i ( 0 ) = y i
 (b) for each j , k = 1 , , K
   i. Find a i ( j ) A j and b i ( k ) B k from a i ( j - 1 ) A j - 1 and b i ( k - 1 ) B k - 1
   ii. log g j , k n ( x n , y n ) : = log g j , k n ( x n , y n ) + log c i , j , k ( a i ( j ) , b i ( k ) ) + 1 / 2 i - 1 + | A j | | B k | / 2 - log ( η X ( a i ( j ) ) η Y ( b i ( k ) ) )
   iii. c i , j , k ( a i ( j ) , b i ( k ) ) : = c i , j , k ( a i ( j ) , b i ( k ) ) + 1
4. g ^ X Y n ( x n , y n ) : = j = 1 K k = 1 K 1 K 2 g j , k n ( x n , y n )
Propositions 1–3 are obtained for large K. However, we can prepare only a finite number of quantizations. Furthermore, if n is small, then the number of examples that each bin contains is small, and we cannot estimate the histogram well. Therefore, given n, K must be moderately sized, and we recommend to set K = 1 m log n because the number of examples contained in a bin decreases exponentially with increasing depth, where m is the number of variables in the local score. For example, m = 1 and m = 2 for (A) and (B), respectively. Algorithm 1 (A)(B) of do not guarantee anything for the theoretical property assured in Proposition 3 and Theorems 3–5 for finite K, however, as K grows, consistency holds.
In Step 3(a) of Algorithm 1(A)(B), we calculate a i ( k ) from a i ( k - 1 ) and not from x i , which means that the computational time required to obtain ( a i ( 1 ) , , a i ( K ) ) from x i is O ( K ) . Thus, the total computation times of Algorithm 1 (A)(B) are at most O ( n K ) .
In Step 3(b) of Algorithm 1(A), we compute for i = 1 , , n and k = 1 , , K :
log g k i ( x i ) g k i - 1 ( x i - 1 ) = log Q k i ( a 1 ( k ) , , a i ( k ) ) Q k i - 1 ( a 1 ( k ) , , a i - 1 ( k ) ) - log η X ( a i ( k ) )
if x i is quantized into a i ( k ) A k , i = 1 , , n .
For the memory requirements, we require exponential orders of K. However, because we set K = 1 m log n , the computational time and memory requirements are at most O ( n log n ) and O ( n ) for Algorithm 1(A)(B).
Based on the same notion, we can construct g Z n ( z n ) , g X Z n ( x n , z n ) , g Y Z ( y n . z n ) , g X Y Z n ( x n , y n , z n ) from examples x n X n , y n Y n and z n Z n , and Propositions 2 and 3 hold for three variables.
Theorem 3. With probability one as n :
1 n log g X Y Z n ( x n , y n , z n ) g Z ( z n ) g X Y n ( x n , z n ) g Y Z n ( y n , z n ) I ( X , Y , Z )
Proof. From Propositions 2 and 3 for two and three variables and the law of large numbers, we have:
lim n 1 n log g X Y Z n ( x n , y n , z n ) g Z n ( z n ) g X Z n ( x n , z n ) g Y Z n ( y n , z n ) = lim n 1 n log f X Y Z n ( x n , y n , z n ) f Z n ( z n ) f X Z n ( x n , z n ) f Y Z n ( y n , z n ) = lim n 1 n i = 1 n { log f X Y Z ( x i , y i , z i ) f Z ( z i ) f X Z ( x i , z i ) f Y Z ( y i , z i ) } = E log f X Y Z ( X , Y , Z ) f X Z ( X , Z ) f Y Z ( Y , Z ) = I ( X , Y , Z )
with probability one, which completes the proof. ☐
From the discussion in Section 2.1, even when more than two variables are present, if D ( P * | | P ) > 0 , we can choose P * rather than P with probability one as n .
Now, we prove that the continuous counterpart of the decision based on (1) is strongly consistent:
Theorem 4. With probability one as n :
X Y | Z p g X Z n ( x n , z n ) g Y Z n ( y n , z n ) ( 1 - p ) g X Y Z n ( x n , y n , z n ) g Z n ( z n ) ,
where p is the prior probability of X Y | Z .
Proof: Suppose that X ¬ Y | Z . Then, the conditional mutual information between X and Y given Z is positive, and from Theorem 3, the estimator converges to a positive value with probability one as n ; thus, p g X Z n ( x n , z n ) g Y Z n ( y n , z n ) ( 1 - p ) g X Y Z n ( x n , y n , z n ) g Z n ( z n ) holds almost surely. Suppose that X Y | Z . The discrete variables X and Y are conditionally independent given Z if and only if:
c Q X Z n ( x n , z n ) Q Y Z n ( y n , z n ) ( 1 - c ) Q X Y Z n ( x n , y n , z n ) Q Z n ( z n )
with probability one as n for any constant 0 < c < 1 , even if c does not coincide with the prior probability p. If X , Y and Z are continuous, we quantize x n , y n and z n into ( a 1 ( j ) , , a n ( j ) ) , ( b 1 ( k ) , , b n ( k ) ) and ( c 1 ( l ) , , c n ( l ) ) . Thus, for each j , k and l, we have:
p w j l w k l Q j l n ( a 1 ( j ) , , a n ( j ) , c 1 ( l ) , , c n ( l ) ) Q k l n ( b 1 ( k ) , , b n ( k ) , c 1 ( l ) , , c n ( l ) ) ( 1 - p ) w j k l w l Q j k l n ( a 1 ( j ) , , a n ( j ) , b 1 ( k ) , , b n ( k ) , c 1 ( l ) , , c n ( l ) ) Q n ( l ) ( c 1 ( l ) , , c n ( l ) )
with probability one as n . Thus, if we divide both sides by:
η X ( a 1 ( j ) ) η X ( a n ( j ) ) η Y ( b 1 ( k ) ) η Y ( b n ( k ) ) η Z ( c 1 ( l ) ) η Z ( c n ( l ) )
and take summations of both sides over j , k . l = 1 , 2 , , we have:
p g X Z n ( x n , z n ) g Y Z n ( y n , z n ) ( 1 - p ) g X Y Z n ( x n , y n , z n ) g Z n ( z n )
with probability one, where we have assumed w j , k , l > 0 w j l , w k l > 0 because of K = 1 m log n , which completes the proof.
Note that even if either X or Y is discrete, the same conclusion will be obtained. The generalized density functions cover the discrete distributions as a special case.
From the discussion in Section 2.1, even when more than two variables are present, if D ( P * | | P ) = 0 , we can choose P * rather than P with probability one as n .
Let h Z n ( z n ) , h X Z n ( x n , z n ) , h Y Z n ( y n , z n ) and h X Y Z n ( x n , y n , z n ) take the same values of g Z n ( z n ) , g X Z n ( x n , z n ) , g Y Z n ( y n , z n ) and g X Y Z n ( x n , y n , z n ) , except that the log n terms in - log Q Z n ( z n ) , - log Q X Z n ( x n , z n ) , - log Q Y Z n ( y n , z n ) and - log Q X Y Z n ( x n , y n , z n ) are replaced by 2 ( 1 + ϵ ) log log n , respectively, where ϵ > 0 is arbitrary. Then, we obtain the final result:
Theorem 5. With probability one as n :
p h X Z n ( x n , z n ) h Y Z n ( y n , z n ) ( 1 - p ) h X Y Z n ( x n , y n , z n ) h Z n ( z n ) X Y | Z .
This paper focuses on the theoretical aspects of the BN structure learning, in particular for consistency when continuous variables are present. For the details of the practical matters we deal with in this section, see the conference paper [14].

3.3. The Number of Local Scores to be Computed

We refer the conditional independence (CF) score w.r.t. X and Y given Z to the left of (8). Suppose we follow the fastest Bayesian network structure learning due to [6]: let P a ( X , V ) be the optimal parent set of X V contained in V - { X } for V U : = { 1 , , N } and S ( X , V ) its local score. Then, we can obtain:
T ( V ) : = max x V { S ( X , V ) + T ( V - { X } ) }
For each V U , the sinks:
X N = argmax X U T ( U ) , X N - 1 = argmax X U - { X N } T ( U - { X N } ) , ,
and the parent sets:
P a ( X N , U ) , P ( X N - 1 , U - { X N } ) , , { } .
For each fixed pair ( X , V ) , maximizing the local score 1 n log g W + { X } g W and maximizing the CF score 1 n log g V - { X } g W + { X } g V g W w.r.t. V - { X } and W + { X } , given W are equivalent. In other words,
1 n log g W + { X } g W 1 n log g W ' + { X } g W ' 1 n log g V - { X } g W ' + { X } g V g W 1 n log g V - { X } g W ' + { X } g V g W '
for W , W ' V - { X } .
On the other hand, from [15,16], we know that the relationship between the complexity term and the likelihood term gives tight bounds on the maximum number of parents in the optimal BN for any given dataset. In particular, the number of elements in each parent set P a ( X , V ) is at most O ( log n ) for X V and V U . Hence, the number for computing the CF scores is much less than exponentialwith N.

4. Concluding Remarks

In this paper, we considered the problem of learning a Bayesian network structure from examples and provided two contributions.
First, we found that the log n terms in the penalty terms of the description length can be replaced by 2 ( 1 + ϵ ) log log n to obtain strong consistency, where the derivation is based on the law of iterated logarithms. We claim that the Hannan and Quinn principle [7] is applicable to this problem.
Second, we constructed an extended version of the score function for finding a Bayesian network structure with the maximum posterior probability and proved that the decision is strongly consistent even when continuous variables are present. Thus far, consistency has been obtained only for discrete variables, and many authors have been seeking consistency when continuous variables are present.
Consistency has been proven in many model selection methods that maximize the posterior probability or, equivalently, minimize the description length [1]. However, almost all such methods assume that the variables are either discrete or that the variables obey Gaussian distributions. This paper proposed an extended version of the MDL/Bayesian principle without assuming such constraints and proved its strong consistency in a precise manner, which we believe provides a substantial contribution to the statistics and machine learning communities.

Appendix: Proof of Theorem 1

Hereafter, we write P(X = x|Z = z) and P(Y = y|Z = z) simply as P(x|z) and P(y|z) respectively, for x ∈ X, y ∈ Y and z ∈ Z. We find that the empirical mutual information:
Entropy 17 05752 i001
Entropy 17 05752 i002
Entropy 17 05752 i003
is approximated by Entropy 17 05752 i004 with:
Entropy 17 05752 i005
where the difference between them is zero with probability one as n → ∞, and (1 + t)log(1 + t) = t + t2/2 − t3/{6[1 + δ(t)t]2} with 0 < δ(t) < 1 and:
Entropy 17 05752 i006
has been applied for (11), (12) and (13), respectively. Furthermore, we derive:
Entropy 17 05752 i007
where V = (Vxy)x ∈ X, and y ∈ Y with Entropy 17 05752 i022 and u and v are the column vectors Entropy 17 05752 i008 and Entropy 17 05752 i009, respectively. Hereafter, we arbitrarily fix zZ. Let U = (u[0], u[1], …, u[α − 1]), with u[0] = u and W = (w[0], w[1], …, w[β − 1], with w[0] = w being eigenvectors of Entropy 17 05752 i010 and Entropy 17 05752 i011, where Em is the identity matrix of dimension m.
Then, tuVw = 0, and for U ~ = (u[1], …, u[α − 1] and W ~ = (w[1], …, w[β − 1], we have:
Entropy 17 05752 i012
and:
Entropy 17 05752 i013
If we note that UtU = tUU = Eα and WtW = tWW = Eβ, we obtain:
Entropy 17 05752 i014
and find that (14) becomes:
Entropy 17 05752 i015
with rij := tu[i]Vw[j]. Then, we can see:
Entropy 17 05752 i016
and that the (α − 1) × (β − 1) matrix   t U ~ V W ~ consists of mutually independent elements rij with i = 1, …, α − 1 and j = 1, …, β − 1: E[rij] = 0, and:
Entropy 17 05752 i017
where σ i j 2 is the variance of rij and the expectation of σ i j 2 , so that (15) implies:
Entropy 17 05752 i018
If we define for each xX and yY and for i = 1, …, n:
Entropy 17 05752 i019
where u[i] = (u[i,x])xX and w[j] = (w[y,j])yY, then we can check E[Zi,j,k] = 0 and V[Zi,j,k] = 1, where expectation E and variance V are with respect to the examples Xn = xn and Yn = yn, and I(A) takes one if the event A is true and zero otherwise. We can easily check:
Entropy 17 05752 i020
We consider applying the obtained derivation to Lemma 1. From (17), we obtain:
Entropy 17 05752 i021
which means that (14) is upper bounded by(1 + ϵ)(α − 1)(β − 1)log log n with probability one as n → ∞ for any ϵ > 0, from (16). This completes the proof of Theorem 1.

References

  1. Rissanen, J. Modeling by shortest data description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
  2. Billingsley, P. Probability & Measure, 3rd ed.; Wiley: New York, NY, USA, 1995. [Google Scholar]
  3. Friedman, N.; Linial, M.; Nachman, I.; Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol. 2000, 7, 601–620. [Google Scholar] [CrossRef] [PubMed]
  4. Imoto, S.; Kim, S.; Goto, T.; Aburatani, S.; Tashiro, K.; Kuhara, S.; Miyano, S. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. J. Bioinform. Comput. Biol. 2003, 1, 231–252. [Google Scholar] [CrossRef] [PubMed]
  5. Zhang, K.; Peters, J.; Janzing, D.; Scholkopf, B. Kernel-based Conditional Independence Test and Application in Causal Discovery. In Proceedings of the 2011 Uncertainty in Artificial Intelligence Conference, Barcelona, Spain, 14–17 July 2011; pp. 804–813.
  6. Silander, T.; Myllymaki, P. A simple approach for finding the globally optimal Bayesian network structure. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, Arlington, Virginia, 13–16 July 2006; pp. 445–452.
  7. Hannan, E.J.; Quinn, B.G. The Determination of the Order of an Autoregression. J. R. Stat. Soc. B 1979, 41, 190–195. [Google Scholar]
  8. Suzuki, J. The Hannan–Quinn Proposition for Linear Regression, Int. J. Stat. Probab. 2012, 1, 2. [Google Scholar]
  9. Suzuki, J. On Strong Consistency of Model Selection in Classification. IEEE Trans. Inf. Theory 2006, 52, 4767–4774. [Google Scholar] [CrossRef]
  10. Ryabko, B. Compression-based Methods for Nonparametric Prediction and Estimation of Some Characteristics of Time Series, IEEE Trans. Inform. Theory 2009, 55, 4309–4315. [Google Scholar] [CrossRef]
  11. Suzuki, J. Universal Bayesian Measures. In Proceedings of the 2013 IEEE International Symposium on Information Theory, Istanbul, Turkey, 7–12 July 2013; pp. 644–648.
  12. Krichevsky, R.E.; Trofimov, V.K. The Performance of Universal Encoding. IEEE Trans. Inf. Theory 1981, 27, 199–207. [Google Scholar] [CrossRef]
  13. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: New York, NY, USA, 1995. [Google Scholar]
  14. Suzuki, J. Learning Bayesian Network Structures When Discrete and Continuous Variables Are Present. In Proceedings of the 2014 Workshop on Probabilistic Graphical Models, 17–19 September 2014; Springer Lecture Notes on Artificial Intelligence. Volume 8754, pp. 471–486.
  15. Suzuki, J. Learning Bayesian belief networks based on the minimum description length principle: An efficient algorithm using the B&B technique. In Proceedings of the 13th International Conference on Machine Learning (ICML'96), Bari, Italy, 3–6 July 1996; pp. 462–470.
  16. De Campos, C.P.; Ji, Q. Efficient Structure Learning of Bayesian Networks using Constraints. JMLR 2011, 12, 663–689. [Google Scholar]
  17. Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
  18. Judea, P. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan-Kaufmann: San Mateo, CA, USA, 1988. [Google Scholar]

Share and Cite

MDPI and ACS Style

Suzuki, J. Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach. Entropy 2015, 17, 5752-5770. https://doi.org/10.3390/e17085752

AMA Style

Suzuki J. Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach. Entropy. 2015; 17(8):5752-5770. https://doi.org/10.3390/e17085752

Chicago/Turabian Style

Suzuki, Joe. 2015. "Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach" Entropy 17, no. 8: 5752-5770. https://doi.org/10.3390/e17085752

Article Metrics

Back to TopTop