Next Article in Journal
Overview of Machine Learning Process Modelling
Previous Article in Journal
ECG Signal Classification Using Deep Learning Techniques Based on the PTB-XL Dataset
Previous Article in Special Issue
Extended Variational Message Passing for Automated Approximate Bayesian Inference
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Computation of Kullback–Leibler Divergence in Bayesian Networks

Computer Science and Artificial Intelligent Department, University of Granada, 18071 Granada, Spain
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(9), 1122; https://doi.org/10.3390/e23091122
Submission received: 29 July 2021 / Revised: 19 August 2021 / Accepted: 25 August 2021 / Published: 28 August 2021
(This article belongs to the Special Issue Bayesian Inference in Probabilistic Graphical Models)

Abstract

:
Kullback–Leibler divergence K L ( p , q ) is the standard measure of error when we have a true probability distribution p which is approximate with probability distribution q. Its efficient computation is essential in many tasks, as in approximate computation or as a measure of error when learning a probability. In high dimensional probabilities, as the ones associated with Bayesian networks, a direct computation can be unfeasible. This paper considers the case of efficiently computing the Kullback–Leibler divergence of two probability distributions, each one of them coming from a different Bayesian network, which might have different structures. The paper is based on an auxiliary deletion algorithm to compute the necessary marginal distributions, but using a cache of operations with potentials in order to reuse past computations whenever they are necessary. The algorithms are tested with Bayesian networks from the bnlearn repository. Computer code in Python is provided taking as basis pgmpy, a library for working with probabilistic graphical models.

1. Introduction

When experimentally testing Bayesian network learning algorithms, in most of the cases, the performance is evaluated looking at structural differences between the graphs of the original Bayesian network and the learned one [1], as in the case of using the structural Hamming distance. This measure is used in recent contributions as [2,3,4]. A  study and comparison of the different metrics used to measure the structural differences between two Bayesian networks can be found in [1].
However, in most cases the aim of learning a Bayesian network is to estimate a joint probability for the variables in the problem. In that situation the error of a learning procedure should be computed by measuring the difference between the probability associated with the learned network and the original joint probability distribution. Therefore, it can be useful to estimate a network that is less dense than the original one, but in which parameters can have a more accurate estimation. This is the case of the Naive Bayes classifier, which obtains very good results in classification problems, despite the fact that the structure is not the correct one. So, in this situation, structural graphical differences are not a good measure of performance.
The basic measure to determine the divergence between an estimated distribution and a true one is the so-called Kullback–Leibler divergence [5]. Some papers use this way of asserting the quality of a learning procedure as in [6,7,8]. A direct computation of the divergence is unfeasible if the number of variables is high. However, some basic decomposition properties [9] (Theorem 8.5) can be applied to reduce the cost of computation of the divergence. This is the basis of the procedure implemented in the Elvira system [10] which is the one used in [6]. Methods in [7,8] are also based on the same basic decomposition. Kullback–Leibler divergence is not only meaningful for measuring divergence between a learned network and a true one, but also for other tasks, as for example the approximation of a Bayesian network by a simpler one [11,12,13] by removing some of the existing links.
The aim of this work is to improve existing methods for computing Kullback–Leibler divergence in Bayesian networks and to provide a basic algorithm for this task using Python and integrated into the pgmpy [14] environment. The algorithm implemented in the Elvira system [10] is based on carrying out a number of propagation computations in the original true network. The  hypothesis underlying our approach is that there are a lot of computations that are repeated in these propagation algorithms, so what it is done is to determine which are the operations with potentials that are repeated and then storing the results in a cache of operations in order to allow reuse them. The experimental work will show that this is an effective method to improve the efficiency of algorithms, especially in large networks.
The paper is organized as follows: Section 2 is devoted to set the basic framework and to present fundamental results for the Kullback–Leibler divergence computation; Section 3 describes the method implemented in the Elvira system for computing Kullback–Leibler divergence; Section 4 is devoted to describing our proposal based on the cache of operations with potentials; Section 5 contains the experimental setting and the obtained results; finally the conclusions are shown in Section 6.

2. Kullback–Leibler Divergence

Let N be a Bayesian network defined on a set of variables X = { X 1 X n } . The  family of a variable X i in X is termed f ( X i ) = { X i } p a ( X i ) , where p a ( X i ) is the set of parents of X i in the directed acyclic graph (DAG) defined by N. F = { f ( X 1 ) f ( X n ) } denotes the complete set of families, one for each one of the variables. Sometimes simplified notations for families and parent sets will be used: f i (for f ( X i ) ) and p a i (for p a ( X i ) ), respectively. As a running example, assume a network with three variables, X 1 , X 2 , X 3 and the following structure: X 1 X 2 X 3 (see right part of Figure 1). Then the set of families for this network is given by { f 1 , f 2 , f 3 } , where f 1 = { X 1 } , f 2 = { X 2 , X 1 } , f 3 = { X 3 , X 2 } .
A configuration or assignment of values to a set of variables X , { X 1 = x 1 X n = x n } , can be abbreviated with ( x 1 x n ) and is denoted as x . If the set of possible values for each variable in the previous example is { 0 , 1 } , then a configuration can be x = ( 0 , 0 , 1 ) , representing the assignment { X 1 = 0 , X 2 = 0 , X 3 = 1 } .
A partial configuration involving a subset of variables Y X is denoted as y . If the set of variables is f i or p a i , then the partial configuration will be denoted by x f i or x p a i , respectively. In our example, if  f 2 = { X 2 , X 1 } an example of partial configuration about these variables will be x f 2 = ( 0 , 0 ) .
The set of configurations for variables Y is denoted by Ω Y . If  x is an assignment and Y X , then the configuration y obtained by deleting the values of the variables in X \ Y is denoted by x Y . If  x f 2 = ( 0 , 0 ) is a partial configuration about variables { X 2 , X 1 } and we consider Y = { X 2 } , then x f 2 Y is the configuration obtained by removing the value of X 1 , i.e.,  ( 0 ) .
If w and z are configurations for W X and Z X respectively, and  W Z = , then ( w , z ) is a configuration for W Z , and will be called the composition of w and z . For example, if  w is the configuration ( 0 ) about variable X 1 and y is the configuration ( 0 , 1 ) defined on X 2 , X 3 , then its composition will be the configuration ( 0 , 0 , 1 ) for variables { X 1 , X 2 , X 3 } .
The conditional probability distribution for X i given its parents will be denoted as ϕ i which is a potential defined on the set of variables f ( X i ) . In general, a potential ϕ for variables Y X is a mapping defined on Ω Y into the set of real numbers: ϕ : Ω Y R . The set of variables of potential ϕ will be denoted as v ( ϕ ) . If  Φ is a set of potentials, v ( Φ ) will denote ϕ Φ v ( ϕ ) .
In our example, there are three potentials and Φ = { ϕ 1 ( X 1 ) , ϕ 2 ( X 2 , X 1 ) , ϕ 3 ( X 3 , X 2 ) } (that is, a probability distribution about X 1 , and two conditional probability distributions: one for X 2 given X 1 and the other for X 3 given X 2 , respectively).
There are three basic operations that can be performed on potentials:
  • Multiplication. If ϕ , ϕ are potentials, then their multiplication is the potential ϕ · ϕ , with set of variables v ( ϕ · ϕ ) = v ( ϕ ) v ( ϕ ) and obtained by pointwise multiplication:
    ϕ · ϕ ( y ) = ϕ ( y v ( ϕ ) ) · ϕ ( y v ( ϕ ) ) .
    In our example, the combination of ϕ 2 and ϕ 3 will be the potential ϕ 2 · ϕ 3 defined on { X 1 , X 2 , X 3 } and given by ϕ 2 · ϕ 3 ( x 1 , x 2 , x 3 ) = ϕ 2 ( x 2 , x 1 ) · ϕ 3 ( x 3 , x 2 ) .
  • Marginalization. If ϕ is a potential defined for variables Y and Z Y , then the marginalization of ϕ on Z is denoted by ϕ Z and it is obtained by summing in the variables in Y \ Z :
    ϕ Z ( z ) = y Z = z ϕ ( y ) .
    When Z is equal to Y minus a variable W, then ϕ Z will be called the result of removing W in ϕ and also denoted as ϕ W . In the example, a marginalization of ϕ 3 is obtained by removing X 3 producing ϕ 3 X 3 defined on X 2 and given by ϕ 3 X 3 ( x 2 ) = ϕ 3 ( 0 , x 2 ) + ϕ 3 ( 1 , x 2 ) . If ϕ 3 ( x 3 , x 2 ) represents the conditional probability of X 3 = x 3 given X 2 = x 2 , then it can be obtained that ϕ 3 X 3 ( x 2 ) is always equal to 1 ( x 2 v ( ϕ 3 ) ).
  • Selection. If ϕ is a potential defined for variables Y and z is a configuration for variables Z , then the selection of ϕ for this configuration z is the potential ϕ Z = z defined on variables W = Y \ Z and given by
    ϕ Z = z ( w ) = ϕ ( w , z Y ) .
    In this expression ( w , z Y ) is the composition of configurations w and z Y which is a configuration for variables v ( ϕ ) . Going back to the example, assume that we want to perform the selection of ϕ 3 to configuration z = ( 0 , 1 ) for variables { X 1 , X 2 } , then ϕ 3 Z = z will be a potential defined for variables { X 2 , X 3 } \ { X 1 , X 2 } = { X 3 } given by ϕ 3 Z = z ( x 3 ) = ϕ 3 ( x 3 , 1 ) , as we are reducing ϕ 3 ( X 3 , X 2 ) to a configuration z in which X 2 = 1 .
The family of all the conditional distributions is denoted as Φ = { ϕ 1 , , ϕ n } . It is well known that given N the joint probability distribution of the variables in N, p, is a potential that decomposes as the product of the potentials included in Φ :
p = ϕ i Φ ϕ i
Considering the example, Φ = { ϕ 1 , ϕ 2 , ϕ 3 } and p = ϕ 1 · ϕ 2 · ϕ 3 . The marginal distribution of p for a set of variables Y X is equal to p Y . When Y contains only one variable X i , then to simplify the notation, p Y will be denoted as p i . Sometimes it will be needed to make reference to the Bayesian network containing a potential or family. In these cases we will use a superscript. For example, f A ( X i ) and p a A ( X i ) refer to the family and parents set of X i in a Bayesian network N A respectively.
The aim of this paper is to compute the Kullback–Leibler divergence (termed K L ) between the joint probability distributions, p A and p B , of two different Bayesian networks N A and N B defined on the same set of variables X but possibly having different structures. This divergence, denoted as K L ( N A , N B ) can be computed considering the probabilities for each configuration x in both distributions as follows:
K L ( N A , N B ) = x p A ( x ) log ( p A ( x ) p B ( x ) )
However, the computation of the joint probability distribution may be unfeasible for complex models as the number of configurations x is exponential in the number of variables. If  p , q are probability distributions on X then the expected log likelihood ( L L ) of q with respect to p is:
L L ( p , q ) = x p ( x ) log ( q ( x ) ) ,
then, from Equation (2) it is immediate that:
K L ( N A , N B ) = x p A ( x ) log ( p A ( x ) ) x p A ( x ) log ( p B ( x ) ) = L L ( p A , p A ) L L ( p A , p B ) = L L ( N A , N A ) L L ( N A , N B )
The probability distribution p B can be decomposed as well as considered in Equation (1). Therefore, the term L L ( N A , N B ) in Equation (3) can be obtained as follows considering the families of variables in N B and their corresponding configurations, x f i B :
L L ( N A , N B ) = x p A ( x ) log ( p B ( x ) ) = x p A ( x ) log ( X i X ϕ i B ( x f i B ) ) = x p A ( x ) X i X log ( ϕ i B ( x f i B ) )
Interchanging additions and reorganizing the terms in Equation (4):
L L ( N A , N B ) = X i X x p A ( x ) log ( ϕ i B ( x f i B ) ) = X i X x f i B log ( ϕ i B ( x f i B ) ) ( x f i B = x f i B p A ( x ) ) = X i X x f i B log ( ϕ i B ( x f i B ) ) ( p A ) f i B ( x f i B )
Equation (5) implies a decomposition of the term L L ( N A , N B ) and, as a consequence of K L ( N A , N B ) computation as well. Observe that ϕ i B ( x f i B ) is the value of the potential ϕ i B for a configuration x f i B and can be obtained directly from the potential ϕ i B of the Bayesian network N B . The main difficulty in Equation (5) consists of the computation of ( p A ) f i B ( x f i B ) values, as it is necessary to compute the marginal probability distribution for variables in f i B , the family of X i in Bayesian network N B , but using the joint probability distribution p A associated with the Bayesian network N A .

3. Computation with Propagation Algorithms

In this section we introduce the category of inference algorithms based on deletion of variables and then we show how these algorithms can be applied to compute the Kullback–Leibler divergence using Equation (5).

3.1. Variable Elimination Algorithms

To compute ( p A ) f i B we consider Φ A the set of potentials associated to network N A : the multiplication of all the potentials in Φ A is equal to p A . Deletion algorithms [15,16], can be applied to Φ A to determine the required marginalizations. The basic step of these algorithms is the deletion of a variable from a set Φ A :
  • Variable Deletion. If Φ is a set of potentials, the deletion of X i consists of the following operations:
    Compute Φ i = { ϕ : X i v ( ϕ ) } , i.e., the set of potentials containing variable X i .
    Compute ϕ i = ( ϕ Φ i ϕ ) X i , i.e., combine all the potentials in Φ i and remove variable X i by marginalization.
    Update Φ ( Φ \ Φ i ) { ϕ i } , i.e., remove from Φ the potentials containing X i and add the new potential ϕ i which does not contain X i .
The main property of the deletion step is the following: starting with ϕ Φ ϕ = q , then after the deletion of X i from Φ , ϕ Φ ϕ = q X i . It is easy to see that the deletion of a variable X i can be computed just operating with the elements of Φ defined on X i .
If Φ is the initial set of potentials of a network N, then p = ϕ Φ ϕ . In order to compute the marginalization of p on a set of variables Y X , the deletion procedure should be repeated for each variable X i in X \ Y . If the marginal probability distribution for variable X k is to be calculated, any variable in X different from X k should be deleted. The order of variable deletion is not meaningful for the final result, but the efficiency may depend on it.
When there are observed variables, Z = z , then a previous step of selection should be carried out: any potential ϕ Φ is transformed into ϕ Z = z . This step will be called evidence restriction. After it q, the  product of the potentials in Φ is defined for variables in Y = X \ Z and its value is q ( y ) = p ( y , z ) , i.e.,  the joint probability of obtaining this value and the observations. If  a deletion of variables in W is carried out, then the product of the potentials in Φ is the potential defined for variables Y = ( X \ Z ) \ W , and satisfies q ( y ) = p Y Z ( y , z ) .
When we have observations and we want to compute the marginal on a variable X i , it is well known that not all the initial potentials in Φ are relevant. A previous pruning step can be done using the Bayes-ball algorithm [17] in order to remove the irrelevant potentials from Φ before restricting to the observations and carrying out the deletion of variables.

3.2. Computation of Kullback–Leibler Divergence Using Deletion Algorithms

Our first alternative to compute L L ( N A , N B ) is based on using a simple deletion algorithm to compute the values ( p A ) f i B ( x f i B ) in Equation (5). The basic steps are:
  • Given a specific variable X i , we have that f i B = { X i } p a i B . Then for each possible configuration x p a i B of the parent variables, we include the observation p a i B = x p a i B and we apply a selection operation to the list of potentials associated with Bayesian network N A by means of evidence restriction. We also apply a pruning of irrelevant variables using the Bayes-ball algorithm.
  • Then all the variables are deleted except the target variable X i . The  potentials in Φ will be all defined for variable X i and their product will be a potential q defined for variable X i such that q ( x i ) = ( p A ) f i B ( x i , x p a i B ) .
  • The deletion algorithm is repeated for each variable X i and each configuration of the parent variables x p a i B in Bayesian network N B . So, the number of executions of the propagation algorithm in Bayesian network N A is equal to i = 1 n X j p a B ( X i ) n j , where n j is the number of possible values of X j . This is immediate taking into account that X j p a B ( X i ) n j is the number of possible configurations x p a B ( X i ) of variables in p a B ( X i ) .
Though this method can take advantage of propagation algorithms to compute marginals in a Bayesian network, and it avoids a brute force computation associated with the use of Equation (2), it is quite time consuming when the structure of the involved Bayesian network is complex.
Algorithm 1 Computation of L L using an evidence propagation algorithm
1:
function LL( N A , N B )
2:
     s u m 0.0                      ▹ sets initial value to sum
3:
    for each X i in N B  do
4:
        for each x p a i B do                ▹ configuration of X i parents
5:
           Let Φ the set of relevant potentials from Φ                    ▹ Applying Bayes-ball
6:
           Restrict the potentials in Φ to evidence p a i B = x p a i B
7:
           Delete in Φ all the variables in v ( Φ ) \ { X i }
8:
           Let q the product of all the potential in Φ
9:
           for each x i in Ω X i  do
10:
                s u m s u m + q ( x i ) log ( ϕ i B ( x i , x p a i B ) )
11:
           end for
12:
        end for
13:
    end for
14:
    return s u m
15:
end function
Algorithm 1 details the basic steps of the initial proposal for computing the Kullback–Leibler divergence. This algorithm is the one used in [8,10]. Observe that this algorithm computes L L ( N A , N B ) . It allows the computation of the K L divergence by using Equation (3).
As an example, let us suppose we wish to compute the K L divergence between two Bayesian networks N A and N B defined on X = { X 1 , X 2 , X 3 } (see Figure 1). Let us assume N A is the reference model. The families of variables in both models are presented in Figure 1. We have to compute L L ( N A , N A ) and L L ( N A , N B ) . To compute L L ( N A , N B ) Algorithm 1 is applied. Initially, Φ = { ϕ 1 A , ϕ 2 A , ϕ 3 A } where ϕ i A is defined for variables f A ( X i ) . The algorithm works as follows:
  • The parent set for X 1 is empty. The set of relevant potentials in network N A to compute the marginal for X 1 is given by Φ = { ϕ 1 A } , which is the desired marginal q.
  • The parents set for X 2 in N B is { X 1 } . So, for  each value X 1 = x 1 we have to introduce this evidence in network N A and compute the marginal on X 2 . The set relevant potentials is Φ = { ϕ 1 A , ϕ 2 A } . These potentials are reduced by selection on configuration X 1 = x 1 . If we call ϕ 4 , ϕ 5 the results of reducing ϕ 1 A , ϕ 2 A , respectively, then ϕ 4 is a potential defined for the empty set of variables and determined by its value ϕ 4 ( ) for the empty configuration. ϕ 5 is a potential defined for variable X 2 . The desired marginal q is the multiplication of these potentials: q ( x 2 ) = ϕ 4 ( ) · ϕ 5 ( x 2 ) .
  • The parents set for X 3 in N B is { X 2 } . So, for each value X 2 = x 2 we have to introduce this evidence in network N A and compute the marginal on X 3 . In this case, all the potentials are relevant Φ = { ϕ 1 A , ϕ 2 A , ϕ 3 A } . The first step introduces the evidence X 2 = x 2 in all the potentials containing this variable. Only ϕ 2 A contains X 2 ; therefore the selection ϕ 2 A X 2 = x 2 is a potential defined on variable X 1 which we will denote as ϕ 6 . So, after that Φ = { ϕ 1 A , , ϕ 3 A , ϕ 6 } . To compute the marginal on X 3 , we have to delete variable X 1 . As all the potentials in Φ contains this variable they must be combined for removing X 1 afterwards, i.e.,  computing ( ϕ 1 A · ϕ 3 A · ϕ 6 ) X 1 . After this operation, this will be the only potential in Φ and it is the desired marginal q.

4. Inference with Operations Cache

The approach proposed in this paper is based on the following fact: the computation of K L divergence using Equations (3) and (5) requires us to obtain the following families of marginal distributions:
  • ( p A ) f i B , for each X i in N B , for computing L L ( N A , N B )
  • ( p A ) f i A , for each X i in N A , for obtaining L L ( N A , N A )
We have designed a procedure to compute each one of the required marginals ( p A ) Y for each Y { f i A : X i N A } { f i B : X i N B } . Marginals are computed by deleting the variables not in Y . The procedure uses a cache of computations which can be reused in the different marginalizations in order to avoid repeated computations.
We have implemented a general procedure to calculate the marginal for a family Y of subsets Y of X in a Bayesian network N. In  our case the family Y is ( p A ) Y for each Y { f i A : X i N A } { f i B : X i N B } and the Bayesian network is N A . A previous step consists of determining the relevant potentials for computing the marginal on a subset Y , as  not all the initial potentials are necessary. If  Φ is the list of potentials, then a conditional probability potential ϕ i for variable X i is relevant for Y when X i is an ascendant for some of the variables in Y . This is a consequence of known relevance properties in Bayesian networks [17]. Let us call Φ Y the family of relevant potentials for subset Y .
Our algorithm assumes that the subsets in Y are numbered from 1 to K: { Y 1 , , Y K } . The algorithm first carries out the deletion algorithm symbolically, without actually doing numerical computations, in order to determine which of them can be reused. A symbolic combination of ϕ and ϕ consists of determining a potential ϕ · ϕ defined for variables v ( ϕ ) v ( ϕ ) but without computing its numerical values (only the scope of the resulting potential is actually computed). This procedure is analogously done in the case of marginalization.
In fact, two repositories are employed: one for potentials ( R Φ ) and another for operations ( R O ). The entry for each potential in R Φ contains a value acting as its identifier ( i d ); the potential itself; the identifier of the last operation for which this potential was required (this is denoted as potential t i m e ). Initially, R Φ contains the potentials in Φ assigning t i m e = 0 to all of them. When a potential is no longer required, then it is removed from R Φ in order to alleviate memory space requirements. The potentials representing the required marginals (the results of the queries) are set with t i m e = 1 in order to avoid their deletion.
The repository R O contains an entry for each operation (combination or marginalization) with potentials performed during the running of the algorithm in order to compute the required marginals. This allows that if an operation is needed in the future, its result can be retrieved from R O preventing repeated computations. Initially R O will be empty. At the end of the analysis, it will include the description of the elementary operations carried out throughout the evaluation of all the queries. Two kinds of operations will be stored in R O :
  • combination of two potentials ϕ 1 and ϕ 2 producing a new one as result, ϕ r .
  • marginalization of a potential ϕ 1 , in order to sum-out a variable and obtaining ϕ r as result.
The operation description will be stored as registers ( i d , t y p e , ϕ 1 , ϕ 2 , ϕ r ) with the following information:
  • A unique identifier for the operation ( i d ; an integer).
  • The type of operation ( t y p e ): marginalization or combination.
  • Identifiers of the potentials involved as operands and result (identifiers allow to retrieve potentials from R Φ ). If the operation is a marginalization, then ϕ 2 will identify the index of the variable to remove.
The computation of a marginal for a set Y will also require a deletion order of variables in some cases. This order is always obtained with a fixed triangulation heuristic m i n w e i g h t [18]. However, the procedure described here does not depend on this heuristic and any one of them could be used.
Algorithm 2 depicts the basic structure of the procedure. The result is L R , an ordered list of K potentials containing the required marginals for Y 1 , , Y K . The algorithm is divided into two main parts.
In the first part (lines 2–26), the operations are planned (using symbolic propagation) and detecting repeated operations. It is assumed that there are two basic functions SCombine ( ϕ 1 , ϕ 2 ) and SMarginalize ( ϕ , i ) , representing the symbolic operations: SCombine ( ϕ 1 , ϕ 2 ) will create a new potential ϕ r with v ( ϕ r ) = v ( ϕ 1 ) v ( ϕ 2 ) and SMarginalize ( ϕ , i ) producing another potential ϕ r with v ( ϕ r ) = v ( ϕ ) \ { X i } .
We will also consider that there are two conditional versions of these operations: if the operation already exists, only the t i m e is updated, and if it does not exist it is symbolically carried out and added to the repository of operations. The conditional combination will be CondSCombine ( ϕ 1 , ϕ 2 , t ) and the conditional marginalization will be CondSMarginalize ( ϕ , i ) and are depicted in Algorithms 3 and 4, respectively. It is assumed that both repositories are global variables for all the procedures. The potentials representing the required marginals are never deleted. For that, a time equal to 1 is assigned: if t i m e = 1 , then the potential should not be removed and then this time is never updated. We will assume the function UpdateTime ( ϕ , t ) which does nothing if the time of ϕ is equal to 1 , and updates the time of ϕ to t otherwise in repository R Φ .
Observe that the first part of Algorithm 2 (lines 2–26) just determines the necessary operations for the deletion algorithm for for all the marginals, while the second part (lines 27–32) carries out the numerical computations in the order that was established in the first part. After each operation, the potentials that are no longer necessary are removed from R Φ and their memory is deallocated. We will assume a function DeleteIf ( ϕ , t ) doing this (remove from R Φ if t i m e of ϕ is equal to t).
As mentioned above, the analysis of the operation sequence will be carried out using symbolic operations and taking into account the scopes of potentials but without numerical computations. This allows an efficient analysis. The result of the analysis will be used as an operation planning for the posterior numerical computation.
Assume the same example considered in previous sections for the networks in Figure 1. The marginals to compute on N A (as reference model) will correspond to families f A ( X 1 ) = { X 1 } , f A ( X 2 ) = { X 1 , X 2 } , f A ( X 3 ) = { X 1 , X 3 } and f B ( X 3 ) = { X 2 , X 3 } (observe that f A ( X 1 ) = f B ( X 1 ) , and  f A ( X 2 ) = f B ( X 2 ) ). Therefore, in this case Y = { { X 1 } , { X 1 , X 2 } , { X 1 , X 3 } , { X 2 , X 3 } } .
Initially, the potentials repository R Φ contains the potentials of N A (a conditional probability for each variable given its parents): ϕ 1 A ( X 1 ) , ϕ 2 A ( X 2 , X 1 ) , and  ϕ 3 A ( X 3 , X 1 ) with time 0. We indicate the variables involved in each potential. The  operations repository, R O , will be empty. Table 1 contains the initial repositories. Notice that the superscript A has been omitted in order to simplify the notation.
Algorithm 2 Computation of marginals of p for subsets Y Y
1:
function Marginal( N , Y )
2:
     t 1
3:
    for each k = 1 , , K  do
4:
        Let Y the subset Y k in Y
5:
        Let Φ Y the family of potentials from Φ relevant to subset Y
6:
        for  X i v ( Φ Y ) \ Y  do             ▹ determine operations for the query
7:
           Let Φ i = { ϕ Φ Y : X i v ( ϕ ) }
8:
           Assume Φ i = { ϕ 1 , , ϕ L }
9:
            ψ = ϕ 1
10:
           for  l = 2 , , L  do
11:
                ψ CondSCombine ( ψ , ϕ l , t )
12:
                t t + 1
13:
           end for
14:
            ψ CondSMarginalize ( ψ , i , t )
15:
            t t + 1
16:
            Φ Y ( Φ Y \ Φ i ) { ψ }
17:
        end for
18:
        Assume Φ Y = { ϕ 1 , , ϕ J }               ▹ compute joint distribution
19:
         ψ k ϕ 1
20:
        for  j = 2 , , J  do
21:
            ψ k CondSCombine ( ψ k , ϕ j , t )
22:
            t t + 1
23:
        end for
24:
        Append ψ k to L R
25:
        Set t i m e of ψ k to 1
26:
    end for
27:
     T t 1
28:
    for each t = 1 , , T  do  ▹ start numerical computation using operations planning
29:
        Select register with time t from R O : ( t , t y p e , ϕ 1 , ϕ 2 , ϕ r )
30:
        Compute numerical values of ϕ r               ▹ Actual computation
31:
        DeleteIf ( ϕ 1 , t ) , DeleteIf ( ϕ 2 , t ) , DeleteIf ( ϕ r , t )
32:
    end for
33:
    return  L R
34:
end function
Algorithm 3 Conditional symbolic combination
1:
function CondSCombine( ϕ 1 , ϕ 2 , t )
2:
    if register ( i d , c o m b , ϕ 1 , ϕ 2 , ϕ r ) is in R O  then
3:
        UpdateTime ( ϕ 1 , t ) , UpdateTime ( ϕ 2 , t ) , UpdateTime ( ϕ r , t )
4:
    else
5:
         ϕ r = SCombine ( ϕ 1 , ϕ 2 )
6:
        Add register ( i d , c o m b , ϕ 1 , ϕ 2 , ϕ r ) to R O with i d as identifier
7:
        UpdateTime ( ϕ 1 , t ) , UpdateTime ( ϕ 2 , t )
8:
        Add ϕ r to R ϕ with t i m e = t
9:
    end if
10:
    return  ϕ r
11:
end function
Algorithm 4 Conditional symbolic marginalization
1:
function CondSMarginalize( ϕ , i , t )
2:
    if register ( i d , m a r g , ϕ , i , ϕ r ) is in R O  then
3:
        UpdateTime ( ϕ , t ) , UpdateTime ( ϕ r , t )
4:
    else
5:
         ϕ r = SMarginalize ( ϕ , i )
6:
        Add register ( i d , m a r g , ϕ , i , ϕ r ) to R O with i d as identifier
7:
        UpdateTime ( ϕ , t )
8:
        Add ϕ r to R ϕ with t as time
9:
    end if
10:
    return  ϕ r
11:
end function
The first marginal to compute is for Y = { X 1 } . In this case, the set of relevant potentials is Φ Y = { ϕ 1 } and there are not operations to carry out. Therefore the first marginal is Ψ 1 = ϕ 1 which is appended to L R .
The second marginal to be computed is for Y = { X 1 , X 2 } . In this case, the relevant potentials are Φ Y = { ϕ 1 , ϕ 2 } and there are no variables to remove, but it is necessary to carry out the symbolic combination of ϕ 1 and ϕ 2 in order to compute Ψ 2 (lines 19–23 of Algorithm 2). If we call ϕ 4 the result, then the repositories after this operation will be as shown in Table 2.
The third marginal to compute is for set Y = { X 1 , X 3 } . Now, the relevant potentials are Φ Y = { ϕ 1 , ϕ 3 } . The situation is analogous to the computation of the previous marginal, with the difference that now the symbolic combination to carry out is ϕ 1 · ϕ 3 . The repositories status after k = 3 is shown in Table 3. We have that the third desired marginal is ψ 3 = ϕ 5 .
Finally, for k = 4 we have to compute the marginal for Y = { X 2 , X 3 } . The relevant potentials are now Φ Y = { ϕ 1 , ϕ 2 , ϕ 3 } . Variable X 1 has to be deleted from this set of potentials. As all the potentials contain this variable, as a first step it is necessary to combine all of them, and afterwards to remove X 1 by marginalizing on { X 2 , X 3 } . Assume that the order of the symbolic operations is: first combine ϕ 1 and ϕ 2 and then its result is combined with ϕ 3 ; then this result is marginalized by removing X 1 . Then the repositories after k = 4 are as presented in Table 4. The combination of ϕ 1 and ϕ 2 was previously carried out for k = 2 and therefore its result can be retrieved without new computations.
After that, the numerical part of operations in Table 4 are carried out in the same order in which they are described in that table. In this process, after doing an operation with an identifier ( i d ) equal to t, the potentials with t i m e equal to t are removed from the R Φ table. For example, in this case, potentials ϕ 2 and ϕ 3 can be removed from R Φ after doing operation with i d = 4 and potential ϕ 6 can be removed after operation with i d = 5 , leaving only in R Φ the potentials containing the desired marginal potentials needed to compute the K L divergence between both networks (potentials with t i m e = 1 ).

5. Experiments

In order to compare the computation approaches presented in the paper the experimentation uses a set of Bayesian networks available in the bnlearn [19] repository (https://www.bnlearn.com/bnrepository/, accessed on 24 August 2021). This library provides all the functions required for the process described below. Given a certain Bayesian network as defined in the repository:
  • A dataset is generated using the rbn function. As explained in the library documentation, this function simulates random samples from a Bayesian network, using forward/backward sampling.
  • The dataset is used for learning a Bayesian network. For this step, the tabu function is employed using the default setting (a dataset as unique argument). It is one of the structural learning methods available on bnlearn. Since the learned model could have unoriented links, the cextend function is required, which results in a Bayesian network consistent with the model passed as argument. Any other different learning algorithm could have been used, since the goal is to have an alternative Bayesian network that will be used later to calculate the Kullback–Leibler divergence with the methods described in Algorithms 1 and 2.
For each network, the Kullback–Leibler divergence is computed with the procedures presented using evidence propagation (see Algorithm 1) and using operations cache (described in Algorithm 2). The main purpose of the experiment is to get an estimation of the computation times required for both approaches. The obtained results are included in Table 5. It contains the following information:
  • Network name.
  • Number of nodes.
  • Number of arcs.
  • Number of parameters required for quantifying the uncertainty of the network.
  • time1: Runtime using the algorithm without cache (Algorithm 1).
  • time2: Runtime using the algorithm with cache (Algorithm 2).
  • ops: Number of elementary operations stored in the operations repository R O to compute all the necessary distributions for the calculation using Algorithm 2.
  • rep: Number of operations that are repeated and that, thanks to the use of R O and R Φ , will be executed only once.
  • del: Number of factors that were removed from the R Φ with the consequent release of memory space for future calculations.
The experiments have been run in a desktop computer with an Intel(R) Xeon(R) Gold 6230 CPU working at 3.60GHz (80 cores). It has 312 Gb of RAM memory. The operating system is Linux Fedora Core 34.
It is observed that the calculation with the second method always offers shorter runtimes than the first one. The shortest runtimes are presented in the table with bold style. It is noteworthy that the case of three networks in which the method based on the use of evidence cannot be completed because the available memory capacity is exceeded: water, mildew, and barley. Moreover, the computational overhead required to manage operations and factor repositories is beneficial as it avoids the repetition of a significant number of operations and enables unnecessary potentials to be released, especially in the most complex networks.

6. Conclusions

Computing the KL divergence between the joint probabilities associated with two Bayesian networks is an important task that is relevant for many problems, for example assessing the accuracy of Bayesian network learning algorithms. However, in general, it is not possible to find this function implemented in software packages for probabilistic graphical models. In this paper, we provide an algorithm that uses local computation to calculate the KL divergence between two Bayesian networks. The algorithm is based in a procedure with two stages. The first one plans the operations determining the repeated operations and the times in which potentials are no longer necessary, while the second one carries out the numerical operations, taking care to reuse the results of repeated operations instead of repeating them and deallocating the memory space associated with useless potentials. Experiments show that this strategy saves time and space, especially in complex networks.
The functions have been implemented in Python taking as basis pgmpy software package and are available in the github repository: https://github.com/mgomez-olmedo/KL-pgmpy, accessed on 24 August 2021. The README file of the project offers details about the implementation and the methods available for reproducing the experiments.
In the future, we plan to further improve the efficiency of the algorithms. The main line will be to invest more time in the planning stage looking for deletion orderings minimizing the total number of operations or optimizing the order of combinations, when several potentials have to be multiplied.

Author Contributions

Conceptualization, S.M., A.C. and M.G.-O.; methodology, S.M., A.C. and M.G.-O.; software, S.M., A.C. and M.G.-O.; validation, S.M., A.C. and M.G.-O.; formal analysis, S.M., A.C. and M.G.-O.; investigation, S.M., A.C. and M.G.-O.; writing–original draft preparation, S.M., A.C. and M.G.-O.; visualization, S.M., A.C. and M.G.-O.; supervision, S.M., A.C. and M.G.-O.; funding acquisition, S.M., A.C. and M.G.-O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly supported by the Spanish Ministry of Education and Science under project PID2019-106758GB-C31 and the European Regional Development Fund (FEDER).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We are very grateful to the anonymous reviewers for their valuable comments and suggestions that have contributed to the improvement of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. de Jongh, M.; Druzdzel, M.J. A comparison of structural distance measures for causal Bayesian network models. In Recent Advances in Intelligent Information Systems, Challenging Problems of Science, Computer Science Series; Springer: Basel, Switzerland, 2009; pp. 443–456. [Google Scholar]
  2. Scutari, M.; Vitolo, C.; Tucker, A. Learning Bayesian networks from big data with greedy search: Computational complexity and efficient implementation. Stat. Comput. 2019, 29, 1095–1108. [Google Scholar]
  3. Talvitie, T.; Eggeling, R.; Koivisto, M. Learning Bayesian networks with local structure, mixed variables, and exact algorithms. Int. J. Approx. Reason. 2019, 115, 69–95. [Google Scholar]
  4. Natori, K.; Uto, M.; Ueno, M. Consistent learning Bayesian networks with thousands of variables. In Advanced Methodologies for Bayesian Networks; PMLR; 2017; pp. 57–68. Available online: https://proceedings.mlr.press/v73/natori17a (accessed on 29 July 2021).
  5. Kullback, S. Information Theory and Statistics; Dover Publication: Mineola, NY, USA, 1968. [Google Scholar]
  6. de Campos, L.M. A scoring function for learning Bayesian networks based on mutual information and conditional independence tests. J. Mach. Learn. Res. 2006, 7, 2149–2187. [Google Scholar]
  7. Liu, H.; Zhou, S.; Lam, W.; Guan, J. A new hybrid method for learning Bayesian networks: Separation and reunion. Knowl.-Based Syst. 2017, 121, 185–197. [Google Scholar]
  8. Cano, A.; Gómez-Olmedo, M.; Moral, S. Learning Sets of Bayesian Networks. In Proceedings of the International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Lisbon, Portugal, 15–19 June 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 151–164. [Google Scholar]
  9. Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  10. Elvira Consortium. Elvira: An Environment for Probabilistic Graphical Models. In Proceedings of the 1st European Workshop on Probabilistic Graphical Models, Cuenca, Spain, 6–8 November 2002; pp. 222–230. [Google Scholar]
  11. Kjærulff, U. Approximation of Bayesian Networks through Edge Removals; Technical Report IR-93-2007; Department of Mathematics and Computer Science, Aalborg University: Aalborg, Denmark, 1993. [Google Scholar]
  12. Choi, A.; Chan, H.; Darwiche, A. On Bayesian Network Approximation by Edge Deletion. arXiv 2012, arXiv:1207.1370. [Google Scholar]
  13. Kjærulff, U. Reduction of computational complexity in Bayesian networks through removal of weak dependences. In Uncertainty Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994; pp. 374–382. [Google Scholar]
  14. Ankan, A.; Panda, A. pgmpy: Probabilistic graphical models using Python. In Proceedings of the 14th Python in Science Conference (SCIPY 2015), Austin, TX, USA, 6–12 June 2015. [Google Scholar]
  15. Shafer, G.R.; Shenoy, P.P. Probability propagation. Ann. Math. Artif. Intell. 1990, 2, 327–351. [Google Scholar]
  16. Dechter, R. Bucket elimination: A unifying framework for reasoning. Artif. Intell. 1999, 113, 41–85. [Google Scholar]
  17. Shachter, R. Bayes-Ball: The Rational Pasttime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams). In Proceedings of the Fourteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI–98), Madison, WI, USA, 24–26 July 1998; Morgan Kaufmann Publishers: San Francisco, CA, USA, 1998; pp. 48–487. [Google Scholar]
  18. Cano, A.; Moral, S. Heuristic algorithms for the triangulation of graphs. In Proceedings of the International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Paris, France, 4–8 July 1994; Springer: Berlin/Heidelberg, Germany, 1994; pp. 98–107. [Google Scholar]
  19. Scutari, M. Learning Bayesian networks with the bnlearn R package. J. Stat. Softw. 2010, 35, 1–22. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Bayesian networks to compare.
Figure 1. Bayesian networks to compare.
Entropy 23 01122 g001
Table 1. Initial state for R Φ (left part) and R O (right part).
Table 1. Initial state for R Φ (left part) and R O (right part).
Rep. of Potentials ( R Φ ) Rep. of Operations ( R O )
PotentialTime IdTypeArg. 1Arg. 2Result
ϕ 1 ( X 1 ) 0
ϕ 2 ( X 1 , X 2 ) 0
ϕ 3 ( X 1 , X 3 ) 0
Table 2. Repositories after k = 2 .
Table 2. Repositories after k = 2 .
Rep. of Potentials ( R Φ ) Rep. of Operations ( R O )
PotentialTime IdTypeArg. 1Arg. 2Result
ϕ 1 ( X 1 ) −1 1Comb ϕ 1 ϕ 2 ϕ 4
ϕ 2 ( X 1 , X 2 ) 1
ϕ 3 ( X 1 , X 3 ) 0
ϕ 4 ( X 1 , X 2 ) −1
Table 3. Repositories after k = 3 .
Table 3. Repositories after k = 3 .
Rep. of Potentials ( R Φ ) Rep. of Operations ( R O )
PotentialTime IdTypeArg. 1Arg. 2Result
ϕ 1 ( X 1 ) −1 1Comb ϕ 1 ϕ 2 ϕ 4
ϕ 2 ( X 1 , X 2 ) 1 2Comb ϕ 1 ϕ 3 ϕ 5
ϕ 3 ( X 1 , X 3 ) 2
ϕ 4 ( X 1 , X 2 ) −1
ϕ 5 ( X 1 , X 3 ) −1
Table 4. Repositories after k = 4 .
Table 4. Repositories after k = 4 .
Rep. of Potentials ( R Φ ) Rep. of Operations ( R O )
PotentialTime IdTypeArg. 1Arg. 2Result
ϕ 1 ( X 1 ) −1 1Comb ϕ 1 ϕ 2 ϕ 4
ϕ 2 ( X 1 , X 2 ) 4 2Comb ϕ 1 ϕ 3 ϕ 5
ϕ 3 ( X 1 , X 3 ) 4 3Comb ϕ 1 ϕ 3 ϕ 5
ϕ 4 ( X 1 , X 2 ) −1 4Comb ϕ 4 ϕ 3 ϕ 6
ϕ 5 ( X 1 , X 3 ) −1 5Marg ϕ 6 1 ϕ 7
ϕ 6 ( X 1 , X 2 , X 3 ) 5
ϕ 7 ( X 2 , X 3 ) −1
Table 5. Runtimes for KL computation without cache (time1) and with cache (time2).
Table 5. Runtimes for KL computation without cache (time1) and with cache (time2).
NetworkNodesArcsParametersTime1Time2OpsRepDel
cancer54180.03130.012482113
earthquake54100.03160.009131155
survey56210.05040.0143492312
asia88180.07870.0173622813
sachs11171780.34840.0409813025
child20252300.87960.07261425856
insurance275298416.99900.3788631313223
water326610,0838.6822640299170
mildew3546540,15015.93931278955150
alarm37465094.09490.3354638415132
barley4884114,005205.659722731695328
hailfinder56662656362.52620.84981197921127
hepar270123145323.10471.408818641459242
win95pts761122656404.45681.0080960458314
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Moral, S.; Cano, A.; Gómez-Olmedo, M. Computation of Kullback–Leibler Divergence in Bayesian Networks. Entropy 2021, 23, 1122. https://doi.org/10.3390/e23091122

AMA Style

Moral S, Cano A, Gómez-Olmedo M. Computation of Kullback–Leibler Divergence in Bayesian Networks. Entropy. 2021; 23(9):1122. https://doi.org/10.3390/e23091122

Chicago/Turabian Style

Moral, Serafín, Andrés Cano, and Manuel Gómez-Olmedo. 2021. "Computation of Kullback–Leibler Divergence in Bayesian Networks" Entropy 23, no. 9: 1122. https://doi.org/10.3390/e23091122

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop