Computation of Kullback–Leibler Divergence in Bayesian Networks

Moral, Serafín; Cano, Andrés; Gómez-Olmedo, Manuel

doi:10.3390/e23091122

Open AccessArticle

Computation of Kullback–Leibler Divergence in Bayesian Networks

by

Serafín Moral

,

Andrés Cano

and

Manuel Gómez-Olmedo

^*

Computer Science and Artificial Intelligent Department, University of Granada, 18071 Granada, Spain

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(9), 1122; https://doi.org/10.3390/e23091122

Submission received: 29 July 2021 / Revised: 19 August 2021 / Accepted: 25 August 2021 / Published: 28 August 2021

(This article belongs to the Special Issue Bayesian Inference in Probabilistic Graphical Models)

Download

Browse Figure

Versions Notes

Abstract

:

Kullback–Leibler divergence

K L (p, q)

is the standard measure of error when we have a true probability distribution p which is approximate with probability distribution q. Its efficient computation is essential in many tasks, as in approximate computation or as a measure of error when learning a probability. In high dimensional probabilities, as the ones associated with Bayesian networks, a direct computation can be unfeasible. This paper considers the case of efficiently computing the Kullback–Leibler divergence of two probability distributions, each one of them coming from a different Bayesian network, which might have different structures. The paper is based on an auxiliary deletion algorithm to compute the necessary marginal distributions, but using a cache of operations with potentials in order to reuse past computations whenever they are necessary. The algorithms are tested with Bayesian networks from the bnlearn repository. Computer code in Python is provided taking as basis pgmpy, a library for working with probabilistic graphical models.

Keywords:

probabilistic graphical models; learning algorithms; Kullback–Leibler divergence

1. Introduction

When experimentally testing Bayesian network learning algorithms, in most of the cases, the performance is evaluated looking at structural differences between the graphs of the original Bayesian network and the learned one [1], as in the case of using the structural Hamming distance. This measure is used in recent contributions as [2,3,4]. A study and comparison of the different metrics used to measure the structural differences between two Bayesian networks can be found in [1].

However, in most cases the aim of learning a Bayesian network is to estimate a joint probability for the variables in the problem. In that situation the error of a learning procedure should be computed by measuring the difference between the probability associated with the learned network and the original joint probability distribution. Therefore, it can be useful to estimate a network that is less dense than the original one, but in which parameters can have a more accurate estimation. This is the case of the Naive Bayes classifier, which obtains very good results in classification problems, despite the fact that the structure is not the correct one. So, in this situation, structural graphical differences are not a good measure of performance.

The basic measure to determine the divergence between an estimated distribution and a true one is the so-called Kullback–Leibler divergence [5]. Some papers use this way of asserting the quality of a learning procedure as in [6,7,8]. A direct computation of the divergence is unfeasible if the number of variables is high. However, some basic decomposition properties [9] (Theorem 8.5) can be applied to reduce the cost of computation of the divergence. This is the basis of the procedure implemented in the Elvira system [10] which is the one used in [6]. Methods in [7,8] are also based on the same basic decomposition. Kullback–Leibler divergence is not only meaningful for measuring divergence between a learned network and a true one, but also for other tasks, as for example the approximation of a Bayesian network by a simpler one [11,12,13] by removing some of the existing links.

The aim of this work is to improve existing methods for computing Kullback–Leibler divergence in Bayesian networks and to provide a basic algorithm for this task using Python and integrated into the pgmpy [14] environment. The algorithm implemented in the Elvira system [10] is based on carrying out a number of propagation computations in the original true network. The hypothesis underlying our approach is that there are a lot of computations that are repeated in these propagation algorithms, so what it is done is to determine which are the operations with potentials that are repeated and then storing the results in a cache of operations in order to allow reuse them. The experimental work will show that this is an effective method to improve the efficiency of algorithms, especially in large networks.

The paper is organized as follows: Section 2 is devoted to set the basic framework and to present fundamental results for the Kullback–Leibler divergence computation; Section 3 describes the method implemented in the Elvira system for computing Kullback–Leibler divergence; Section 4 is devoted to describing our proposal based on the cache of operations with potentials; Section 5 contains the experimental setting and the obtained results; finally the conclusions are shown in Section 6.

2. Kullback–Leibler Divergence

Let N be a Bayesian network defined on a set of variables

X = {X_{1} \dots X_{n}}

. The family of a variable

X_{i}

in

X

is termed

f (X_{i}) = {X_{i}} \cup p a (X_{i})

, where

p a (X_{i})

is the set of parents of

X_{i}

in the directed acyclic graph (DAG) defined by N.

F = {f (X_{1}) \dots f (X_{n})}

denotes the complete set of families, one for each one of the variables. Sometimes simplified notations for families and parent sets will be used:

f_{i}

(for

f (X_{i})

) and

p a_{i}

(for

p a (X_{i})

), respectively. As a running example, assume a network with three variables,

X_{1}, X_{2}, X_{3}

and the following structure:

X_{1} \to X_{2} \to X_{3}

(see right part of Figure 1). Then the set of families for this network is given by

{f_{1}, f_{2}, f_{3}}

, where

f_{1} = {X_{1}}, f_{2} = {X_{2}, X_{1}}, f_{3} = {X_{3}, X_{2}}

.

A configuration or assignment of values to a set of variables

X

,

{X_{1} = x_{1} \dots X_{n} = x_{n}}

, can be abbreviated with

(x_{1} \dots x_{n})

and is denoted as

x

. If the set of possible values for each variable in the previous example is

{0, 1}

, then a configuration can be

x = (0, 0, 1)

, representing the assignment

{X_{1} = 0, X_{2} = 0, X_{3} = 1}

.

A partial configuration involving a subset of variables

Y \subseteq X

is denoted as

y

. If the set of variables is

f_{i}

or

p a_{i}

, then the partial configuration will be denoted by

x_{f_{i}}

or

x_{p a_{i}}

, respectively. In our example, if

f_{2} = {X_{2}, X_{1}}

an example of partial configuration about these variables will be

x_{f_{2}} = (0, 0)

.

The set of configurations for variables

Y

is denoted by

Ω_{Y}

. If

x

is an assignment and

Y \subseteq X

, then the configuration

y

obtained by deleting the values of the variables in

X \ Y

is denoted by

x^{↓ Y}

. If

x_{f_{2}} = (0, 0)

is a partial configuration about variables

{X_{2}, X_{1}}

and we consider

Y = {X_{2}}

, then

x_{f_{2}}^{↓ Y}

is the configuration obtained by removing the value of

X_{1}

, i.e.,

(0)

.

If

w

and

z

are configurations for

W \subseteq X

and

Z \subseteq X

respectively, and

W \cap Z = \emptyset

, then

(w, z)

is a configuration for

W \cup Z

, and will be called the composition of

w

and

z

. For example, if

w

is the configuration

(0)

about variable

X_{1}

and

y

is the configuration

(0, 1)

defined on

X_{2}, X_{3}

, then its composition will be the configuration

(0, 0, 1)

for variables

{X_{1}, X_{2}, X_{3}}

.

The conditional probability distribution for

X_{i}

given its parents will be denoted as

ϕ_{i}

which is a potential defined on the set of variables

f (X_{i})

. In general, a potential

ϕ

for variables

Y \subseteq X

is a mapping defined on

Ω_{Y}

into the set of real numbers:

ϕ : Ω_{Y} \to R

. The set of variables of potential

ϕ

will be denoted as

v (ϕ)

. If

Φ

is a set of potentials,

v (Φ)

will denote

⋃_{ϕ \in Φ} v (ϕ)

.

In our example, there are three potentials and

Φ = {ϕ_{1} (X_{1}), ϕ_{2} (X_{2}, X_{1}), ϕ_{3} (X_{3}, X_{2})}

(that is, a probability distribution about

X_{1}

, and two conditional probability distributions: one for

X_{2}

given

X_{1}

and the other for

X_{3}

given

X_{2}

, respectively).

There are three basic operations that can be performed on potentials:

Multiplication. If $ϕ, ϕ^{'}$ are potentials, then their multiplication is the potential $ϕ \cdot ϕ^{'}$ , with set of variables $v (ϕ \cdot ϕ^{'}) = v (ϕ) \cup v (ϕ^{'})$ and obtained by pointwise multiplication:

$ϕ \cdot ϕ^{'} (y) = ϕ (y^{↓ v (ϕ)}) \cdot ϕ^{'} (y^{↓ v (ϕ^{'})}) .$

In our example, the combination of $ϕ_{2}$ and $ϕ_{3}$ will be the potential $ϕ_{2} \cdot ϕ_{3}$ defined on ${X_{1}, X_{2}, X_{3}}$ and given by $ϕ_{2} \cdot ϕ_{3} (x_{1}, x_{2}, x_{3}) = ϕ_{2} (x_{2}, x_{1}) \cdot ϕ_{3} (x_{3}, x_{2})$ .
Marginalization. If $ϕ$ is a potential defined for variables $Y$ and $Z \subseteq Y$ , then the marginalization of $ϕ$ on $Z$ is denoted by $ϕ^{↓ Z}$ and it is obtained by summing in the variables in $Y \ Z$ :

$ϕ^{↓ Z} (z) = \sum_{y^{↓ Z} = z} ϕ (y) .$

When $Z$ is equal to $Y$ minus a variable W, then $ϕ^{↓ Z}$ will be called the result of removing W in $ϕ$ and also denoted as $ϕ^{- W}$ . In the example, a marginalization of $ϕ_{3}$ is obtained by removing $X_{3}$ producing $ϕ_{3}^{- X_{3}}$ defined on $X_{2}$ and given by $ϕ_{3}^{- X_{3}} (x_{2}) = ϕ_{3} (0, x_{2}) + ϕ_{3} (1, x_{2})$ . If $ϕ_{3} (x_{3}, x_{2})$ represents the conditional probability of $X_{3} = x_{3}$ given $X_{2} = x_{2}$ , then it can be obtained that $ϕ_{3}^{- X_{3}} (x_{2})$ is always equal to 1 ( $\forall x_{2} \in v (ϕ_{3})$ ).
Selection. If $ϕ$ is a potential defined for variables $Y$ and $z$ is a configuration for variables $Z$ , then the selection of $ϕ$ for this configuration $z$ is the potential $ϕ_{Z = z}$ defined on variables $W = Y \ Z$ and given by

$ϕ_{Z = z} (w) = ϕ (w, z^{↓ Y}) .$

In this expression $(w, z^{↓ Y})$ is the composition of configurations $w$ and $z^{↓ Y}$ which is a configuration for variables $v (ϕ)$ . Going back to the example, assume that we want to perform the selection of $ϕ_{3}$ to configuration $z = (0, 1)$ for variables ${X_{1}, X_{2}}$ , then ${ϕ_{3}}_{Z = z}$ will be a potential defined for variables ${X_{2}, X_{3}} \ {X_{1}, X_{2}} = {X_{3}}$ given by ${ϕ_{3}}_{Z = z} (x_{3}) = ϕ_{3} (x_{3}, 1)$ , as we are reducing $ϕ_{3} (X_{3}, X_{2})$ to a configuration $z$ in which $X_{2} = 1$ .

The family of all the conditional distributions is denoted as

Φ = {ϕ_{1}, \dots, ϕ_{n}}

. It is well known that given N the joint probability distribution of the variables in N, p, is a potential that decomposes as the product of the potentials included in

Φ

:

p = \prod_{ϕ_{i} \in Φ} ϕ_{i}

(1)

Considering the example,

Φ = {ϕ_{1}, ϕ_{2}, ϕ_{3}}

and

p = ϕ_{1} \cdot ϕ_{2} \cdot ϕ_{3}

. The marginal distribution of p for a set of variables

Y \subseteq X

is equal to

p^{↓ Y}

. When

Y

contains only one variable

X_{i}

, then to simplify the notation,

p^{↓ Y}

will be denoted as

p_{i}

. Sometimes it will be needed to make reference to the Bayesian network containing a potential or family. In these cases we will use a superscript. For example,

f^{A} (X_{i})

and

p a^{A} (X_{i})

refer to the family and parents set of

X_{i}

in a Bayesian network

N^{A}

respectively.

The aim of this paper is to compute the Kullback–Leibler divergence (termed

K L

) between the joint probability distributions,

p^{A}

and

p^{B}

, of two different Bayesian networks

N^{A}

and

N^{B}

defined on the same set of variables

X

but possibly having different structures. This divergence, denoted as

K L (N^{A}, N^{B})

can be computed considering the probabilities for each configuration

x

in both distributions as follows:

K L (N^{A}, N^{B}) = \sum_{x} p^{A} (x) log (\frac{p^{A} (x)}{p^{B} (x)})

(2)

However, the computation of the joint probability distribution may be unfeasible for complex models as the number of configurations

x

is exponential in the number of variables. If

p, q

are probability distributions on

X

then the expected log likelihood (

L L

) of q with respect to p is:

L L (p, q) = \sum_{x} p (x) log (q (x)),

then, from Equation (2) it is immediate that:

\begin{matrix} K L (N^{A}, N^{B}) = \sum_{x} p^{A} (x) log (p^{A} (x)) - \sum_{x} p^{A} (x) log (p^{B} (x)) = \\ L L (p^{A}, p^{A}) - L L (p^{A}, p^{B}) = L L (N^{A}, N^{A}) - L L (N^{A}, N^{B}) \end{matrix}

(3)

The probability distribution

p^{B}

can be decomposed as well as considered in Equation (1). Therefore, the term

L L (N^{A}, N^{B})

in Equation (3) can be obtained as follows considering the families of variables in

N^{B}

and their corresponding configurations,

x^{↓ f_{i}^{B}}

:

\begin{matrix} L L (N^{A}, N^{B}) = \sum_{x} p^{A} (x) log (p^{B} (x)) = \sum_{x} p^{A} (x) log (\prod_{X_{i} \in X} ϕ_{i}^{B} (x^{↓ f_{i}^{B}})) = \\ \sum_{x} p^{A} (x) \sum_{X_{i} \in X} log (ϕ_{i}^{B} (x^{↓ f_{i}^{B}})) \end{matrix}

(4)

Interchanging additions and reorganizing the terms in Equation (4):

\begin{matrix} L L (N^{A}, N^{B}) = \sum_{X_{i} \in X} \sum_{x} p^{A} (x) log (ϕ_{i}^{B} (x^{↓ f_{i}^{B}})) = \\ \sum_{X_{i} \in X} \sum_{x_{f_{i}^{B}}} log (ϕ_{i}^{B} (x_{f_{i}^{B}})) (\sum_{x^{↓ f_{i}^{B}} = x_{f_{i}^{B}}} p^{A} (x)) = \\ \sum_{X_{i} \in X} \sum_{x_{f_{i}^{B}}} log (ϕ_{i}^{B} (x_{f_{i}^{B}})) {(p^{A})}^{↓ f_{i}^{B}} (x_{f_{i}^{B}}) \end{matrix}

(5)

Equation (5) implies a decomposition of the term

L L (N^{A}, N^{B})

and, as a consequence of

K L (N^{A}, N^{B})

computation as well. Observe that

ϕ_{i}^{B} (x_{f_{i}^{B}})

is the value of the potential

ϕ_{i}^{B}

for a configuration

x_{f_{i}^{B}}

and can be obtained directly from the potential

ϕ_{i}^{B}

of the Bayesian network

N^{B}

. The main difficulty in Equation (5) consists of the computation of

{(p^{A})}^{↓ f_{i}^{B}} (x_{f_{i}^{B}})

values, as it is necessary to compute the marginal probability distribution for variables in

f_{i}^{B}

, the family of

X_{i}

in Bayesian network

N^{B}

, but using the joint probability distribution

p^{A}

associated with the Bayesian network

N^{A}

.

3. Computation with Propagation Algorithms

In this section we introduce the category of inference algorithms based on deletion of variables and then we show how these algorithms can be applied to compute the Kullback–Leibler divergence using Equation (5).

3.1. Variable Elimination Algorithms

To compute

{(p^{A})}^{↓ f_{i}^{B}}

we consider

Φ^{A}

the set of potentials associated to network

N^{A}

: the multiplication of all the potentials in

Φ^{A}

is equal to

p^{A}

. Deletion algorithms [15,16], can be applied to

Φ^{A}

to determine the required marginalizations. The basic step of these algorithms is the deletion of a variable from a set

Φ^{A}

:

Variable Deletion. If $Φ$ is a set of potentials, the deletion of $X_{i}$ consists of the following operations:
–
Compute $Φ_{i} = {ϕ : X_{i} \in v (ϕ)}$ , i.e., the set of potentials containing variable $X_{i}$ .
–
Compute $ϕ^{- i} = {(\prod_{ϕ \in Φ_{i}} ϕ)}^{- X_{i}}$ , i.e., combine all the potentials in $Φ_{i}$ and remove variable $X_{i}$ by marginalization.
–
Update $Φ \leftarrow (Φ \ Φ_{i}) \cup {ϕ^{- i}}$ , i.e., remove from $Φ$ the potentials containing $X_{i}$ and add the new potential $ϕ^{- i}$ which does not contain $X_{i}$ .

The main property of the deletion step is the following: starting with

\prod_{ϕ \in Φ} ϕ = q

, then after the deletion of

X_{i}

from

Φ

,

\prod_{ϕ \in Φ} ϕ = q^{- X_{i}}

. It is easy to see that the deletion of a variable

X_{i}

can be computed just operating with the elements of

Φ

defined on

X_{i}

.

If

Φ

is the initial set of potentials of a network N, then

p = \prod_{ϕ \in Φ} ϕ

. In order to compute the marginalization of p on a set of variables

Y \subseteq X

, the deletion procedure should be repeated for each variable

X_{i}

in

X \ Y

. If the marginal probability distribution for variable

X_{k}

is to be calculated, any variable in

X

different from

X_{k}

should be deleted. The order of variable deletion is not meaningful for the final result, but the efficiency may depend on it.

When there are observed variables,

Z = z

, then a previous step of selection should be carried out: any potential

ϕ \in Φ

is transformed into

ϕ_{Z = z}

. This step will be called evidence restriction. After it q, the product of the potentials in

Φ

is defined for variables in

Y = X \ Z

and its value is

q (y) = p (y, z)

, i.e., the joint probability of obtaining this value and the observations. If a deletion of variables in

W

is carried out, then the product of the potentials in

Φ

is the potential defined for variables

Y = (X \ Z) \ W

, and satisfies

q (y) = p^{↓ Y \cup Z} (y, z)

.

When we have observations and we want to compute the marginal on a variable

X_{i}

, it is well known that not all the initial potentials in

Φ

are relevant. A previous pruning step can be done using the Bayes-ball algorithm [17] in order to remove the irrelevant potentials from

Φ

before restricting to the observations and carrying out the deletion of variables.

3.2. Computation of Kullback–Leibler Divergence Using Deletion Algorithms

Our first alternative to compute

L L (N^{A}, N^{B})

is based on using a simple deletion algorithm to compute the values

{(p^{A})}^{↓ f_{i}^{B}} (x_{f_{i}^{B}})

in Equation (5). The basic steps are:

Given a specific variable $X_{i}$ , we have that $f_{i}^{B} = {X_{i}} \cup p a_{i}^{B}$ . Then for each possible configuration $x_{p a_{i}^{B}}$ of the parent variables, we include the observation $p a_{i}^{B} = x_{p a_{i}^{B}}$ and we apply a selection operation to the list of potentials associated with Bayesian network $N^{A}$ by means of evidence restriction. We also apply a pruning of irrelevant variables using the Bayes-ball algorithm.
Then all the variables are deleted except the target variable $X_{i}$ . The potentials in $Φ$ will be all defined for variable $X_{i}$ and their product will be a potential q defined for variable $X_{i}$ such that $q (x_{i}) = {(p^{A})}^{↓ f_{i}^{B}} (x_{i}, x_{p a_{i}^{B}})$ .
The deletion algorithm is repeated for each variable $X_{i}$ and each configuration of the parent variables $x_{p a_{i}^{B}}$ in Bayesian network $N^{B}$ . So, the number of executions of the propagation algorithm in Bayesian network $N^{A}$ is equal to $\sum_{i = 1}^{n} \prod_{X_{j} \in p a^{B} (X_{i})} n_{j}$ , where $n_{j}$ is the number of possible values of $X_{j}$ . This is immediate taking into account that $\prod_{X_{j} \in p a^{B} (X_{i})} n_{j}$ is the number of possible configurations $x_{p a^{B} (X_{i})}$ of variables in $p a^{B} (X_{i})$ .

Though this method can take advantage of propagation algorithms to compute marginals in a Bayesian network, and it avoids a brute force computation associated with the use of Equation (2), it is quite time consuming when the structure of the involved Bayesian network is complex.

Algorithm 1 Computation of

L L

using an evidence propagation algorithm

1:: function LL( $N^{A}, N^{B}$ )
2:: $s u m \leftarrow 0.0$ ▹ sets initial value to sum
3:: for each $X_{i}$ in $N^{B}$ do
4:: for each $x_{p a_{i}^{B}}$ do ▹ configuration of $X_{i}$ parents
5:: Let $Φ^{'}$ the set of relevant potentials from $Φ$ ▹ Applying Bayes-ball
6:: Restrict the potentials in $Φ^{'}$ to evidence $p a_{i}^{B} = x_{p a_{i}^{B}}$
7:: Delete in $Φ^{'}$ all the variables in $v (Φ) \ {X_{i}}$
8:: Let q the product of all the potential in $Φ^{'}$
9:: for each $x_{i}$ in $Ω_{X_{i}}$ do
10:: $s u m \leftarrow s u m + q (x_{i}) log (ϕ_{i}^{B} (x_{i}, x_{p a_{i}^{B}}))$
11:: end for
12:: end for
13:: end for
14:: return $s u m$
15:: end function

Algorithm 1 details the basic steps of the initial proposal for computing the Kullback–Leibler divergence. This algorithm is the one used in [8,10]. Observe that this algorithm computes

L L (N^{A}, N^{B})

. It allows the computation of the

K L

divergence by using Equation (3).

As an example, let us suppose we wish to compute the

K L

divergence between two Bayesian networks

N^{A}

and

N^{B}

defined on

X = {X_{1}, X_{2}, X_{3}}

(see Figure 1). Let us assume

N^{A}

is the reference model. The families of variables in both models are presented in Figure 1. We have to compute

L L (N^{A}, N^{A})

and

L L (N^{A}, N^{B})

. To compute

L L (N^{A}, N^{B})

Algorithm 1 is applied. Initially,

Φ = {ϕ_{1}^{A}, ϕ_{2}^{A}, ϕ_{3}^{A}}

where

ϕ_{i}^{A}

is defined for variables

f^{A} (X_{i})

. The algorithm works as follows:

The parent set for $X_{1}$ is empty. The set of relevant potentials in network $N^{A}$ to compute the marginal for $X_{1}$ is given by $Φ^{'} = {ϕ_{1}^{A}}$ , which is the desired marginal q.
The parents set for $X_{2}$ in $N^{B}$ is ${X_{1}}$ . So, for each value $X_{1} = x_{1}$ we have to introduce this evidence in network $N^{A}$ and compute the marginal on $X_{2}$ . The set relevant potentials is $Φ^{'} = {ϕ_{1}^{A}, ϕ_{2}^{A}}$ . These potentials are reduced by selection on configuration $X_{1} = x_{1}$ . If we call $ϕ_{4}, ϕ_{5}$ the results of reducing $ϕ_{1}^{A}, ϕ_{2}^{A}$ , respectively, then $ϕ_{4}$ is a potential defined for the empty set of variables and determined by its value $ϕ_{4} ()$ for the empty configuration. $ϕ_{5}$ is a potential defined for variable $X_{2}$ . The desired marginal q is the multiplication of these potentials: $q (x_{2}) = ϕ_{4} () \cdot ϕ_{5} (x_{2})$ .
The parents set for $X_{3}$ in $N^{B}$ is ${X_{2}}$ . So, for each value $X_{2} = x_{2}$ we have to introduce this evidence in network $N^{A}$ and compute the marginal on $X_{3}$ . In this case, all the potentials are relevant $Φ^{'} = {ϕ_{1}^{A}, ϕ_{2}^{A}, ϕ_{3}^{A}}$ . The first step introduces the evidence $X_{2} = x_{2}$ in all the potentials containing this variable. Only $ϕ_{2}^{A}$ contains $X_{2}$ ; therefore the selection ${ϕ_{2}^{A}}_{X_{2} = x_{2}}$ is a potential defined on variable $X_{1}$ which we will denote as $ϕ_{6}$ . So, after that $Φ^{'} = {ϕ_{1}^{A},, ϕ_{3}^{A}, ϕ_{6}}$ . To compute the marginal on $X_{3}$ , we have to delete variable $X_{1}$ . As all the potentials in $Φ^{'}$ contains this variable they must be combined for removing $X_{1}$ afterwards, i.e., computing ${(ϕ_{1}^{A} \cdot ϕ_{3}^{A} \cdot ϕ_{6})}^{- X_{1}}$ . After this operation, this will be the only potential in $Φ^{'}$ and it is the desired marginal q.

4. Inference with Operations Cache

The approach proposed in this paper is based on the following fact: the computation of

K L

divergence using Equations (3) and (5) requires us to obtain the following families of marginal distributions:

${(p^{A})}^{↓ f_{i}^{B}}$ , for each $X_{i}$ in $N^{B}$ , for computing $L L (N^{A}, N^{B})$
${(p^{A})}^{↓ f_{i}^{A}}$ , for each $X_{i}$ in $N^{A}$ , for obtaining $L L (N^{A}, N^{A})$

We have designed a procedure to compute each one of the required marginals

{(p^{A})}^{↓ Y}

for each

Y \in {f_{i}^{A} : X_{i} \in N^{A}} \cup {f_{i}^{B} : X_{i} \in N^{B}}

. Marginals are computed by deleting the variables not in

Y

. The procedure uses a cache of computations which can be reused in the different marginalizations in order to avoid repeated computations.

We have implemented a general procedure to calculate the marginal for a family

Y

of subsets

Y

of

X

in a Bayesian network N. In our case the family

Y

is

{(p^{A})}^{↓ Y}

for each

Y \in {f_{i}^{A} : X_{i} \in N^{A}} \cup {f_{i}^{B} : X_{i} \in N^{B}}

and the Bayesian network is

N^{A}

. A previous step consists of determining the relevant potentials for computing the marginal on a subset

Y

, as not all the initial potentials are necessary. If

Φ

is the list of potentials, then a conditional probability potential

ϕ_{i}

for variable

X_{i}

is relevant for

Y

when

X_{i}

is an ascendant for some of the variables in

Y

. This is a consequence of known relevance properties in Bayesian networks [17]. Let us call

Φ_{Y}

the family of relevant potentials for subset

Y

.

Our algorithm assumes that the subsets in

Y

are numbered from 1 to K:

{Y_{1}, \dots, Y_{K}}

. The algorithm first carries out the deletion algorithm symbolically, without actually doing numerical computations, in order to determine which of them can be reused. A symbolic combination of

ϕ

and

ϕ^{'}

consists of determining a potential

ϕ \cdot ϕ^{'}

defined for variables

v (ϕ) \cup v (ϕ^{'})

but without computing its numerical values (only the scope of the resulting potential is actually computed). This procedure is analogously done in the case of marginalization.

In fact, two repositories are employed: one for potentials (

R_{Φ}

) and another for operations (

R_{O}

). The entry for each potential in

R_{Φ}

contains a value acting as its identifier (

i d

); the potential itself; the identifier of the last operation for which this potential was required (this is denoted as potential

t i m e

). Initially,

R_{Φ}

contains the potentials in

Φ

assigning

t i m e = 0

to all of them. When a potential is no longer required, then it is removed from

R_{Φ}

in order to alleviate memory space requirements. The potentials representing the required marginals (the results of the queries) are set with

t i m e = - 1

in order to avoid their deletion.

The repository

R_{O}

contains an entry for each operation (combination or marginalization) with potentials performed during the running of the algorithm in order to compute the required marginals. This allows that if an operation is needed in the future, its result can be retrieved from

R_{O}

preventing repeated computations. Initially

R_{O}

will be empty. At the end of the analysis, it will include the description of the elementary operations carried out throughout the evaluation of all the queries. Two kinds of operations will be stored in

R_{O}

:

combination of two potentials $ϕ_{1}$ and $ϕ_{2}$ producing a new one as result, $ϕ_{r}$ .
marginalization of a potential $ϕ_{1}$ , in order to sum-out a variable and obtaining $ϕ_{r}$ as result.

The operation description will be stored as registers

(i d, t y p e, ϕ_{1}, ϕ_{2}, ϕ_{r})

with the following information:

A unique identifier for the operation ( $i d$ ; an integer).
The type of operation ( $t y p e$ ): marginalization or combination.
Identifiers of the potentials involved as operands and result (identifiers allow to retrieve potentials from $R_{Φ}$ ). If the operation is a marginalization, then $ϕ_{2}$ will identify the index of the variable to remove.

The computation of a marginal for a set

Y

will also require a deletion order of variables in some cases. This order is always obtained with a fixed triangulation heuristic

m i n - w e i g h t

[18]. However, the procedure described here does not depend on this heuristic and any one of them could be used.

Algorithm 2 depicts the basic structure of the procedure. The result is

L R

, an ordered list of K potentials containing the required marginals for

Y_{1}, \dots, Y_{K}

. The algorithm is divided into two main parts.

In the first part (lines 2–26), the operations are planned (using symbolic propagation) and detecting repeated operations. It is assumed that there are two basic functions SCombine

(ϕ_{1}, ϕ_{2})

and SMarginalize

(ϕ, i)

, representing the symbolic operations: SCombine

(ϕ_{1}, ϕ_{2})

will create a new potential

ϕ_{r}

with

v (ϕ_{r}) = v (ϕ_{1}) \cup v (ϕ_{2})

and SMarginalize

(ϕ, i)

producing another potential

ϕ_{r}

with

v (ϕ_{r}) = v (ϕ) \ {X_{i}}

.

We will also consider that there are two conditional versions of these operations: if the operation already exists, only the

t i m e

is updated, and if it does not exist it is symbolically carried out and added to the repository of operations. The conditional combination will be CondSCombine

(ϕ_{1}, ϕ_{2}, t)

and the conditional marginalization will be CondSMarginalize

(ϕ, i)

and are depicted in Algorithms 3 and 4, respectively. It is assumed that both repositories are global variables for all the procedures. The potentials representing the required marginals are never deleted. For that, a time equal to

- 1

is assigned: if

t i m e = - 1

, then the potential should not be removed and then this time is never updated. We will assume the function UpdateTime

(ϕ, t)

which does nothing if the time of

ϕ

is equal to

- 1

, and updates the time of

ϕ

to t otherwise in repository

R_{Φ}

.

Observe that the first part of Algorithm 2 (lines 2–26) just determines the necessary operations for the deletion algorithm for for all the marginals, while the second part (lines 27–32) carries out the numerical computations in the order that was established in the first part. After each operation, the potentials that are no longer necessary are removed from

R_{Φ}

and their memory is deallocated. We will assume a function DeleteIf

(ϕ, t)

doing this (remove from

R_{Φ}

if

t i m e

of

ϕ

is equal to t).

As mentioned above, the analysis of the operation sequence will be carried out using symbolic operations and taking into account the scopes of potentials but without numerical computations. This allows an efficient analysis. The result of the analysis will be used as an operation planning for the posterior numerical computation.

Assume the same example considered in previous sections for the networks in Figure 1. The marginals to compute on

N^{A}

(as reference model) will correspond to families

f^{A} (X_{1}) = {X_{1}}

,

f^{A} (X_{2}) = {X_{1}, X_{2}}

,

f^{A} (X_{3}) = {X_{1}, X_{3}}

and

f^{B} (X_{3}) = {X_{2}, X_{3}}

(observe that

f^{A} (X_{1}) = f^{B} (X_{1})

, and

f^{A} (X_{2}) = f^{B} (X_{2})

). Therefore, in this case

Y = {{X_{1}}, {X_{1}, X_{2}}, {X_{1}, X_{3}}, {X_{2}, X_{3}}}

.

Initially, the potentials repository

R_{Φ}

contains the potentials of

N^{A}

(a conditional probability for each variable given its parents):

ϕ_{1}^{A} (X_{1})

,

ϕ_{2}^{A} (X_{2}, X_{1})

, and

ϕ_{3}^{A} (X_{3}, X_{1})

with time 0. We indicate the variables involved in each potential. The operations repository,

R_{O}

, will be empty. Table 1 contains the initial repositories. Notice that the superscript A has been omitted in order to simplify the notation.

Algorithm 2 Computation of marginals of p for subsets

Y \in Y

1:: function Marginal( $N, Y$ )
2:: $t \leftarrow 1$
3:: for each $k = 1, \dots, K$ do
4:: Let $Y$ the subset $Y_{k}$ in $Y$
5:: Let $Φ_{Y}$ the family of potentials from $Φ$ relevant to subset $Y$
6:: for $X_{i} \in v (Φ_{Y}) \ Y$ do ▹ determine operations for the query
7:: Let $Φ_{i} = {ϕ \in Φ_{Y} : X_{i} \in v (ϕ)}$
8:: Assume $Φ_{i} = {ϕ_{1}, \dots, ϕ_{L}}$
9:: $ψ = ϕ_{1}$
10:: for $l = 2, \dots, L$ do
11:: $ψ \leftarrow$ CondSCombine $(ψ, ϕ_{l}, t)$
12:: $t \leftarrow t + 1$
13:: end for
14:: $ψ \leftarrow$ CondSMarginalize $(ψ, i, t)$
15:: $t \leftarrow t + 1$
16:: $Φ_{Y} \leftarrow (Φ_{Y} \ Φ_{i}) \cup {ψ}$
17:: end for
18:: Assume $Φ_{Y} = {ϕ_{1}, \dots, ϕ_{J}}$ ▹ compute joint distribution
19:: $ψ_{k} \leftarrow ϕ_{1}$
20:: for $j = 2, \dots, J$ do
21:: $ψ_{k} \leftarrow$ CondSCombine $(ψ_{k}, ϕ_{j}, t)$
22:: $t \leftarrow t + 1$
23:: end for
24:: Append $ψ_{k}$ to $L R$
25:: Set $t i m e$ of $ψ_{k}$ to $- 1$
26:: end for
27:: $T \leftarrow t - 1$
28:: for each $t = 1, \dots, T$ do ▹ start numerical computation using operations planning
29:: Select register with time t from $R_{O}$ : $(t, t y p e, ϕ_{1}, ϕ_{2}, ϕ_{r})$
30:: Compute numerical values of $ϕ_{r}$ ▹ Actual computation
31:: DeleteIf $(ϕ_{1}, t)$ , DeleteIf $(ϕ_{2}, t)$ , DeleteIf $(ϕ_{r}, t)$
32:: end for
33:: return $L R$
34:: end function

Algorithm 3 Conditional symbolic combination

1:: function CondSCombine( $ϕ_{1}, ϕ_{2}, t$ )
2:: if register $(i d, c o m b, ϕ_{1}, ϕ_{2}, ϕ_{r})$ is in $R_{O}$ then
3:: UpdateTime $(ϕ_{1}, t),$ UpdateTime $(ϕ_{2}, t),$ UpdateTime $(ϕ_{r}, t)$
4:: else
5:: $ϕ_{r} =$ SCombine $(ϕ_{1}, ϕ_{2})$
6:: Add register $(i d, c o m b, ϕ_{1}, ϕ_{2}, ϕ_{r})$ to $R_{O}$ with $i d$ as identifier
7:: UpdateTime $(ϕ_{1}, t),$ UpdateTime $(ϕ_{2}, t)$
8:: Add $ϕ_{r}$ to $R_{ϕ}$ with $t i m e = t$
9:: end if
10:: return $ϕ_{r}$
11:: end function

Algorithm 4 Conditional symbolic marginalization

1:: function CondSMarginalize( $ϕ, i, t$ )
2:: if register $(i d, m a r g, ϕ, i, ϕ_{r})$ is in $R_{O}$ then
3:: UpdateTime $(ϕ, t),$ UpdateTime $(ϕ_{r}, t)$
4:: else
5:: $ϕ_{r} =$ SMarginalize $(ϕ, i)$
6:: Add register $(i d, m a r g, ϕ, i, ϕ_{r})$ to $R_{O}$ with $i d$ as identifier
7:: UpdateTime $(ϕ, t)$
8:: Add $ϕ_{r}$ to $R_{ϕ}$ with t as time
9:: end if
10:: return $ϕ_{r}$
11:: end function

The first marginal to compute is for

Y = {X_{1}}

. In this case, the set of relevant potentials is

Φ_{Y} = {ϕ_{1}}

and there are not operations to carry out. Therefore the first marginal is

Ψ_{1} = ϕ_{1}

which is appended to

L R

.

The second marginal to be computed is for

Y = {X_{1}, X_{2}}

. In this case, the relevant potentials are

Φ_{Y} = {ϕ_{1}, ϕ_{2}}

and there are no variables to remove, but it is necessary to carry out the symbolic combination of

ϕ_{1}

and

ϕ_{2}

in order to compute

Ψ_{2}

(lines 19–23 of Algorithm 2). If we call

ϕ_{4}

the result, then the repositories after this operation will be as shown in Table 2.

The third marginal to compute is for set

Y = {X_{1}, X_{3}}

. Now, the relevant potentials are

Φ_{Y} = {ϕ_{1}, ϕ_{3}}

. The situation is analogous to the computation of the previous marginal, with the difference that now the symbolic combination to carry out is

ϕ_{1} \cdot ϕ_{3}

. The repositories status after

k = 3

is shown in Table 3. We have that the third desired marginal is

ψ_{3} = ϕ_{5}

.

Finally, for

k = 4

we have to compute the marginal for

Y = {X_{2}, X_{3}}

. The relevant potentials are now

Φ_{Y} = {ϕ_{1}, ϕ_{2}, ϕ_{3}}

. Variable

X_{1}

has to be deleted from this set of potentials. As all the potentials contain this variable, as a first step it is necessary to combine all of them, and afterwards to remove

X_{1}

by marginalizing on

{X_{2}, X_{3}}

. Assume that the order of the symbolic operations is: first combine

ϕ_{1}

and

ϕ_{2}

and then its result is combined with

ϕ_{3}

; then this result is marginalized by removing

X_{1}

. Then the repositories after

k = 4

are as presented in Table 4. The combination of

ϕ_{1}

and

ϕ_{2}

was previously carried out for

k = 2

and therefore its result can be retrieved without new computations.

After that, the numerical part of operations in Table 4 are carried out in the same order in which they are described in that table. In this process, after doing an operation with an identifier (

i d

) equal to t, the potentials with

t i m e

equal to t are removed from the

R_{Φ}

table. For example, in this case, potentials

ϕ_{2}

and

ϕ_{3}

can be removed from

R_{Φ}

after doing operation with

i d = 4

and potential

ϕ_{6}

can be removed after operation with

i d = 5

, leaving only in

R_{Φ}

the potentials containing the desired marginal potentials needed to compute the

K L

divergence between both networks (potentials with

t i m e = - 1

).

5. Experiments

In order to compare the computation approaches presented in the paper the experimentation uses a set of Bayesian networks available in the bnlearn [19] repository (https://www.bnlearn.com/bnrepository/, accessed on 24 August 2021). This library provides all the functions required for the process described below. Given a certain Bayesian network as defined in the repository:

A dataset is generated using the rbn function. As explained in the library documentation, this function simulates random samples from a Bayesian network, using forward/backward sampling.
The dataset is used for learning a Bayesian network. For this step, the tabu function is employed using the default setting (a dataset as unique argument). It is one of the structural learning methods available on bnlearn. Since the learned model could have unoriented links, the cextend function is required, which results in a Bayesian network consistent with the model passed as argument. Any other different learning algorithm could have been used, since the goal is to have an alternative Bayesian network that will be used later to calculate the Kullback–Leibler divergence with the methods described in Algorithms 1 and 2.

For each network, the Kullback–Leibler divergence is computed with the procedures presented using evidence propagation (see Algorithm 1) and using operations cache (described in Algorithm 2). The main purpose of the experiment is to get an estimation of the computation times required for both approaches. The obtained results are included in Table 5. It contains the following information:

Network name.
Number of nodes.
Number of arcs.
Number of parameters required for quantifying the uncertainty of the network.
time1: Runtime using the algorithm without cache (Algorithm 1).
time2: Runtime using the algorithm with cache (Algorithm 2).
ops: Number of elementary operations stored in the operations repository $R_{O}$ to compute all the necessary distributions for the calculation using Algorithm 2.
rep: Number of operations that are repeated and that, thanks to the use of $R_{O}$ and $R_{Φ}$ , will be executed only once.
del: Number of factors that were removed from the $R_{Φ}$ with the consequent release of memory space for future calculations.

The experiments have been run in a desktop computer with an Intel(R) Xeon(R) Gold 6230 CPU working at 3.60GHz (80 cores). It has 312 Gb of RAM memory. The operating system is Linux Fedora Core 34.

It is observed that the calculation with the second method always offers shorter runtimes than the first one. The shortest runtimes are presented in the table with bold style. It is noteworthy that the case of three networks in which the method based on the use of evidence cannot be completed because the available memory capacity is exceeded: water, mildew, and barley. Moreover, the computational overhead required to manage operations and factor repositories is beneficial as it avoids the repetition of a significant number of operations and enables unnecessary potentials to be released, especially in the most complex networks.

6. Conclusions

Computing the KL divergence between the joint probabilities associated with two Bayesian networks is an important task that is relevant for many problems, for example assessing the accuracy of Bayesian network learning algorithms. However, in general, it is not possible to find this function implemented in software packages for probabilistic graphical models. In this paper, we provide an algorithm that uses local computation to calculate the KL divergence between two Bayesian networks. The algorithm is based in a procedure with two stages. The first one plans the operations determining the repeated operations and the times in which potentials are no longer necessary, while the second one carries out the numerical operations, taking care to reuse the results of repeated operations instead of repeating them and deallocating the memory space associated with useless potentials. Experiments show that this strategy saves time and space, especially in complex networks.

The functions have been implemented in Python taking as basis pgmpy software package and are available in the github repository: https://github.com/mgomez-olmedo/KL-pgmpy, accessed on 24 August 2021. The README file of the project offers details about the implementation and the methods available for reproducing the experiments.

In the future, we plan to further improve the efficiency of the algorithms. The main line will be to invest more time in the planning stage looking for deletion orderings minimizing the total number of operations or optimizing the order of combinations, when several potentials have to be multiplied.

Author Contributions

Conceptualization, S.M., A.C. and M.G.-O.; methodology, S.M., A.C. and M.G.-O.; software, S.M., A.C. and M.G.-O.; validation, S.M., A.C. and M.G.-O.; formal analysis, S.M., A.C. and M.G.-O.; investigation, S.M., A.C. and M.G.-O.; writing–original draft preparation, S.M., A.C. and M.G.-O.; visualization, S.M., A.C. and M.G.-O.; supervision, S.M., A.C. and M.G.-O.; funding acquisition, S.M., A.C. and M.G.-O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly supported by the Spanish Ministry of Education and Science under project PID2019-106758GB-C31 and the European Regional Development Fund (FEDER).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We are very grateful to the anonymous reviewers for their valuable comments and suggestions that have contributed to the improvement of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

de Jongh, M.; Druzdzel, M.J. A comparison of structural distance measures for causal Bayesian network models. In Recent Advances in Intelligent Information Systems, Challenging Problems of Science, Computer Science Series; Springer: Basel, Switzerland, 2009; pp. 443–456. [Google Scholar]
Scutari, M.; Vitolo, C.; Tucker, A. Learning Bayesian networks from big data with greedy search: Computational complexity and efficient implementation. Stat. Comput. 2019, 29, 1095–1108. [Google Scholar]
Talvitie, T.; Eggeling, R.; Koivisto, M. Learning Bayesian networks with local structure, mixed variables, and exact algorithms. Int. J. Approx. Reason. 2019, 115, 69–95. [Google Scholar]
Natori, K.; Uto, M.; Ueno, M. Consistent learning Bayesian networks with thousands of variables. In Advanced Methodologies for Bayesian Networks; PMLR; 2017; pp. 57–68. Available online: https://proceedings.mlr.press/v73/natori17a (accessed on 29 July 2021).
Kullback, S. Information Theory and Statistics; Dover Publication: Mineola, NY, USA, 1968. [Google Scholar]
de Campos, L.M. A scoring function for learning Bayesian networks based on mutual information and conditional independence tests. J. Mach. Learn. Res. 2006, 7, 2149–2187. [Google Scholar]
Liu, H.; Zhou, S.; Lam, W.; Guan, J. A new hybrid method for learning Bayesian networks: Separation and reunion. Knowl.-Based Syst. 2017, 121, 185–197. [Google Scholar]
Cano, A.; Gómez-Olmedo, M.; Moral, S. Learning Sets of Bayesian Networks. In Proceedings of the International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Lisbon, Portugal, 15–19 June 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 151–164. [Google Scholar]
Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
Elvira Consortium. Elvira: An Environment for Probabilistic Graphical Models. In Proceedings of the 1st European Workshop on Probabilistic Graphical Models, Cuenca, Spain, 6–8 November 2002; pp. 222–230. [Google Scholar]
Kjærulff, U. Approximation of Bayesian Networks through Edge Removals; Technical Report IR-93-2007; Department of Mathematics and Computer Science, Aalborg University: Aalborg, Denmark, 1993. [Google Scholar]
Choi, A.; Chan, H.; Darwiche, A. On Bayesian Network Approximation by Edge Deletion. arXiv 2012, arXiv:1207.1370. [Google Scholar]
Kjærulff, U. Reduction of computational complexity in Bayesian networks through removal of weak dependences. In Uncertainty Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994; pp. 374–382. [Google Scholar]
Ankan, A.; Panda, A. pgmpy: Probabilistic graphical models using Python. In Proceedings of the 14th Python in Science Conference (SCIPY 2015), Austin, TX, USA, 6–12 June 2015. [Google Scholar]
Shafer, G.R.; Shenoy, P.P. Probability propagation. Ann. Math. Artif. Intell. 1990, 2, 327–351. [Google Scholar]
Dechter, R. Bucket elimination: A unifying framework for reasoning. Artif. Intell. 1999, 113, 41–85. [Google Scholar]
Shachter, R. Bayes-Ball: The Rational Pasttime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams). In Proceedings of the Fourteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI–98), Madison, WI, USA, 24–26 July 1998; Morgan Kaufmann Publishers: San Francisco, CA, USA, 1998; pp. 48–487. [Google Scholar]
Cano, A.; Moral, S. Heuristic algorithms for the triangulation of graphs. In Proceedings of the International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Paris, France, 4–8 July 1994; Springer: Berlin/Heidelberg, Germany, 1994; pp. 98–107. [Google Scholar]
Scutari, M. Learning Bayesian networks with the bnlearn R package. J. Stat. Softw. 2010, 35, 1–22. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Bayesian networks to compare.

Table 1. Initial state for

R_{Φ}

(left part) and

R_{O}

(right part).

Table 1. Initial state for

R_{Φ}

(left part) and

R_{O}

(right part).

Rep. of Potentials ( $R_{Φ}$ )		Rep. of Operations ( $R_{O}$ )
Potential	Time	Id	Type	Arg. 1	Arg. 2	Result
$ϕ_{1} (X_{1})$	0
$ϕ_{2} (X_{1}, X_{2})$	0
$ϕ_{3} (X_{1}, X_{3})$	0

Table 2. Repositories after

k = 2

.

Table 2. Repositories after

k = 2

.

Rep. of Potentials ( $R_{Φ}$ )		Rep. of Operations ( $R_{O}$ )
Potential	Time	Id	Type	Arg. 1	Arg. 2	Result
$ϕ_{1} (X_{1})$	−1	1	Comb	$ϕ_{1}$	$ϕ_{2}$	$ϕ_{4}$
$ϕ_{2} (X_{1}, X_{2})$	1
$ϕ_{3} (X_{1}, X_{3})$	0
$ϕ_{4} (X_{1}, X_{2})$	−1

Table 3. Repositories after

k = 3

.

Table 3. Repositories after

k = 3

.

Rep. of Potentials ( $R_{Φ}$ )		Rep. of Operations ( $R_{O}$ )
Potential	Time	Id	Type	Arg. 1	Arg. 2	Result
$ϕ_{1} (X_{1})$	−1	1	Comb	$ϕ_{1}$	$ϕ_{2}$	$ϕ_{4}$
$ϕ_{2} (X_{1}, X_{2})$	1	2	Comb	$ϕ_{1}$	$ϕ_{3}$	$ϕ_{5}$
$ϕ_{3} (X_{1}, X_{3})$	2
$ϕ_{4} (X_{1}, X_{2})$	−1
$ϕ_{5} (X_{1}, X_{3})$	−1

Table 4. Repositories after

k = 4

.

Table 4. Repositories after

k = 4

.

Rep. of Potentials ( $R_{Φ}$ )		Rep. of Operations ( $R_{O}$ )
Potential	Time	Id	Type	Arg. 1	Arg. 2	Result
$ϕ_{1} (X_{1})$	−1	1	Comb	$ϕ_{1}$	$ϕ_{2}$	$ϕ_{4}$
$ϕ_{2} (X_{1}, X_{2})$	4	2	Comb	$ϕ_{1}$	$ϕ_{3}$	$ϕ_{5}$
$ϕ_{3} (X_{1}, X_{3})$	4	3	Comb	$ϕ_{1}$	$ϕ_{3}$	$ϕ_{5}$
$ϕ_{4} (X_{1}, X_{2})$	−1	4	Comb	$ϕ_{4}$	$ϕ_{3}$	$ϕ_{6}$
$ϕ_{5} (X_{1}, X_{3})$	−1	5	Marg	$ϕ_{6}$	1	$ϕ_{7}$
$ϕ_{6} (X_{1}, X_{2}, X_{3})$	5
$ϕ_{7} (X_{2}, X_{3})$	−1

Table 5. Runtimes for KL computation without cache (time1) and with cache (time2).

Network	Nodes	Arcs	Parameters	Time1	Time2	Ops	Rep	Del
cancer	5	4	18	0.0313	0.012	48	21	13
earthquake	5	4	10	0.0316	0.0091	31	15	5
survey	5	6	21	0.0504	0.0143	49	23	12
asia	8	8	18	0.0787	0.0173	62	28	13
sachs	11	17	178	0.3484	0.0409	81	30	25
child	20	25	230	0.8796	0.0726	142	58	56
insurance	27	52	984	16.9990	0.3788	631	313	223
water	32	66	10,083	–	8.6822	640	299	170
mildew	35	46	540,150	–	15.9393	1278	955	150
alarm	37	46	509	4.0949	0.3354	638	415	132
barley	48	84	114,005	–	205.6597	2273	1695	328
hailfinder	56	66	2656	362.5262	0.8498	1197	921	127
hepar2	70	123	1453	23.1047	1.4088	1864	1459	242
win95pts	76	112	2656	404.4568	1.0080	960	458	314

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moral, S.; Cano, A.; Gómez-Olmedo, M. Computation of Kullback–Leibler Divergence in Bayesian Networks. Entropy 2021, 23, 1122. https://doi.org/10.3390/e23091122

AMA Style

Moral S, Cano A, Gómez-Olmedo M. Computation of Kullback–Leibler Divergence in Bayesian Networks. Entropy. 2021; 23(9):1122. https://doi.org/10.3390/e23091122

Chicago/Turabian Style

Moral, Serafín, Andrés Cano, and Manuel Gómez-Olmedo. 2021. "Computation of Kullback–Leibler Divergence in Bayesian Networks" Entropy 23, no. 9: 1122. https://doi.org/10.3390/e23091122

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Computation of Kullback–Leibler Divergence in Bayesian Networks

Abstract

1. Introduction

2. Kullback–Leibler Divergence

3. Computation with Propagation Algorithms

3.1. Variable Elimination Algorithms

3.2. Computation of Kullback–Leibler Divergence Using Deletion Algorithms

4. Inference with Operations Cache

5. Experiments

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI