Bayesian Network Structure Learning Using Improved A* with Constraints from Potential Optimal Parent Sets

He, Chuchao; Di, Ruohai; Tan, Xiangyuan

doi:10.3390/math11153344

Open AccessArticle

Bayesian Network Structure Learning Using Improved A* with Constraints from Potential Optimal Parent Sets

by

Chuchao He

¹,

Ruohai Di

^1,* and

Xiangyuan Tan

²

¹

School of Electronics and Information Engineering, Xi’an Technological University, Xi’an 710021, China

²

School of Electronic Information, Northwestern Polytechnical University, Xi’an 710192, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(15), 3344; https://doi.org/10.3390/math11153344

Submission received: 4 July 2023 / Revised: 27 July 2023 / Accepted: 29 July 2023 / Published: 30 July 2023

(This article belongs to the Special Issue New Advances in Probabilistic Machine Learning and Bayesian Predictive Methods)

Download

Browse Figures

Versions Notes

Abstract

:

Learning the structure of a Bayesian network and considering the efficiency and accuracy of learning has always been a hot topic for researchers. This paper proposes two constraints to solve the problem that the A* algorithm, an exact learning algorithm, is not efficient enough to search larger networks. On the one hand, the parent–child set constraints reduce the number of potential optimal parent sets. On the other hand, the path constraints are obtained from the potential optimal parent sets to constrain the search process of the A* algorithm. Both constraints are proposed based on the potential optimal parent sets. Experiments show that the time efficiency of the A* algorithm can be significantly improved, and the ability of the A* algorithm to search larger Bayesian networks can be improved by the two constraints. In addition, compared with the globally optimal Bayesian network learning using integer linear programming (GOBNILP) algorithm and the max–min hill-climbing (MMHC) algorithm, which are state of the art, the A* algorithm enhanced by constraints still performs well in most cases.

Keywords:

Bayesian network; structural learning; A* search; constraint; potential optimal parent sets

MSC:

68T30

1. Introduction

Artificial intelligence has been widely used in reality after decades of development. However, capturing and understanding causal relationships from data, known as causal discovery, remains a challenging task in artificial intelligence. Robust causal analysis is widely recognized as a key driver of techniques such as learning, prediction, diagnosis, and counterfactual reasoning, which can have significant implications for almost all fields of science.

A Bayesian network (BN) is a probabilistic graphical model of the combination of probability theory and graph theory, and a BN can support the representation and analysis of causal structure in the field of artificial intelligence. The structure of the BN is a directed acyclic graph (DAG), which represents the dependence relationship between nodes. These relationships are further quantified by a set of conditional probability distributions. In general, a Bayesian network represents the joint probability distribution of a set of random variables.

The number of possible structures increases exponentially with the number of nodes

n

. Determining the BN structure from data is an NP-hard problem [1,2], and it has been a hot topic in the research field of BN in recent decades. Existing BN structure learning algorithms can be divided into three categories: constraint-based, score-based, and hybrid search approaches [3].

Constraint-based approaches use statistical tests or information theory techniques to test conditional independence (CI), determine the relationship between variables, and obtain the corresponding DAG. Widely adopted constraint-based approaches mainly include the Peter–Clark (PC) algorithm [4], inductive causality (IC) algorithm [5], and grow–shrink (GS) algorithm [6]. These algorithms can return equivalence classes if CI tests are correct. However, such assumptions are difficult to satisfy in practice. The CI test is often affected by statistical tests, and there will be a certain probability of error in the case of insufficient samples or noise. Perhaps even worse, because a series of steps in the algorithm relies on CI tests, these erroneous CI test results can further amplify errors in the subsequent learning process.

Score-based approaches are the most common BN structure learning approaches. Score-based approaches use a scoring function to measure the fitness between the BN structure and data, and return the BN structure with the optimal score. Therefore, score-based approaches treat BN structure learning as a combinatorial optimization problem. From the perspective of combinatorial optimization, greedy search (GS) [7,8,9], simulated annealing (SA) [10], ordering-based search (OBS) [11], and other algorithms have developed rapidly in the early stage. In addition, the genetic algorithm (GA) [12], particle swarm optimization (PSO) [13], ant colony optimization (ACO) [14], and other swarm intelligence algorithms have also been widely used. However, these algorithms often obtain locally optimal network structures; in particular, swarm intelligence algorithms’ convergence to the optimal solution in infinite time has no practical significance. Therefore, exact learning algorithms have begun to enter the field of vision of researchers. Research on exact learning algorithms began with a series of dynamic programming (DP) algorithms [15,16,17], which require exponential time and space complexity and can only be used for small-scale network structure learning. In recent years, other exact learning algorithms that are more competitive than DP algorithms have also been proposed, such as the A* [18,19], anytime window A* (AWA*) [20], Bidirectional heuristic search (BiHS) [21], CPBayes [22], and Globally Optimal Bayesian Network learning using Integer Linear Programming (GOBNILP) algorithms [23,24]. The A*, AWA*, and BiHS algorithms regard the structure learning problem as the shortest path search problem and use different strategies to search, and A* is the more stable algorithm among them. CPBayes and GOBNILP use constraint programming and integer programming, respectively, to solve the structure learning problem. Later, CPBayes was enhanced by including linear programming techniques to provide more efficient acyclicity checking [25]. Compared with DP algorithms, these algorithms improve the scalability and efficiency of learning BNs, but their efficiency is still relatively low, and there is still room for improvement. Furthermore, algorithms based on continuous optimization have been developed in recent years. For example, the noncombinatorial optimization via trace exponential and augmented Lagrangian for structure learning (NOTEARS) algorithm [26] used the augmented Lagrangian method to address continuous optimization problems.

Hybrid search approaches try to combine the advantages of score-based and constraint-based algorithms. The most famous hybrid search approach is the max–min hill-climbing (MMHC) algorithm [27]. This algorithm is divided into two stages: in the first stage, the max–min parent and children (MMPC) algorithm [28] is used to learn the parent–child set of each node; in the second stage, the hill-climbing algorithm is conducted within the limited range of the parent–child set. The constrained optimal search (COS) algorithm [29] performs an optimal search through the DP algorithm under superstructure constraints. The edge-constrained optimal search (ECOS) algorithm [30] clusters the superstructure based on the COS algorithm, learns in each cluster, and finally merges to obtain the complete BN structure. The constrained hill-climbing (CHC) algorithm [31] improves algorithm efficiency and accuracy by dynamically limiting the hill-climbing algorithm process. The separation and reunion (SAR) algorithm [32] decomposes a large BN into learning some relatively small BNs through the CI test and builds the actual network structure by remerging these smaller network structures. Kuipers et al. [33] proposed a hybrid algorithm that creates a restricted search space using the PC algorithm followed by Markov Chain Monte Carlo (MCMC) sampling.

In this paper, we continue to study the structural learning problem from the perspective of the shortest path search problem. This paper proposes an improved A* algorithm based on constraints from potential optimal parent sets (POPS), which is specifically improved in two aspects. On the one hand, the number of POPS is reduced through parent–child sets constraints; on the other hand, obtaining path constraints from POPS constrains the A* algorithm. The search efficiency of A* can be improved by the two aspects of constraints about POPS.

The remainder of this paper is organized as follows. In Section 2, the relevant theoretical basis of the problem is introduced and formulated. The details of the proposed algorithms are designed in Section 3. The proposed algorithms are demonstrated with experiments in Section 4, followed by conclusions in Section 5.

2. Preliminaries for Bayesian Networks

This section will introduce BN and the structure learning problem based on the shortest path search perspective, providing the theoretical basis for the new algorithm.

2.1. Bayesian Networks

The Bayesian network

B N = (G, P)

consists of two parts, DAG

G

and probability distribution

P

.

G

is called the structure of a BN, in which each node corresponds to the variables in variable set

V = {X_{1}, \dots, X_{n}}

one by one, and the directed edges between nodes reflect the dependencies between nodes. The probability distribution

P

is called the parameter of BN, specifically

P (X_{i} | P A_{i})

, where

P A_{i}

represents the parent set of

X_{i}

. The joint probability of all variables can be decomposed into the product of conditional probability distributions. Figure 1 shows an example of DAG, and

P A_{2}

is formed by

{X_{1}}

and

{}

.

Given the dataset

D = {D_{1}, \dots, D_{N}}

(

N

is the sample size), the scoring function can give the fitness of network structure

G

and dataset

D

. The goal of BN structure learning is to find a network structure

G^{*}

so that the scoring function can obtain the optimal value. Many scoring functions can be used in the BN structure learning problem. Since this paper is based on the perspective of the shortest path search problem, the score of the minimum description length (MDL) [34] is selected. The smaller the MDL score is, the better the corresponding BN structure. Many scoring functions, including the MDL score, are decomposable, namely, the following:

M D L (G) = \sum_{i = 1}^{n} M D L (X_{i} | P A_{i})

(1)

where

M D L (X_{i} | P A_{i})

is called the local score.

Each local score

M D L (X_{i} | P A_{i})

is calculated as follows:

M D L (X_{i} | P A_{i}) = H (X_{i} | P A_{i}) + \frac{\log N}{2} K (X_{i} | P A_{i})

(2)

H (X_{i} | P A_{i}) = - \sum_{x_{i}, p a_{i}} N_{x_{i}, p a_{i}} \log \frac{N_{x_{i}, p a_{i}}}{N_{p a_{i}}}

(3)

K (X_{i} | P A_{i}) = - (r_{i} - 1) \prod_{X_{l} \in P A_{i}} r_{l}

(4)

where

N_{x_{i}, p a_{i}}

is the number of data points that satisfy

X_{i} = x_{i}

and

P A_{i} = p a_{i}

in the dataset

D

,

N_{p a_{i}}

is the number of data points that satisfy

P A_{i} = p a_{i}

in the dataset

D

, and

r_{i}

is the number of states of

X_{i}

.

For any variable

X_{i}

, the other

n - 1

variables can be its parent nodes, and thus, each node can have

2^{n - 1}

parent sets. There are

n 2^{n - 1}

possible parent sets in total, and the corresponding

n 2^{n - 1}

local scores must be calculated theoretically. Obviously, it is impossible to calculate all local scores, and this number can be further reduced by some pruning rules. The following theorems, which have been proven in [19,35], provide a basis for ignoring some parent sets when searching for an optimal parent set for a variable with the MDL function.

Theorem 1

([19,35]). In an optimal Bayesian network based on the MDL scoring function, each variable has at most

⌊\log (2 N / \log N)⌋

parents.

Theorem 2

([19]). Let

U

and

S

be two candidate parent sets for

X_{i}

,

U \subset S

, and

K (X_{i} | S) - M D L (X_{i}, U) > 0

. Then,

S

and all supersets of

S

cannot possibly be optimal parent sets for

X_{i}

.

Theorem 3

([19]). Let

U

and

S

be two candidate parent sets for

X_{i}

such that

U \subset S

, and

M D L (X_{i}, U) \leq M D L (X_{i}, S)

. Then,

S

is not the optimal parent set of

X_{i}

for any candidate set.

The pruning rules are lossless and can ensure that the optimal parent set of each node can be obtained in the remaining parent sets. The remaining parent sets are called the potential optimal parent sets, and the potential optimal parent sets of

X_{i}

are denoted as

{POPS}_{i}

.

For exact learning algorithms, the local scores that correspond to the POPS are usually calculated in advance. Then, the local scores are used as the input to obtain the output optimal BN score. Therefore, for the exact learning algorithms, BN structure learning is regarded as a combinatorial optimization problem, as follows:

\begin{array}{l} Input : A set V = {X_{1}, \dots, X_{n}} and a set of {POPS}_{i} for each X_{i} \\ Output : Find a DAG G^{*} such that \\ G^{*} \in \underset{G}{\arg \min} \sum_{i = 1}^{n} M D L (X_{i} | P A_{i}) \\ where P A_{i} \in {POPS}_{i} \end{array}

(5)

2.2. Structure Learning in Order Graph

A series of DP algorithms are proposed to solve the abovementioned combinatorial optimization problem. DP algorithms are mainly based on the following recursive formula:

M D L (V) = \min_{X_{i}} \{M D L (V \ X_{i}) + B e s t M D L (X_{i}, V \ X_{i})\}

(6)

B e s t M D L (X_{i}, V \ X_{i}) = \min_{P A_{i} \subseteq V \ \{X_{i}\}, P A_{i} \in {POPS}_{i}} M D L (X_{i}, P A_{i})

(7)

According to the recursive relationship, the basic principle of DP algorithms is as follows. First, the optimal network structure is found for a single variable starting from the empty set. Then, nodes are gradually added to build the optimal subnetwork for an increasingly large set of variables until the optimal network corresponding to

V = {X_{1}, \dots, X_{n}}

is found. DP algorithms can find the optimal BN in the time and space complexity of

O (n 2^{n})

, and the whole process can be graphically represented by the order graph. Figure 2 shows the order graph of a four-node BN.

Yuan and Malone [18,19] further transformed the combinatorial optimization problem into the shortest path search problem. They regarded the top empty set

O = \emptyset

as the start state and the bottom

V

as the goal state. Therefore, a path from the start state to the goal state corresponds to the ordering of nodes, which is how the order graph obtains its name. In the order graph, state

U

to the next state

S = U \cup \{X_{i}\} (X_{i} \in V \ U)

is equivalent to adding node

X_{i}

based on subnetwork

U

, and the path cost from state

U

to the next state

S

is

\begin{array}{l} c o s t (U, S) = B e s t M D L (X_{i}, U) \\ = \min_{P A_{i} \subseteq U, P A_{i} \in {POPS}_{i}} M D L (X_{i}, P A_{i}) \end{array}

(8)

where

B e s t M D L (X_{i}, U)

is obtained by replacing

V \ X_{i}

in (7) by

U

. The cost of the path is equal to the score of selecting an optimal parent set for

X_{i}

out of

U

, i.e.,

B e s t M D L (X_{i}, U)

. For example, the path

\{X_{2}, X_{3}\} \to \{X_{1}, X_{2}, X_{3}\}

has a cost equal to

B e s t M D L (X_{1}, \{X_{2}, X_{3}\})

.

Then, the corresponding optimal ordering can be obtained in the order graph by finding the shortest path from the start state

O

to the goal state

V

. In the process of finding the shortest path, the optimal parent sets corresponding to the path are recorded. The optimal BN can be built by combining the optimal node ordering with the optimal parent set of each node.

Yuan and Malone searched for the optimal BN structure using the classical A* algorithm based on the shortest path search. In the A* algorithm, for each current state

U

in the order graph, the path cost

g (U)

generated from start state

O

to

U

is calculated, and the heuristic function

h

is used to estimate the cost

h (U)

from the current state

U

to the goal state

V

. During the search,

f (U) = g (U) + h (U)

is used to estimate the optimal cost of the path through state

U

, the

O p e n

list is used to store the states that will be expanded, and the

C l o s e d

list is used to store the states that have been expanded. In the

O p e n

list, the current state with the lowest

f

value is expanded each time, and the current state is put into the

C l o s e d

list, while the state expanded by the current state is put into the

O p e n

list. Until the goal state

V

is expanded, the shortest path from

O

to

V

is found, and the optimal node ordering is also found; thus, the corresponding optimal BN can also be built.

AWA* and BiHS also search for the shortest path in the order graph, only their search strategies are different from A*. Although AWA* and BiHS have the ability to return more upper and lower bounds for the optimal score than A*, in most datasets, A* expands fewer states and has better stability than AWA* and BiHS in the order graph.

3. Two Constraints for Improved A* Algorithm

According to the introduction of BN structure learning theories and the A* algorithm based on the order graph in Section 2, we can obtain two key factors that restrict the efficiency of the A* algorithm:

(1) Potential optimal parent sets. According to the BN structure learning problem description and Formulas (3) and (4), it is necessary to find an optimal parent set from the

{POPS}_{i}

of each node

X_{i}

. Obviously, the number of POPS limits the efficiency of the search. If the number of remaining POPS is very small, it is easy to find the parent sets that meet the requirements.

(2) Order graph. The total number of states in the order graph is

2^{n}

, and its scale increases exponentially with the increase in the nodes in BN. The size of the search space also limits the efficiency of the search.

Based on the above two points, this paper will propose solutions to improve the efficiency of the A* algorithm.

3.1. Pruned Potential Optimal Parent Sets with MMPC

Although pruning rules can further reduce the number of POPS, the remaining number is still considerable. In the problem of BN structure learning, we ultimately need optimal parent sets. Therefore, other sets are relatively unnecessary. Given a target variable

T

, the MMPC algorithm can quickly return the parent–child set

C P C (T)

of the target variable

T

under the CI test. For the two variables

X

,

T

and the set

Z

, the CI

P_{I T} (X, T | Z)

can be calculated by

G^{2}

statistics under the null hypothesis of conditional independence. Let

N^{a b c}

represent the occurrence times of

X = a

,

T = b

, and

Z = c

in the dataset

D

(respectively,

a

,

b

, and

c

denote the values specifically taken by

X

,

T

, and

Z

.

a

and

b

generally are integers, and

c

is a combination of the integers.); then, the statistical variable

G^{2}

is defined as

G^{2} = 2 \sum_{a, b, c} N^{a b c} \ln (\frac{N^{a b c} N^{c}}{N^{a c} N^{b c}})

(9)

Under the null hypothesis, the

G^{2}

statistic asymptotically obeys the distribution of

χ^{2}

statistics. Therefore, given the significance level

α

, if the value

p

calculated by the test, namely,

P_{I T} (X, T | Z)

, is less than

α

, the hypothesis is rejected, and the variables

X

and

T

are considered to be conditionally dependent under a given

Z

. Otherwise,

X

and

T

are considered to be conditionally independent under a given

Z

. The pseudocode of the MMPC algorithm is shown in Algorithm 1. In Algorithm 1,

C P C (T)

is the parent–child set of the target variable

T

.

Algorithm 1: MMPC

Input: Target variable

T

, variable set

V

, and significance level

α

Output: Parent-child set of the target variable

T

:

C P C (T)

Let parent-child set of the target variable $T$ : $C P C (T) = \emptyset$ , $R = V \ \{T\}$ ;
while $R \neq \emptyset$
for $\forall X \in R$ do
if $m a x_{Z \subseteq C P C (T)} P_{I T} (X, T | Z) > α$ then $R = R \ \{T\}$ end if
end for
$Y = a r g m i n_{X \in R} m a x_{Z \subseteq C P C (T)} P_{I T} (X, T | Z)$ and $C P C (T) = C P C (T) \cup \{Y\}$
for $\forall X \in C P C (T) \ \{Y\}$ do
if $\max_{Z \subseteq C P C (T) \ \{X\}} P_{I T} (X, T | Z) > α$ then $C P C (T) = C P C (T) \ \{X\}$ end if
end for

end while

Given a condition set

Z

, the MMPC algorithm not only considers

P_{I T} (X, T | Z)

to determine whether

X

and

T

are independent but also considers

\max_{Z^{'} \subseteq Z} P_{I T} (X, T | Z^{'})

, which has stronger robustness, to determine whether

X

and

T

are independent. Certainly, this approach requires more

χ^{2}

test calculations. Finally, we can use the MMPC algorithm to compute the parent–child set

C P C (T)

for each variable

X_{i}

.

Through the constraint of the parent–child set

C P C (T)

, we can further prune the unnecessary sets and their corresponding MDL score calculations for the

{POPS}_{i}

. Taking a four-node BN as an example, for node

X_{4}

,

\{X_{1}, X_{2}, X_{3}\}

and all of its subsets could be the parent set of

X_{4}

. If the traditional pruning rules (Theorems 1–3) are not considered to be in effect, its POPS are still

\{X_{1}, X_{2}, X_{3}\}

and all its subsets, which are represented as the parent graph of

X_{4}

, as shown in Figure 3. If the parent–child set of

X_{4}

obtained by the MMPC algorithm is

C P C (X_{4}) = \{X_{2}, X_{3}\}

, then the parent graph of

X_{4}

shown in Figure 3 can be pruned to the parent graph shown in Figure 4. For larger BNs, this pruning will be more significant in its score calculations. The constraints of the parent–child set calculated by the MMPC algorithm can greatly reduce unnecessary score calculations and storage. Limiting the number of corresponding POPS improves the search efficiency of the A* algorithm.

3.2. Pruned Order Graph with Path Constraints

According to Section 2.2, the optimal BN structure learning actually searches for the shortest path from the order graph. Therefore, if the path constraints can be found in the order graph, it will greatly improve the efficiency of the A* algorithm in searching for the shortest path in the order graph.

Before illustrating such path constraints, a simple example can be taken. Table 1 shows the POPS of each variable in a six-node BN. We assume that we have obtained the POPS of each variable by the score pruning rules or MMPC algorithm in Section 3.1. It can be seen from Table 1 that not all nodes can choose all other nodes as their parent nodes due to the parent–child set constraints obtained from Section 3.1. For example,

X_{1}

can only choose

X_{2}

as its parent node or an empty set with no parent.

A directed graph can be obtained by connecting each node

X_{i}

and its potential optimal parent sets

{POPS}_{i}

. We connect from each node

X_{i}

and its potential optimal parent sets

{POPS}_{i}

from Table 1 to obtain the directed graph, as shown in Figure 5. In such a directed graph, if

X_{j}

is a potential parent node of

X_{i}

, then the graph contains directed edges from

X_{j}

to

X_{i}

.

An interesting phenomenon can be observed in Figure 5: there are only directed edges from the node in

\{X_{1}, X_{2}\}

to the node in

\{X_{3}, X_{4}, X_{5}, X_{6}\}

but no directed edges from the node in

\{X_{3}, X_{4}, X_{5}, X_{6}\}

to the node in

\{X_{1}, X_{2}\}

; in other words, the node in

\{X_{3}, X_{4}, X_{5}, X_{6}\}

cannot be the parent node of the node in

\{X_{1}, X_{2}\}

, and thus, it can be split into two parts:

\{X_{1}, X_{2}\}

and

\{X_{3}, X_{4}, X_{5}, X_{6}\}

. Thus, based on Figure 5, by contracting

\{X_{1}, X_{2}\}

to one node and

\{X_{3}, X_{4}, X_{5}, X_{6}\}

to another node, we can finally obtain the acyclic component graph, as shown in Figure 6. Based on the above splitting method, we can split the order graph of learning the six-node BN into two subgraphs, as shown in Figure 6.

Obviously, it can be seen from the above that the complete order graph of the six-variable BN should contain 2⁶ states. However, the order graph can be split into two subgraphs based on the constraints from Figure 5, in other words,

\{X_{1}, X_{2}\}

and

\{X_{3}, X_{4}, X_{5}, X_{6}\}

. We refer to this splitting method as path constraints.

As the structure of the order graph changed, the entire process of searching the order graph also changed. First, we find the shortest path from

O

to

\{X_{1}, X_{2}\}

in the first subgraph of Figure 7, and then find the shortest path from

\{X_{1}, X_{2}\}

to

V = \{X_{1}, X_{2}, X_{3}, X_{4}, X_{5}, X_{6}\}

in the second subgraph of Figure 7. The shortest path from

O

to

V

is obtained by concatenating the shortest paths in the two subgraphs.

\{X_{1}, X_{2}\}

becomes the necessary state in the shortest searching process of the order graph. Therefore, the number of the states in the order graph search space of Figure 7 can be reduced to

2^{2} + 2^{4} - 1

.

Compared with

2^{6}

states in the complete order graph of the six-variable BN, path constraints can reduce the number of states in the order graph. As the number of nodes

n

increases, the path constraint reduces the number of states in the order graph more significantly. We give Theorem 4 to generalize and quantify this reduction.

Theorem 4.

In a Bayesian network with node set

V = {X_{1}, \dots, X_{n}}

, given path constraints,

V

can be split into subsets

P_{1}, P_{2}, \dots, P_{m}

(

\sum_{i = 1}^{m} P_{i} = V

). Then, the number of states in the order graph is reduced from

2^{n}

to

\sum_{i = 1}^{m} 2^{|P_{i}|} - m + 1

.

Proof of Theorem 4.

Obviously, the total

2^{n}

states of the complete order graph correspond to the Bayesian network of

n

nodes. For the ordered graph under path constraints, where the number of states of any subgraph split by

P_{i}

is

2^{|P_{i}|}

, there are

m

such subgraphs, and the total number of states is

\sum_{i = 1}^{m} 2^{|P_{i}|}

. However, this will double count

m - 1

states. Therefore,

m - 1

states are removed from the total number of computations. Finally, the total number of states is

\sum_{i = 1}^{m} 2^{|P_{i}|} - m + 1

. □

This simple example shows that the directed graph built from each node

X_{i}

and its potential optimal parent sets

{POPS}_{i}

implies path constraints, which can be used to prune the order graph.

The internal principle is briefly described as follows, requiring the help of Theorem 5 (it is proved in the literature [19]).

Theorem 5.

Let

U

and

S

be two candidate parent sets for

X_{i}

such that

U \subset S

. We must have

B e s t M D L (X_{i}, S) \leq B e s t M D L (X_{i}, U)

.

In general, if there is only a directed path from

X_{j}

to

X_{i}

in the directed graph but no directed path from

X_{i}

to

X_{j}

, then the order graph does not need to generate states containing

X_{i}

but excluding

X_{j}

. One way to think about this phenomenon is the following. For a current state

U

in the order graph that does not include

X_{i}

and

X_{j}

, if we expand

X_{j}

first and then

X_{i}

, then the path cost from state

U

to state

U \cup \{X_{i}, X_{j}\}

is

B e s t M D L (X_{j}, U) + B e s t M D L (X_{i}, U \cup \{X_{j}\})

. On the other hand, if we expand

X_{i}

first and then

X_{j}

, the path cost from state

U

to state

U \cup \{X_{i}, X_{j}\}

is

B e s t M D L (X_{i}, U) + B e s t M D L (X_{j}, U \cup \{X_{i}\})

. However, since only a directed path from

X_{j}

to

X_{i}

can exist,

B e s t M D L (X_{j}, U \cup \{X_{i}\}) = B e s t M D L (X_{j}, U)

. For these two path expansion plans, we should continue to compare the values of

B e s t M D L (X_{i}, U \cup \{X_{j}\})

and

B e s t M D L (X_{i}, U)

. According to Theorem 5 and

U \subseteq U \cup \{X_{j}\}

, it is more likely to obtain a better value that makes this path smaller in a larger set, and thus,

B e s t M D L (X_{i}, U \cup \{X_{j}\}) \leq B e s t M D L (X_{i}, U)

. Therefore, the plan that expands

X_{j}

first and then

X_{i}

is more likely to achieve the shortest path from

U

to

U \cup \{X_{i}, X_{j}\}

. Thus, there is no need to generate states that contain

X_{i}

but exclude

X_{j}

.

Based on the previous simple example, we discuss the general method of obtaining path constraints in order to prune the order graph.

A new concept, the strongly connected component (SCC), is actually involved in the process of splitting the directed graph built from each node

X_{i}

and its potential optimal parent sets

{POPS}_{i}

. In a directed graph, if there is a directed path from

V_{i}

to

V_{j}

between two nodes and a directed path from

V_{j}

to

V_{i}

, the two nodes are said to be strongly connected. A directed graph is a strongly connected graph if any two nodes are strongly connected. The extremely strongly connected subgraph of a directed graph is called a strongly connected component. The strongly connected components of a directed graph form an acyclic component graph, which is also a DAG. Each node

C_{i}

in the acyclic component graph corresponds to a strongly connected component

{SCC}_{i}

and to a subset of the node set

V = {X_{1}, \dots, X_{n}}

in a BN. The acyclic component graph gives more intuitive path constraints. In the acyclic component graph, if there are directed paths from

C_{i}

to

C_{j}

, the variable in

{SCC}_{j}

cannot be the parent node of the variable in

{SCC}_{i}

.

Based on the above concept, we try to obtain path constraints by extracting SCCs to prune the order graph. At present, there are mature algorithms for SCC extraction, among which the Kosaraju algorithm is the most commonly used. The pseudocode of the algorithm that obtains path constraints by extracting SCCs is shown in Algorithm 2.

In this algorithm, the potential optimal parent sets

{POPS}_{i}

of each node

X_{i}

are used to build the directed graph

G^{0}

, and the SCC

\{{SCC}_{1}, \dots {SCC}_{i}, \dots {SCC}_{m}\}

of the directed graph

G^{0}

is extracted by the Kosaraju algorithm. It is worth noting that if the size of the SCC is too large, it is still not conducive to improving the efficiency of the algorithm and to searching for a larger network. For example, the original A* algorithm itself cannot search the network of over 50 nodes. If, in the operation of building a directed graph

G^{0}

and extracting SCCs through the POPS, two SCCs with

| {SCC}_{1} | = 1

and

| {SCC}_{2} | = 49

are obtained, and the path constraints are determined, such a method is still meaningless. Because the A* algorithm still cannot search a network of 49 nodes. Thus, we limit the size of the SCC with the parameter

t

. If the size of the maximum SCC exceeds the parameter

t

, a part of the set of potential optimal parent sets

{POPS}_{i}

is selected to rebuild the directed graph

G^{0}

. We prefer to select the sets that correspond to the local scores of the top

k

in

P O P S_{i}

. The parameter

t

will gradually decrease from the maximum number of POPS until the SCC that meets the conditions can be extracted from the built directed graph. Then, Algorithm 2 breaks out of the loop and returns the extracted SCCs under the constraints of the parameter

t

. However, this method is greedy because only part of POPS is used to build the directed graph, and the extracted SCCs lose some information. The pruned order graph formed according to the path constraints of SCCs has certain problems, which will affect the shortest path search and affect the accuracy of the final BN. This effect will be analyzed in detail in the experimental section.

Algorithm 2: Obtain path constraints algorithm

Input: Potential optimal parent sets of each node

X_{i}

{POPS}_{i}

, maximum size

t

Output: Path constraints

P_{1}, \dots, P_{i}, \dots, P_{m}

$p \leftarrow \max (| {POPS}_{1} |, \dots | {POPS}_{i} |, \dots | {POPS}_{m} |)$
build graph $G^{0}$ according to all ${POPS}_{i}$ of $X_{i}$
$\{{SCC}_{1}, \dots, {SCC}_{i}, \dots, {SCC}_{m}\} \leftarrow K o s a r a j u (G^{0})$ ;
$q \leftarrow \max (| {SCC}_{1} |, \dots | {SCC}_{i} |, \dots | {SCC}_{m} |)$ ;
if $q > t$ then
for $k = p \to 1$ do
build graph $G^{0}$ according to the best $k$ $P O P S_{i}$ of $X_{i}$
$\{{SCC}_{1}, \dots, {SCC}_{i}, \dots, {SCC}_{m}\} \leftarrow K o s a r a j u (G^{0})$ ;
$q \leftarrow \max (| {SCC}_{1} |, \dots | {SCC}_{i} |, \dots | {SCC}_{m} |)$ ;
if $q \leq t$ then break; endif
end for
end if
$P_{1}, \dots, P_{i}, \dots, P_{m} \leftarrow {SCC}_{1}, \dots, {SCC}_{i}, \dots, {SCC}_{m}$

Finally, we discuss the search complexity in the pruned order graph.

\{{SCC}_{1}, \dots, {SCC}_{i}, \dots, {SCC}_{m}\}

obtained by Algorithm 2 can split the original order graph into

m

subgraphs, where the connection state between each subgraph is

F_{i} = \cup_{k = 1}^{i} {SCC}_{k}

, and

F_{m} = \cup_{k = 1}^{m} {SCC}_{k} = V

,

F_{0} = {} = O

. Thus, for each subgraph, the start state is

F_{i - 1}

and the goal state is

F_{i}

. For the entire order graph, it is equivalent to searching the shortest paths from

F_{0}

to

F_{1}

, then from

F_{1}

to

F_{2}

, all the way to

F_{m - 1}

to

F_{m}

. Still taking Figure 7 as an example, because there are

{SCC}_{1} = {X_{1}, X_{2}}

and

{SCC}_{2} = {X_{3}, X_{4}, X_{5}, X_{6}}

, there are

F_{1} = {X_{1}, X_{2}}

and

F_{2} = {X_{1}, X_{2}, X_{3}, X_{4}, X_{5}, X_{6}}

. Therefore, we search the shortest path from

F_{0} = \emptyset

to

F_{1} = {X_{1}, X_{2}}

and then search the shortest path from

F_{1} = {X_{1}, X_{2}}

to

F_{2} = {X_{1}, X_{2}, X_{3}, X_{4}, X_{5}, X_{6}}

. Finally, it only remains to connect each shortest path to obtain the entire shortest path on the order graph. For each subgraph, the maximum complexity of the A* search is

O (2^{| S C C_{i} |})

. This case is the worst case, which is almost impossible because A* uses heuristic functions. Therefore, in the pruned order graph, which is split into

m

subgraphs using

\{{SCC}_{1}, \dots, {SCC}_{i}, \dots, {SCC}_{m}\}

, the maximum complexity of the A* search is

O (2^{| {SCC}_{1} |} + \dots + 2^{| {SCC}_{i} |} \dots + 2^{| {SCC}_{m} |}) = O (m \max_{i} 2^{| {SCC}_{i} |})

(10)

This conclusion also corroborates Theorem 4. This finding shows that the maximum complexity depends on the size of the maximum SCC. Therefore, it is necessary for Algorithm 2 to use the parameter

t

to limit the size of the maximum SCC, which effectively limits the maximum complexity of the A* search of the pruned order graph.

4. Experiments

To evaluate the effect of A* under MMPC constraints (Section 3.1) and path constraints (Section 3.2), experiments will be performed on some common benchmark BNs and UCI datasets. The A* algorithm using only MMPC constraints is named A*-MMPC, the A* algorithm using only path constraints is named A*-PC, and the algorithm using both constraints is named A*-MM2PC. Experiments are mainly divided into two parts. First, the improvement effect of the two constraints on A* is tested; in other words, the various indices between the A*, A*-MMPC, A*-PC, and A*-MM2PC algorithms are tested. Then, the A*-MM2PC with two constraints is compared with the typical GOBNILP and MMHC algorithms.

First, the comparison experiment will compare A*, A*-MMPC, A*-PC, and A*-MM2PC from the following three aspects:

Time: Time recorded the running time of the algorithm (OT means out of time);
States: The number of expanded states in the order graph;
Error: The percentage error $(M D L (G_{o b t a i n e d}) / M D L (G_{e x a c t}) - 1) \times 100 %$ where $M D L (G_{o b t a i n e d})$ is the MDL score of the BN obtained by the performing algorithm after learning the data, and $M D L (G_{e x a c t})$ is the exact MDL score of the BN obtained by the original A* or GOBNILP algorithms.
Both the original A* and GOBNILP algorithms are exact learning algorithms. If scoring values can be obtained, they are both optimal and equal. Therefore, these two algorithms do not need to calculate the percentage error, and they are both 0%. On the one hand, the MMPC algorithm uses a CI test, and there will be a certain probability of error in the case of insufficient or noisy samples. On the other hand, path constraints adopt a greedy strategy when extracted SCCs do not meet parameter $t$ . Therefore, A*-MMPC, A*-PC, and A*-MM2PC should all consider the influence of the accuracy, namely, the percentage error. The smaller the percentage error is, the higher the accuracy of the corresponding algorithm. Two decimal places are retained for each index, and scientific notation is used for numbers that are too large or too small.

The benchmark BNs are selected for the comparison experiment, and the sampled data are obtained from the sample sizes of 1000, 3000, 5000, 7000, and 10,000. Then, BN is learned from the sampled data. Table 2 records the performances of the A*, A*-MMPC, A*-PC, and A*-MM2PC algorithms on benchmark BNs. By adding path constraints, the scale of BNs that the A* algorithm can search is expanded, and Water and Alarm networks that cannot be searched before can be searched.

In terms of the number of expanded states in the order graph, using either MMPC or path constraints can significantly reduce the number of states, and A*-MM2PC using both constraints has the lowest number of expanded states. Similarly, in general, the trend of the corresponding time consumption follows the trend of the number of expanded nodes, except for the Sachs network. In the Sachs network, although A*-PC and A*-MM2PC have fewer expanded states than A* and A*-MMPC, their corresponding time consumption is higher. The time consumption for structure learning in the Sachs network is already low; however, it takes a certain amount of time to generate heuristic functions every time the A* program is executed. Therefore, in the Sachs network, A*-PC and A*-MM2PC divide the complete order graph into several subgraphs using path constraints, and each subgraph increases the generation time of the heuristic function accordingly. In conclusion, A*-PC and A*-MM2PC use path constraints to reduce the number of expanded states in the order graph, and thus reduce the search time, which is far less than the generation time of the increased heuristic function, resulting in higher time consumption in a smaller network. However, in larger networks, the reduction in more states is the more dominant factor, and thus, their time consumption is significantly reduced. In the Alarm network, the size of the maximum SCC obtained by Algorithm 2 is 18, while the sizes of other SCCs are mostly 2 and 3. Therefore, the experimental results regarding the time consumption and the number of expanded states are similar to the Child network with 20 nodes. In Hailfinder and Win95pts, the size of the maximum SCC obtained by Algorithm 2 is relatively small, so their total time consumption and the number of expanded states are both even smaller than the corresponding results for smaller networks. This phenomenon shows that the maximum complexity of the algorithm depends on the size of the maximum SCC, which is consistent with the conclusion in Section 3.2.

In terms of the accuracy of these algorithms, it is easier to lose accuracy using MMPC constraints. The accuracy loss is more significant in the case of a small sample size, and the accuracy is higher in the case of a large sample size. Since MMPC uses the CI test, it is sensitive to the sample size and can only obtain good results under the condition of sufficient samples; as a result, this drawback is also inherited into the new algorithm. In addition, path constraints lead to accuracy loss in the Alarm network. For calculating path constraints in a large-scale network, Algorithm 2 will attempt to generate a directed graph using the sets that corresponds to the local scores of the top

k

in

P O P S_{i}

. However, this approach is greedy, which reduces the accuracy of the final BN. Fortunately, path constraints using the greedy approach lead to a lower accuracy loss than MMPC constraints. Furthermore, it can be concluded from the experimental results that the accuracy loss of the A*-MM2PC algorithm using MMPC constraints and path constraints comes more from the accuracy loss caused by MMPC constraints.

Table 3 records the performances of the A*, A*-MMPC, A*-PC, and A*-MM2PC algorithms on 13 common UCI datasets. By adding MMPC constraints and path constraints, the size of BNs that can be searched by the A* algorithm is expanded, and the performance of adding path constraints is more significant than that of adding MMPC constraints. In terms of the number of expanded states in the order graph, MMPC or path constraints can significantly reduce the number of expanded states, and adding both can reduce the number of expanded states even further. The variation trend of time consumption is similar to that of the number of expanded states. In terms of the accuracy of these algorithms, since most UCI datasets have small sample sizes, and the CI test used in MMPC constraints requires a sufficient sample size, it is easier to lose accuracy by using MMPC constraints. Comparatively, the accuracy loss of path constraints appears only in Flag, Soybean, Bands, and Spectf. The accuracy loss of the A*-MM2PC algorithm using both MMPC constraints and path constraints mainly comes from MMPC constraints.

On the whole, MMPC and path constraints can effectively and significantly improve the overall efficiency of the A* algorithm and reduce time consumption and the number of expanded states in the order graph, at the cost of only a slight loss in accuracy.

A*-MM2PC with MMPC constraints and path constraints is compared with other classical algorithms. The GOBNILP algorithm is considered to be the state-of-the-art algorithm among the exact learning algorithms, while the MMHC algorithm is the most well-known algorithm among the hybrid algorithms, which also uses MMPC constraints. In addition, the Insert Neighborhood Ordering-Based Search (INOBS) [36] algorithm is a state-of-the-art improved variant of OBS. The comparison experiment will compare GOBNILP, MMHC, INOBS, and A*-MM2PC from the following two aspects:

Time: time recorded the running time of the algorithm (OT means out of time);
Error: the percentage error $(M D L (G_{o b t a i n e d}) / M D L (G_{e x a c t}) - 1) \times 100 %$ .

The percentage error is calculated in the same way as before, and since the GOBNILP algorithm is an exact learning algorithm, it is always 0% and is not recorded. In addition, since GOBNILP, MMHC, and INOBS have different search spaces and do not use the order graph as the search space, the number of expanded states in the order graph is also not recorded.

Table 4 records the performances of the GOBNILP, MMHC, INOBS, and A*-MM2PC algorithms on benchmark BNs. Compared with the GOBNILP algorithm, A*-MM2PC consumes less time in the Sachs, Child, Alarm, Hailfinder, and Win95pts networks. Although the accuracy of A*-MM2PC is not as good as GOBNILP, the overall accuracy loss of A*-MM2PC is less than 3%, mainly concentrated within 0.5%. Compared with the MMHC algorithm, A*-MM2PC has less time consumption on the Sachs, Child, Alarm, Hailfinder, and Win95pts networks and always has less accuracy loss than the MMHC algorithm. Although both the MMHC and A*-MM2PC algorithms adopt MMPC constraints, MMHC uses greedy search to further search in the second stage, while A*-MM2PC is based on the A* algorithm, an exact learning algorithm, to further search; thus, A*-MM2PC will achieve higher accuracy than MMHC. The accuracy of the MMHC algorithm also roughly conforms to the trend of low accuracy when the sample size is small and high accuracy when the sample size is large. The reason is the fact that CI tests rely on sufficient samples. Compared with the INOBS algorithm, A*-MM2PC mostly has less time consumption on the Sachs, Child, Alarm, Hailfinder, and Win95pts networks than the INOBS algorithm, and has less accuracy loss than the INOBS algorithm, in most cases.

Table 5 records the performances of the GOBNILP, MMHC, and A*-MM2PC algorithms on UCI datasets. Compared with the GOBNILP algorithm, the A*-MM2PC algorithm has less time consumption on most datasets. Compared with the MMHC algorithm, it has less time consumption on most datasets and has less accuracy loss. Compared with the INOBS algorithm, A*-MM2PC has less time consumption and has less accuracy loss than the INOBS algorithm on more datasets. Due to the small sample size in the UCI dataset, the accuracy of the A*-MM2PC algorithm and the MMHC algorithm in the UCI dataset is lower than that of benchmark BNs, and the overall accuracy of the A*-MM2PC algorithm is higher than that of the MMHC algorithm. The A*-MM2PC algorithm has more advantages in time and accuracy than the MMHC algorithm. For the Mushroom dataset, due to having a large number of states of each variable and a large number of POPS, the time consumption of GOBNILP and MMHC increases significantly when learning its structure, and the accuracy of the MMHC algorithm decreases significantly. However, the search space of A*-MM2PC is an order graph, which does not increase as the number of variable states and POPS increases. In contrast, it decreases with appropriate path constraints. Therefore, A*-MM2PC has higher time efficiency than GOBNILP and MMHC on Mushroom.

5. Conclusions

Based on the A* algorithm and POPS, this paper proposes an improved A* algorithm, which is specifically divided into two aspects. On the one hand, it is called MMPC constraints, and POPS can be reduced through parent–child set constraints calculated by the MMPC algorithm. On the other hand, it is called path constraints and uses a directed graph built from POPS, and SCCs are extracted to obtain the path constraints in the order graph. The two constraints based on POPS can improve the search efficiency of A*. A large number of experiments were conducted to test the performance of the constraints. The A*-MMPC algorithm with MMPC constraints and the A*-PC algorithm with path constraints have less time consumption and fewer expanded states in the order graph and search larger networks than the original A* algorithm. Meanwhile, the A*-MM2PC algorithm with two constraints has better performance. The only drawback of both constraints is a slight loss of accuracy, most of which is no more than 0.5%. Compared with the MMPC constraints, the path constraints bring a lower loss of accuracy, and the scale of the Bayesian network that can be searched is larger. Compared with the state-of-the-art GOBNILP algorithm, A*-MM2PC has higher time efficiency in some experiments. Compared with the well-known MMHC algorithm of the hybrid algorithms, A*-MM2PC has more time efficiency and accuracy advantages in most cases. Compared with the state-of-the-art improved variant of OBS, namely, INOBS, A*-MM2PC has less time consumption and has less accuracy loss on more datasets.

Of course, the constraints we propose can not only be applied to the A* algorithm. DP, AWA*, and BiHS have the same search space as A*, only the specific search method is different. Thus, our proposed constraints can also be applied to these algorithms. The method of application is similar to adding the constraints proposed in this paper into A*. First, the number of potential optimal parent sets for DP, AWA*, and BiHS is reduced by parent–child set constraints from the MMPC algorithm. After that, path constraints are obtained from their pruned potential optimal parent sets to limit the search process of these algorithms.

However, further research questions remain. In Algorithm 2, if the size of the maximum SCC exceeds the parameter

t

, we prefer to select the sets that correspond to the local scores of the top

k

in

P O P S_{i}

, which is a greedy strategy, leading to a slight loss in accuracy. Perhaps this can be mitigated by more reasonable rules that filter out better POPS.

In practice, we try to obtain complete and sufficient data. On this basis, local scores and POPS are calculated, POPS are pruned by the MMPC algorithm, and then path constraints files are obtained from Algorithm 2 by the POPS. The A* algorithm is modified to provide the ability to search after reading the path constraints files. Finally, the A* algorithm is invoked through a script to enable it to search for sub-networks under different path constraint files and merge the sub-networks.

In future work, we seek to obtain more useful constraints from POPS to further restrict the learning process of the BN structure to improve the performance of the BN exact structure learning algorithm on larger networks. In addition, determining whether further constraints can be obtained directly from the data is also a direction for new thinking in producing new research.

Author Contributions

Conceptualization, Methodology, C.H.; Software, R.D.; Writing—original draft, X.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (61573285, 62171360), and the Natural Science Foundation of Shaanxi Province of China (2022JQ-590).

Data Availability Statement

The benchmark Bayesian networks are known, and they are publicly available (http://www.bnlearn.com/bnrepository (accessed on 20 November 2022)). The UCI datasets are publicly available (https://archive.ics.uci.edu/ml/index.php (accessed on 20 November 2022)).

Conflicts of Interest

The authors declare no conflict of interest.

References

Chickering, D.M. Learning Bayesian Networks is NP-Complete. In Learning from Data: Artificial Intelligence and Statistics V; Springer: Berlin/Heidelberg, Germany, 1996; Volume 112, pp. 121–130. [Google Scholar]
Chickering, D.M.; Heckerman, D.; Meek, C. Large-sample learning of Bayesian networks is NP-hard. J. Mach. Learn. Res. 2004, 5, 1287–1330. [Google Scholar]
Kitson, N.K.; Constantinou, A.C.; Guo, Z.; Liu, Y.; Chobtham, K. A survey of Bayesian Network structure learning. Artif. Intell. Rev. 2023, 1, 1–94. [Google Scholar] [CrossRef]
Spirtes, P.; Glymour, C. An algorithm for fast recovery of sparse causal graphs. Soc. Sci. Comput. Rev. 1991, 9, 62–72. [Google Scholar] [CrossRef] [Green Version]
Pearl, J.; Verma, T.S. A theory of inferred causation. In Studies in Logic and the Foundations of Mathematics; Elsevier: Amsterdam, The Netherlands, 1995; Volume 134, pp. 789–811. [Google Scholar]
Margaritis, D.; Thrun, S. Bayesian network induction via local neighborhoods. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; pp. 505–511. [Google Scholar]
Cooper, G.F.; Herskovits, E. A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 1992, 9, 309–347. [Google Scholar] [CrossRef] [Green Version]
Heckerman, D.; Geiger, D.; Chickering, D.M. Learning Bayesian networks: The combination of knowledge and statistical data. Mach. Learn. 1995, 20, 197–243. [Google Scholar] [CrossRef] [Green Version]
Behjati, S.; Beigy, H. Improved K2 algorithm for Bayesian network structure learning. Eng. Appl. Artif. Intel. 2020, 91, 103617. [Google Scholar] [CrossRef]
De Campos, L.M.; Huete, J.F. Approximating causal orderings for Bayesian networks using genetic algorithms and simulated annealing. In Proceedings of the Eighth International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Kansas City, MO, USA, 2–6 November 1999; pp. 333–340. [Google Scholar]
Teyssier, M.; Koller, D. Ordering-Based Search: A Simple and Effective Algorithm for Learning Bayesian Networks. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, Edinburgh, Scotland, 26–29 July 2005; pp. 584–590. [Google Scholar]
Larranaga, P.; Kuijpers, C.M.; Murga, R.H.; Yurramendi, Y. Learning Bayesian network structures by searching for the best ordering with genetic algorithms. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 1996, 26, 487–493. [Google Scholar] [CrossRef] [Green Version]
Gheisari, S.; Meybodi, M.R. BNC-PSO: Structure learning of Bayesian networks by particle swarm optimization. Inform. Sci. 2016, 348, 272–289. [Google Scholar] [CrossRef]
Daly, R.; Shen, Q. Learning Bayesian network equivalence classes with ant colony optimization. J. Mach. Learn. Res. 2009, 35, 391–447. [Google Scholar] [CrossRef]
Koivisto, M.; Sood, K. Exact Bayesian structure discovery in Bayesian networks. J. Mach. Learn. Res. 2004, 5, 549–573. [Google Scholar]
Silander, T.; Myllymaki, P. A simple approach for finding the globally optimal Bayesian network structure. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, Cambridge, MA, USA, 13–16 July 2006; pp. 445–452. [Google Scholar]
Malone, B.; Yuan, C. Memory-efficient dynamic programming for learning optimal Bayesian networks. In Proceedings of the 25th AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 7–11 August 2011; pp. 1057–1062. [Google Scholar]
Yuan, C.; Malone, B.; Wu, X. Learning optimal Bayesian networks using A* search. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Catalonia, Spain, 16–22 July 2011; pp. 2186–2191. [Google Scholar]
Yuan, C.; Malone, B. Learning optimal Bayesian networks: A shortest path perspective. J. Artif. Intell. Res. 2013, 48, 23–65. [Google Scholar] [CrossRef]
Malone, B.; Yuan, C. Evaluating anytime algorithms for learning optimal Bayesian networks. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, Bellevue, WA, USA, 11–15 August 2013; pp. 381–390. [Google Scholar]
Tan, X.; Gao, X.; Wang, Z.; He, C. Bidirectional heuristic search to find the optimal Bayesian network structure. Neurocomputing 2021, 426, 35–46. [Google Scholar] [CrossRef]
Van Beek, P.; Hoffmann, H.F. Machine learning of Bayesian networks using constraint programming. In Proceedings of the International Conference on Principles and Practice of Constraint Programming, Cork, Ireland, 31 August–4 September 2015; pp. 429–445. [Google Scholar]
Barlett, M.; Cussens, J. Advances in Bayesian network learning using integer programming. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, Bellevue, WA, USA, 11–15 August 2013; pp. 182–191. [Google Scholar]
Bartlett, M.; Cussens, J. Integer linear programming for the Bayesian network structure learning problem. Artif. Intell. 2017, 244, 258–271. [Google Scholar] [CrossRef]
Trösser, F.; De Givry, S.; Katsirelos, G. Improved Acyclicity Reasoning for Bayesian Network Structure Learning with Constraint Programming. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–27 August 2021; pp. 4250–4257. [Google Scholar]
Zheng, X.; Aragam, B.; Ravikumar, P.; Xing, E.P. DAGs with NO TEARS: Continuous optimization for structure learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 9492–9503. [Google Scholar]
Tsamardinos, I.; Brown, L.E.; Aliferis, C.F. The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 2006, 65, 31–78. [Google Scholar] [CrossRef] [Green Version]
Tsamardinos, I.; Aliferis, C.F.; Statnikov, A. Time and sample efficient discovery of markov blankets and direct causal relations. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, DC, USA, 24–27 August 2003; pp. 673–678. [Google Scholar]
Perrier, E.; Imoto, S.; Miyano, S. Finding Optimal Bayesian Network Given a Super-Structure. J. Mach. Learn. Res. 2008, 9, 2251–2286. [Google Scholar]
Kojima, K.; Perrier, E.; Imoto, S.; Miyano, S. Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure. J. Mach. Learn. Res. 2010, 11, 285–310. [Google Scholar]
Gámez, J.A.; Mateo, J.L.; Puerta, J.M. Learning Bayesian networks by hill climbing: Efficient methods based on progressive restriction of the neighborhood. Data. Mib. Knowl. Disc. 2011, 22, 106–148. [Google Scholar] [CrossRef]
Liu, H.; Zhou, S.; Lam, W.; Guan, J. A new hybrid method for learning Bayesian networks: Separation and reunion. Knowl.-Based Syst. 2017, 121, 185–197. [Google Scholar] [CrossRef]
Kuipers, J.; Suter, P.; Moffa, G. Efficient sampling and structure learning of Bayesian networks. J. Comput. Graph. Stat. 2022, 31, 639–650. [Google Scholar] [CrossRef]
Rissanen, J. Modeling by shortest data description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
Tian, J. A branch-and-bound algorithm for MDL learning Bayesian networks. In Proceedings of the Sixteenth conference on Uncertainty in Artificial Intelligence, San Francisco, CA, USA, 30 June–3 July 2000; pp. 580–588. [Google Scholar]
Lee, C.; Beek, P.V. Metaheuristics for score-and-search Bayesian network structure learning. In Proceedings of the Canadian Conference on Artificial Intelligence, Edmonton, AB, Canada, 16–19 May 2017; pp. 129–141. [Google Scholar]

Figure 1. The order graph of a 4-node BN.

Figure 2. The order graph of a four-node BN.

Figure 3. Parent graph of

X_{4}

in a 4-node BN.

Figure 3. Parent graph of

X_{4}

in a 4-node BN.

Figure 4. Parent graph of

X_{4}

after pruning in a 4-node BN.

Figure 4. Parent graph of

X_{4}

after pruning in a 4-node BN.

Figure 5. A directed graph based on each variable in Table 1 and its POPS.

Figure 6. Acyclic component graph based on Figure 5.

Figure 7. Pruned order graph based on the constraints from Figure 5.

Table 1. The POPS of each variable in a 6-node BN.

Variable	POPS
$X_{1}$	$\{X_{2}\}$ , ${}$
$X_{2}$	$\{X_{1}\}$ , ${}$
$X_{3}$	$\{X_{1}, X_{2}\}$ , $\{X_{1}, X_{4}\}$ , $\{X_{2}, X_{4}\}$ , $\{X_{1}\}$ , ${}$
$X_{4}$	$\{X_{1}, X_{5}\}$ , $\{X_{1}\}$ , $\{X_{5}\}$ , ${}$
$X_{5}$	$\{X_{1}, X_{2}\}$ , $\{X_{6}\}$ , $\{X_{1}\}$ , $\{X_{2}\}$ , ${}$
$X_{6}$	$\{X_{2}, X_{3}\}$ , $\{X_{3}\}$ , ${}$

Table 2. Performances of the A*, A*-MMPC, A*-PC, and A*-MM2PC algorithms on benchmark BNs. The best-performing results (including minimum time, minimum number of expanded states and minimum percentage error) are highlighted in bold in the Table 2 and the following tables.

Name	n	N	A*		A*-MMPC			A*-PC			A*-MM2PC
Name	n	N	Time(s)	States	Time(s)	States	Error	Time(s)	States	Error	Time(s)	States	Error
Sachs	11	1000	2.61 × 10⁻³	215	2.02 × 10⁻³	199	0%	3.55 × 10⁻³	58	0%	3.05 × 10⁻³	49	0%
Sachs	11	3000	2.64 × 10⁻³	416	2.27 × 10⁻³	368	0%	4.52 × 10⁻³	84	0%	3.37 × 10⁻³	76	0%
Sachs	11	5000	3.04 × 10⁻³	492	2.33 × 10⁻³	396	0%	5.01 × 10⁻³	89	0%	3.71 × 10⁻³	79	0%
Sachs	11	7000	3.92 × 10⁻³	599	2.82 × 10⁻³	487	0%	5.64 × 10⁻³	100	0%	4.04 × 10⁻³	86	0%
Sachs	11	10,000	3.65 × 10⁻³	629	2.86 × 10⁻³	501	0%	5.83 × 10⁻³	113	0%	4.16 × 10⁻³	98	0%
Child	20	1000	1.94 × 10⁻¹	50,887	1.28 × 10⁻¹	36,678	0.21%	7.89 × 10⁻²	18,714	0%	6.30 × 10⁻²	16,226	0.21%
Child	20	3000	5.69 × 10⁻¹	158,853	2.48 × 10⁻¹	63,946	0.27%	2.08 × 10⁻¹	58,677	0%	7.44 × 10⁻²	29,427	0.27%
Child	20	5000	4.07 × 10⁻¹	115,639	3.13 × 10⁻¹	65,100	0.19%	1.73 × 10⁻¹	52,226	0%	7.53 × 10⁻²	30,242	0.19%
Child	20	7000	4.19 × 10⁻¹	119,577	3.05 × 10⁻¹	59,259	0%	1.81 × 10⁻¹	54,496	0%	7.39 × 10⁻²	28,040	0%
Child	20	10,000	4.64 × 10⁻¹	137,545	3.36 × 10⁻¹	67,065	0%	2.02 × 10⁻¹	63,799	0%	8.87 × 10⁻²	31,196	0%
Insurance	27	1000	142.45	2.05 × 10⁺⁷	36.47	4.71 × 10⁺⁶	0.72%	44.98	7.00 × 10⁺⁶	0%	12.36	1.95 × 10⁺⁶	0.72%
Insurance	27	3000	175.68	2.44 × 10⁺⁷	27.87	3.42 × 10⁺⁶	0.60%	48.42	8.14 × 10⁺⁶	0%	10.44	1.67 × 10⁺⁶	0.60%
Insurance	27	5000	192.73	2.56 × 10⁺⁷	21.22	2.63 × 10⁺⁶	0.49%	50.76	8.08 × 10⁺⁶	0%	8.84	1.35 × 10⁺⁶	0.49%
Insurance	27	7000	218.43	2.66 × 10⁺⁷	23.67	2.91 × 10⁺⁶	0.45%	52.54	8.48 × 10⁺⁶	0%	9.13	1.38 × 10⁺⁶	0.45%
Insurance	27	10,000	215.52	2.55 × 10⁺⁷	26.95	3.16 × 10⁺⁶	0.42%	50.95	8.15 × 10⁺⁶	0%	9.53	1.46 × 10⁺⁶	0.42%
Water	32	1000	OT		OT			23.65	5.67 × 10⁺⁶	0%	18.98	4.98 × 10⁺⁶	0.19%
Water	32	3000	OT		OT			26.31	6.20 × 10⁺⁶	0%	18.55	4.77 × 10⁺⁶	0.18%
Water	32	5000	OT		OT			27.07	6.39 × 10⁺⁶	0%	17.11	4.20 × 10⁺⁶	0.11%
Water	32	7000	OT		OT			27.66	6.41 × 10⁺⁶	0%	19.60	4.46 × 10⁺⁶	0.06%
Water	32	10,000	OT		OT			27.86	6.45 × 10⁺⁶	0%	21.08	4.85 × 10⁺⁶	0.07%
Alarm	37	1000	OT		OT			3.17 × 10⁻¹	1.01 × 10⁺⁶	0%	1.40 × 10⁻¹	29,265	0.03%
Alarm	37	3000	OT		OT			1.37 × 10⁻¹	4992	0.22%	4.96 × 10⁻²	2731	1.99%
Alarm	37	5000	OT		OT			2.59 × 10⁻¹	78,817	0.02%	7.14 × 10⁻²	29,283	0.44%
Alarm	37	7000	OT		OT			3.33 × 10⁻¹	10,3821	0%	8.42 × 10⁻²	30,415	0.02%
Alarm	37	10,000	OT		OT			2.48 × 10⁻¹	86,143	0.02%	8.10 × 10⁻²	29,695	0.21%
Hailfinder	56	1000	OT		OT			6.67 × 10⁻²	1670	0.013%	2.52 × 10⁻²	573	0.11%
Hailfinder	56	3000	OT		OT			6.41 × 10⁻²	1931	0.023%	3.42 × 10⁻²	922	0.42%
Hailfinder	56	5000	OT		OT			1.56	3.62 × 10⁺⁵	0.025%	8.26 × 10⁻¹	1.51 × 10⁺⁵	1.35%
Hailfinder	56	7000	OT		OT			2.91 × 10⁻²	491	0.067%	1.21 × 10⁻²	202	2.21%
Hailfinder	56	10,000	OT		OT			1.66	3.66 × 10⁺⁵	0.043%	8.61 × 10⁻¹	1.53 × 10⁺⁵	1.21%
Win95pts	76	1000	OT		OT			1.47	3.23 × 10⁺⁵	0.49%	6.48 × 10⁻¹	1.32 × 10⁺⁵	1.61%
Win95pts	76	3000	OT		OT			2.07	4.66 × 10⁺⁵	0.67%	1.08	2.37 × 10⁺⁵	2.13%
Win95pts	76	5000	OT		OT			3.36	7.63 × 10⁺⁵	0.50%	2.29	4.40 × 10⁺⁵	2.99%
Win95pts	76	7000	OT		OT			3.98	8.91 × 10⁺⁵	0.59%	2.72	5.44 × 10⁺⁵	2.15%
Win95pts	76	10,000	OT		OT			3.70	8.36 × 10⁺⁵	0.56%	3.28	6.56 × 10⁺⁵	1.94%

Table 3. Performances of the A*, A*-MMPC, A*-PC, and A*-MM2PC algorithms on UCI datasets.

Name	n	N	A*		A*-MMPC			A*-PC			A*-MM2PC
Name	n	N	Time(s)	States	Time(s)	States	Error	Time(s)	States	Error	Time(s)	States	Error
Lympho	19	148	5.37 × 10⁻³	17,414	2.94 × 10⁻²	95	0%	5.33 × 10⁻²	8757	0%	2.44 × 10⁻²	93	0%
Hepatitis	20	126	6.05 × 10⁻³	8515	2.80 × 10⁻²	2824	0.26%	3.85 × 10⁻²	4809	0%	1.86 × 10⁻³	1533	0.26%
Segment	20	2310	1.72	428,083	7.99 × 10⁻¹	218,456	0.76%	4.01 × 10⁻¹	107,902	0%	1.71 × 10⁻¹	54,588	0.76%
Mushroom	23	8124	5.93 × 10⁻¹	49,593	3.55 × 10⁻¹	38,835	2.07%	4.89 × 10⁻¹	33,167	0%	2.55 × 10⁻¹	25,669	2.07%
Autos	26	159	36.55	4.76 × 10⁺⁶	12.24	1.64 × 10⁺⁶	2.60%	18.38	2.54 × 10⁺⁶	0%	6.74	906,247	2.60%
Steel	28	1941	OT		50.48	7.61 × 10⁺⁶	2.70%	12.34	2.55 × 10⁺⁶	0%	1.46	310,744	2.70%
Flag	29	194	OT		2.99	319,340	0.69%	3.47	418,554	0.01%	3.53 × 10⁻¹	46,659	0.69%
Soybean	36	266	OT		OT			1.23	234,185	0.13%	1.16	164,698	3.93%
Bands	39	277	OT		OT			11.33	1.54 × 10⁺⁶	0%	0.23	15,354	1.06%
Spectf	45	267	OT		OT			8.45 × 10⁻²	76	0.10%	7.76 × 10⁻²	74	0.27%
Sponge	45	76	OT		OT			3.01 × 10⁻¹	13,106	0.659%	1.73 × 10⁻¹	6753	1.304%
LungCancer	57	32	OT		OT			4.42 × 10⁻¹	19,223	2.159%	3.66 × 10⁻¹	26,019	5.54%
Splice	61	3190	OT		OT			1.62 × 10⁻¹	216	0.123%	1.62 × 10⁻¹	193	0.323%

Table 4. Performances of GOBNILP, MMHC, and A*-MM2PC algorithms on benchmark BNs.

Name	n	N	GOBNILP	MMHC		INOBS		A*-MM2PC
Name	n	N	Time(s)	Time(s)	Error	Time(s)	Error	Time(s)	Error
Sachs	11	1000	0.22	0.39	0.67%	1.5 × 10⁻²	0%	3.05 × 10⁻³	0%
Sachs	11	3000	0.21	0.60	0%	1.5 × 10⁻²	0.098%	3.37 × 10⁻³	0%
Sachs	11	5000	0.31	0.95	0%	1.5 × 10⁻²	0.065%	3.71 × 10⁻³	0%
Sachs	11	7000	0.38	1.11	0%	1.5 × 10⁻²	0.130%	4.04 × 10⁻³	0%
Sachs	11	10,000	0.35	1.41	0%	1.6 × 10⁻²	0.072%	4.16 × 10⁻³	0%
Child	20	1000	0.48	0.77	4.25%	1.5 × 10⁻²	0.33%	6.30 × 10⁻²	0.21%
Child	20	3000	0.87	1.63	0.34%	1.6 × 10⁻²	0.13%	7.44 × 10⁻²	0.27%
Child	20	5000	1.08	2.54	0.32%	9.24 × 10⁻²	0.38%	7.53 × 10⁻²	0.19%
Child	20	7000	1.47	3.62	0.33%	1.73 × 10⁻¹	0%	7.39 × 10⁻²	0%
Child	20	10,000	2.03	5.56	0.32%	2.61 × 10⁻¹	0%	8.87 × 10⁻²	0%
Insurance	27	1000	2.87	0.76	2.08%	0.32	1.21%	12.36	0.72%
Insurance	27	3000	6.26	2.37	1.20%	0.33	0.71%	10.44	0.60%
Insurance	27	5000	6.82	4.56	1.09%	0.57	0.63%	8.84	0.49%
Insurance	27	7000	9.07	6.87	0.83%	0.89	0.37%	9.13	0.45%
Insurance	27	10,000	10.16	11.05	0.67%	0.98	0%	9.53	0.42%
Water	32	1000	2.13	0.43	1.94%	0.31	0.21%	18.98	0.19%
Water	32	3000	2.98	0.59	1.58%	0.42	0.28%	18.55	0.18%
Water	32	5000	5.25	0.88	1.15%	0.48	0.33%	17.11	0.11%
Water	32	7000	5.71	1.34	1.16%	0.63	0.07%	19.60	0.06%
Water	32	10,000	7.51	1.61	0.77%	0.97	0.15%	21.08	0.07%
Alarm	37	1000	2.58	0.80	6.69%	6.33 × 10⁻¹	0.69%	1.40 × 10⁻¹	0.03%
Alarm	37	3000	8.61	1.32	6.21%	6.82 × 10⁻¹	2.17%	4.96 × 10⁻²	1.99%
Alarm	37	5000	10.77	2.24	3.20%	1.26	0.92%	7.14 × 10⁻²	0.44%
Alarm	37	7000	14.15	2.78	2.72%	1.41	0.25%	8.42 × 10⁻²	0.02%
Alarm	37	10,000	22.34	4.27	2.37%	1.25	0.23%	8.10 × 10⁻²	0.21%
Hailfinder	56	1000	8.11 × 10⁻¹	7.93	2.75%	5.61 × 10⁻¹	0.14%	2.52 × 10⁻²	0.11%
Hailfinder	56	3000	1.92	2.74	14.49%	6.72 × 10⁻¹	0.36%	3.42 × 10⁻¹	0.42%
Hailfinder	56	5000	6.56	42.78	5.59%	7.46 × 10⁻¹	1.48%	8.26 × 10⁻¹	1.35%
Hailfinder	56	7000	34.39	138.75	3.81%	1.85	2.31%	1.21 × 10⁻¹	2.21%
Hailfinder	56	10,000	68.88	307.99	2.25%	1.56	1.60%	8.61 × 10⁻²	1.21%
Win95pts	76	1000	256.87	1.862	8.56%	1.68	0.83%	6.48 × 10⁻¹	1.61%
Win95pts	76	3000	529.38	5.734	3.86%	2.24	2.56%	1.08	2.13%
Win95pts	76	5000	3201.13	55.928	4.91%	4.84	4.21%	2.29	2.99%
Win95pts	76	7000	2883.96	509.719	3.01%	6.36	2.89%	2.72	2.15%
Win95pts	76	10,000	4798.52	659.807	5.23%	8.22	3.57%	3.28	1.94%

Table 5. Performances of GOBNILP, MMHC, and A*-MM2PC algorithms on UCI datasets.

Name	n	N	GOBNILP	MMHC		INOBS		A*-MM2PC
Name	n	N	Time(s)	Time(s)	Error	Time(s)	Error	Time(s)	Error
Lympho	19	148	1.67 × 10⁻¹	0.49	3.86%	1.11 × 10⁻¹	0%	2.44 × 10⁻²	0%
Hepatitis	20	126	4.51 × 10⁻¹	0.32	1.47%	3.13 × 10⁻¹	0%	1.86 × 10⁻³	0.26%
Segment	20	2310	13.01	2.57	2.67%	2.36	0.93%	1.71 × 10⁻¹	0.76%
Mushroom	23	8124	OT	258.33	85.33%	6.75 × 10⁻¹	3.43%	2.55 × 10⁻¹	2.07%
Autos	26	159	13.47	0.85	5.19%	1.62	2.67%	6.74	2.60%
Steel	28	1941	75.32	20.08	0.63%	8.21 × 10⁻¹	1.52%	1.46	2.70%
Flag	29	194	1.21	0.84	1.83%	4.71 × 10⁻¹	0.86%	3.53 × 10⁻¹	0.69%
Soybean	36	266	31.50	1.32	6.44%	2.25	1.13%	1.16	3.93%
Bands	39	277	6.42	0.68	2.14%	6.23 × 10⁻¹	1.72%	2.31 × 10⁻¹	1.06%
Spectf	45	267	4.85 × 10⁻¹	0.83	0.29%	3.52 × 10⁻¹	0.56%	7.76 × 10⁻²	0.27%
Sponge	45	76	7.61 × 10⁻¹	9.12 × 10⁻¹	10.263%	1.78 × 10⁻¹	1.296%	1.73 × 10⁻¹	1.304%
LungCancer	57	32	4.68	6.32 × 10⁻¹	4.758%	2.23 × 10⁻¹	2.806%	3.66 × 10⁻¹	5.54%
Splice	61	3190	303.882	116.56	0.894%	1.02 × 10⁻¹	0.603%	1.62 × 10⁻¹	0.323%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, C.; Di, R.; Tan, X. Bayesian Network Structure Learning Using Improved A* with Constraints from Potential Optimal Parent Sets. Mathematics 2023, 11, 3344. https://doi.org/10.3390/math11153344

AMA Style

He C, Di R, Tan X. Bayesian Network Structure Learning Using Improved A* with Constraints from Potential Optimal Parent Sets. Mathematics. 2023; 11(15):3344. https://doi.org/10.3390/math11153344

Chicago/Turabian Style

He, Chuchao, Ruohai Di, and Xiangyuan Tan. 2023. "Bayesian Network Structure Learning Using Improved A* with Constraints from Potential Optimal Parent Sets" Mathematics 11, no. 15: 3344. https://doi.org/10.3390/math11153344

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bayesian Network Structure Learning Using Improved A* with Constraints from Potential Optimal Parent Sets

Abstract

1. Introduction

2. Preliminaries for Bayesian Networks

2.1. Bayesian Networks

2.2. Structure Learning in Order Graph

3. Two Constraints for Improved A* Algorithm

3.1. Pruned Potential Optimal Parent Sets with MMPC

3.2. Pruned Order Graph with Path Constraints

4. Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI