A Novel Ensemble Method of Divide-and-Conquer Markov Boundary Discovery for Causal Feature Selection

Li, Hao; Zhan, Jianjun; Wang, Haosen; Zhao, Zipeng

doi:10.3390/math12182927

Open AccessArticle

A Novel Ensemble Method of Divide-and-Conquer Markov Boundary Discovery for Causal Feature Selection

¹

Hunan Institute of Advanced Technology, Changsha 410073, China

²

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

³

Cainiao Network, Hangzhou 311100, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(18), 2927; https://doi.org/10.3390/math12182927

Submission received: 7 August 2024 / Revised: 4 September 2024 / Accepted: 16 September 2024 / Published: 20 September 2024

(This article belongs to the Special Issue Computational Methods and Machine Learning for Causal Inference)

Download

Browse Figures

Versions Notes

Abstract

:

The discovery of Markov boundaries is highly effective at identifying features that are causally related to the target variable, providing strong interpretability and robustness. While there are numerous methods for discovering Markov boundaries in real-world applications, no single method is universally applicable to all datasets. Therefore, in order to balance precision and recall, we propose an ensemble framework of divide-and-conquer Markov boundary discovery algorithms based on U-I selection strategy. We put three divide-and-conquer Markov boundary methods into the framework to obtain an ensemble algorithm, focusing on judging controversial parent–child variables to further balance precision and recall. By combining multiple algorithms, the ensemble algorithm can leverage their respective strengths and more thoroughly analyze the cause-and-effect relationships of target variables through various perspectives. Furthermore, it can enhance the robustness of the algorithm and reduce dependence on a single algorithm. In the experiment, we select four advanced Markov boundary discovery algorithms as comparison algorithms and compare them on nine benchmark Bayesian networks and three real-world datasets. The results show that EDMB ranks first in the overall ranking, which illustrates the superiority of the integrated algorithm and the effectiveness of the adopted U-I selection strategy. The main contribution of this paper lies in proposing an ensemble framework for divide-and-conquer Markov boundary discovery algorithms, balancing precision and recall through the U-I selection strategy, and judging controversial parent–child variables to enhance algorithm performance and robustness. The advantage of the U-I selection strategy and its difference from existing methods is the ability to independently obtain the maximum precision and recall of multiple algorithms within the ensemble framework. By assessing controversial parent–child variables, it further balances precision and recall, leading to results that are closer to the true Markov boundary.

Keywords:

causal feature selection; ensemble method; Markov boundary; Bayesian network

MSC:

62D20; 68T01

1. Introduction

Data with a large number of features can capture more details, provide more comprehensive feature information, and help reveal complex patterns and relationships. However, it also brings challenges such as the large consumption of computing and storage resources, data overfitting, and algorithm performance degradation [1]. The most discriminative information in the real world is usually carried by a few relevant features. In order to overcome these challenges and avoid the curse of dimensionality, researchers have extensively studied feature selection techniques. The Markov boundary (MB) reflects the causal relationship among variables and can be applied to feature selection [2]. For the target feature, its MB consists of the target’s parents, children, and spouses [3]. In Figure 1, we show the Markov boundary of the target node T in a Bayesian network. G and H are the parents, C and D are the children, and E and F are the spouses. The set of these six nodes is the Markov boundary of the target node T. The MB is the smallest feature set, which makes all other features conditionally independent of the target.

MB discovery algorithms in conventional data are mainly divided into single MB discovery algorithms and multiple MB discovery algorithms. The application scenarios of these two types of algorithms depend on whether the training data satisfies the fidelity assumption. Under the condition of satisfying fidelity, the MB of the target variable is unique. When the real data does not meet the fidelity condition, there may be multiple equivalent MBs for the target. This paper studies the MB discovery algorithm that satisfies fidelity. Researchers have proposed many MB discovery algorithms and their variants.

The single MB discovery algorithm assumes that the target variable has a unique MB, which is also the feature selection set of the target node. The single MB discovery algorithm is mainly divided into two categories: direct MB discovery algorithm and divide-and-conquer MB discovery algorithm. The direct method learns MB variables directly according to the nature of MB, while the divide-and-conquer method learns parent–child variables and spouse variables, respectively, and then obtains their union.

In the development of direct MB discovery methods, D. Koller et al. proposed the KS algorithm, which found the Markov boundary by minimizing the cross-entropy loss, but there is no theoretical reliability guarantee [4]. Margaritas et al. put forward the grow–shrink (GS) discovery method, which has growth and contraction stages, which first improve the recall rate and then improve the accuracy [5]. Tsamardinos et al. proposed incremental association Markov blanket (IAMB) algorithm based on GS to improve the accuracy by reordering the variables in each iteration [6]. In order to improve accuracy, data efficiency, time efficiency and high-dimensional data, researchers have improved IAMB to obtain many variant algorithms, improving the performance in these aspects, such as fast-IAMB [7], KIAMB [8], and fit-IAMB [9]. Borboudakis et al. proposed forward–backward selection with early dropping to enhance the IAMB algorithm, enabling the detection of incorrectly removed features during the growth phase for k iterations, thereby improving search accuracy and time efficiency [10]. Facing high-dimensional feature data, Tsamardinos et al. proposed the parallel, forward–backward with pruning (PFBP) algorithm, which achieves large-scale parallel computation through meta-analysis calculations and conditional independence tests [11]. Although the direct method has high efficiency and accuracy, it has strong data dependence and the effect is not ideal in small sample datasets.

Researchers proposed divide-and-conquer MB discovery algorithms to further improve the accuracy and reduce data dependence. Tsamardinosi et al. proposed the earliest divide-and-conquer algorithm, the max–min Markov blanket (MMMB) algorithm, which obtained parent–child nodes by judging the independent relationship between the variable and the target, and obtains the spouse variable through the V-structure [12]. Statnikov et al. proposed the HITON–Markov blanket (HITON-MB), which interleaves the growth and contraction phases, and can identify incorrect parent–child variables early to avoid error accumulation and improve accuracy [13]. Semi-HITON-MB improves the variable processing method of the candidate feature set based on HITON-MB [14].

Since the divide-and-conquer method can discover the parent–child relationship between variables, it can apply the symmetry judgment of the parent–child relationship to improve the accuracy of the MB discovery algorithm. Jose et al. proposed the first algorithm, the parent–child-based Markov blanket algorithm (PCMB), that applied symmetry testing and the AND rule to avoid the descendant variables of child variables being mistaken for child variables, thereby improving the accuracy of identifying parent–child variables [15]. Iterative PCMB (IPCMB) reversely implements the identification MB process of PCMB, avoiding the problem of too large a condition set size and reducing data noise by removing erroneous MB variables as early as possible in the complete variable set [16]. The simultaneous Markov blanket algorithm (STMB) applied spouse variables to assist in deleting erroneous parent–child variables and improve the efficiency of symmetry testing. When the data samples are insufficient, some related feature data distributions will deviate and fail the symmetry test [17]. Ling et al. introduced a novel balanced Markov blanket discovery algorithm (BAMB), aimed at balancing efficiency and accuracy by alternately identifying parent–child and spouse variables to achieve a trade-off between computational efficiency and learning accuracy [18]. Furthermore, Wang et al. developed a new algorithm for efficient and effective Markov blanket discovery (EEMB), which differentiates between parent–child relationships and spouses during Markov blanket discovery to prevent incorrect child variables from leading to erroneous spouses [19]. Morais et al. proposed the MB search by the OR condition (MBOR), which uses the OR rule to process approximately determined PC nodes to increase the recall of correct nodes [20]. Cross-check and complement Markov blanket algorithm (CCMB) theoretically analyzed the reasons for misjudging parent–child variables that do not conform to the OR rule and fixed this error, further improving the recall rate of correct nodes [21]. Wu et al. proposed a two-stage Markov boundary discovery method called separation and recovery Markov boundary discovery (SRMB), which addresses the issue of missing true positive nodes in data with distributional shift [22]. The error-aware Markov blanket algorithm (EAMB) proposed an R-AND rule when conducting symmetry tests, which could recover MB variables lost due to inaccurate conditional independence tests, and designed a contraction strategy to reduce unreliable conditional independence tests [23].

In addition to the conditional independence test, there are also methods for MB discovery through Bayesian network structure learning and kernel function optimization. Acid proposed the DMB and RPDMB methods based on Bayesian network scoring, utilizing a hill-climbing search algorithm to find the graph structure that maximizes the score for Markov blanket discovery [24]. Guo et al. introduced score-based synchronous Markov blanket discovery (S²TMB) and its more efficient variant, S²TMB+, which utilize the co-occurrence property between the target node’s spouse and offspring, eliminating the need for enforcing common symmetric constraints and thereby enhancing algorithm efficiency [25]. Due to the limitations of conditional independence tests in considering multivariate relationships, Wu et al. proposed a tolerant MB discovery algorithm (TLMB) [26]. This method maps the feature space and target space to a reproducing kernel Hilbert space, utilizing scoring and optimization techniques to obtain causal features with multivariate dependencies.

The existing methods for MB discovery encompass a wide range of approaches aimed at improving efficiency, accuracy, and reducing data dependence. These methods include direct MB discovery algorithms and their variants, each designed to enhance performance in various aspects. Researchers have also proposed divide-and-conquer algorithms to further refine accuracy and reduce data dependence by leveraging parent–child relationships and symmetry testing. The divide-and-conquer approach has shown promise in enhancing MB discovery algorithms by allowing for more the accurate identification of relationships between variables. Furthermore, beyond conditional independence tests, researchers have explored MB discovery through the Bayesian network structure learning and kernel function optimization.

While existing methods for Markov blanket discovery have made progress in improving efficiency, accuracy, and reducing data dependence, they still have limitations. direct MB discovery algorithms, although effective in efficiency and accuracy, often exhibit strong data dependence, leading to suboptimal performance on small sample datasets. Some existing MB discovery algorithms can be complex, requiring multiple iterations, extensive parameter tuning, or significant computational resources, resulting in longer processing times, especially with large-scale datasets. In the pursuit of accuracy, some algorithms may sacrifice efficiency or compromise a certain degree of accuracy. Additionally, certain MB discovery algorithms may make specific assumptions about data distributions, limiting their applicability across diverse data distributions. These assumptions can constrain the algorithms’ effectiveness when dealing with complex and varied real-world datasets.

The above algorithms focus on data quality, sample size, conditional independent test confidence, time efficiency, and other factors to improve the recall rate and precision rate of Markov boundary discovery from different angles. However, according to the no free lunch theorem [27], there is no algorithm that can achieve good results on all datasets. We propose an ensemble framework of divide-and-conquer Markov boundary discovery algorithms to further balance the precision and recall. Based on the differences in parent–child nodes obtained by different algorithms, we define controversial parent–child nodes and propose a new intersection-and-union-based selection strategy (I-U select) to improve the efficiency of MB discovery. Our research question focuses on integrated algorithms in the field of Markov boundary discovery. Specifically, we explore how to improve the accuracy and robustness of causal relationship identification by combining multiple causal discovery methods. Traditional causal discovery algorithms usually perform well but may face obstacles in complex data environments. In the existing literature, although some studies have focused on improving the effects of causal discovery algorithms from different directions, few systematic studies have explored how to effectively integrate different methods to improve the overall performance. In view of this, this study aims to propose an integrated causal discovery algorithm framework to integrate multiple causal discovery methods and improve the accuracy and reliability of causal identification.

The proposed ensemble method is important for this work in the following aspects:

(1): Enhanced performance: By combining multiple divide-and-conquer Markov boundary discovery algorithms into an ensemble framework, the proposed method can leverage the strengths of each individual algorithm. This ensemble approach often leads to improved performance compared to using a single algorithm, as it can capture a more diverse range of patterns and relationships in the data.
(2): Balanced precision and recall: The ensemble framework based on the U-I selection strategy allows for a better balance between precision and recall. This balance is essential in ensuring that the discovered Markov boundaries are both accurate and comprehensive, which is crucial for understanding causal relationships in complex datasets.
(3): Addressing dataset variability: Different datasets may exhibit varying characteristics and complexities. The ensemble method can adapt to these variations by integrating multiple algorithms, thereby improving the adaptability and generalizability of the Markov boundary discovery process across diverse datasets.

The proposed method can overcome the limitations of previous works in several ways:

(1): Enhanced robustness: By integrating multiple divide-and-conquer Markov boundary discovery algorithms into an ensemble framework, the proposed method can enhance the robustness of the overall approach. This robustness comes from the ability to leverage the strengths of different algorithms and mitigate the weaknesses of individual methods present in previous works.
(2): Handling controversial parent–child variables: The proposed method focuses on judging controversial parent–child variables to further refine the Markov boundary discovery process. By addressing contentious relationships between variables, the method can provide more accurate and reliable results compared to approaches that might overlook or misinterpret such relationships.
(3): Comparative analysis: By comparing the results of the ensemble method with those of individual algorithms, researchers can gain insights into the strengths and weaknesses of each algorithm. This comparative analysis can further refine the Markov boundary discovery process and enhance the overall quality of the findings.

The objectives of this work include developing an ensemble framework for Markov boundary discovery to balance precision and recall, improving algorithm performance while enhancing robustness. The main contribution of the paper lies in proposing an ensemble framework for divide-and-conquer Markov boundary discovery algorithms based on the U-I selection strategy, integrating three divide-and-conquer methods into the framework to construct an ensemble algorithm. It involves judging controversial parent–child variables to further balance precision and recall, enhancing algorithm robustness by combining multiple algorithms within the ensemble framework to reduce reliance on a single method.

The rest of this paper is organized as follows. Section 2 introduces the basic symbols and definitions. In Section 3, we describe the ensemble of divide-and-conquer Markov boundary discovery algorithms. Section 4 introduces the experiment results and analysis. Finally, Section 5 summarizes this paper and describes future research objectives.

2. Definitions and Theorems

MB-related basic definitions and theories are the basis of MB discovery algorithms. A Bayesian network can express the causal relationship between multiple variables. Fidelity is the basic assumption of Bayesian networks and the conceptual source of single MB. In the identification of MB nodes based on conditional independence, the parent and child nodes cannot be directly distinguished, while the spouse variables and related child nodes can be identified through the V-structure.

Definition 1.

Bayesian network [28]. For a Bayesian network

〈 U, G, P 〉

,

U

denotes the set of variables,

G

denotes the directed acyclic graph on U, and

P

denotes the probability distribution on U. Figure 2 shows a Bayesian network where the solid arrows represent real connections and the dashed arrows represent nonexistent connections. The nonexistent connections are to illustrate some concepts and examples. If the arrow M → B exists, then {C,E,M} has a loop and no longer meets the requirement of acyclic. In Figure 2, U = {A, B, C, D, E, M, N}, G = {A → B, A → C, B → D, C → E, D → E, E → M, N → C}, and P is the discrete joint probability distribution of a set of random nodes U obtained through the directed acyclic graph G.

Definition 2.

Faithfulness [29]. For a Bayesian network

< U, G, P >

,

G

is faithful to

P

if and only if every conditional independence relation in

P

is determined by

G

and Markov conditions. In Figure 2, if there is an implicit A → N, then the data distribution of N will be affected by the data distribution of A, and the faithfulness condition will not be satisfied.

Definition 3.

Markov boundary [28]. In a faithful Bayesian network, the Markov boundary of a node consists of the parent node, child node and spouse node (other parent nodes of the child node). In Figure 2, if A is the focus of the study, its Markov boundary includes the child nodes {B, C} and the spouse node {N}, with no parent nodes. If D is the focus of the study, its Markov boundary comprises the parent node {B}, the child node {E}, and the spouse node {C}. If M is the focus of the study, its Markov boundary is {E}, including only the parent node.

Definition 4.

V-structure [28]. If node Y has two incoming edges from X and Z, forming X→Y←Z, and X is not adjacent to Z, the three nodes X, Y, and Z form a V-shaped structure. In Figure 2, D → E ← C and A → C ← N are both V-structures.

The MB discovery algorithm can find the MB of variables without learning a complete Bayesian network. It has a special statistical property, as shown in Theorem 1. If the MB set is used as a condition, the target variable will be conditionally independent of other features, which can be used as the optimal solution to the feature selection problem, as shown in Theorem 2. Theorems 3 and 4 give the discriminant conditions of parent–child variables and spouse variables, respectively. Based on the above theorem, the divide-and-conquer MB discovery algorithm can search parent–child and spouse variables a conditional independence test.

Theorem 1.

For the variable

T \in U

, the Markovian boundary

M B \subseteq U

of

T

satisfies the following:

\forall Y \in U - M B - {T}

,

T ⊥ Y | M B

, and MB is the minimum set of variables satisfying this statistical property.

Theorem 2.

In the data that satisfy the fidelity assumption, the MB of the target variable is unique and is the optimal solution for feature selection [30,31].

Theorem 3.

In a Bayesian network on U, if nodes X and Y satisfy any variable subset

Z \subseteq U - {X, Y}

,

X ⊥ Y | Z

, then X and Y are a pair of parent–child variables.

Theorem 4.

In a Bayesian network on U, if the unconnected nodes X and Y are connected to T, and if there is a variable quantum set

Z \subseteq U - {X, Y, T}

, such that

X ⊥ Y | Z

is valid but

X ⊥ Y | Z \cup {T}

is not true, then X and Y are a pair of dual variables [32].

3. Methods

In Section 3.1, we analyze the theoretical basis and general process of direct and divide-and-conquer methods in MB discovery and introduce several typical divide-and-conquer MB algorithms. In Section 3.2, we propose a framework for integrated divide-and-conquer MB discovery and integrate several divide-and-conquer MB algorithms into the framework to obtain an ensemble algorithm.

3.1. Several Typical Divide-and-Conquer Markov Boundary Discovery Algorithms

For a typical direct Markov boundary discovery algorithm (DiMB), the common algorithm flow is to obtain MB according to Definition 1, as shown in Figure 3a. For a typical divide-and-conquer Markov boundary discovery algorithm (DcMB), the common algorithm flow is: first, find the parent–child node according to Theorem 3, and find the superset of the spouse nodes from the parent–child nodes of the parent–child nodes. Then, find the spouse nodes according to Theorem 4. Finally, obtain MB according to Definition 3, as shown in Figure 3b.

3.1.1. PCMB

The PCMB algorithm is the first divide-and-conquer MB discovery algorithm that uses the AND rule as shown in Algorithm 1. In the GetPC stage, it sets the candidate PC set to retain possible parent–child nodes and delete the separated nodes, and then repeats this process until the candidate PC set no longer changes. In this way, it can obtain the final candidate PC set and the separation set of deleted nodes. Then, the same operation is performed on each node in the candidate PC set to obtain the candidate PC set of each node. If the target node is in the candidate PC set of this node, that is, it meets the AND rule, then the node will be considered the PC node of the target node. If the target node is not in the candidate PC set of the node, the node will not be considered the PC node of the target node. After this judgment process, the PC node set of the target node can be obtained. In the GetSP stage, for each node in PC set, its parent and child nodes are found, and then the qualified spouse node set is found according to Theorem 4 and the separated set obtained in the GetPC stage. Finally, the set of spouse nodes is merged with the set of parent–child nodes to get the MB set.

Algorithm 1 PCMB (T, D)

Input: Data D, target node T
Initialize PCD = ∅
Initialize CanPCD = {T}
Initialize PC = ∅
Initialize SP = ∅
{In the GetPC stage}
repeat
for each X in CanPCD do

S e p [X] = \arg \min_{{Z \subseteq P C D}} d e p (X, T ∥ Z)

if

X ⊥ T | S e p [X]

then

C a n P C D = C a n P C D \ {X}

Y = \arg \max_{x \in C a n P C D} d e p (X, T ∥ S e p [X])

P C D = P C D \cup {Y}

C a n P C D = C a n P C D \ {Y}

for each X in PCD do

S e p [X] = \arg \min_{Z \subseteq P C D \ {X}} d e p (X, T ∥ Z)

if

X ⊥ T ∥ S e p [X]

then

P C D = P C D \ {X}

until PCD remains unchanged
return PCD
for each X in PCD do
if

T \in P C D (X, D)

then

P C = P C \cup {X}

return PC
MB = PC
{In the GetSP stage}
for each Y in PC do
for each X in PC(Y, D) do
if

X \notin P C

then

f i n d Z, X ⊥ T ∥ Z, X, T \notin Z

if

X ⊥ T ∥ Z \cup {Y}

then

S P = S P \cup {X}

M B = S P \cup P C

return MB
Output: Markov boundary MB of T

3.1.2. MBOR

MBOR algorithm is the first divide-and-conquer MB discovery algorithm that uses the OR rule as shown in Algorithm 2. In the GetPC stage, it extracts the superset of PC nodes and the superset of spouse nodes from the dataset, reducing the number of variables. Then, the candidate MB is obtained through the direct method of the IAMB series, and the spouse node of the target is deleted from the candidate MB set to obtain the candidate PC set. A judgment is made for each difference node between the candidate set and the superset. If the target node is in the candidate PC set of the difference node, it is added to the candidate PC set of the target node. In this way, the PC set of the target node is obtained. In the GetSP stage, the processing is the same as that of PCMB.

Algorithm 2 MBOR (T, D)

Input: Data D, target node T
Initialize PC = ∅
Initialize SP = ∅
{In the GetPC stage}

[P C S, S e p] = P C S u p e r s e t (T, D)

S P S = S P S u p e r s e t (T, D, P C S, S e p)

M B S = P C S \cup S P S

D = M B S \cup {T}

P C = M B t o P C (T, D)

for X in PCS and X not in PC do
if T in

M B t o P C (X, D)

then

P C = {X} \cup P C

{In the GetSP stage}
for X in PC do
for Y in

M B t o P C (X, D)

and Y not in

{T} \cup P C

do

f i n d Z, T ⊥ Y ∥ Z, Z \in M B S \ {T \cup Y}

If

T ⊥ Y ∥ Z \cup {X}

then

S P = {Y} \cup S P

M B = S P \cup P C

Output: Markov boundary MB of T

3.1.3. STMB

Since the AND rule performs a large number of parent–child node judgments in the symmetry check step, which is inefficient, STMB uses spouse variables to help delete incorrect parent–child variables to improve precision and time efficiency as shown in Algorithm 3. In the GetPC stage, it obtains the candidate PC set through the same steps as PCMB. In the GetSP stage, STMB finds spouses and deletes incorrect children from the candidate PC set. The coexistence characteristics of spouse and false PC nodes can be applied to any other MB discovery method based on Bayesian networks.

Algorithm 3 STMB (T, D)

Input: Data D, target node T
Initialize PC = ∅
Initialize SP = ∅
Initialize Del = ∅
{In the GetPC stage}

[P C, S e p] = P C D (T, D)

{In the GetSP and remove false nodes stage}
for each Y in PC do
for each X in

U \ {T} \ P C

do
if

X ⊥ T ∥ S e p {X} \cup {Y}

if

\exists Z \subseteq P C \cup {X} \ {Y}, Y ⊥ T ∥ Z

D e l = D e l \cup {Y}

Break;
else

S P {Y} = S P {Y} \cup {X}

P C = P C \ D e l

for each Y in SP do
for each X in SP{Y} do

T S = P C \cup S P {Y} \ {X}

If

X ⊥ T ∥ T S

S P {Y} = S P {Y} \ {X}

PCM = PC
for X in PCM do
if

X ⊥ T ∥ P C \cup S P \ {X}

P C = P C \ {X}

M B = S P \cup P C

return MB
Output: Markov boundary MB of T

Under the faithfulness assumption, existing topology-based methods identify the parent–child set

P C_{G e t}

as the union of the true parent–child set

P C_{T r u e}

and some false positives

P C_{F a l s e}

. Among these false positives, any present nodes are always caused by collider nodes of the target T’s descendants. The resulting V-structures imply that T must have at least one spouse, indicating a co-occurrence relationship between the false positive nodes in

P C_{F a l s e}

and T’s spouse. Given the

P C_{G e t}

set, the remaining task involves eliminating false positives and then reintroducing the spouse into the Markov blanket. STMB can efficiently handle both of these tasks simultaneously, thus reducing algorithm complexity. By exploiting the coexistence property between the spouse and offspring of the target node, STMB eliminates the necessity of enforcing the symmetry-constrained AND rule. In Figure 4a, for node D, the

P C_{G e t}

initially obtained by the PCMB algorithm is {B, E}, which matches its

P C_{T r u e}

. In Figure 4b, the

P C_{G e t}

initially obtained by the PCMB algorithm is {B, E, G}, while its

P C_{T r u e}

is {B, E}, and the

P C_{F a l s e}

is {G}. The presence of {G} is due to the spouse node {C} of D, forming a sub-collider structure E → G ← H, making it challenging to find a conditioning set that renders D and G independent.

3.2. Ensemble Method of Divide-and-Conquer Markov Boundary Discovery Algorithms

In Section 3.2.1, we define the controversial parent–child nodes to further refine the differences between the PC nodes obtained by different algorithms and propose a U-I selection strategy to deal with them. Then, we propose an ensemble of divide-and-conquer Markov boundary discovery algorithms to integrate the three algorithms of PCMB, MBOR, and STMB in Section 3.2.2.

3.2.1. Controversial Parent–Child Nodes and U-I Select Strategy

Although each divide-and-conquer MB discovery algorithm judges the parent–child nodes through Theorem 3. There are different ways to add the correct parent–child nodes and delete the wrong parent–child nodes, and the

G e t P C

set is different. Suppose the set of divide-and-conquer MB discovery algorithms

D c M B

is

{D c M B_{1}, D c M B_{2}, \dots, D c M B_{n}, \dots, D c M B_{N}}

and the set of obtained PC sets of each divide-and-conquer MB discovery algorithm

P C_{D c M B}

is

{P C_{D c M B_{1}}, P C_{D c M B_{2}}, \dots, P C_{D c M B_{n}}, \dots, P C_{D c M B_{N}}}

. For a specific algorithm

P C_{D c M B_{n}}

, the method to obtain PC is

G e t P C_{n}

. We combine these PC sets to get

P C_{u n i o n}

from Equation (1). We get

P C_{i n t e r}

from the intersection of these PC sets from Equation (2). Finally, we get the difference

P C_{d i f f}

between

P C_{u n i o n}

and

P C_{i n t e r}

as Equation (3). Different MB algorithms have different identification as to whether the parent–child nodes in the

P C_{d i f f}

set are the parent–child nodes of the target, which is controversial. Therefore, we define the parent–child nodes in the

P C_{d i f f}

set as the controversial parent–child nodes.

P C_{u n i o n} = P C_{D c M B_{1}} \cup P C_{D c M B_{2}} \cup \dots P C_{D c M B_{n}} \cup \dots P C_{D c M B_{N}}

(1)

P C_{i n t e r} = P C_{D c M B_{1}} \cap P C_{D c M B_{2}} \cap \dots P C_{D c M B_{n}} \cap \dots P C_{D c M B_{N}}

(2)

P C_{d i f f} = P C_{u n i o n} - P C_{i n t e r}

(3)

There are some wrong PC nodes in the obtained

P C_{d i f f}

set. It is necessary to adopt an appropriate selection strategy to remove the wrong PC nodes and retain the correct ones to further reduce the size of

P C_{d t f f}

, and then get

P C_{s e l e c t}

after taking the union with

P C_{i n t e r}

. This selection process is called U-I select strategy. Then, the

S P

set is obtained through the

G e t S P

strategy. Furthermore,

M B

is obtained by taking the union of

P C_{s e l e c t}

and

S P

.

The method of obtaining MB through the ensemble framework is shown in Figure 5. The ensemble framework for Markov boundary discovery integrates multiple divide-and-conquer methods to enhance the robustness and accuracy of the algorithm. The framework is structured as follows:

Divide-and-Conquer Methods: The framework begins with several divide-and-conquer Markov boundary discovery algorithms, labeled DcMB1, DcMB2,…, DcMBn. Each of these methods independently identifies potential candidate sets for the Markov boundary.

Candidate Set Extraction: Each divide-and-conquer method generates a set of candidate parents and children for the Markov boundary, resulting in multiple candidate sets.

Combination of Candidate Set: The candidate sets from all divide-and-conquer methods are combined into a single union set. This step ensures that all potential candidates identified by any method are considered.

Conflict Resolution: The union set may contain conflicting or redundant variables. To address this, a difference set is derived, and a filtering process is applied to refine the set by resolving conflicts and eliminating redundancies.

Selection of parent–child nodes: From the filtered set, a selection process is employed to determine the most relevant parent–child nodes. This step is crucial for balancing precision and recall, ensuring that the final Markov boundary is both accurate and comprehensive.

Extract the target node’s spouses: The spouses of the target node are obtained based on the spouse node judgment conditions and the screened PC nodes.

Final Markov Boundary: The selected parent–child nodes and the spouses set are combined to form the final Markov boundary.

The ensemble framework’s rationale lies in its ability to enhance the robustness, coverage, conflict resolution, and consideration of controversial variables in the Markov boundary discovery process. By integrating multiple divide-and-conquer methods, the framework diminishes reliance on individual algorithms, bolstering the overall robustness of the process. Through the union of candidate sets, all potential variables identified by diverse methods are thoroughly examined, ensuring a comprehensive identification of the Markov boundary. Conflict resolution mechanisms, including filtering and selection processes, mitigate conflicts and redundancies among candidate sets, promoting a balance between precision and recall. Moreover, the framework’s special consideration of controversial variables via a dedicated set guarantees the inclusion of crucial variables, thereby elevating the accuracy and reliability of the final Markov boundary.

In our research, we use U-I selection strategy to optimize the selection process of parent–child nodes, aiming to select the most reasonable parent–child nodes from multiple groups of parent–child nodes, in which the utility and importance of each parent–child node are considered. In the absence of any other prior knowledge, we set the importance of each algorithm to be the same, and the parent–child nodes obtained by each algorithm have the same gravity. If the accuracy rate is used to measure the utility, then the intersection of the child node obtained by each algorithm is the highest utility. If utility is measured by recall rate, then the highest utility is obtained by selecting the union of parent–child nodes obtained by each algorithm.

The computational complexity and scalability of the ensemble framework are crucial aspects. Given the combination of multiple algorithms, discussing how this framework adapts to larger datasets and higher-dimensional data is essential. Figure 5 illustrates the method of obtaining Markov boundaries through the integrated framework, which combines various divide-and-conquer methods to enhance the robustness and accuracy of the algorithm. Since each divide-and-conquer method runs independently and requires conflict resolution and filtering for each candidate set, the computational complexity of the framework depends on the number and complexity of the divide-and-conquer methods, the size of the candidate sets and merging processes, as well as conflict resolution and selection processes, which increase the computational cost and resource requirements as these factors necessitate conflict resolution and filtering for each candidate set.

To accommodate larger datasets and higher-dimensional data, the framework needs to exhibit good scalability. This includes running divide-and-conquer methods in parallel in a distributed computing environment, employing incremental update strategies to handle large-scale data, utilizing efficient data structures (such as hash tables, tree structures) to improve processing speed, and enhancing the efficiency of the selection process through optimized selection strategies (such as the U-I selection strategy) that consider the utility and importance of nodes. These measures collectively enhance the scalability and efficiency of the framework in handling larger and higher-dimensional datasets.

3.2.2. Ensemble of Divide-and-Conquer Markov Boundary Algorithms

Since PCMB and MBOR obtain PC nodes in different ways, we obtain different PC sets. STMB has a strategy for deleting wrong PC nodes, so we introduce it into the selection process for deleting

P C_{D i f f}

wrong nodes. By integrating these three divide-and-conquer MB algorithms, we propose the ensemble divide-and-conquer Markov boundary (EDMB) algorithm to find the MB of the target node. The EDMB algorithm has five main steps as shown in Algorithm 4:

Obtain the candidate PC set $P C_{P C M B}^{T}$ by the PCMB algorithm, and the candidate PC set $P C_{M B O R}^{T}$ by the MBOR algorithm.
Obtain the intersection PC set $P C_{i n t e r}$ , the union PC set $P C_{u n i o n}$ , and the difference PC set $P C_{d i f f}$ from the candidate PC sets.
Remove the non-PC nodes from the $P C_{d i f f}$ set using the same steps as STMB, and take the union with $P C_{i n t e r}$ as a $P C_{s e l e c t}$ .
Find the set of spouse nodes $S P$ through the same steps as PCMB.
Obtain the MB set by taking the union of $P C_{s e l e c t}$ and $S P$ .

Algorithm 4 EDMB(T)

Input: Target T, Data U
Initialize

P C_{s e l e c t}

, SP,
{step 1: find the candidate PC sets}

P C_{P C M B}^{T}, S E P^{T} \leftarrow g e t P C_{P C M B}

P C_{M B O R}^{T} \leftarrow g e t P C_{M B O R}

{step 2: obtain the intersection, union, and difference of the PC sets

P C_{i n t e r} = P C_{P C M B}^{T} \cap P C_{M B O R}^{T}

P C_{u n i o n} = P C_{P C M B}^{T} \cap P C_{M B O R}^{T}

P C_{d i f f} = P C_{u n i o n} - P C_{i n t e r}

{step 3: remove wrong PC nodes from

P C_{d i f f}

}
For each

Y \in P C_{u n i o n}

do
For each

X \in {U \ P C_{u n i o n} \ {T}}

do
If

X ⊥ T S E P_{X}^{T} \cup {Y}

then
If

\exists Z \subseteq {P C_{u n i o n} \cup {X} \ {Y}

,

Y ⊥ T ∥ Z

and

Y \in P C_{d i f f}

P C_{d i f f} \leftarrow P C_{d i f f} \ {Y}

P C_{s e l e c t} = P C_{i n t e r} \cup P C_{d i f f}

{step 4: find the spouse nodes}
For each

Y \in P C_{s e l e c t}

do
For each

X \in P C_{Y}

do
if

X \notin P C

,then
find

Z \subseteq S E P_{X}^{T}

st

X ⊥ T ∥ Z, X, T \notin Z

if

X ⊥ T Z \cup {Y}

then

S P = S P \cup {X}

Return SP
{step 5: obtain the MB set}

M B = P C_{s e l e c t} \cup S P

4. Results

We chose four most advanced Markov discovery algorithms for comparison, which are PCMB, MBOR, IAMB, and STMB. In Section 4.1, we introduce the standard Bayesian datasets and apply five algorithms to discover Markov boundaries on nine datasets. In Section 4.2, we apply these algorithms to the classification task of real-world datasets and analyze the results. In the experiment, we utilized MATLAB 2020a, Windows 10, and an Intel Core i7-11800H processor to run the algorithms. For the validation of EDMB and other algorithms on standard datasets, we compared the algorithms against the true Markov boundary matrices corresponding to each dataset. Each algorithm was independently run 10 times on different samples from the dataset for result analysis. On real-world datasets, we utilized ten-fold cross-validation to evaluate the algorithm performance for both EDMB and other algorithms.

4.1. Standard MB Discovery Datasets

We select nine typical network datasets from the standard BN data, including Alarm, Child, Insurance, and their extended datasets [33]. Table 1 provides the statistics of these datasets, including the number of variables and edges. We run all MB algorithms for each variable in each dataset, repeating 10 times on the same dataset with different samples. The size of the samples

S

increases from 500 to 5000 with an interval of 500. The effects of the algorithms are compared under multiple sample sets, and the changes in algorithm performance as the size of the training sample increases are analyzed.

Precision and recall are commonly used to measure the search ability of MB algorithms. The precision metric is the ratio of the retrieved true positives to the total number of detected MB variables, while the recall metric is the ratio of the retrieved true positives to the total number of true MB variables. The distance metric combines precision and recall measuring the distance between the MB obtained by the MB algorithm and the true MB. Therefore, the smaller the distance, the closer the obtained MB is to the true MB. Since the algorithms were tested on multiple datasets, we used the Friedman test method to rank them to measure the performance of the algorithms [34,35].

Figure 6 shows the network graphs and distances of MB discovery by multiple algorithms on Alarm and its ex-tended datasets. On the Alarm dataset, when

S

is between 500 and 3000, the distance metric of EDMB is the best among these algorithms, and when

S

is between 3000 and 5000, the performance of EDMB is close to that of PCMB, and it performs well. On the Alarm3 dataset, when

S

is 500, the distance metric of EDMB is between PCMB and MBOR, ranking second. When

S

is between 1000 and 2000, the distance metric of EDMB is the best among these algorithms. When

S

is between 2000 and 5000, the performance of EDMB is close to that of MBOR, and it performs well. On the Alarm5 dataset, when

S

is 500, the distance metric of EDMB is between PCMB and MBOR, ranking second. When

S

is between 1000 and 1500, the distance metric of EDMB is the best among these algorithms. When

S

is between 1500 and 5000, the performance of EDMB is close to that of MBOR, and it performs well.

Figure 7 shows the network graphs and distances of MB discovery by multiple algorithms on Child and its extended datasets. On the Child dataset, when

S

is between 500 and 3000, the distance metric of EDMB is between PCMB and MBOR. When

S

is between 3000 and 3500, the performance of EDMB is close to that of MBOR, and it performs well. When

S

is between 3500 and 5000, the performance of EDMB is close to that of PCMB, and it performs well. On the Child3 dataset, when

S

is between 500 and 2000, the distance metric of EDMB is between PCMB and MBOR. When

S

is between 2000 and 5000, the distance metric of EDMB is the best among these algorithms. On the Child5 dataset, when

S

is between 500 and 2000, the distance metric of EDMB is between PCMB and MBOR. When

S

is between 2000 and 5000, the distance metric of EDMB is the best among these algorithms.

Figure 8 shows the network graphs and distances of MB discovery by multiple algorithms on Insurance and its extended datasets. On the Insurance dataset, when

S

is between 500 and 5000, the distance metric of EDMB is between PCMB and MBOR. On the Insurance3 dataset, when

S

is between 500 and 1000, the distance metric of EDMB is between PCMB and MBOR. When it is between 1500 and 5000, the distance metric of EDMB is the best among these algorithms. On Insurance5 dataset, when

S

is between 500 and 1500, the distance metric of EDMB is between PCMB and MBOR. When

S

is between 1500 and 3000, the performance of EDMB is close to that of MBOR, and it performs well. When

S

is between 3000 and 5000, the distance metric of EDMB is the best among these algorithms.

When the number of samples increases, the distance metric of all algorithms shrinks, which shows that the amount of data has a great impact on the performance of the algorithm. In particular, we analyze the algorithm effect when the number of samples is 5000. When the number of samples is 5000, the results are as shown in Table 2. We can see that no algorithm has the best effect on every dataset. When applying the distance metric to rank, the ranking results are EDMB, MBOR, PCMB, IAMB, and STMB, demonstrating the excellent performance of EDMB. When applying the recall metric to rank, the ranking results are STMB, EDMB, MBOR, PCMB, and IAMB, among which EDMB and MBOR have the same ranking scores, ranking second. When applying the precision metric to rank, the ranking results are PCMB, EDMB, MBOR, IAMB, and STMB. It can be seen that EDMB achieves the balance between precision and recall and has the best overall performance in obtaining the Markov boundary. In Table 2, for the comparison of the three algorithms MBOR, PCMB, and STMB, according to the Fridman ranking of the Distance indicator, the scores of MBOR and PCMB are the same, and the ranking score of STMB is worse than them. According to the ranking of the Recall indicator, STMB has the highest rank score, MBOR has a middle ranking score, and PCMB has the lowest ranking score. According to the Fridman ranking of the Precision indicator, PCMB has the highest rank score, MBOR has a middle ranking score, and STMB has the lowest ranking score. Therefore, the most appropriate choice should be selected based on specific needs and importance trade-offs. MBOR has the most balanced comprehensive performance. When choosing, the ideal algorithm is STMB when focusing on recall rate, and the ideal algorithm is PCMB when focusing on precision.

EDMB demonstrates a balanced high precision and recall across multiple datasets. On some datasets, EDMB can maintain high precision and recall simultaneously, indicating superior performance in terms of accuracy and comprehensiveness, potentially better capturing the true Markov boundaries compared to other algorithms. By integrating the strengths of multiple algorithms, EDMB can mitigate the weaknesses of individual algorithms, thereby enhancing its performance on complex datasets. By synthesizing the results of different algorithms, EDMB can comprehensively consider the characteristics of the dataset, leading to good performance across various datasets. EDMB may possess better adaptability, capable of handling datasets with different structures and features. In certain datasets, EDMB may better adapt to the characteristics of the data, thereby improving its performance on these datasets.

Table 3 displays the true Markov boundaries of selected nodes in the Alarm dataset, the Markov boundaries obtained by five different algorithms, and the precision and recall percentages for each algorithm. For node 2, apart from the MBOR algorithm erroneously identifying an extra node, both PCMB and STMB achieved 100% precision and recall, while the integrated EDMB showed 100% precision and recall. In the case of node 14, PCMB missed one node, resulting in an 80% recall rate, while MBOR achieved 100% precision and recall, STMB had a precision of 63% and recall of 100%, and EDMB achieved 100% precision and recall. Regarding the node 15, MBOR, PCMB, and STMB all achieved 100% precision, with recall rates of 71%, 29%, and 43%, respectively, while EDMB had a final precision of 100% and a recall rate of 71%. From the Markov boundary acquisition results for these three different structured nodes, we can observe that EDMB combines the strengths of three algorithms, enabling it to mitigate the weaknesses of individual algorithms. This is a key reason why EDMB performs well on certain datasets.

For node 21, both MBOR and PCMB achieved 100% precision, while STMB had 80% precision. PCMB achieved 100% recall, and MBOR and STMB had recall rates of 83% and 80%, respectively. EDMB demonstrated 83% precision and 100% recall, indicating that STMB did not perform well in the node selection process, likely due to its weaker ability to judge this type of structure or data distribution. This could impact the performance of EDMB on certain datasets. For node 31, the precision for MBOR, PCMB, and STMB was 75.00%, 100.00%, and 25.00%, respectively, with all having a 75% recall rate, while EDMB showed 100% precision and 75% recall. While EDMB performed reasonably well compared to MBOR, PCMB, and STMB, it was less effective than IAMB, which achieved 100% precision and recall. This indicates that the direct method is slightly more effective than the divide-and-conquer approach for certain specific structured nodes or data distributions, which also contributes to why EDMB may perform slightly less effectively on some datasets.

4.2. Real Word Datasets

We choose three real word datasets in [36] to demonstrate the effectiveness of EDMB in feature selection applications. Table 4 describes the number of nodes and samples of the real-world dataset. We use three classifiers, naïve Bayes, linear SVM, and decision tree, to calculate their classification accuracy.

As shown in Table 5, we find that the classification accuracy of the same dataset is different under different classifiers, and the classifiers of the best classification results corresponding to different datasets are also different. According to Friedman’s ranks, EDMB ranks first; MBOR ranks second; STMB ranks third; IAMB ranks fourth; and PCMB ranks fifth. EDMB achieves the best comprehensive effect on feature selection in real datasets compared with other MB algorithms.

5. Conclusions

The Markov boundary discovery ensemble framework combines several state-of-the-art Markov boundary discovery algorithms to enhance the process of identifying parent–child relationships. Based on the experimental results, we have successfully demonstrated the efficacy of our proposed algorithm, which integrates causal feature selection and leverages Markov boundary discovery. These results were obtained by conducting extensive evaluations on multiple datasets. Furthermore, we have made use of the collinearity among spouse nodes and child nodes to enhance the accuracy of identifying Markov boundaries. In future research, we will explore more accurate parent–child node selection strategies so that the obtained parent–child nodes are closer to the real parent–child nodes. This ensemble framework has a wide range of potential applications in practical scenarios. In the medical field, integrated causal feature selection methods can help physicians extract causal links from large amounts of biological and clinical data to improve disease diagnosis and treatment. In the field of intelligent manufacturing, equipment data and environmental factors are detected to find causal characteristics that affect product quality and production efficiency. The challenge in future work is that integrating multiple causal feature selection methods may require more computational resources and time, especially on large-scale datasets, so the further optimization of the methods is needed to improve efficiency. When integrating multiple feature selection methods, it is necessary to consider how to integrate the results of different methods effectively, which involves differences and potential conflicts among methods.

Author Contributions

Conceptualization, H.L., J.Z. and H.W.; methodology, H.L., J.Z. and Z.Z.; software, H.L.; validation, H.L., J.Z. and Z.Z.; formal analysis, H.L. and H.W.; investigation, J.Z.; resources, Z.Z.; data curation, H.L.; writing—original draft preparation, H.L. and J.Z.; writing—review and editing, J.Z. and Z.Z.; visualization, H.W.; supervision, J.Z. and Z.Z.; project administration, H.W. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data and analytical data can be obtained from the author Hao Li ([email protected]).

Conflicts of Interest

Author Jianjun Zhan was employed by the company Cainiao Network. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Nssibi, M.; Manita, G.; Korbaa, O. Advances in Nature-Inspired Metaheuristic Optimization for Feature Selection Problem: A Comprehensive Survey. Comput. Sci. Rev. 2023, 49, 100559. [Google Scholar] [CrossRef]
Fu, S.; Desmarais, M.C. Markov Blanket Based Feature Selection: A Review of Past Decade. In Proceedings of the World Congress on Engineering, London, UK, 30 June–2 July 2010; Newswood Ltd.: Hong Kong, China, 2010; Volume 1, pp. 321–328. [Google Scholar]
Raffa, M. Markov Blankets for Sustainability. In Proceedings of the International Conference on Software Engineering and Formal Methods, Berlin, Germany, 26–30 September 2022; Springer: Cham, Switzerland, 2022; pp. 313–323. [Google Scholar]
Koller, D.; Sahami, M. Toward Optimal Feature Selection. In Proceedings of the ICML’96, Bari, Italy, 3–6 July 1996; Volume 96, p. 292. [Google Scholar]
Margaritis, D. Toward Provably Correct Feature Selection in Arbitrary Domains. In Proceedings of the NIPS’09: Proceedings of the 22nd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 7–10 December 2009. [Google Scholar]
Tsamardinos, I.; Aliferis, C.F.; Statnikov, A.R.; Statnikov, E. Algorithms for Large Scale Markov Blanket Discovery. In Proceedings of the FLAIRS, St. Augustine, FL, USA, 11–15 May 2003; Volume 2, pp. 376–381. [Google Scholar]
Yaramakala, S.; Margaritis, D. Speculative Markov Blanket Discovery for Optimal Feature Selection. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), Houston, TX, USA, 27–30 November 2005. [Google Scholar]
Pena, J.M.; Nilsson, R.; Björkegren, J. Towards Scalable and Data Efficient Learning of Markov Boundaries. Int. J. Approx. Reason. 2007, 45, 211–232. [Google Scholar] [CrossRef]
Yang, X.; Wang, Y.; Ou, Y.; Tong, Y. Three-Fast-Inter Incremental Association Markov Blanket Learning Algorithm. Pattern Recognit. Lett. 2019, 122, 73–78. [Google Scholar] [CrossRef]
Borboudakis, G.; Tsamardinos, I. Forward-Backward Selection with Early Dropping. J. Mach. Learn. Res. 2019, 20, 1–39. [Google Scholar]
Tsamardinos, I.; Borboudakis, G.; Katsogridakis, P.; Pratikakis, P.; Christophides, V. A Greedy Feature Selection Algorithm for Big Data of High Dimensionality. Mach. Learn. 2019, 108, 149–202. [Google Scholar] [CrossRef] [PubMed]
Tsamardinos, I.; Aliferis, C.F.; Statnikov, A. Time and Sample Efficient Discovery of Markov Blankets and Direct Causal Relations. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–27 August 2003; pp. 673–678. [Google Scholar]
Aliferis, C.F.; Tsamardinos, I.; Statnikov, A. HITON: A Novel Markov Blanket Algorithm for Optimal Variable Selection. In Proceedings of the AMIA—American Medical Informatics Association Annual Symposium, Washington, DC, USA, 8–12 November 2003; pp. 21–25. [Google Scholar]
Aliferis, C.F.; Statnikov, A.; Tsamardinos, I. Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation. J. Mach. Learn. Res. 2010, 11, 235–284. [Google Scholar]
Peña, J.M.; Björkegren, J.; Tegnér, J. Scalable, Efficient and Correct Learning of Markov Boundaries under the Faithfulness Assumption. In Symbolic and Quantitative Approaches to Reasoning with Uncertainty; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2005; pp. 136–147. [Google Scholar]
Fu, S.; Desmarais, M.C. Fast Markov Blanket Discovery Algorithm via Local Learning within Single Pass. In Advances in Artificial Intelligence; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2008; pp. 96–107. [Google Scholar]
Gao, T.; Ji, Q. Efficient Markov Blanket Discovery and Its Application. IEEE Trans. Cybern. 2017, 47, 1169–1179. [Google Scholar] [CrossRef] [PubMed]
Ling, Z.; Yu, K.; Wang, H.; Liu, L.; Ding, W.; Wu, X. BAMB: A Balanced Markov Blanket Discovery Approach to Feature Selection. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 10, 52. [Google Scholar] [CrossRef]
Wang, H.; Ling, Z.; Yu, K.; Wu, X. Towards Efficient and Effective Discovery of Markov Blankets for Feature Selection. Inf. Sci. 2020, 509, 227–242. [Google Scholar] [CrossRef]
Rodrigues de Morais, S.; Aussem, A. A Novel Scalable and Data Efficient Feature Subset Selection Algorithm. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Antwerp, Belgium, 15–19 September 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 298–312. [Google Scholar]
Wu, X.; Jiang, B.; Yu, K.; Miao, C.; Chen, H. Accurate Markov Boundary Discovery for Causal Feature Selection. IEEE Trans. Cybern. 2019, 50, 4983–4996. [Google Scholar] [CrossRef] [PubMed]
Wu, X.; Jiang, B.; Yu, K.; Chen, H. Separation and Recovery Markov Boundary Discovery and Its Application in EEG-Based Emotion Recognition. Inf. Sci. 2021, 571, 262–278. [Google Scholar] [CrossRef]
Guo, X.; Yu, K.; Cao, F.; Li, P.; Wang, H. Error-Aware Markov Blanket Learning for Causal Feature Selection. Inf. Sci. 2022, 589, 849–877. [Google Scholar] [CrossRef]
Acid, S.; de Campos, L.M.; Fernández, M. Score-Based Methods for Learning Markov Boundaries by Searching in Constrained Spaces. Data Min. Knowl. Discov. 2013, 26, 174–212. [Google Scholar] [CrossRef]
Gao, T.; Ji, Q. Efficient Score-Based Markov Blanket Discovery. Int. J. Approx. Reason. 2017, 80, 277–293. [Google Scholar] [CrossRef]
Wu, X.; Jiang, B.; Zhong, Y.; Chen, H. Tolerant Markov Boundary Discovery for Feature Selection. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 19–23 October 2020; pp. 2261–2264. [Google Scholar]
Adam, S.P.; Alexandropoulos, S.-A.N.; Pardalos, P.M.; Vrahatis, M.N. No Free Lunch Theorem: A Review. In Approximation and Optimization: Algorithms, Complexity and Applications; Springer: Cham, Switzerland, 2019; pp. 57–82. [Google Scholar]
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann: Burlington, MA, USA, 1988. [Google Scholar]
Chickering, D.M.; Meek, C. On the Incompatibility of Faithfulness and Monotone DAG Faithfulness. Artif. Intell. 2006, 170, 653–666. [Google Scholar] [CrossRef]
Tsamardinos, I.; Aliferis, C.F. Towards Principled Feature Selection: Relevancy, Filters and Wrappers. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, PMLR, Key West, FL, USA, 3–6 January 2003; pp. 300–307. [Google Scholar]
Pellet, J.-P.; Elisseeff, A. Using Markov Blankets for Causal Structure Learning. J. Mach. Learn. Res. 2008, 9, 1295–1342. [Google Scholar]
Spirtes, P.; Glymour, C.; Scheines, R. Causation, Prediction, and Search; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
Tsamardinos, I.; Brown, L.E.; Aliferis, C.F. The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm. Mach. Learn. 2006, 65, 31–78. [Google Scholar] [CrossRef]
Friedman, M. The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
Friedman, M. A Comparison of Alternative Tests of Significance for the Problem of m Rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
Brown, G.; Pocock, A.; Zhao, M.-J.; Luján, M. Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection. J. Mach. Learn. Res. 2012, 13, 27–66. [Google Scholar]

Figure 1. The Markov boundary of the target node T in a Bayesian network.

Figure 2. A Bayesian network with real and nonexistent connections.

Figure 3. The general process of single Markov boundary discovery algorithm: (a) direct Markov boundary discovery process; (b) divide-and-conquer Markov boundary discovery process.

Figure 4. Coexistence of spouse and faulty PC nodes: (a) Bayesian network without descendant collision structure; (b) Bayesian network with offspring collision structure.

Figure 5. The ensemble framework of divide-and-conquer Markov boundary method.

Figure 6. The network graphs and distances of MB discovery by multiple algorithms on Alarm and its extended datasets.

Figure 7. The network graphs and distances of MB discovery by multiple algorithms on Child and its extended datasets.

Figure 8. The network graphs and distances of MB discovery by multiple algorithms on Insurance and its extended datasets.

Table 1. The statistics of standard datasets [33].

Dataset	Alarm	Alarm3	Alarm5	Child	Child3	Child5	Insurance	Insurance3	Insurance5
Variables	37	111	185	20	60	100	27	81	135
Edges	46	149	265	25	79	126	52	163	281

Table 2. Comparison of five MB algorithms on nine benchmark BN datasets (size = 5000).

Dataset	Metric	MBOR	PCMB	STMB	IAMB	EDMB
Alarm	Distance	6.25% ± 1.68%	5.22% ± 2.62%	29.53% ± 2.87%	15.63% ± 2.88%	5.21% ± 2.13%
	Recall	95.79% ± 1.06%	95.64% ± 2.27%	96.11% ± 1.02%	89.70% ± 0.87%	96.39% ± 1.79%
	Precision	97.20% ± 1.24%	98.65% ± 0.97%	73.50% ± 2.68%	92.44% ± 2.79%	97.87% ± 1.14%
Alarm3	Distance	15.92% ± 1.60%	17.64% ± 1.14%	47.95% ± 2.01%	31.75% ± 1.30%	15.60% ± 1.47%
	Recall	87.28% ± 0.70%	84.57% ± 0.88%	90.14% ± 0.88%	83.22% ± 0.77%	86.79% ± 0.87%
	Precision	94.45% ± 1.98%	95.31% ± 0.83%	57.42% ± 2.19%	80.49% ± 1.40%	95.59% ± 1.48%
Alarm5	Distance	18.14% ± 0.84%	21.23% ± 1.13%	55.84% ± 1.30%	41.59% ± 1.58%	18.66% ± 0.96%
	Recall	85.21% ± 0.71%	80.75% ± 1.19%	87.51% ± 0.81%	79.20% ± 0.63%	83.90% ± 0.94%
	Precision	94.87% ± 0.73%	96.03% ± 0.53%	50.53% ± 1.27%	72.16% ± 1.43%	95.80% ± 0.83%
Child	Distance	4.66% ± 3.09%	1.51% ± 1.51%	16.84% ± 3.33%	20.62% ± 4.00%	2.11% ± 2.17%
	Recall	98.20% ± 0.93%	99.29% ± 0.94%	98.12% ± 1.65%	90.11% ± 1.38%	99.19% ± 1.09%
	Precision	96.75% ± 2.98%	99.21% ± 1.02%	84.39% ± 3.34%	86.77% ± 3.97%	98.67% ± 1.80%
Child3	Distance	6.27% ± 1.15%	6.14% ± 0.82%	35.83% ± 2.58%	34.73% ± 3.09%	5.58% ± 1.02%
	Recall	96.89% ± 0.41%	96.14% ± 0.77%	97.09% ± 0.48%	89.61% ± 0.61%	96.73% ± 0.41%
	Precision	96.67% ± 1.24%	97.47% ± 0.99%	66.34% ± 2.31%	73.00% ± 3.20%	97.45% ± 1.17%
Child5	Distance	4.11% ± 1.50%	3.14% ± 1.82%	41.31% ± 2.17%	39.33% ± 2.16%	3.38% ± 1.74%
	Recall	99.75% ± 0.29%	99.34% ± 0.95%	99.74% ± 0.33%	92.21% ± 0.44%	99.62% ± 0.49%
	Precision	96.12% ± 1.35%	97.48% ± 1.22%	58.81% ± 2.13%	66.37% ± 2.14%	96.97% ± 1.54%
Insurance	Distance	28.11% ± 3.42%	36.79% ± 2.26%	42.22% ± 1.57%	35.08% ± 2.45%	30.07% ± 2.88%
	Recall	75.10% ± 2.26%	68.04% ± 1.60%	80.39% ± 1.09%	68.57% ± 1.61%	73.74% ± 2.03%
	Precision	93.74% ± 3.27%	89.03% ± 3.11%	70.16% ± 2.19%	92.16% ± 2.82%	92.62% ± 2.92%
Insurance3	Distance	28.68% ± 1.68%	27.27% ± 1.52%	59.62% ± 1.57%	40.61% ± 1.15%	26.24% ± 1.53%
	Recall	77.11% ± 0.72%	78.23% ± 1.54%	83.80% ± 0.87%	68.74% ± 0.68%	79.42% ± 1.23%
	Precision	89.96% ± 1.65%	90.34% ± 0.91%	47.57% ± 2.05%	85.38% ± 1.53%	90.62% ± 1.20%
Insurance5	Distance	28.64% ± 0.91%	28.81% ± 0.84%	64.52% ± 1.95%	46.30% ± 1.50%	27.14% ± 0.77%
	Recall	76.62% ± 0.55%	75.22% ± 0.75%	82.35% ± 0.61%	66.62% ± 0.62%	77.01% ± 0.61%
	Precision	90.83% ± 1.14%	91.89% ± 1.01%	42.90% ± 2.22%	81.80% ± 1.49%	92.31% ± 1.06%
Friedman’s ranks	Distance rank score	2.3333	2.3333	4.8889	4	1.4444
	Rank	2	2	5	4	1
	Recall rank score	2.4444	3.6667	1.5556	4.8889	2.4444
	Rank	2	4	1	5	2
	Precision rank score	2.7778	1.6667	5	3.8889	1.6667
	Rank	3	1	5	4	1

Table 3. Comparison of the Markov boundaries of some nodes in the Alarm dataset (size = 5000).

Target		Real MB	MBOR	PCMB	STMB	IAMB	EDMB
2	MB	[23,27,29]	[22,23,27,29]	[23,27,29]	[23,27,29]	[23,27,29]	[23,27,29]
	Precision		75.00%	100.00%	100.00%	100.00%	100.00%
	Recall		100.00%	100.00%	100.00%	100.00%	100.00%
14	MB	[13,15,16,18,31]	[13,15,16,18,31]	[13,16,18,31]	[7,13,15,16,18,21,29,31]	[4,13,16,18,31]	[13,15,16,18,31]
	Precision		100.00%	100.00%	63.00%	80.00%	100.00%
	Recall		100.00%	80.00%	100.00%	80.00%	100.00%
15	MB	[4,14,16,18,21,22,31]	[4,14,16,21,31]	[4,21]	[4,14,21]	[4,14,21]	[4,14,16,21,31]
	Precision		100.00%	100.00%	100.00%	100.00%	100.00%
	Recall		71.00%	29.00%	43.00%	43.00%	71.00%
21	MB	[15,19,20,22,29]	[4,15,19,20,22,29]	[15,19,20,22,29]	[15,18,19,22,29]	[4,19,22,29]	[4,15,19,20,22,29]
	Precision		83.00%	100.00%	80.00%	75.00%	83.00%
	Recall		100.00%	100.00%	80.00%	60.00%	100.00%
31	MB	[14,15,16,18]	[4,14,16,18]	[14,16,18]	[1,2,4,14,15,16,18,19,21,22,23,24,27,28,29,30]	[14,15,16,18]	[14,16,18]
	Precision		75.00%	100.00%	25.00%	100.00%	100.00%
	Recall		75.00%	75.00%	75.00%	100.00%	75.00%

Table 4. Statistics for real word datasets.

Dataset	Wine	Diabetes	Statlog
Instances	178	768	1000
Features	13	8	20

Table 5. The results of different classification accuracy on real world datasets.

Classifier	Dataset	MBOR	PCMB	STMB	IAMB	EDMB
Naive Bayes	wine	0.9608	0.9546	0.9775	0.9775	0.9490
	diabetes	0.7618	0.7552	0.7592	0.7618	0.7618
	Statlog	0.7330	0.7450	0.7220	0.7230	0.7480
Linear SVM	wine	0.9608	0.9608	0.9386	0.9386	0.9663
	diabetes	0.7632	0.7632	0.7632	0.7606	0.7632
	Statlog	0.7320	0.7350	0.7410	0.7420	0.7360
Decision Tree	wine	0.9101	0.9216	0.9271	0.9216	0.9271
	diabetes	0.6993	0.6811	0.6928	0.6902	0.6993
	Statlog	0.7200	0.7120	0.7150	0.7190	0.7140
Rank score		2.8333	3.7222	3.0000	3.0556	2.3889
Rank		2	5	3	4	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, H.; Zhan, J.; Wang, H.; Zhao, Z. A Novel Ensemble Method of Divide-and-Conquer Markov Boundary Discovery for Causal Feature Selection. Mathematics 2024, 12, 2927. https://doi.org/10.3390/math12182927

AMA Style

Li H, Zhan J, Wang H, Zhao Z. A Novel Ensemble Method of Divide-and-Conquer Markov Boundary Discovery for Causal Feature Selection. Mathematics. 2024; 12(18):2927. https://doi.org/10.3390/math12182927

Chicago/Turabian Style

Li, Hao, Jianjun Zhan, Haosen Wang, and Zipeng Zhao. 2024. "A Novel Ensemble Method of Divide-and-Conquer Markov Boundary Discovery for Causal Feature Selection" Mathematics 12, no. 18: 2927. https://doi.org/10.3390/math12182927

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Ensemble Method of Divide-and-Conquer Markov Boundary Discovery for Causal Feature Selection

Abstract

1. Introduction

2. Definitions and Theorems

3. Methods

3.1. Several Typical Divide-and-Conquer Markov Boundary Discovery Algorithms

3.1.1. PCMB

3.1.2. MBOR

3.1.3. STMB

3.2. Ensemble Method of Divide-and-Conquer Markov Boundary Discovery Algorithms

3.2.1. Controversial Parent–Child Nodes and U-I Select Strategy

3.2.2. Ensemble of Divide-and-Conquer Markov Boundary Algorithms

4. Results

4.1. Standard MB Discovery Datasets

4.2. Real Word Datasets

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI