Next Article in Journal
Analysis of Control Methods for the Traction Drive of an Alternating Current Electric Locomotive
Next Article in Special Issue
Methods of Retrieving Large-Variable Exponents
Previous Article in Journal
Intelligent System for Estimation of the Spatial Position of Apples Based on YOLOv3 and Real Sense Depth Camera D415
Previous Article in Special Issue
On Markov Moment Problem, Polynomial Approximation on Unbounded Subsets, and Mazur–Orlicz Theorem
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Online Streaming Features Selection via Markov Blanket

1
School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, China
2
School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan 250014, China
3
Department of Computer Science, Shaheed Benazir Bhutto University, Peshawar 25000, Pakistan
*
Author to whom correspondence should be addressed.
Symmetry 2022, 14(1), 149; https://doi.org/10.3390/sym14010149
Submission received: 15 November 2021 / Revised: 19 December 2021 / Accepted: 25 December 2021 / Published: 13 January 2022
(This article belongs to the Special Issue Symmetry and Approximation Methods)

Abstract

:
Streaming feature selection has always been an excellent method for selecting the relevant subset of features from high-dimensional data and overcoming learning complexity. However, little attention is paid to online feature selection through the Markov Blanket (MB). Several studies based on traditional MB learning presented low prediction accuracy and used fewer datasets as the number of conditional independence tests is high and consumes more time. This paper presents a novel algorithm called Online Feature Selection Via Markov Blanket (OFSVMB) based on a statistical conditional independence test offering high accuracy and less computation time. It reduces the number of conditional independence tests and incorporates the online relevance and redundant analysis to check the relevancy between the upcoming feature and target variable T, discard the redundant features from Parents-Child (PC) and Spouses (SP) online, and find PC and SP simultaneously. The performance OFSVMB is compared with traditional MB learning algorithms including IAMB, STMB, HITON-MB, BAMB, and EEMB, and Streaming feature selection algorithms including OSFS, Alpha-investing, and SAOLA on 9 benchmark Bayesian Network (BN) datasets and 14 real-world datasets. For the performance evaluation, F1, precision, and recall measures are used with a significant level of 0.01 and 0.05 on benchmark BN and real-world datasets, including 12 classifiers keeping a significant level of 0.01. On benchmark BN datasets with 500 and 5000 sample sizes, OFSVMB achieved significant accuracy than IAMB, STMB, HITON-MB, BAMB, and EEMB in terms of F1, precision, recall, and running faster. It finds more accurate MB regardless of the size of the features set. In contrast, OFSVMB offers substantial improvements based on mean prediction accuracy regarding 12 classifiers with small and large sample sizes on real-world datasets than OSFS, Alpha-investing, and SAOLA but slower than OSFS, Alpha-investing, and SAOLA because these algorithms only find the PC set but not SP. Furthermore, the sensitivity analysis shows that OFSVMB is more accurate in selecting the optimal features.

1. Introduction

In machine learning, several feature selection algorithms are essential for processing high-dimensional data. The optimal feature (feature/variable/node are same) set for a target variable T is the Markov blanket (MB) [1], composed of Parents-Child (PC), Spouses (SP), and direct causes and effects between them, as shown in Figure 1.
There are two major approaches for feature selection, i.e., traditional-based and streaming-based. Traditional-based MB discovery (discovery/learning are interchangeable in this article) assumes that all features are readily available from the beginning. This assumption, however, is often violated in real-world applications. For instance, for the problem of detecting Mars crater from high-resolution planetary images [2], it is impracticable to attain the complete feature set, which means to have a near-global coverage of the Martian surface. Several MB discovery algorithms are proposed based on traditional concepts requiring all features and instances to be available before learning. Such algorithms include Incremental Association-Based Markov Blanket (IAMB) [3], HITON-MB (HITON-MB) [4], Simultaneous Markov Blanket (STMB) [5], Balanced Markov Blanket (BAMB) [6], and Efficient and Effective Markov Blanket (EEMB) [7]. Traditional-based MB discovery algorithms do not apply to real-world scenarios, such as users of the famous microblogging website Twitter who yield over 250 million tweets every single day, which includes words and abbreviations [8], personalized recommendations [9], malware scanning [10], and ecological inspection and analysis [11] with the quality of frequent features updates and dynamic space-changing [12]. A fascinating question is whether we should wait long for all features to become accessible before starting the learning process. We need a lot of computational efforts to build such features upfront. This raises an intriguing research challenge about constructing an effective feature selection process without knowing the entire feature space. However, the existing algorithms for Markov blanket discovery commonly assume that all features must be present in advance. Therefore to tackle such issues, streaming feature selection was introduced and applied to many applications such as biology, weather forecasting, transportation, stock markets, clinical research, etc. [8]. Several algorithms based on streaming features (SF) were proposed for real scenarios including Grafting [8], Alpha-investing ( α investing) [13], Scalable and Accurate Online Feature Selection (SAOLA) [14], and Online Streaming Feature Selection (OSFS) [15]. However, these algorithms only focus on obtaining PC sets and do not consider the Spouses, which causes them to lose the interpretability by ignoring the causal MB discovery.
Motivated by these observations and issues, this paper presents an Online Streaming Features Selection via Markov Blanket algorithm, based on a statistical conditional independence test. The features are no longer static but flow in one by one and are analyzed as it arrives. A null-conditional test motivated it to address the streaming features, feature relevance analysis to find the true positive PC and Spouses, and feature redundancy analysis to remove the false positive/irrelevant features.
The main contributions of this study are as follows:
  • OFSVMB obtains Parent-Child and Spouse feature sets simultaneously and separates them in the MB set;
  • OFSVMB reduces the impact of conditionally independent tests errors since it uses fewer conditionally independent tests to learn MB;
  • Sensitivity analysis of OFSVMB using three different parameters α values concerning the Rate of Instance | R I | to analyze the performance with small and large sample sizes;
  • Performance evaluation of OFSVMB algorithm on benchmark BN and real-world datasets.
The remainder of the paper is organized as follows: The related work is presented in Section 2. The preliminaries is discussed in Section 3. The proposed algorithm is presented in Section 4. The experimental findings are discussed in Section 5, and the paper is concluded in Section 6.

2. Related Work

In traditional MB discovery techniques, features must be present before learning begins. Different algorithms developed for MB learning, which is based on traditional concepts such as Incremental Association-Based Markov Blanket (IAMB) [16], Max-Min Markov Blanket (MMMB) [17], HITON-MB (HITON-MB) [18], Simultaneous Markov Blanket (STMB) [19], Iterative Parent-Child-based MB (IPCMB) [19], Balanced Markov Blanket (BAMB) [6], and Efficient and Effective Markov Blanket (EEMB) [7]. While MB is learning the Parents-Child and Spouses of the target feature, T cannot differentiate by the IAMB [16] algorithm. Additionally, when the sample size of the dataset is not quite large, the IAMB can not faithfully discover the Markov blanket of the target feature T. The Min-Max Markov blanket (MMMB) uses the Divide and Conquer strategy to minimize the size of data samples [17]. It needs to split the dilemma of finding MB into two sub-dilemmas to find the Parents-Child and the Spouses. MMMB has been modified to build HITON-MB, which excludes false-positive features from the Parents-Child set quickly by interleaving the growing and shrinking phases [18]. Under the Markov blanket assumption, MMMB [17] and HITON-MB [18] were conceptually flawed and needed a new mechanism for accurate MB discovery. The Iterative Parent-Child-based search of Markov Blanket (IPCMB) algorithm uses the same PC algorithm as PCMB [19] to identify the PC set and increases the efficiency without losing accuracy [19]. However, the symmetry constraint check makes this algorithm computationally slow. STMB [19] also followed the technique as IPCMB [19] for Parents-Child exploration. However, it consumes more time and avoids symmetry checks.
Moreover, the online feature selection methods receive features one by one dynamically. These methods include grafting [8], Alpha-investing ( α investing) [8], Scalable and Accurate Online Feature Selection (SAOLA) [20], and Online Streaming Feature Selection (OSFS) [21]. Grafting was the first algorithm designed by Perkins and Theiler. It attempts the streaming features selection technique, which is a stage-wise technique for gradient descent. Alpha-investing used p-value and threshold to control the feature selection process and decide to select relevant features and remove redundant ones [8]. The benefit of alpha-investing is to handle features sets of unknown sizes even up to infinity, but it fails to investigate redundant features, causing an unpredictable and low prediction accuracy. SAOLA [20] examines two features simultaneously and analyzes redundancy under a single scenario. It fails to find an optimum relevance threshold value to remove all redundant features. In contrast, OSFS [21] removes unnecessary features which are not relevant/associated with the target feature, T, using conditional independence. It uses two steps to achieve the Parents-Child relevant to the target feature T. The first step analyzes the online relevance, and the second step analyzes online redundancy. It identifies the approximate MB without Spouses. The methods of online feature selection such as Alpha-investing, SAOLA, and OSFS only identify Parents-Child (PC) features and avoid the Spouses (SP) and the causality neglect by these algorithms. Based on online feature selection and causality, the Causal Discovery From Streaming Feature (CDFSF) is used [22]. The symmetric Causal Discovery From Streaming Feature (S-CDFSF) [22] is developed, which uses conditional independence test to identify the relevant features of the target feature, T, based on streaming, which belongs to the Parents-Child and Spouses.

3. Preliminaries

In this segment, the MB discovery through streaming features is defined. In addition, the specific aspects are discussed in detail. Table 1 summarized the notations used in this paper based on the new definitions.
Definition 1 (Streaming Features [15]). Involves a feature vector { X 1 , X 2 , X 3 , , X i } that streams in one by one over time { t i , X i + 1 , X i + 2 , , X i + n } while a number of training samples remain constant.
Definition 2 (Conditional Independence [6]). Feature (variable) X i is conditionally independent of feature (variable) Y i given S, if and only if P ( X i | Y i , S ) = P ( X i | S ) .
Definition 3 (Strong relevant [23]). A feature X i R is strongly relevant to the target feature T, if and only if S R { X i } ,   s . t . P X i | S P X i | S , T .
Definition 4 (Irrelevant [23]). A feature X i R is irrelevant to a target feature T, if and only if it is S R { X i } , s . t . P X i | S , T = P S | T .
Definition 5 (Redundant [24]). A feature X i is redundant to the target variable T, if and only if it is weakly relevant to target variable T and has a Markov blanket, M B ( X i ) , then it is a subset of the Markov blanket of M B T .
Definition 6 (Faith-fullness Condition [25]). G denotes a Bayesian network, and P represents a joint probability distribution through feature set R. So, G is devoted to P if P captures all. Only the conditional independence is among features in G.
Definition 7 (V-structure [26]). If there is no an arrow between feature (variable) X i and and feature (variable) Y i , and feature (variable) Z i has two incoming arrows from X i and Y i , respectively, then X i , Z i , and Y i form a V-structure X i Z i Y i .
Definition 8 (D-Separation [26]). A path D between a feature (variable) X i and feature (variable) Y i is D-separated by set of features (variables) S, if and only if:
  • D includes a chain X i Z i Y i such that the middle one features Z i is in S.
  • D includes a collider X i Z i Y i such that the middle one feature Z i is not in S and none of Z i successors are in S.
A feature set S is said to be D-separated X i and Y i , if and only if S jammed each path D from a feature X i to a feature Y i .
Theorem 1.
In a faithful Bayesian Network, an MB of the target feature T, M B T , in a set R is an optimal set of features, composed of Parents, Children, and Spouses. All other features are not conditionally dependent of target feature T given M B T , X i R ( M B T T ) , s.t. X i T | M B T .

4. Online Feature Selection via Markov Blanket

This section presents the proposed algorithm for implementing the framework for feature selection with streaming features called Online features selection via a Markov blanket (OFSVMB).

4.1. Framework of OFSVMB

The framework of the Online Feature Selection via Markov blanket is shown in Table 2. Two conditional tests are used to check the association between features. The first is a statistical G 2 test (for discrete data) and the second is the statistical Fisher’s z-test (for continuous data). I n d X i , T | S = X i T | S (Equal(=) sign means they are same in this article) denotes a conditional independence test between a feature X i and the target feature (variable) T, given a subset S. While D e p X i , T | S = X i T | S represents a conditional dependence test between a feature X i and the target feature T, given a subset S.
For a new incoming feature, the OFSVMB algorithm performs the Null conditional independence analysis [26], relevance analysis, and redundancy analysis. The pseudocode of OFSVMB is shown in Algorithm 1. In Algorithm 1 RecogNC, line 5 performs null conditional independence using Proposition 1 to check the dependency between feature X i and target feature T. If X i is dependent on target feature T, then add X i to C P C T ; otherwise, add it to non_pc T .
Proposition 1.
Using null conditional independence, check the feature X i relevancy or irrelevancy with target feature T.
Proof of Proposition 1. 
Assuming that X i R and Y i R , the following hold:
X i Y i | [ ] P X i | Y i , [ ] = P X i | [ ] P X i | Y i = P X i P X i , Y P Y = P X i P X i , Y i = P X i P Y i X i Y i
Therefore, X i Y i , [ ] represents that X i and Y i are not relevant to each other.
Through the relevancy analysis based on Proposition 2, the OFSVMB analyzes features and adds them to the candidate Parents-Child set C P C T and the candidate Spouse set C S P T , which are the candidate set of the Parents-Child and Spouse set. If feature X i is related to the target feature T given S C P C T , it is added to C P C T ; otherwise, it is removed from C P C T and added to non_pc T .
Furthermore, it also analyzes whether X i is a candidate spouse C S P T { X i } from the non_pc T set. For example, if feature X i non_pc T , the conditional feature set that causes X i and target independent variable T is the s e p T { Y i } . If there exists a feature N C P C T , X i is related to target variable T under the condition of s e p T { Y i } union N, then X i is added to the C S P T N as mentioned in Algorithm 1.
Proposition 2.
A current feature X i arrived at time t, and T is a target feature. If X i C P C T , i.e., X i T | S , then X i C P C T .
Proof. 
The proof is as follows: If X i C P C , s . t . X i T | S , if feature X i is related to target feature T under R condition by Theorem 1, then add X i to C P C T . □
Based on Theorem 2 and Proposition 3, using the redundancy analysis, the OFSVMB removes false-positive features separately from the candidate Parent-Child set C P C T and the candidate Spouse set C S P T , which are the candidate sets of the Parents-Child and Spouse set. However, it also looks for non-MB descendant features in C P C T and discards them if they have been through Theorem 2, which makes the OFSVMB different from the other existing algorithms. Removing false positive Spouses through redundant analysis if Y i C P C T , if Y i is not dependent to the target feature (variable) T, giving subset S, then Y i is removed from C P C T . By removing the irrelevant Spouse such that N C P C T and E C S P T N , if E is not relevant to target feature T under the condition of S N , then E is removed from C S P T N .
Theorem 2.
(PC recognizes false-positives): Only successors of the target T, standing for D e s T , make up false-positives f P C T .
Proof. 
C P C T is composed of a whole PC and a few false positives. We show that f D e s T . C P C T is a candidate super-set of all true positive PCs because it should hold the true positive PCs. After an exhaustive search for the PC set, the entire parents set of target T is represented by P a T . According to Definition 5, all non-successor nodes (features) are independent of target T provided P a T . f T | P a T is any non-descendant node if f f is any non-successor nodes (feature). As a result, f will be omitted from C P C T due to the Markov condition f D e s T . □
Proposition 3.
(Remove false-positives from a Spouse). In a BN, R is a feature set, assuming that X i is neighboring to Y i , Y i is neighboring to T, and X i is not neighboring to T (e.g, X i Y i T ). However, once feature M enters C S P T { Y i } , for any existing feature X in C S P T { Y i } N , S C P C T C S P T { Y i } X i , X i T | S { Y i } , then X is a false-positives and removed from C S P T { Y i } .
Proof of Proposition 3. 
V-structure X i Y i T illustrates that if X i is target T’s Spouse and Y i is the mutual Child, there occurs a subset S R { X i , T , Y i , N } so that features X and T are independent given feature S. Still, they are dependent given S Y i . If there is another feature, M exists to block the path between X i and Y i . M has a direct effect on Y i that is not satisfied by V-structure X i Y i T , so in this condition, X is removed from C S P T { Y i } and is considered as a false-positives Spouse. □

4.2. The Proposed OFSVMB Algorithm and Analysis

This paper segment explains the proposed algorithm for Online Feature Selection via Markov blanket, given in Algorithm 1. The proposed algorithm can derive the M B T by deleting all redundant features. The OFSVMB algorithm based on null-conditional independence test, relevance analysis, and redundant analysis. First, the null-conditional independence [26] (line 5) of Algorithm 1 is used to tackle the streaming feature. It analyze the new feature X i arrived at time t i . If X i is dependent of target T given empty set, (the empty set is equal to ⌀), then include feature X i to C P C T ; otherwise, add X i to non_pc T . After null conditional independence analysis, check if either feature X i is a true positive P C (line 6–9) given subset S. If it is not a true positive, remove from C P C T (line 10); otherwise, analyze whether it is a candidate Spouse C S P T { X i } (line 11–19).
Algorithm 1 OFSVMB.
Symmetry 14 00149 i001
OFSVMB then checks non-MB successors X i in the C P C T that may have several pathways to the target variable T. If X i and T are independent, X i is removed from the C P C T (line 24). Algorithm 1 finds the Parents-Child (line 5–10) and computes the Spouses from the non_pc T set that is discarded during the null-conditional independence step (line 5) and from false-positive PC (line 6–10).
Moreover, after checking the conditional independence test at (line 6), if the feature X i is not dependent on target feature T, it comes to (line 29), to identify the spouse from the non_pc T set. If X i T | S e p T { X i } N , include the C S P T N (line 31–34). Through redundant analysis (line 35–45) check whether the selected spouse is redundant, if yes, then discard the false positive/redundant feature from C S P T N .

4.3. Statistical Conditional Independence Terminology in OFSVMB

The conditional independence test is used to classify irrelevant and redundant features [15], denoted by the notations I n d ( T , X i | S ) and D e p ( T , X i | S ) in Algorithm 1. The G 2 test, equivalent to the χ 2 test (discrete data), and Fisher’s z-test are used.

4.3.1. Statistical G 2 Test for Discrete Data

The G 2 with three features (variables), X i , X j , and X k , set V i j k a b c as the number of checks satisfying X i = x , X j = y , and X k = z in a dataset. V i j x y , V j k y z , and V k z are all described in the same way. If X i and X j are conditionally not dependent given X k , thus, in Equation (1) below, the G 2 test is shown:
G 2 = 2 x , y , z V i j k x y z l n V i j k x y z V k z V i k x z V j k y z .
With sufficient degrees of freedom, G 2 is asymptotically distributed as χ 2 . In general, when checking the conditional independence of X i and X j given S, the amount of degrees of freedom d f used during the test is measured as:
d f = c i 1 c j 1 c k ϵ S c k
where, in Equation (2), c i represents the number of distinct values of X i .

4.3.2. Statistical Fisher’s z-Test for Continuous Data

Fisher’s z-test, on the other hand, calculates the degree of correlation between features as given in Equation (3). After the feature subset S is provided, the partial correlation coefficient r X i , T | S between the feature X i and the target variable T is expressed in the Gaussian distribution N μ , [14]:
r X i , T | S = r X i , T r X i , S r S , T 1 r X i , S 2 1 r T , S 2
Under the null hypothesis of conditional independence between the feature X i and the target variable T of the current feature subset S, r X i , T | S = 0 , according to Fisher’s z-test. Assume that α = { 0.01 , 0.05 } is a given level of significance and ρ is the p-value obtained by Fisher’s z-test.
Supposing ρ > α , X i and target variable T are not related to each other when the subset S is given, according to the null hypothesis of the conditional independence of X i and target variable T. If ρ α , then X i and target variable T are both relevant.

4.4. Correctness of OFSVMB

OFSVMB produces the MB of the target feature T truly and accurately, under the faithful assumption. According to Theorem 1, in the beginning, OFSVMB finds true positive or relevant features which belong to the C P C T . It is a candidate Parents-Child set of target feature T using null-conditional independence and adds them in the candidate parent child (CPC) set, e.g., C P C T . In contrast, add the false positive or redundant feature in the non-Parents-Child set (line 5). Then, in the candidate Parents-Child (CPC) set, e.g., C P C T , OFSVMB searches for a false positive or redundant feature and discards it from the candidate parent-child (CPC) set (line 9) and (line 10). The C P C T contains all relevant features dependent on target feature T given any subset S R .
Additionally, OFSVMB finds the Spouse (line 13–19) of target T at the same time as looking for a redundant feature in the C P C T set. Non-MB successors in C P C T are discarded by conditioning on the true positive Spouse in C S P T { X i } , combined with the true positive PC in C P C T in the joint set C P C T and C S P T { X i } (line 20–27). Suppose the feature X i (line 6) is independent of target feature T, so the OFSVMB considers searching Spouse from the non_pc T set (line 29–34) and simultaneously removes the redundant feature from C S P T through redundant analysis (line 35–38). At last, if there is no feature left, OFSVMB finds the Markov blanket (MB) of target feature T at (line 48), which is the union of true positive P C T = C P C T and true positive S P T = C S P T .

4.5. Time Complexity Analysis

The time complexity, as presented in Table 3 of the state-of-the-art MB discovery algorithms, counts on how many CI tests have been used in the algorithm. OFSVMB identifies the MB of target feature T through online relevance and redundancy analysis. It is assumed that R represents the total number of features that appeared with time t, where R ' is the number of features relevant to target feature T in R. The remaining number of features in R, not relevant to target T, is represented by R . C S P T { X i } represents the Candidate Parents-Child set, and C S P T { X i } represents a subset of the Spouses of target T with regards to target T’s child X i . When the feature X i appears at time t, the OFSVMB time complexity is explained below: The null-conditional independence has a time complexity of O 1 . The Parents-Child identification takes O R R | C P C T | k C P C and the discovery of Candidate Spouses is O R | C P S T { X i } | K C P C T , where k is the maximum limit of a conditioning set that might increase. The OFSVMB is based on streaming discard redundant features in the streaming scenario, and both C P C and C S P become smaller. The approximate time complexity of the proposed OFSVMB algorithm is O | R | 2 C , where C is the is equal to C P C T and C S P T . The OFSVMB is somehow more efficient than the state-of-the-art because it handles features in real-time scenarios.
The proposed OFSVMB and EEMB accuracy are comparable. However, OFSVMB performs the streaming feature selection where features come one by one inflow, so removing the redundant features from C P C T and C S P T jointly takes a few CI tests. The STMB uses a backward strategy during PC learning and identifies the separated PC set from all other subsets of R at any repetition. It makes the STMB slower than the proposed OFSVMB algorithm, as presented in Table 3.

5. Results and Discussion

In this segment, the results of the proposed OFSVMB are discussed in detail. The results are conducted through extensive experiments and comparing them with the traditional-based MB discovery algorithms such as Iterative Associative Markov Blanket (IAMB), Simultaneous MB (STMB), HITON-MB (HITON-MB), Balanced Markov Blanket (BAMB), an Efficient and Effective MB discovery (EEMB), and streaming-based algorithms such as Alpha-investing ( α -investing), Scalable and Accurate Online Feature Selection (SAOLA), and Online Streaming Feature Selection (OSFS).

5.1. Datasets and Experiment Setup

The experimental results are computed on 9 benchmark BN datasets (https://pages.mtu.edu/lebrown/supplements/mmhc_paper/mmhc_index.html, accessed on 1 July 2021) of small and large sample-size, as shown in Table 4, and 14 real-world feature selection datasets as shown in Table 5. The real-world datasets are selected from different domains, such as sets from the UCI machine learning repository [27]; frequently studied public microarray (wdbc) [28], ionoshpere, colon, arcene, leukemia, and madelon are from the NIPS 2003 feature selection competition [29]; lung and medical belongs to biomedical [30]; lymphoma, reged1, and marti1 [31,32]; and prostate-GE and sido0 [33,34].
The OFSVMB algorithm is implemented in Matlab R2017b. All the experimental work is conducted on Windows 10 with an Intel Core i5-6500U with 8 GB RAM. The two conditional independence (CI) tests, including the G 2 test (for discrete data) and the Fisher’s z-test (for continuous data) with the significance levels of 0.01 and 0.05 are used.

5.2. Evaluation Metrics

The performance of the proposed OFSVMB algorithm is evaluated using three evaluation metrics on the benchmark BN datasets. The evaluation metrics are as follows:
  • Precision: The number of true positives in an algorithm’s output (e.g., features that belong to the true MB of a target feature in a DAG) divided by the number of features in the algorithm’s output yields this metric.
  • Recall: This metric is calculated by dividing the number of true positives in the output by the number of true positives (the true MB of a target feature in a DAG).
  • F1 = 2 * Precision * Recall / Precision + Recall : The harmonic mean of the precision and recall is used to calculate the F1 score. In the best-case scenario, F1 = 1 if precision and recall are both excellent. In the worst-case scenario, F1 = 0.
Each benchmark BNs contains two groups of sample size, i.e., 500 and 5000.

Results and Discussion on Benchmark BN

The efficiency and efficacy of the proposed OFSVMB algorithm is computed and compared with other state-of-the-art traditional-based MB discovery algorithms, such as IAMB, STMB, HITON-MB, BAMB, and EEMB, on 9 benchmark BN datasets. The F1, precision, recall, and running time in seconds of OFSVMB and other state-of-the-art are shown in Table 6 for 500 samples and Table 7 for 5000 samples. The sign “/” represents the separation of significant level, e.g., on the left side the α = 0.01 and on the right side the α = 0.05 , where α is the significant level. The F1, precision, and recall of sample sizes 500 and 5000 (a) shows significance levels of 0.01 and (b) shows significance levels of 0.05 in Figure 2 and Figure 3, respectively. Moreover, Figure 4 shows the running time of OFSVMB and five other state-of-the-art algorithms with different sample sizes. In the figures, the x-axis are denoted as the number of datasets (see Table 4), and the y-axis represent the F1, precision, and recall, together with running time, respectively.
According to Table 6, OFSVMB is the most accurate and fastest algorithm among the other five state-of-the-art algorithms because it is based on streaming features, which compute the features in an online manner. Meanwhile, on Child, Child3, and Insurance datasets with a sample size of 500 and significance levels of 0.01 and 0.05, OFSVMB is accurate in terms of F1, precision, and recall compared to IAMB, STMB, HITON-MB, BAMB, and EEMB, while faster than STMB, HITON-MB, BAMB, and EEMB. STMB and BAMB do not perform the symmetry check, but they must conduct an exhaustive subset analysis for the PC learning, making these two algorithms relatively slow and inefficient. The OFSVMB is slower than IAMB on Child and Child3 datasets because in each iteration, IAMB uses the entire set of presently selected features as conditioning set to identify whether or not to add or remove a feature from the currently selected features. IAMB is computationally efficient on datasets with small sample sizes, as presented in Table 6 in bold and shown in Figure 2a,b and Figure 4. On Child10 datasets, the OFSVMB is faster than other state-of-the-art algorithms, and IAMB is the second fastest, as shown in Table 6 in bold and shown in Figure 2a,b and the running time in Figure 4. Moreover, on the Child10 dataset, the HITON-MB’s accuracy is comparable with the OFSVMB in terms of precision with a significance level of 0.05, as shown in Table 6 in bold.
On the Alarm10 dataset with a sample size of 500, the EEMB is more accurate in terms of F1 and recall than the OFSVMB, IAMB, STMB, HITON-MB, and BAMB. At the same time, the OFSVMB is accurate than EEMB in terms of precision and the running time is faster as well, as presented in Table 6 in bold and shown in Figure 2a,b and Figure 4. On extensive datasets such as Pig and Gene with a sample size of 500 using significance levels of 0.01 and 0.05, OFSVMB is accurate in terms of F1, precision, and recall compared to its rivals, as shown in Figure 2a,b, while it is faster than the state-of-the-art as presented in Table 6 in bold and in Figure 4. In a dense dataset such as Barley, the HITON-MB is more accurate because HITON-MB locates the target’s Spouses, and must find the PC of each feature in the target’s discovered PC set. This method drastically reduces the number of sample data required and increases MB discovery performance, especially when dealing with high-dimensional and small data samples. When the size of the PC set of the features within the target’s PC set is enormous, therefore, this type of technique is computationally costly. The HITON-MB accuracy is higher in terms of F1 and precision than IAMB, STMB, BAMB, EEMB, and OFSVMB, but at a significance level of 0.01, the OFSVMB has comparable accuracy with HITON-MB in terms of F1. Moreover, in terms of recall, the EEMB has higher accuracy than IAMB, STMB, HITON-MB, BAMB, and OFSVMB, but the OFSVMB has higher running time than others, as shown in Table 6 in bold and Figure 2a,b and Figure 4. On Mildew, the OFSVMB shows higher accuracy than other algorithms. However, HITON-MB is more accurate than the OFSVMB and the others such as IAMB, STMB, BAMB, and EEMB at a significance level of 0.05 in terms of precision. In contrast, at a significance level of 0.01, the OFSVMB is more accurate than HITON-MB, as shown in Table 6. The OFSVMB on the Mildew dataset runs faster than the state-of-the-art as presented in Table 6 in bold and shown in Figure 2a,b and Figure 4.
On small Child and Child3 datasets with a sample size of 5000 using significance levels of 0.01 and 0.05, OFSVMB is more accurate in terms of F1, precision, and recall compared to others, as presented in Table 7 in bold and shown in Figure 3a,b. However, on the Child dataset, the OFSVMB runs faster than the HITON-MB, STMB, BAMB, and EEMB but slower than IAMB. While on the Child3 dataset, the IAMB runs faster than the OFSVMB and the other four algorithms as shown in Table 7, and the running time is shown in Figure 4. On the Child10 dataset with a sample size of 5000, HITON-MB outperforms on OFSVMB, IAMB, STMB, BAMB, and EEMB in terms of F1, precision, and recall, while OFSVMB is run faster than the other 5 algorithms as shown in Table 7 in bold. Figure 3a,b shows the accuracy of OFSVMB and other algorithms, while the running time is given in Figure 4. On the Alarm10, Insurance datasets with a sample size of 5000, OFSVMB is more accurate than IAMB, STMB, HITON-MB, BAMB, and EEMB in terms of F1, precision, and recall as shown in Table 7 in bold and in Figure 3a,b. Meanwhile, OFSVMB shows higher accuracy and better running time than others, as shown in Figure 4. On a large-size dataset with a sample size of 5000, such as the Pig, the BAMB and EEMB are more accurate in terms of F1 and precision at a significance level of 0.01 and for recall at significance levels of 0.01 and 0.05, as shown in Table 7 in bold and in Figure 3a,b. The OFSVMB is more accurate in terms of F1 and precision at significance levels of 0.05 and runs faster than its rivals as shown in Figure 4. Moreover, on an extensive Gene dataset, the OFSVMB is still more accurate and runs faster than other state-of-the-art algorithms as shown in Table 7 in bold and in Figure 3a,b, and the running time is shown in Figure 4.
In a dense dataset with a sample size of 5000, such as the Barley, the EEMB is more accurate in terms of F1 and recall at a significant level of 0.01 and 0.05, as shown in Table 7 in bold and in Figure 3a,b. Still, OFSVMB is more accurate in terms of precision at a significance level of 0.01 and 0.05 than EEMB and runs faster than others, as shown in Figure 4. While on the Mildew dataset, the OFSVMB is more accurate than IAMB, STMB, BAMB, and EEMB. The HITON-MB is more accurate than OFSVMB in terms of precision at a significance level of 0.05, while comparable accuracy in terms of precision at a significance level of 0.01, as shown in Table 7 in bold and in Figure 3a,b and runs faster than other algorithms as shown in Figure 4.

5.3. Evaluation Classifiers

The number of selected features and the prediction accuracy of OFSVMB on 14 real-world datasets with low to high dimensionality is conducted using 12 classifiers including C1 = Coarse Gaussian SVM, C2 = Coarse KNN, C3 = Coarse Tree, C4 = Cosine KNN, C5 = Fine Gaussian SVM, C6 = Fine Tree, C7 = Linear Discriminant, C8 = Linear SVM, C9 = Medium KNN, C10 = Medium Tree, C11 = Subspace Discriminant, and C12 = Subspace KNN, where C standards for classifier and saving space in Table 8, Table 9, Table 10 and Table 11, the abbreviation is used with numbers. For all the datasets, cross-validation is used 10-fold to prevent bias in error estimation. Moreover, the selected number of features and running time in seconds are also reported to show the efficiency of the algorithms.

Results and Discussion on Real-World Dataset

This segment shows the comparison of selected features, prediction accuracy, and efficiency of the OFSVMB algorithm with other state-of-the-art streaming-based algorithms, such as Alpha-investing, SAOLA, and OSFS, on 14 real-world datasets with a significance level of 0.01. The following Table 8, Table 9, Table 10, Table 11 and Table 12 describe the prediction accuracy, mean prediction accuracy based on 12 classifiers, and the number of selected features and running time in seconds.
Figure 5 shows the prediction accuracy of the algorithms based on 12 classifiers where the x-axis denotes the dataset, and the y-axis represents the prediction accuracy. In contrast, Figure 6 shows the running time in seconds of the algorithms. The sign “−” denotes that the algorithm fails to select any number of features in the corresponding dataset and takes longer. The better outcomes are underlined in bold text, presented in the following Table 8, Table 9, Table 10, Table 11 and Table 12.
Table 8 and Table 9 describe the prediction accuracy of OFSVMB against OSFS, Alpha-investing, and SAOLA using 12 classifiers. As described in the tables, OFSVMB performs better than the other algorithms on most datasets. Its mean prediction accuracy is higher than others using the 12 classifiers as shown in Table 10 and Table 11. The OFSVMB searches for PC and Spouse of the target feature T. In contrast, the OSFS, Alpha-investing, and SAOLA only search for PC and ignore the Spouse set of the target feature T, which causes them to lose the interpretability.
On the arcene dataset, SAOLA fails to obtain any features, as shown in Table 8 and Table 9. OSFS and Alpha-investing have comparable accuracy with OFSVMB on very few datasets. OFSVMB somehow includes false positives, not worse than its rivals on real-world feature selection datasets. From Figure 5, the OFSVMB has better prediction accuracy compared with other algorithms on many datasets under 12 classifiers. Table 12 describes the number of selected features and the running time of the algorithms. Alpha-investing selects more features on many datasets because it does not re-evaluate the redundant features, making it ineffective but time-efficient. The OSFS selects fewer features than Alpha-investing and SAOLA and OFSVMB; when the new feature arrives, the OSFS first checks the relevancy, and then the arrived features’ redundancy, which improves accuracy against Alpha-investing and SAOLA, and selects fewer features.
The running time of OSFS is slower than Alpha-investing and SAOLA on many datasets because it considers features repeatedly to add or remove them from the feature set. The OFSVMB selects many features from many datasets and causes its running time to be slower than other algorithms, as shown in Figure 6. The OFSVMB searches for the PC and Spouse of the target feature T, while OSFS, Alpha-investing, and SAOLA only search the PC set of the target feature T.

5.4. Sensitivity Analysis

The OFSVMB algorithm is governed by α (Alpha α and significance level are similar) and | R I | . We conducted a sensitivity analysis to investigate the effect of parameter values on the model’s accuracy. This section provides the details of sensitivity experiments. Concerning different parameters, including α values and with different Rate of Instance | R I | , the analysis is practiced on six real-world feature selection datasets, where the Rate of Instance | R I | is the subset of instances from the dataset. We choose three values of α such as 0.1 , 0.01 , 0.05 concerning three different values of the | R I | such as 0.1 , 0.8 , 0.9 .
The parameter α determines how well the reconstruction function preserves the original, observable feature’s values. The α = 0.1 , 0.01 , 0.05 is used to see the effect of OFSVMB prediction accuracy. In contrast, investigating | R I | = 0.1 , 0.8 , 0.9 can keep important features along the instance to affect the prediction accuracy of the OFSVMB. In Figure 7, the x-axis represents the Rate of Instance | R I | , the y-axis represents parameter α , and the z-axis represents the prediction accuracy (%) on six real-world datasets including (a) spect, (b) sylva, (c) madelon, (d) marti1, (e) ionosphere, and (f) reged1. These six datasets contain small and large sample sizes and sparse data.
From the results given in Figure 7c, we observe that when α is fixed with different | R I | values, the prediction accuracy of the OFSVMB algorithm increases for | R I | = 0.8 and 0.9. For some datasets given in Figure 7b,f, when the | R o I | = 0.1, it can still keep the important feature along with instances and increase the prediction accuracy. We observe that the most optimal values of parameter α are 0.01 and 0.05, and under these values with different | R I | , the prediction accuracy of OFSVMB is higher. Still, the parameter α = 0.1 with different | R I | also has an optimal contribution to the given six datasets regarding prediction accuracy. Thus, such an empirical value can be adopted in practice for future methods based on the statistical conditional independence test. These observations conclude that the OFSVMB is more accurate in selecting the optimal features regarding different α with different | R I | .

6. Conclusions

This paper proposes an Online Feature Selection via Markov Blanket (OFSVMB) algorithm that uses conditional independence G 2 and Fisher’s z-tests to find the MB based on streaming features. Once a feature is included in the PC and SP sets using online relevance analysis, it examines and checks for a true or false-positive feature using the online redundant analysis. OFSVMB tries to make the candidate features set of both PC and SP as small as possible, reducing the number of conditional independence tests. The proposed OFSVMB jointly identifies the Parents-Child and Spouse and separates them in streaming. The evaluation metrics such as F1, precision, and recall are used for evaluating the proposed algorithm on benchmark BN datasets and real-world datasets with 12 classifiers. Additionally, it also obtains the MB set with the highest accuracy. The results demonstrate that the OFSVMB is better than the traditional-based MB discovery algorithms such as IAMB, STMB, HITON-MB, BAMB, and EEMB on most benchmark BN datasets such as Child, Child3, Insurance, Pig, Gene with a sample size of 500 and Child3, Alarm10, Insurance, Gene, and Mildew with a sample size of 5000 with significance levels of 0.01 and 0.05. The OFSVMB also performs better on mean prediction accuracy regarding real-world datasets as compared to other streaming-based algorithms such as OSFS, Alpha-investing, and SAOLA using 12 classifiers including Fine Tree, Medium Tree, Coarse Tree, Linear Discriminant, Linear SVM, Fine Gaussian SVM, Coarse Gaussian SVM, Medium KNN, Coarse KNN, Cosine KNN, Subspace Discriminant, Subspace KNN. Furthermore, Searching feature strategies including the PC and Spouses of OFSVMB makes it a little time-consuming than OSFS, Alpha-investing, and SAOLA because these algorithms only search for the PC feature set and ignore the Spouses. OFSVMB is based on MB discovery, so it considers both PC and Spouses of the target feature T. In addition, the sensitivity analysis using two parameters α and Rate of Instance | R o I | also shows the OFSVMB performs better.
On a large and dense network with many features, statistical hypothesis-based tests for conditional independence causes performance inconsistency in OFSVMB and reduce accuracy. Using a V-structure cause the OFSVMB running time to be slower against OSFS, Alpha-investing, and SAOLA because Alpha-investing and SAOLA select many features. However, they are still faster than our proposed algorithm.
In future work, we plan to overcome the limitations of the proposed algorithm and extend our work to address the direct causes and effects of obtaining the local causal discovery in the streaming feature of the target variable T using mutual information or Neighborhood mutual information with a combination of conditional independence tests using different structure instead of V-structure. It will help to compute more true positive PCs and Spouses. In addition, the focus will be to improve the accuracy and consistently examine the impact of causal faith-fullness violations in the streaming feature selection.

Author Contributions

Conceptualization, W.K., L.K. and B.B; methodology, software, formal analysis, validation, data curation, writing—original draft preparation, W.K. and L.K.; investigation, resources, supervision, project administration, funding acquisition, L.K.; writing—review and editing, visualization, W.K., B.B., L.W. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

There is no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MBMarkov Blanket
BNBayesian Network
PCParents-Child
SPSpouse
SFStreaming Feature

References

  1. Wu, D.; He, Y.; Luo, X.; Zhou, M. A Latent Factor Analysis-Based Approach to Online Sparse Streaming Feature Selection. IEEE Trans. Syst. Man Cybern. Syst. 2021. [Google Scholar] [CrossRef]
  2. DeLatte, D.; Crites, S.T.; Guttenberg, N.; Yairi, T. Automated crater detection algorithms from a machine learning perspective in the convolutional neural network era. Adv. Space Res. 2019, 64, 1615–1628. [Google Scholar] [CrossRef]
  3. Tsamardinos, I.; Aliferis, C. Towards Principled Feature Selection: Relevancy, Filters and Wrappers. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, Key West, FL, USA, 3–6 January 2003. [Google Scholar]
  4. Aliferis, C.; Tsamardinos, I.; Statnikov, A. HITON: A Novel Markov Blanket Algorithm for Optimal Variable Selection. In Annual Symposium Proceedings. AMIA Symposium; AMIA: Bethesda, MD, USA, 2003; pp. 21–25. [Google Scholar]
  5. Gao, T.; Ji, Q. Efficient Markov Blanket Discovery and Its Application. IEEE Trans. Cybern. 2017, 47, 1169–1179. [Google Scholar] [CrossRef] [PubMed]
  6. Ling, Z.; Yu, K.; Wang, H.; Liu, L.; Ding, W.; Wu, X. BAMB: A Balanced Markov Blanket Discovery Approach to Feature Selection. ACM Trans. Intell. Syst. Technol. 2019, 10, 52:1–52:25. [Google Scholar] [CrossRef]
  7. Wang, H.; Ling, Z.; Yu, K.; Wu, X. Towards efficient and effective discovery of Markov blankets for feature selection. Inf. Sci. 2020, 509, 227–242. [Google Scholar] [CrossRef]
  8. Alnuaimi, N.; Masud, M.; Serhani, M.A.; Zaki, N. Streaming feature selection algorithms for big data: A survey. Appl. Comput. Inform. 2020. [Google Scholar] [CrossRef]
  9. Pan, W.; Chen, L.; Ming, Z. Personalized recommendation with implicit feedback via learning pairwise preferences over item-sets. Knowl. Inf. Syst. 2019, 58, 295–318. [Google Scholar] [CrossRef]
  10. Yang, S.; Wang, H.; Hu, X. Efficient Local Causal Discovery Based on Markov Blanket. arXiv 2019, arXiv:abs/1910.01288. [Google Scholar]
  11. Sowmya, R.; Suneetha, K. Data mining with big data. In Proceedings of the 2017 11th International Conference on Intelligent Systems and Control (ISCO), Coimbatore, India, 5–6 January 2017; pp. 246–250. [Google Scholar]
  12. Boulesnane, A.; Meshoul, S. Effective Streaming Evolutionary Feature Selection Using Dynamic Optimization. In IFIP International Conference on Computational Intelligence and Its Applications; Springer: Berlin, Germany, 2018; pp. 329–340. [Google Scholar]
  13. Zhou, J.; Foster, D.; Stine, R.; Ungar, L. Streaming feature selection using alpha-investing. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in data Mining, Chicago, IL USA, 21–24 August 2005; pp. 384–393. [Google Scholar]
  14. Yu, K.; Wu, X.; Ding, W.; Pei, J. Scalable and accurate online feature selection for big data. ACM Trans. Knowl. Discov. Data (TKDD) 2016, 11, 1–39. [Google Scholar] [CrossRef]
  15. Wu, X.; Yu, K.; Ding, W.; Wang, H.; Zhu, X. Online feature selection with streaming features. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 1178–1192. [Google Scholar]
  16. Yu, K.; Guo, X.; Liu, L.; Li, J.; Wang, H.; Ling, Z.; Wu, X. Causality-based feature selection: Methods and evaluations. ACM Comput. Surv. (CSUR) 2020, 53, 1–36. [Google Scholar] [CrossRef]
  17. Wu, X.; Jiang, B.; Yu, K.; Chen, H.; Miao, C. Multi-label causal feature selection. Proc. Aaai Conf. Artif. Intell. 2020, 34, 6430–6437. [Google Scholar] [CrossRef]
  18. Liu, C.; Yang, S.; Yu, K. Markov Boundary Learning With Streaming Data for Supervised Classification. IEEE Access 2020, 8, 102222–102234. [Google Scholar] [CrossRef]
  19. Ling, Z.; Yu, K.; Wang, H.; Li, L.; Wu, X. Using feature selection for local causal structure learning. IEEE Trans. Emerg. Top. Comput. Intell. 2020, 5, 530–540. [Google Scholar] [CrossRef]
  20. Zhou, P.; Wang, N.; Zhao, S. Online group streaming feature selection considering feature interaction. Knowl.-Based Syst. 2021, 226, 107157. [Google Scholar] [CrossRef]
  21. You, D.; Wang, Y.; Xiao, J.; Lin, Y.; Pan, M.; Chen, Z.; Shen, L.; Wu, X. Online Multi-label Streaming Feature Selection with Label Correlation. IEEE Trans. Knowl. Data Eng. 2021. [Google Scholar] [CrossRef]
  22. Li, L.; Lin, Y.; Zhao, H.; Chen, J.; Li, S. Causality-based online streaming feature selection. In Concurrency and Computation: Practice and Experience; Wiley Online Library: Hoboken, NJ, USA, 2021; p. e6347. [Google Scholar]
  23. Wang, H.; You, D. Online Streaming Feature Selection via Multi-Conditional Independence and Mutual Information Entropy†. Int. J. Comput. Intell. Syst. 2020, 13, 479–487. [Google Scholar] [CrossRef]
  24. You, D.; Wu, X.; Shen, L.; He, Y.; Yuan, X.; Chen, Z.; Deng, S.; Ma, C. Online Streaming Feature Selection via Conditional Independence. Appl. Sci. 2018, 8, 2548. [Google Scholar] [CrossRef] [Green Version]
  25. Spirtes, P.; Glymour, C.; Scheines, R. Causation, prediction, and search. In Causation, Prediction, and Search; Springer: New York, NY, USA, 1993; pp. 238–258. [Google Scholar]
  26. You, D.; Li, R.; Sun, M.; Ou, X.; Liang, S.; Yuan, F. Online Markov Blanket Discovery With Streaming Features. In Proceedings of the 2020 IEEE International Conference on Knowledge Graph (ICKG), Nanjing, China, 9–11 August 2020; pp. 92–99. [Google Scholar]
  27. Singh, A.; Kumar, R. Heart Disease Prediction Using Machine Learning Algorithms. In Proceedings of the 2020 International Conference on Electrical and Electronics Engineering (ICE3), Gorakhpur, India, 14–15 February 2020; pp. 452–457. [Google Scholar]
  28. Shen, Z.; Chen, X.; Garibaldi, J. Performance Optimization of a Fuzzy Entropy Based Feature Selection and Classification Framework. In Proceedings of the 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 7–10 October 2021; pp. 1361–1367. [Google Scholar]
  29. Wu, D.; Luo, X.; Shang, M.; He, Y.; Wang, G.; Wu, X. A data-characteristic-aware latent factor model for web services QoS prediction. IEEE Trans. Knowl. Data Eng. 2020. [Google Scholar] [CrossRef]
  30. Ucar, M.K.; Nour, M.; Sindi, H.F.; Polat, K. The Effect of Training and Testing Process on Machine Learning in Biomedical Datasets. Math. Probl. Eng. 2020, 2020, 2836236. [Google Scholar] [CrossRef]
  31. Hu, W.; Fey, M.; Zitnik, M.; Dong, Y.; Ren, H.; Liu, B.; Catasta, M.; Leskovec, J. Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv 2020, arXiv:abs/2005.00687. [Google Scholar]
  32. He, Y.; Wu, B.; Wu, D.; Beyazit, E.; Chen, S.; Wu, X. Toward Mining Capricious Data Streams: A Generative Approach. IEEE Trans. Neural Networks Learn. Syst. 2021, 32, 1228–1240. [Google Scholar] [CrossRef] [PubMed]
  33. Ge, C.; Zhang, L.; Liao, B. Abstract 5327: KPG-121, a novel CRBN modulator, potently inhibits growth of metastatic castration resistant prostate cancer as a single agent or in combination with androgen receptor signaling inhibitors both in vitro and in vivo. Cancer Res. 2020, 80, 5327. [Google Scholar]
  34. Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11618–11628. [Google Scholar]
Figure 1. The Markov blanket of node T.
Figure 1. The Markov blanket of node T.
Symmetry 14 00149 g001
Figure 2. F1, Precision, Recall using 9 benchmark BN datasets: (a) significance level = 0.01; (b) significance level = 0.05 with sample size of 500.
Figure 2. F1, Precision, Recall using 9 benchmark BN datasets: (a) significance level = 0.01; (b) significance level = 0.05 with sample size of 500.
Symmetry 14 00149 g002
Figure 3. Precision, Recall, F1 using 9 benchmark BN datasets: (a) significance level = 0.01; (b) significance level = 0.05, with sample size of 5000.
Figure 3. Precision, Recall, F1 using 9 benchmark BN datasets: (a) significance level = 0.01; (b) significance level = 0.05, with sample size of 5000.
Symmetry 14 00149 g003
Figure 4. Running time on 9 benchmark BNs datasets, sample sizes of 500 and 5000.
Figure 4. Running time on 9 benchmark BNs datasets, sample sizes of 500 and 5000.
Symmetry 14 00149 g004
Figure 5. Prediction Accuracy in the 14 real-world datasets under 12 classifiers.
Figure 5. Prediction Accuracy in the 14 real-world datasets under 12 classifiers.
Symmetry 14 00149 g005
Figure 6. Running time in the 14 real-world datasets.
Figure 6. Running time in the 14 real-world datasets.
Symmetry 14 00149 g006
Figure 7. Prediction accuracy of OFSVMB with respect to different α and | R I | on 6 real-world datasets: (a) spect, (b) sylva, (c) madelon, (d) marti1, (e) ionosphere, (f) reged1.
Figure 7. Prediction accuracy of OFSVMB with respect to different α and | R I | on 6 real-world datasets: (a) spect, (b) sylva, (c) madelon, (d) marti1, (e) ionosphere, (f) reged1.
Symmetry 14 00149 g007
Table 1. Summary of Notation.
Table 1. Summary of Notation.
NotationsMathematical Meanings
RA feature set under streaming features
SConditional feature set within R
TTarget attribute
GOver R, a DAG (Directed Acyclic Graph)
POver R, a JPD (Joint Probability Distribution)
fFalse positive
X, Y, Z, N, E or X i , Y i , Z i A feature/a variable/a node, which belong to R
t i A time point of the X i arriving feature
P C T A set of Parents-Children of T
C P C T A candidate feature set of P C T
non-pc T Non-Parents-Child set of target T
X i Y i Given S, X and Y are independent
X i Y i Given S, X and Y are dependent
S P T A set of Spouses of T
S P T { X } T’s Spouses in relation to T’s child X
C S P T { X } a Candidate feature set of Spouse
D e s T Descendant of Target (T)
P a T Parents of target T
α Significance levels of 0.01 and 0.05
ρ p-value return by statistical G 2 test and Fisher’s z-test
M B T Markov blanket (MB) of Node T
Table 2. The OFSVMB framework.
Table 2. The OFSVMB framework.
1. Initialization:
The target T; Candidate Parents-Child set C P C T = ⌀; non-Parents-Child features set non_pc T = ⌀; Candidate Spouse set C S P T = ⌀; M B = ⌀; 
2. Check relevancy between feature X i arrived at time t i and target feature T through null conditional independence test; if D e p X i , T | , add in C P C T ; if not, remove non-relevant feature through Proposition 1 and add in non_pc T set, then enter step 3.
3. After null-conditional independence, check for relevance and redundant features, e.g., D e p X i , T | S , through Proposition 2. If it is relevant, add to C P C T and jointly find a spouse and add in C S P T . Otherwise, add it to the non_pc T set.  
  • It looks for non-MB successors in C P C T set through Theorem 2.
  • Discard it from C P C T set if it is not dependent on target T.
4. After step 3, find Spouse in non_pc T set, which is redundant in step 2 and 3, and simultaneously remove false positive spouse from spouse set through Proposition 3.
5. Steps 2–4 are repeated until no features remain.
6. Output: Selected MB of target T.
Table 3. Time complexity of Markov blanket (MB) discovery Algorithm.
Table 3. Time complexity of Markov blanket (MB) discovery Algorithm.
AlgorithmTime Complexity
IAMB O | R | 2
STMB O | R | 2 | R |
HITON-MB O 2 | C P C T | | R | | C P C T |
BAMB O | R | 2 | C P C T |
EEMB O | R | 2 | C P C T |
OFSVMB O | R | 2 C
Table 4. Outline of the benchmark BN datasets.
Table 4. Outline of the benchmark BN datasets.
Dataset#Features#EdgesMax In/Out DegreeMin/Max | PC set |
Child20252/71/8
Child360793/71/8
Child102001262/71/9
Alarm103705704/71/9
Insurance27523/71/9
Pig4415922/391/41
Gene8019724/100/11
Barley48844/51/8
Mildew35463/31/5
Table 5. Outline of the real-world datasets.
Table 5. Outline of the real-world datasets.
DatasetFeaturesInstancesDatasetFeaturesInstances
arcene10,001100reged1999500
spect22267medelon5002600
wdbc30569marti11024500
ionoshpere35135medical9781434
colon200062lymphoma624027
lung3312203leukemia727071
prostate-GE5967102sido0493212,678
Table 6. F1, Precision, Recall, and Running time (s) on small and large-size benchmark BN datasets with sample size = 500.
Table 6. F1, Precision, Recall, and Running time (s) on small and large-size benchmark BN datasets with sample size = 500.
DatasetAlgorithmF1PrecisionRecallTime(s)
ChildIAMB0.78/0.670.86/0.670.77/0.770.13
STMB0.73/0.700.91/0.850.67/0.670.20
HITON-MB0.82/0.810.94/0.860.77/0.770.24
BAMB0.86/0.810.93/0.820.81/0.840.18
EEMB0.86/0.810.92/0.810.82/0.830.15
OFSVMB0.89/0.880.97/0.940.83/0.850.14
Child3IAMB0.65/0.700.70/0.700.61/0.710.42
STMB0.65/0.580.82/0.700.62/0.600.86
HITON-MB0.76/0.710.85/0.830.69/0.711.27
BAMB0.77/0.780.89/0.860.70/0.730.74
EEMB0.57/0.570.67/0.670.50/0.500.74
OFSVMB0.89/0.870.95/0.930.83/0.850.48
Child10IAMB0.62/0.620.60/0.610.65/0.642.11
STMB0.58/0.500.70/0.560.60/0.552.25
HITON-MB0.65/0.670.70/0.750.62/0.713.19
BAMB0.70/0.730.70/0.740.71/0.732.32
EEMB0.61/0.600.65/0.690.64/0.612.23
OFSVMB0.75/0.800.75/0.750.75/0.871.32
Alarm10IAMB0.60/0.620.75/0.750.51/0.530.31
STMB0.54/0.560.75/0.760.47/0.490.30
HITON-MB0.71/0.720.90/0.840.62/0.653.08
BAMB0.71/0.700.78/0.810.66/0.630.28
EEMB0.80/0.790.90/0.850.75/0.870.35
OFSVMB0.65/0.660.91/0.880.56/0.580.25
InsuranceIAMB0.55/0.560.79/0.850.43/0.420.17
STMB0.46/0.460.81/0.820.34/0.350.24
HITON-MB0.56/0.570.89/0.890.44/0.450.33
BAMB0.55/0.560.73/0.710.43/0.450.43
EEMB0.62/0.670.75/0.760.58/0.590.56
OFSVMB0.78/0.790.90/0.900.70/0.710.15
PigIAMB0.74/0.750.78/0.800.72/0.7211.07
STMB0.79/0.740.91/0.860.74/0.7211.94
HITON-MB0.91/0.760.88/0.640.90/0.9014.78
BAMB0.86/0.840.88/0.840.86/0.8627.3
EEMB0.92/0.800.93/0.720.90/0.8918.30
OFSVMB0.93/0.910.96/0.910.92/0.929.25
GeneIAMB0.63/0.680.60/0.620.68/0.6820.09
STMB0.56/0.180.44/0.120.73/0.7330.24
HITON-MB0.80/0.660.79/0.560.90/0.9223.34
BAMB0.82/0.680.79/0.590.90/0.9123.18
EEMB0.82/0.670.78/0.570.90/0.9228.50
OFSVMB0.87/0.890.85/0.860.91/0.9314.88
BarleyIAMB0.34/0.350.72/0.750.24/0.230.18
STMB0.12/0.110.30/0.240.08/0.080.20
HITON-MB0.34/0.390.74/0.770.24/0.270.19
BAMB0.28/0.310.30/0.350.28/0.280.21
EEMB0.33/0.350.40/0.270.50/0.500.21
OFSVMB0.34/0.340.73/0.730.24/0.250.17
MildewIAMB0.30/0.320.79/0.750.19/0.210.11
STMB0.13/0.090.29/0.190.09/0.060.14
HITON-MB0.34/0.340.81/0.830.21/0.220.20
BAMB0.33/0.340.81/0.810.21/0.220.16
EEMB0.29/0.250.50/0.330.20/0.200.17
OFSVMB0.35/0.350.83/0.820.23/0.230.9
Table 7. F1, Precision, Recall, and Running time (s) on small and large-size benchmark BN datasets with sample size = 5000.
Table 7. F1, Precision, Recall, and Running time (s) on small and large-size benchmark BN datasets with sample size = 5000.
DatasetAlgorithmF1PrecisionRecallTime(s)
ChildIAMB0.75/0.670.83/0.640.70/0.7111.05
STMB0.82/0.730.85/0.730.80/0.7769.21
HITON-MB0.89/0.830.94/0.840.87/0.8741.33
BAMB0.93/0.860.96/0.830.91/0.9123.25
EEMB0.93/0.880.95/0.820.93/0.9517.15
OFSVMB0.95/0.900.97/0.890.98/0.9911.13
Child3IAMB0.79/0.710.75/0.710.79/0.725.29
STMB0.87/0.750.90/0.790.88/0.879.13
HITON-MB0.89/0.830.94/0.840.87/0.8738.54
BAMB0.60/0.590.61/0.650.63/0.6414.2
EEMB0.70/0.700.67/0.670.74/0.7412.06
OFSVMB0.95/0.910.97/0.890.95/0.958.50
Child10IAMB0.64/0.660.64/0.670.65/0.6512.80
STMB0.52/0.530.40/0.420.49/0.5218.90
HITON-MB0.93/0.920.95/0.960.93/0.9321.30
BAMB0.64/0.630.71/0.680.59/0.5915
EEMB0.61/0.620.65/0.650.64/0.6013.80
OFSVMB0.63/0.690.62/0.750.65/0.6512.60
Alarm10IAMB0.76/0.750.78/0.750.75/0.758.95
STMB0.77/0.800.81/0.900.71/0.7547.33
HITON-MB0.87/0.860.88/0.850.85/0.879.85
BAMB0.75/0.790.83/0.850.70/0.7411.28
EEMB0.79/0.780.89/0.850.77/0.779.02
OFSVMB0.91/0.890.97/0.910.87/0.888.86
InsuranceIAMB0.60/0.610.61/0.610.60/0.618.05
STMB0.65/0.650.92/0.900.54/0.5645.62
HITON-MB0.73/0.740.93/0.930.63/0.6412.64
BAMB0.62/0.630.87/0.880.50/0.5225.46
EEMB0.79/0.380.89/0.270.75/0.7011.86
OFSVMB0.84/0.830.94/0.940.77/0.757.90
PigIAMB0.63/0.610.62/0.620.65/0.7015.22
STMB0.38/0.150.28/0.380.32/0.3815.94
HITON-MB0.64/0.720.75/0.820.76/0.8517.78
BAMB0.96/0.840.93/0.741.00/1.0022.2
EEMB0.96/0.870.94/0.791.00/1.0018.27
OFSVMB0.90/0.900.90/0.910.88/0.8915.07
GeneIAMB0.60/0.600.53/0.530.89/0.8924.01
STMB0.30/0.110.20/0.200.75/0.7835.24
HITON-MB0.83/0.680.77/0.570.85/0.8530.34
BAMB0.82/0.690.76/0.590.94/0.9426.03
EEMB0.82/0.710.77/0.610.94/0.9424.50
OFSVMB0.93/0.910.87/0.880.95/0.9523.09
BarleyIAMB0.35/0.360.63/0.650.25/0.259.70
STMB0.35/0.350.70/0.690.27/0.2810.15
HITON-MB0.54/0.540.75/0.750.42/0.429.89
BAMB0.26/0.240.23/0.210.29/0.2817.21
EEMB0.67/0.630.50/0.221.00/1.0015.63
OFSVMB0.48/0.490.78/0.770.34/0.359.19
MildewIAMB0.48/0.490.60/0.630.41/0.419.87
STMB0.21/0.200.39/0.380.15/0.1411.45
HITON-MB0.42/0.400.77/0.760.41/0.4112.09
BAMB0.16/0.170.41/0.410.11/0.1114.72
EEMB0.29/0.270.19/0.180.60/0.609.89
OFSVMB0.53/0.530.77/0.750.42/0.429.66
Table 8. Comparison of prediction accuracy under C1–C6 classifiers at significance level = 0.01.
Table 8. Comparison of prediction accuracy under C1–C6 classifiers at significance level = 0.01.
Prediction Accuracy (%)
DatasetAlgorithmC-1C-2C-3C-4C-5C-6
arceneOSFS805676827174
Alpha-investing575570706265
SAOLA
OFSVMB806576827174
spectOSFS70707046.49670
Alpha-investing7065.569.369.39669.7
SAOLA70707046.49270
OFSVMB7082.672.791.39771.9
wdbcOSFS96.393.392.1968793.1
Alpha-investing90949396.38893.1
SAOLA919191.690.58890.5
OFSVMB97.395.895.596.19493.3
ionosphereOSFS88.687.29090.68889.2
Alpha-investing8865.890.386.98691.2
SAOLA88.973.290.388.38889.7
OFSVMB9468.79296.98991.5
colonOSFS80.664.583.980.68882.3
Alpha-investing66.164.577.474.27971
SAOLA87.164.580.690.38582.3
OFSVMB88.78088.798.38888.7
lungOSFS83.768.577.387.28281.8
Alpha-investing78.368.581.893.67982.8
SAOLA84.268.580.389.284.2
OFSVMB83.775.575.497.78582.8
prostate-GEOSFS94.15192.294.19992.2
Alpha-investing94.15190.293.18890.2
SAOLA96.15186.396.19586.3
OFSVMB94.18093.198.19991.2
reged1OSFS98.288.497.498.49096.8
Alpha-investing88.288.28888.29486.4
SAOLA99.488.296.8989396.6
OFSVMB99.282.294.698.49195.8
medalonOSFS6261.961.7569493
Alpha-investing49.850.549.449.29650
SAOLA62.461.760.957.79658.3
OFSVMB62.49161.5989695.5
lycomphiaOSFS98.467.795.298.48395.2
Alpha-investing93.567.796.893.58896.8
SAOLA98.467.787.185.58087.1
OFSVMB10092.996.596.67097
marti1OSFS88.288.287.888.28779.2
Alpha-investing88.288.286.488.86682.4
SAOLA88.288.287.888.28179.2
OFSVMB95.295.289.696.88979.4
medicalOSFS99.495.699.499.46299.4
Alpha-investing99.195.699.298.74999.4
SAOLA98.795.699.498.26199.4
OFSVMB99.497.699.4996299.7
leukemiaOSFS83.768.578.388.29282.8
Alpha-investing78.368.577.894.15081.8
SAOLA84.268.579.389.78185.2
OFSVMB90.88081.888.29485.9
sido0OSFS83.768.574.388.28886.9
Alpha-investing78.368.566.894.17881.8
SAOLA80.268.572.389.78585.2
OFSVMB81.879.579.898.29089.3
Table 9. Comparison of prediction accuracy under C7–C12 classifiers at significance level = 0.01.
Table 9. Comparison of prediction accuracy under C7–C12 classifiers at significance level = 0.01.
Prediction Accuracy (%)
DatasetAlgorithmC-7C-8C-9C-10C-11C-12
arceneOSFS848573748574
Alpha-investing767467657273
SAOLA
OFSVMB848575748574
spectOSFS7070957069.789.6
Alpha-investing70.867.89669.77048.3
SAOLA707091707070
OFSVMB71.2809671.99189.6
wdbcOSFS95.696.59593.195.193
Alpha-investing95.197.46693.195.393.1
SAOLA92.191.47491.292.675.2
OFSVMB95.696.88793.395.494.9
ionosphereOSFS87.287.77789.286.388.9
Alpha-investing86.687.7857185.893.2
SAOLA87.287.7889086.690.3
OFSVMB85.587.98991.589.793.3
colonOSFS83.983.98382.380.677.4
Alpha-investing83.982.36582.880.677.5
SAOLA87.185.56582.387.179
OFSVMB88.788.78888.788.778.2
lungOSFS89.292.55681.888.288.2
Alpha-investing9789.75690.29793.1
SAOLA94.69184.293.189.7
OFSVMB91.494.18282.891.690.6
prostate-GEOSFS95.195.19992.294.181.4
Alpha-investing95.196.18886.895.191.2
SAOLA95.195.19886.360.895.1
OFSVMB94.195.19991.294.194.2
reged1OSFS98.698.96796.898.898.8
Alpha-investing88.288.26850.788.277.2
SAOLA9999.26896.69999.2
OFSVMB97.898.89898.599.298.2
medalonOSFS61.361.65161.561.255.4
Alpha-investing5050.45196.85049.3
SAOLA61.862.15160.661.857.2
OFSVMB61.575.2957561.457.8
lycomphiaOSFS98.496.88395.296.898.4
Alpha-investing98.496.87283.896.893.5
SAOLA98.498.48187.110098.4
OFSVMB98.41007997.510098.7
marti1OSFS88.288.28784.688.277.2
Alpha-investing9488.25099.490.485
SAOLA88.288.27584.688.277.2
OFSVMB87.688.2889294.585.2
madicalOSFS99.46299.499.496
Alpha-investing99.55081.898.695.7
SAOLA99.46299.498.195.8
OFSVMB91.399.46299.498.195.6
leukemiaOSFS89.291.69282.891.189.7
Alpha-investing81.295.16381.896.692.6
SAOLA85.593.18085.285.291.6
OFSVMB89.294.6918491.188.7
sido0OSFS89.291.69082.891.189.7
Alpha-investing85.595.188828287.6
SAOLA87.193.18785.285.290.6
OFSVMB89.295.5898790.692.6
Table 10. Comparison of mean prediction accuracy under C1–C6 classifiers at significance level = 0.01.
Table 10. Comparison of mean prediction accuracy under C1–C6 classifiers at significance level = 0.01.
Prediction Accuracy (%)
DatasetAlgorithmC-1C-2C-3C-4C-5C-6
MeanOSFS86.273.582.585.786.486.8
Alpha-investing8070.876.28578.581.5
SAOLA
OFSVMB88.483.285.495.186.788.2
Table 11. Comparison of mean prediction accuracy under C7–C12 classifiers at significance level = 0.01.
Table 11. Comparison of mean prediction accuracy under C7–C12 classifiers at significance level = 0.01.
Prediction Accuracy (%)
DatasetAlgorithmC-7C-8C-9C-10C-11C-12
MeanOSFS88.579.385.287.685.5
Alpha-investing85.869.681.185.682.2
SAOLA
OFSVMB87.591.38787.690.787.9
Table 12. #F (number of selected features) and running time (s) at significance level = 0.01.
Table 12. #F (number of selected features) and running time (s) at significance level = 0.01.
AlgorithmOSFSAlpha-InvestingSAOLAOFSVMB
Dataset#FTime(s)#FTime(s)#FTime(s)#FTime(s)
arcene53.0270.2569.05
spect30.40182.0250.3030.16
wdbc30.27208.6520.5097.01
ionosphere50.2100.650.91219.02
colon22.9340.1741.781384.2
lung851.82450.71301.241052.88
Prostate-GE22.10120.59120.6053.75
reged11134.2810.06150.431038.05
madelon61.0510.1270.0890.25
lycomphia35.1350.30351.993370.55
marti134.99285.8510.5022.23
medical17190.5281.151656212.8
leukemia51.520.01171.021310.55
sido212.50395.8340396.80
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Khan, W.; Kong, L.; Brekhna, B.; Wang, L.; Yan, H. Online Streaming Features Selection via Markov Blanket. Symmetry 2022, 14, 149. https://doi.org/10.3390/sym14010149

AMA Style

Khan W, Kong L, Brekhna B, Wang L, Yan H. Online Streaming Features Selection via Markov Blanket. Symmetry. 2022; 14(1):149. https://doi.org/10.3390/sym14010149

Chicago/Turabian Style

Khan, Waqar, Lingfu Kong, Brekhna Brekhna, Ling Wang, and Huigui Yan. 2022. "Online Streaming Features Selection via Markov Blanket" Symmetry 14, no. 1: 149. https://doi.org/10.3390/sym14010149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop