Next Article in Journal
On the Deposition Equilibrium of Carbon Nanotubes or Graphite in the Reforming Processes of Lower Hydrocarbon Fuels
Next Article in Special Issue
Label-Driven Learning Framework: Towards More Accurate Bayesian Network Classifiers through Discrimination of High-Confidence Labels
Previous Article in Journal
Coherent Processing of a Qubit Using One Squeezed State
Previous Article in Special Issue
Multivariate Multiscale Symbolic Entropy Analysis of Human Gait Signals
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

K-Dependence Bayesian Classifier Ensemble

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(12), 651; https://doi.org/10.3390/e19120651
Submission received: 6 September 2017 / Revised: 23 November 2017 / Accepted: 27 November 2017 / Published: 30 November 2017
(This article belongs to the Special Issue Symbolic Entropy Analysis and Its Applications)

Abstract

:
To maximize the benefit that can be derived from the information implicit in big data, ensemble methods generate multiple models with sufficient diversity through randomization or perturbation. A k-dependence Bayesian classifier (KDB) is a highly scalable learning algorithm with excellent time and space complexity, along with high expressivity. This paper introduces a new ensemble approach of KDBs, a k-dependence forest (KDF), which induces a specific attribute order and conditional dependencies between attributes for each subclassifier. We demonstrate that these subclassifiers are diverse and complementary. Our extensive experimental evaluation on 40 datasets reveals that this ensemble method achieves better classification performance than state-of-the-art out-of-core ensemble learners such as the AODE (averaged one-dependence estimator) and averaged tree-augmented naive Bayes (ATAN).

1. Introduction

Classification is a basic task in data analysis and pattern recognition that requires the learning of a classifier, which assigns labels or categories to instances described by a set of predictive variables or attributes. The induction of classifiers from datasets of preclassified instances is a central problem in machine learning. Given class label C and predictive attributes X = { X 1 , , X n } ( capital letters, such as X , Y and Z, denote attribute names, and lowercase letters, such as x, y and z, denote the specific values taken by those attributes. Sets of attributes are denoted by boldface capital letters, such as X , Y and Z , and assignments of values to the attributes in these sets are denoted by boldface lowercase letters, such as x , y and z ), discriminative learning [1,2,3,4] directly models the conditional probability P ( c | x ) . Unfortunately, P ( c | x ) cannot be decomposed into a separate term for each attribute, and there is no known closed-form solution for the optimal parameter estimates. Generative learning [5,6,7,8] approximates the joint probability P ( c , x ) with different factorizations according to Bayesian network classifiers, which are powerful tools for knowledge representation and inference under conditions of uncertainty. Naive Bayes (NB) [9], which is the simplest kind of Bayesian network classifier that assumes the attributes are independent given the class label, are surprisingly effective. After the discovery of NB, many state-of-the-art algorithms, for example, tree-augmented naive Bayes (TAN) [10] and a k-dependence Bayesian classifier (KDB) [11], are proposed to relax the independence assumption by allowing conditional dependence between attributes X i and X j , which is measured by conditional mutual information I ( X i ; X j | C ) . In order to improve predictive accuracy relative to a single model, ensemble methods [12,13], for example, averaged one-dependence estimator (AODE) [14] and averaged tree-augmented naive Bayes (ATAN) [15] methods, generate multiple global models from a single learning algorithm through randomization (or perturbation).
An ideal Bayesian network classifier should provide the maximum value of mutual information I ( C ; X ) for classification; that is, I ( C ; X ) should represent strong mutual dependence between C and X . However,
max I ( X i ; X j | C ) max I ( C ; X i , X j ) , i , j , i j
The strong conditional dependence between attributes X i and X j may not help to improve classification performance. As shown in Figure 1, the proportional distribution of I ( X i ; X j | C ) differs greatly to that of I ( C ; X i , X j ) .
The KDB is a form of a restricted Bayesian network classifier with numerous desirable properties in the context of learning from large quantities of data. It achieves a good trade-off between classification performance and structure complexity with a single parameter, k. KDB uses mutual information I ( C ; X i ) to predetermine the order of predictive attributes and conditional mutual information I ( X i ; X j | C ) to measure the conditional dependence between predictive attributes.
In this paper, we extend the KDB. The contributions of this paper are as follows:
  • We propose a new sorting method to predetermine the order of predictive attributes. This sorting method considers not only the dependencies between predictive attributes and the class variable, but also the dependencies between predictive attributes.
  • We extend the KDB from one single k-dependence tree to a k-dependence forest (KDF). A KDF reflects more dependencies between predictive attributes than the KDB. We show that our algorithm achieves comparable or lower error on University of California at Irvine (UCI) datasets than a range of popular classification learning algorithms.
The rest of this paper is organized as follows. Section 2 introduces some state-of-the-art Bayesian network classifiers. Section 3 explains the basic idea of the KDF and introduces the learning procedure in detail. Section 4 compares experimental results on datasets from the UCI Machine Learning Repository. Section 5 draws conclusion.

2. Bayesian Network Classifiers

A Bayesian network [16], B N = < G , Θ > , is a directed acyclic graph with a conditional probability distribution for each node, collectively represented by Θ , which quantifies how much a node depends on its parents. Nodes and arcs in G represent random variables and the probability dependence between variables, respectively. The full Bayesian network classifier [17] fully reflects the dependencies between predictive attributes and can be regarded as the optimal Bayesian network classifier. The corresponding joint probability is
P ( c , x ) = P ( c ) P ( x 1 | c ) i = 2 n P ( x i | c , x 1 , , x i 1 )
From Equation (1), we can see that the true complexity in such an unrestricted model (i.e., no independencies) comes from the large number of attribute dependence arcs that are present in the model. As the number of attributes and arcs increase, the computational complexity of the joint probability grows exponentially until it becomes an NP-hard problem [18]. In order to address this issue, researchers have proposed some state-of-the-art classifiers to simplify the network structure [9,19,20,21]. The functional domain of one single classifier may be limited as a result of ignoring the dependencies between some attributes. Classifiers that use the forest or ensemble method are commonly applied to fill the gap [12,14,15]. In the following subsection, we first introduce NB and its corresponding ensemble classifier, that is, AODE. Then, we introduce TAN and its corresponding ensemble classifier, that is, ATAN. Lastly, we introduce the KDB in detail.

2.1. NB and AODE

NB, which is the simplest Bayesian network classifier, supposes that all the predictive attributes are independent of each other given class variable C, transforming Equation (1) into
P ( c , x ) P ( c ) i = 1 n P ( x i | c )
NB has exhibited a high level of predictive competence with other learning algorithms, such as decision trees [22]. However, in the real world, attributes in many learning tasks are correlated to each other, so the conditional independence assumption rarely holds and it may degrade the classification performance. How to relax the conditional independence assumption and simultaneously retain NBs’ efficiency have attracted much attention, and many approaches have been proposed already [11,14,19].
AODE is an ensemble augmentation of NB that utilizes a restricted class of one-dependence estimators (ODEs) and aggregates the predictions of all qualified estimators within this class. A single attribute X i , called a superparent, is selected as the parent of all the other attributes in each ODE. For each ODE, AODE utilizes an assumption that the attributes are independent given the class variable and any predictive attribute X i , estimating Equation (1) by
P ( c , x ) i = 1 n P ( c , x i ) j = 1 , j i n P ( x j x i , c )
AODE achieves lower classification error than NB, because it involves a weaker attribute independence assumption and the ensemble mechanism. Figure 2 shows graphically the structural differences between NB and AODE.

2.2. TAN and ATAN

TAN is a structural augmentation of NB in which every attribute has the class and at most one other attribute as parents. The structure is determined by using an extension of the Chow–Liu tree [23], which utilizes conditional mutual information to find a maximum spanning tree. By learning from the maximum weighted spanning tree (MWST), TAN can represent all significant one-dependence relationships and is commonly regarded as the optimal one-dependence classifier [24]. Rather than obtaining a spanning tree, Ruz and Pham [25] suggest that Kruskal’s algorithm be stopped whenever a Bayesian criterion controlling the likelihood of the data and the complexity of the TAN structure holds.
ATAN is an ensemble augmentation of TAN. It takes not a random node, but each predictive variable as a root node and then builds the corresponding MWST conditioned to that selection. Finally, the posterior probabilities of ATAN are given by the average of the n TAN classifier posterior probabilities. Figure 3 shows graphically the structural differences between TAN and ATAN.

2.3. KDB

The KDB allows each attribute to have a maximum of k parents, except the class variable. The attribute order is predetermined by comparing mutual information I ( X i ; C ) between the predictive attribute and class variable, starting with the highest. Once X i enters the model, its parents are selected by choosing the k variables X j in the model with the highest values of the conditional mutual information I ( X i ; X j | C ) . We note that the first k variables added to the model will have fewer than k parents. We suppose that the attribute order is { X 1 , X 2 , , X n } ; then X i will have i 1 parents when i k , and the remaining n k variables have exactly k parents. Then Equation (1) turns out to be
P ( c , x ) P ( c ) i = 1 n P ( x i | c , x i 1 , , x i p )
where X i 1 , , X i p are the parent attributes of X i and p = m i n ( i 1 , k ) . Figure 4 shows graphically an example of a KDB.

3. The k-Dependence Forest Algorithm

The KDB is supplied with both a database of preclassified instances, a DB, and the k value for the maximum allowable degree of attribute dependence. The structure learning procedure of a KDB can be partitioned into two parts: attribute sorting and dependence analysis. During the sorting procedure, the KDB uses mutual information I ( C ; X i ) to predetermine the order of predictive attributes. The KDB ensures that the predictive attributes that are most dependent on the class variable should be considered first and added to the structure. However, mutual information can only measure the dependencies between predictive attributes and the class variable, while it ignores the dependencies between predictive attributes. The sorting process of the KDB only embodies the dependency between each single attribute and class variable, which may result in a suboptimal order. The proposed algorithm, the KDF, uses a new sorting method to address this issue.
According to the chain rule of information theory, mutual information I ( C ; X ) can be expanded as follows:
I ( C ; X ) = I ( C ; X 1 ) + I ( C ; X 2 | X 1 ) + + I ( C ; X i | X 1 , , X i 1 ) + + I ( C ; X n | X 1 , , X n 1 )
In the ideal case, in classification, we would like to obtain the maximum value of I ( C ; X ) . From Equation (5), we can find that the computational complexity of I ( C ; X i | X 1 , , X i 1 ) grows exponentially as the number of attributes increases. The space to store the conditional probability distribution grows exponentially. How to approximate the probability estimation is challenging. In order to address this issue, we replace I ( C ; X i | X 1 , , X i 1 ) with the following:
S u m _ C M I i = I ( C ; X i | X 1 ) + I ( C ; X i | X 2 ) + + I ( C ; X i | X i 1 )
Equation (6) considers both the mutual dependence and the conditional dependence for classification. On the basis of this, we propose a new approach to predetermine the sequence of predictive attributes by comparing the value of S u m _ C M I i . From Equation (6), we can find that the first attribute of a sequence does not reflect the conditional dependence. Thus we use each attribute as the root node X r o o t in turn. The next attribute, which will be added to the sequence, is the attribute that is most informative about C conditioned on the first attribute (which is measured by I ( X i ; C | X r o o t ) ). Subsequent attributes are chosen to be the most informative about C conditioned on previously chosen attributes (which is measured by S u m _ C M I i ). Because of the n different root nodes, we can obtain n sequences {S 1 , ⋯, S n }. On the basis of the n sequences, n subclassifiers can be generated. The sorting algorithm (Algorithm 1) is depicted below.
Algorithm 1: KDF: Sorting
Input: Preclassified dataset DB with n predictive attributes { X 1 , , X n } .
Output: Sequences {S 1 , ⋯, S n }.
For each sequence S i , i { 1 , , n } :
  • Let S i be empty.
  • Let predictive attribute X i , i { 1 , , n } be the root node.
  • Add the root node to S i .
  • Repeat until S i includes all domain attributes:
    (a)
    Compute S u m _ C M I j for the predictive attribute X j ( j i ), which is not in S i .
    (b)
    Select X m a x , which has the maximum value of S u m _ C M I j .
    (c)
    Add X m a x to S i .
In order to identify the graphical structure of the resulting classifier, the KDB adopts a greedy search strategy. The weight of conditional dependence between X i and its parent X j is measured by conditional mutual information I ( X i ; X j | C ) . However, the dependency relationships between X j and other parents of X i are neglected, whether they are independent or strongly correlated. From Equation (1), we can see that, for the full Bayesian network classifier, the parent of X 2 is X 1 , the parent of X 3 is { X 1 , X 2 } , the parent of X 4 is { X 1 , X 2 , X 3 } , and so forth. Then we can achieve an implicit chain rule, that X 1 is the parent of X 2 , X 2 is the parent of X 3 (or X 1 is the grandparent of X 3 ), X 3 is the parent of X 4 (or X 1 is the great grandparent of X 4 ), and so forth. Thus, as shown in Figure 5, there should exist hierarchical dependency relationships among the parents. If X 4 is one parent of attribute X i , we should follow the dotted line shown in Figure 5 to find the other parents. To make our idea clear, we first introduce the definition of an ancestor node.
Definition 1.
Suppose that X j is the parent of X i . The ancestor attributes of X i include X j ’s parents, grandparents, great grandparents, and so forth.
During the procedure of dependence analysis, X i first selects the attribute X j that corresponds to the largest value of I ( X i ; X j | C ) as its parent. For the other k 1 parents, X i will select among its ancestor attributes. Figure 6a shows an example of the KDF subclassifier, for example, KDF i . We suppose that X 4 = arg max I ( X i ; X 5 | C ) . When X 5 is added to KDF i , X 4 will be selected as the first parent of X 5 . The corresponding parent–child relationships are shown in Figure 6b, from which we can see that the ancestor attributes of X 5 are { X 2 , X 3 } { X 1 , X 2 } X 1 , that is, { X 1 , X 2 , X 3 } . Other parents of X 5 will be selected from { X 1 , X 2 , X 3 } by comparing I ( X i ; X 5 | C ) ( 1 i 3 ) . This strategy helps to reduce the search space of attribute dependencies. The detailed procedure of dependence analysis (Algorithm 2) is depicted below.
Algorithm 2: KDF: Dependence Analysis
Input: Sequences {S 1 , ⋯, S n }.
Output: Subclassifiers {KDF 1 , ⋯, KDF n }.
  • Compute I ( X i ; X j | C ) , for each pair of attributes X i and X j , where i j .
  • For each sequence S i , i { 1 , , n } :
    (1)
    Let the KDF i being constructed begin with a single class node, C.
    (2)
    Repeat until KDF i includes all attributes:
    (a)
    Select the attribute X f i r s t , which is the first attribute in S i and not in KDF i .
    (b)
    Add a node to KDF i representing X f i r s t .
    (c)
    Add an arc from C to X f i r s t in KDF i .
    (d)
    Select X j , which is in KDF i and has the largest value of I ( X f i r s t ; X j | C ) , as the first parent of X f i r s t .
    (e)
    Select other b 1 parents from ancestor attributes of X j by comparing the value of I ( X f i r s t ; X p | C ) , where X p is one of the ancestor attributes of X j , b = m i n ( d , k ) and d is the number of the ancestor attributes of X j .
  • Compute the conditional probability tables inferred by the structure of KDF i by using counts from DB, and output KDF i .
After training multiple learning subclassifiers, ensemble learning treats these as a “committee” of decision makers and combines individual predictions appropriately. The decision of the committee should have better overall accuracy, on average, than any individual committee member. There exist numerous methods for model combination, for example, the linear combiner, the product combiner and the voting combiner. For the subclassifier KDF i , an estimate of the probability of class c given input x is P i ( c | x ) . The linear combiner is used for models that output real-valued numbers; thus it is applicable for the KDF. The ensemble probability estimate is
P ^ ( c | x ) = i = 1 n w i P i ( c | x )
If the weights w i = 1 / n , i , this is a simple uniform averaging of the probability estimates. The notation clearly allows for the possibility of a nonuniformly weighted average. If the classifiers have different accuracies on the data, a nonuniform combination could in theory give a lower error than a uniform combination. However, in practice, the difficulty is of estimating the w i parameters without overfitting and the relatively small gain that is available. Thus, in practice, we use the uniformly rather than nonuniformly weighted average.
The KDF collects the statistics to perform calculations of conditional mutual information of each pair of attributes given the class for structure learning. As an entry must be updated for every training instance and every combination of two attribute values for that instance, the time complexity of forming the three-dimensional probability table is O ( n 2 m c v 2 ) , where m is the number of training instances, n is the number of attributes, c is the number of classes, and v is the maximum number of discrete values that any attribute may take. To calculate the conditional mutual information, the KDF must consider every pairwise combination of their respective values in conjunction with each class value O ( c ( n v ) 2 ) . For each subclassifier KDF i , attribute ordering and parent assignment are O ( n log n ) and O ( n 2 log n ) , respectively. KDF i requires n tables of k + 2 dimensions, with O ( c n v k + 1 ) . Because the KDF needs to average the results of n subclassifiers, the time complexity of classifying a single testing instance is O ( n 2 c k ) time.
The parameter k is closely related to the classification performance of a high-dependence classifier. A higher value of k may result in higher variance and lower bias. Unfortunately there is no a priori means to preselect an appropriate value of k that can help to achieve the lowest error for a given training set, as this is a complex interplay between the data quantity and the complexity and strength of the interactions between the attributes proved by Martinez et al. [8]. From the discussion above, we can see that, for each KDF i , the space complexity of the probability table increases exponentially as k increases; to achieve the trade-off between classification performance and efficiency, we restrict the structure complexity to be two-dependence, which is also adopted by Webb et al. [26].

4. Experiments and Results

In order to verify the efficiency and effectiveness of the proposed KDF algorithm, experiments were conducted on 40 benchmark datasets from the UCI Machine Learning Repository [27]. Table 1 summarizes the characteristics of each dataset, including the number of instances, attributes and classes. All the datasets were ordered by dataset scale. Missing values for qualitative attributes were replaced with modes, and those for quantitative attributes were replaced with means from the training data. For each original dataset, we discretized numeric attributes using minimum description length (MDL) discretization [28]. All experiments were conducted on a desktop computer with an Intel(R) Core(TM) i3-6100 CPU @ 3.70 GHz, 64 bits and 4096 MB of memory. All the experiments for the Bayesian algorithms used C++ software specifically designed to deal with classification methods. The running efficiency of the KDF was good. For example, for a Poker hand dataset, it took 281 s for the KDF to obtain classification results. The following algorithms were compared:
  • NB, standard naive Bayes.
  • TAN, tree-augmented naive Bayes.
  • AODE, averaged one-dependence estimator.
  • KDB, k-dependence Bayesian classifier.
  • KDB S , the KDB that only performs the sorting method proposed above.
  • ATAN, averaged tree-augmented naive Bayes.
  • RF100, random forest containing 100 trees.
  • RFn, random forest containing n trees, where n is the number of predictive attributes.
  • KDF, k-dependence forest.
Kohavi and Wolpert presented a bias-variance decomposition of the expected misclassification rate [29], which is a powerful tool from sampling theory statistics for analyzing supervised learning scenarios. Supposing c and c ^ are the true class label and that generated by a learning algorithm, respectively, the zero-one loss function is defined as
ξ ( c , c ^ ) = 1 c ^ , c ϵ C P ( c = c ^ )
The bias term measures the squared difference between the average output of the target and the algorithm. This term is defined as follows:
b i a s = 1 2 c ^ , c ϵ C [ P ( c ^ | x ) P ( c | x ) ] 2
where x is the combination of any attribute value. The variance term is a real-valued non-negative quantity that equals zero for an algorithm that always makes the same guess regardless of the training set. The variance increases as the algorithm becomes more sensitive to changes in the training set. It is defined as follows:
v a r i a n c e = 1 2 [ 1 c ^ ϵ C P ( c ^ | x ) 2 ]
Given the definite Bayesian network structure, P ( c , x ) can be calculated as follows:
P ( c , x ) = P ( c ) i = 1 n P ( x i | c , P a ( x i ) )
The conditional probability P ( c | x ) in the bias term can be rewritten as
P ( c | x ) = P ( c , x ) P ( x ) = P ( c , x ) C P ( c , x )
Given a dataset containing e test instances, the values of zero-one loss, bias and variance for this dataset can be achieved by averaging the result of zero-one loss, bias and variance for all test instances.
In order to clarify the performance of the KDF over datasets of a different scale, we propose a new scoring criterion, which is called goal difference (GD).
Definition 2.
Goal difference (GD) is a scoring criterion to compare the performance of two classifiers. Given two classifiers A and B, GD is defined as
G D ( A ; B | T ) = | w i n | | l o s s |
where T is the collection of datasets for experimental study, and | w i n | and | l o s s | represent the number of datasets on which A outperforms or underperforms B by comparing the results of the evaluation function (e.g., zero-one loss, bias, and variance), respectively.
Diversity has been recognized as a very important characteristic in classifier combination. However, there is no strict definition of what is intuitively perceived as diversity of classifiers. Many measures of the connection between two classifier outputs can be derived from the statistical literature. There is less clarity on the subject when three or more classifiers are concerned. Supposing that each subclassifier votes for a particular class label, given a test instance T k and assuming equal weights, the proportion that n subclassifiers agree on class label c j is
P r k ( j ) = 1 n i = 1 n f i j
where f i j = 1 ; if KDF i votes for label c j 0 ; otherwise
Entropy is a good measure of dispersion in bootstrap estimation during classification. Given a test set containing M instances, an appropriate measure to evaluate diversity among ensemble members is
Div = k = 1 M j = 1 | C | P r k ( j ) log P r k ( j )
Clearly, when all subclassifiers always vote for the same label, Div will have a minimum value of 0.
We argue that the KDF benefits from the sorting method, dependence analysis and ensemble mechanism. In the following, we propose experiments for these three aspects.

4.1. Impact of Sorting Method

To illustrate the impact of the sorting method on the performance of classification, we consider another version of the KDB, that is, KDB S . KDB S performs the sorting method proposed above to replace the sorting method of KDB. We note that the root node of KDB S is consistent with that of the KDB to make sure the result is fair. Table A1 in Appendix A presents for each dataset the zero-one loss, which is estimated by 10-fold cross-validation to give an accurate estimation of the average performance of an algorithm. The best result is emphasized with bond font. Runs with the various algorithms are carried out on the same training sets and evaluated on the same test sets. In particular, the cross-validation folds are the same for all of the experiments on each dataset. By comparing via a two-tailed binomial sign test with a 95% confidence level, we present summaries of win/draw/loss (W/D/L) records in Table 2. A win indicates that the algorithm has significantly lower error than the comparator. A draw indicates that the differences in error are not significant. We can easily find that KDB S achieves lower error on 13 datasets over KDB. This proves that the better performance of KDB S on 13 datasets can be attributed to the sorting method.
In order to further demonstrate the superiority of this sorting method, Figure 7 shows the scatter plot of KDB S and KDB in terms of zero-one loss. The X-axis represents the zero-one loss results of KDB and the Y-axis represents the zero-one loss results of KDB S . We can see that there are a lot of datasets under the diagonal line, such as Chess, Hepatitis, Lymphography and Echocardiogram, which means that KDB S has a clear advantage over the KDB. Simultaneously, aside from Nursery, Kr vs. kp and Poker hand, the other datasets fall close to the diagonal line. That means that KDB S has much higher classification error than KDB on only these three datasets. For some datasets, this sorting method did not affect the classification error. However, for many datasets, it substantially reduced the classification error, for example, the reduction from 0.1871 to 0.1290 for the Hepatitis dataset.

4.2. Impact of Dependence Analysis

To show the superior performance of dependence analysis (i.e., the selection of ancestor attributes) of the KDF, we clarify from the viewpoint of conditional mutual information I ( X i ; X j | C ) , which can be used to quantitatively evaluate the conditional dependence between X i and X j given C. We propose the definition of average conditional mutual information, that is, A v g _ C M I , to measure the intensity of conditional dependence between predictive attributes for the classifier. A v g _ C M I is defined as follows:
A v g _ C M I = i = 1 n X j ϵ P a ( X i ) I ( X i ; X j | C ) S u m _ a r c
where P a ( X i ) is the parent of X i , and S u m _ a r c is the sum of numbers of arcs between predictive attributes. The comparison results of A v g _ C M I between KDF and KDB are shown in Figure 8. We can find that KDF has a significant advantage over KDB for almost all the datasets. According to Figure 8, we can see that the W/D/L of KDF against KDB is 35/1/4. That is to say, KDB has a higher value of A v g _ C M I than KDF on only four datasets. The experimental results prove that the selection of ancestor attributes of the KDF can fully demonstrate conditional dependence between predictive attributes; for example, the value of A v g _ C M I increases from 0.2947 to 0.4991 for the Vowel dataset.

4.3. Further Experimental Analysis

This part of the experiments compared the KDF with the out-of-core classifiers described in Section 4 in terms of zero-one loss. According to the zero-one loss results in Table A1 in Appendix A, we present summaries of W/D/L records in Table 3. When the dependence complexity increases, the performance of TAN and the KDB becomes better than that of NB. The two-dependence relationship helps the KDB to achieve a slightly better performance than TAN (16 wins and 13 losses). It is clear that AODE performs far better than NB (27 wins and 4 losses). However, the ensemble mechanism does not help ATAN to achieve superior performance to TAN (2 wins and 1 loss). The KDF performs the best. For example, when compared with the KDB, the KDF wins on 23 datasets and loses on 5 datasets. This advantage is more apparent when comparing the KDF with ATAN (26 wins and 2 losses). The KDF also provides better classification performance than AODE (26 wins and 5 losses).
To clarify from the viewpoints of the ensemble mechanism and structure complexity, we only compare the KDF with three classifiers, that is, KDB, ATAN and AODE. We present the fitting curve of GD in terms of zero-one loss in Figure 9. Given datasets { D 1 , , D m } , the X-axis in Figure 9 represents the index number of datasets, and the Y-axis represents the value of G D ( A ; B | S i ) , where S i is the collection of datasets { D 1 , , D i ^ | i ^ < m } for experimental study. In the following discussion, we first compare the KDF with other two ensemble classifiers, that is, ATAN and AODE. Then, the KDF is compared with the KDB in the case of the same value of k. As shown in Figure 9, the KDF only performs a little worse than ATAN when dealing with small datasets with less than 131 instances, for example, Echocardiogram. This indicates that fewer instances are not enough to support discovering significant dependencies for the KDF. However, as more instances are utilized for the training classifier, the sorting method of the KDF and the higher value of k will help to ensure that more dependencies will appear and be expressed in the joint probability distribution. This makes the KDF perform much better than ATAN (the maximum value of G D ( K D F ; A T A N | S i ) is 24). Owing to the same reason, the fitting curve of G D ( K D F ; A O D E | S i ) has a similar trend compared with the fitting curve of G D ( K D F ; A T A N | S i ) . When we compare the KDF with the KDB, the fitting curve shows a different trend. It is clear from Figure 9 that the KDF always performs much better than the KDB on datasets of different scale. This superior performance is due to the ensemble mechanism of the KDF. The KDF has n subclassifiers, where n is the number of predictive attributes, and each subclassifier of the KDF reflects almost the same quantities of mutual dependencies and conditional dependencies compared with the KDB. Moreover, diversity among the subclassifiers of the KDF is also a key part in the superior performance of the KDF. In order to prove this point, we show the results of average entropy diversity in the following discussion.
For the purpose of calculating the average entropy diversity of the KDF over datasets of a different scale and simultaneously ensuring the consistency of the data distribution, we take the Poker hand dataset as an example. Before the segmentation, 200 instances were selected as a test set and the remaining instances were for training. The training set is divided into 17 parts of different sizes. The scale of these 17 parts is in an exponential growth of 2 (from 2 3 to 2 19 ). Figure 10a shows the fitting curve of average entropy diversity of the KDF on the Poker hand dataset. As can be seen, there is a strong diversity among the subclassifiers of the KDF, and the maximum value is close to 0.48 when the dataset contains less than 2 12 instances (4096 instances). The reason for this result is that fewer training instances make each subclassifier learn diverse mutual dependencies and conditional dependencies. As the quantities of instance increase, each subclassifier can be trained well and tends to vote for the same label. Therefore, the fitting curve of the average entropy diversity has a downward trend. However, the slight decrease in diversity does not produce a bad performance in classification accuracy. Figure 10b shows the corresponding fitting curve of zero-one loss of the KDF. We can find that as more instances are utilized for training, the KDF still achieves better classification performance in terms of zero-one loss.

4.3.1. Comparison with In-Core Random Forest

A random forest (RF) is a powerful in-core learning algorithm that is state-of-the-art. To further illustrate the performance of the KDF, here we first compare the KDF with the RF, which contains 100 trees (RF100) with respect to zero-one loss. From Table A1 in Appendix A, we can see that RF100 seems to perform better than the KDF on several datasets. In order to know how much RF100 wins by, we present the scatter plot in Figure 11a, where the X-axis represents the zero-one loss results of RF100 and the Y-axis represents the zero-one loss results of the KDF. We note that we do not obtain the results for RF100 on such two datasets as Covtype and Poker hand because of the limited memory; thus we remove these two points in the plot. We can see that the dataset Anneal is under the diagonal line, which means the KDF could beat RF100 on the Anneal dataset. Except for Vowel, Tic-tac-toe, Promoters and Sign, the other datasets fall close to the diagonal line. This means the performance of the KDF is close to the performance of RF100 on most datasets. It is worthwhile to keep in mind that the number of subclassifiers of the KDF (the maximum number is 64 on the Optdigits dataset) is much smaller than that of RF100.
It is unfair to make a comparison between the KDF and RF when they have a different number of subclassifiers. Thus we present another experiment that limits the RF with n trees (RFn), just as for KDF. Table A1 in Appendix A presents the zero-one loss in detail. We also present the scatter plot in Figure 11b, where the X-axis represents the zero-one loss results of RFn and the Y-axis represents the zero-one loss results of the KDF. From Figure 11b, we can easily find that most datasets are under the diagonal line, for example, Anneal, Car, Chess, Hungarian, Promoters, and so on, which means the KDF performs much better than RFn on these datasets. Except for Vowel and Sign, the other datasets fall close to the diagonal line, which means the performance of the KDF is close to that of RFn on the remaining datasets. The superior performance of the RF can be partially attributed to the great number of decision trees. The experiment results show that the KDF is competitive with the RF when they contain the same number of subclassifiers.

4.3.2. Bias Results

Bias can be used to evaluate the extent to which the final model learned from training data fits the entire dataset. To further illustrate the performance of the proposed KDF, the experimental results of average bias are shown in Table A2 in Appendix A. Only 18 large datasets (size > 2310) are selected for comparison because of statistical significance. Table 4 shows the corresponding W/D/L records. From Table 4, we can see that the fitness of NB is the poorest because its structure is definite regardless of the true data distribution. Although the structure of AODE is also definite, it shows a great advantage over NB (17 wins). The main reason may be that it averages all models from a restricted class of one-dependence classifiers and reflects more dependencies between predictive attributes. ATAN and TAN almost have the same bias results (18 draws). The KDF still performs the best, although the advantage is not significant. By sorting attributes and training n subclassifiers, the ensemble mechanism can help the KDF make full use of the information that is supplied by the training data. The complicated relationship among attributes are measured and depicted from the viewpoint of information theory. Thus, performance robustness can be achieved. The W/D/L records of the KDF compared to AODE show that the advantage is obvious (11 wins and 4 losses) for bias. We can also find that more often than not, the KDF obtains lower bias than ATAN (8 wins and 4 losses) and the KDB (7 wins and 5 losses).
Figure 12 shows the fitting curve of GD in terms of bias. The results indicate that the KDF is competitive to AODE (the minimum value of G D ( K D F ; A O D E | S i ) is 1 and the maximum value of G D ( K D F ; A O D E | S i ) is 7). We believe the reason for the KDF performing better is that the sorting method means it reflect more dependencies than AODE. The KDF performs much better than ATAN (the maximum value of G D ( K D F ; A T A N | S i ) is 6) when dealing with relatively small datasets containing less than 67,557 instances, for example, the Connect-4 dataset. As the quantities of instance increase, dependencies between predictive attributes are completely represented, and the final structure of both the KDF and ATAN fits the entire dataset well. Thus, the KDF wins on two out of the last four datasets. The comparison results between the KDF and KDB in terms of G D ( K D F ; K D B | S i ) show another trend. From the fitting curve, we can find that the KDB is competitive to the KDF for the first four datasets, which contain less than 4601 instances. The minimum value of G D ( K D F ; K D B | S i ) is as low as 4 . The reason for this result is that KDF cannot discover enough dependencies when the dataset contains lower quantities of data. As the quantities of instance increase, the KDF achieves greater advantage in terms of bias.

4.3.3. Variance Results

Table A3 in Appendix A shows the experimental results of average variance on 18 large datasets. Table 5 shows corresponding W/D/L records. A higher degree of attribute dependence means more parameters, which increases the risk of overfitting. An overfitted model does not perform well on data outside the training data. It is clear that NB performs the best among these algorithms, because its network structure is definite and is therefore insensitive to changes in the training set, as shown in Table 5. Owing to the same reason, AODE also has a competitive performance. ATAN has almost the same performance (17 draws) compared to TAN. By contrast, the KDB performs the worst. When the value of k increases, the resulting network tends to have a complex structure. The KDF wins on 13 out of 18 datasets compared to the KDB. AODE wins over the KDF, although the advantage is not significant (7 wins and 9 losses).
Figure 13 shows the fitting curve of GD of in terms of variance. NB and AODE are neglected, because they are insensitive to the changes in the training set. TAN is not considered, because of almost the same performance as ATAN. The KDF obtains a significant advantage over the KDB, but performs similarly to ATAN. ATAN can only represent the most significant one-dependence relationships between attributes and thus performs similarly to TAN. The ensemble mechanism helps the KDF fully represent many non-significant dependencies. This may be the main reason why ATAN and the KDF are not sensitive to the changes in data distribution. In contrast, although the KDB can also represent significant dependencies, some non-significant dependencies will be affected by the training data, particularly when the dataset size is relatively large.

5. Conclusions

The KDB delivers fast and effective classification with a clear theoretical foundation. The current work is motivated by the desire to obtain the accuracy improvements derived by the sorting method and ensemble mechanism. Our new classification technique averages all models from a restricted class of k-dependence classifiers, the class of all such classifiers that have a diverse network structure depending on a different attribute order. Our experiments have shown its superiority from the comparison results of zero-one loss, bias, variance and diversity. However, the subclassifiers of the KDF are trained using the same training set, which may lead to overfitting. Moreover, the number of subclassifiers of the KDF is determined by the number of predictive attributes and is not as many as for the RF. In all, we believe that we have been successful in our goal of developing a classification learning technique that retains the direct theoretical foundation of the KDB while fully representing conditional dependencies among attributes.

Acknowledgments

This work was supported by the National Science Foundation of China (Grant No. 61272209) and the Agreement of Science & Technology Development Project, Jilin Province (No. 20150101014JC).

Author Contributions

All authors have contributed to the study and to the preparation of the article. They have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Tables of the Experimental Section

Table A1. Experimental results of average zero-one loss.
Table A1. Experimental results of average zero-one loss.
DatasetNBTANAODEKDBATANRF100RFnKDB S KDF
Post-operative0.34440.36670.33330.37780.36670.36670.37780.36670.3667
Zoo0.02970.00990.02970.04950.01980.02970.03960.03960.0297
Promoters0.07550.13210.13210.25470.12260.09430.32080.24530.2075
Echocardiogram0.33590.32820.32060.34350.32820.33590.33590.31300.3130
Lymphography0.14860.17570.16890.23650.16890.15540.24320.17570.1554
Hepatitis0.19350.16770.18060.18710.17420.16130.23870.12900.1806
Autos0.31220.21460.20490.20490.21460.16100.18050.20490.1854
Glass-id0.26170.21960.25230.21960.21960.14490.16820.19630.2056
Heart0.17780.19260.17040.21110.19260.18150.25930.20000.1741
Hungarian0.15990.17010.16670.18030.17690.18370.26530.16670.1497
Heart disease-c0.18150.20790.20130.22440.20460.17820.26070.20460.1980
Ionosphere0.10540.06840.07410.07410.06840.07410.08130.07690.0712
House votes-840.09430.05520.05290.05060.05290.03910.07360.04830.0483
Chess0.11250.09260.09980.09980.09260.07990.18150.07800.0744
Breast cancer-w0.02580.04150.03580.07440.04150.03860.06010.08150.0386
Pima-Ind diabetes0.24480.23830.23830.24480.23700.24220.27340.25000.2422
Vehicle0.39240.29430.28960.29430.29430.24700.27540.28720.2861
Anneal0.03790.01110.00890.00890.01110.04790.05680.00890.0078
Tic-tac-toe0.30690.22860.26510.20350.22760.02610.20460.20040.1983
Vowel0.42420.13030.14950.18180.12630.01720.05150.16160.1273
Contraceptive-mc0.50370.48880.49420.50030.48950.48540.51320.50030.4929
Car0.14000.05670.08160.03820.05670.05500.08560.04920.0376
Segment0.07880.03900.03420.04720.03980.02160.03160.04760.0372
Kr vs. kp0.12140.07760.08420.04160.07760.00750.05320.07950.0463
Sick0.03080.02570.02730.02230.02550.01590.03580.02760.0236
Spambase0.10150.06690.06720.06350.06690.04500.09320.07000.0552
Optdigits0.07670.04070.03110.03720.04040.01850.03260.03720.0262
Satellite0.18060.12140.11480.10800.12090.08250.10460.11030.1051
Mushrooms0.01960.00010.00010.00000.00010.00000.00020.00000.0000
Thyroid0.11110.07200.07010.07060.07190.04770.05250.07940.0654
Sign0.35860.27550.28210.25390.27550.11870.14850.24730.2365
Nursery0.09730.06540.07300.02890.06550.00930.02810.05860.0416
Magic0.22390.16750.17520.16370.16740.12000.16550.16560.1526
Adult0.15920.13800.14930.13830.13800.14870.19310.13500.1317
Shuttle0.00390.00150.00080.00090.00130.00010.00020.00070.0007
Connect-40.27830.23540.24200.22830.23540.17510.25150.23800.2145
Waveform0.02200.02020.01800.02560.02020.01360.01680.01970.0195
Census income0.23630.06280.10040.05080.06280.04700.05870.05260.0626
Covtype0.31580.25170.23890.14210.25160.13110.1291
Poker hand0.49880.32950.48120.19610.32950.22540.2204
NB, standard naive Bayes. TAN, tree-augmented naive Bayes. AODE, averaged one-dependence estimator. KDB, k-dependence Bayesian classifier. KDB S , the KDB that only performs the sorting method proposed above. ATAN, averaged tree-augmented naive Bayes. RF100, random forest containing 100 trees. RFn, random forest containing n trees, where n is the number of predictive attributes. KDF, k-dependence forest.
Table A2. Experimental results of bias on large datasets.
Table A2. Experimental results of bias on large datasets.
DatasetNBTANAODEKDBATANKDF
Segment0.08570.04540.03930.03920.04500.0434
Kr vs. kp0.11050.06680.06990.03900.06690.0450
Sick0.02320.02280.02420.02080.02230.0291
Spambase0.09650.06560.06690.05040.06580.0558
Optdigits0.06550.03080.02950.02850.03060.0241
Satellite0.16610.09410.08010.08100.09410.0849
Mushrooms0.03990.00020.00100.00020.00020.0002
Thyroid0.10140.07490.06640.07510.07340.0593
Sign0.31090.24640.25000.21540.24550.2316
Nursery0.07290.05070.05190.04180.05120.0367
Magic0.19870.13570.16130.13210.13560.1370
Adult0.14850.11250.11820.11350.11240.1099
Shuttle0.00660.00230.00230.00280.00230.0023
Connect-40.23270.18290.19210.17880.18300.1798
Waveform0.03140.01380.01510.01800.01380.0146
Census income0.12710.05130.04990.05410.05000.0532
Covtype0.22880.22570.21480.22380.22590.1309
Poker hand0.32660.22650.26270.33060.22670.2684
Table A3. Experimental results of variance on large datasets.
Table A3. Experimental results of variance on large datasets.
DatasetNBTANAODEKDBATANKDF
Segment0.01730.02120.01790.02610.02110.0172
Kr vs. kp0.02490.01850.02330.01010.01840.0128
Sick0.00580.00680.00450.00560.00650.0087
Spambase0.01040.01710.01390.02380.01690.0171
Optdigits0.02470.02800.02360.03220.02790.0243
Satellite0.02070.04130.04910.05290.04130.0400
Mushrooms0.00810.00060.00030.00050.00060.0007
Thyroid0.03570.03880.04110.03680.03890.0340
Sign0.07860.07790.09740.09660.07840.0789
Nursery0.02700.03800.02950.04470.03770.0376
Magic0.04090.07920.06130.08180.07940.0805
Adult0.03550.06400.04460.07170.06340.0555
Shuttle0.00380.00080.00210.00210.00090.0010
Connect-40.09530.08830.11030.10440.08810.0873
Waveform0.00440.01190.00680.01020.01180.0086
Census income0.04680.02230.04770.02260.02240.0200
Covtype0.12320.16650.13290.16230.17520.1887
Poker hand0.20910.23110.21410.32840.23050.2693

References

  1. Zhou, L.P.; Wang, L.; Liu, L.Q.; Ogunbona, P.; Shen, D.G. Learning Discriminative Bayesian networks from high-dimensional continuous neuroimaging data. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 2269–2283. [Google Scholar] [CrossRef] [PubMed]
  2. Pernkopf, F.; Wohlmayr, M.; Tschiatschek, S. Maximum margin Bayesian network classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 521–532. [Google Scholar] [CrossRef] [PubMed]
  3. Pernkopf, F.; Bilmes, J. Efficient heuristics for discriminative structure learning of Bayesian network classifiers. J. Mach. Learn. Res. 2010, 11, 2323–2360. [Google Scholar]
  4. Carvalho, A.; Adão, P.; Mateus, P. Efficient approximation of the conditional relative entropy with applications to discriminative learning of Bayesian network classifiers. Entropy 2013, 15, 2716–2735. [Google Scholar] [CrossRef]
  5. Wang, S.C.; Gao, R.; Wang, L.M. Bayesian network classifiers based on Gaussian kernel density. Expert Syst. Appl. 2016, 51, 207–217. [Google Scholar] [CrossRef]
  6. He, Y.L.; Wang, R.; Kwong, S.; Wang, X.Z. Bayesian classifiers based on probability density estimation and their applications to simultaneous fault diagnosis. Inf. Sci. 2014, 259, 252–268. [Google Scholar] [CrossRef]
  7. Varando, G.; Bielza, C.; Larrañaga, P. Decision boundary for discrete Bayesian network classifiers. J. Mach. Learn. Res. 2015, 16, 2725–2749. [Google Scholar]
  8. Martinez, A.M.; Webb, G.I.; Chen, S.L.; Nayyar, A.Z. Scalable learning of Bayesian network classifiers. J. Mach. Learn. Res. 2013, 1, 1–30. [Google Scholar]
  9. Bielza, C.; Larranaga, P. Discrete Bayesian Network Classifiers: A Survey. ACM Comput. Surv. 2014, 47, 1–43. [Google Scholar] [CrossRef]
  10. Friedman, N.; Dan, G.; Goldszmidt, M. Bayesian classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef]
  11. Sahami, M. Learning limited dependence Bayesian classifiers. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 335–338. [Google Scholar]
  12. Chen, S.L.; Martinez, A.M.; Webb, G.I.; Wang, L.M. Selective AnDE for large data learning: A low-bias memory constrained approach. Knowl. Inf. Syst. 2017, 50, 1–29. [Google Scholar] [CrossRef]
  13. Chen, S.L.; Martinez, A.M.; Webb, G.I.; Wang, L.M. Sample-based attribute selective AnDE for large data. IEEE Trans. Knowl. Data. Eng. 2016, 29, 1–14. [Google Scholar]
  14. Webb, G.I.; Boughton, J.R.; Wang, Z. Not so naive Bayes: Aggregating one-dependence estimators. Mach. Learn. 2005, 58, 5–24. [Google Scholar] [CrossRef]
  15. Jiang, L.X.; Cai, Z.H.; Wang, D.H.; Zhang, H. Improving tree augmented naive bayes for class probability estimation. Knowl.-Based Syst. 2012, 26, 239–245. [Google Scholar] [CrossRef]
  16. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann: Burlington, MA, USA, 1988. [Google Scholar]
  17. Jiang, S.; Zhang, H. Full Bayesian network classifiers. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 897–904. [Google Scholar]
  18. Dagum, P.; Luby, M. Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard. Artif. Intell. 1993, 60, 141–153. [Google Scholar] [CrossRef]
  19. Zhao, Y.P.; Chen, Y.T.; Tu, K.W.; Tian, J. Learning Bayesian network structures under incremental construction curricula. Neurocomputing 2017, 258, 30–40. [Google Scholar] [CrossRef]
  20. Wu, J.; Cai, Z. A naive Bayes probability estimation model based on self-adaptive differential evolution. J. Intell. Inf. Syst. 2014, 42, 671–694. [Google Scholar] [CrossRef]
  21. Francisco, L.; Anderson, A. Bagging k-dependence probabilistic networks: An alternative powerful fraud detection tool. Expert Syst. Appl. 2012, 39, 11583–11592. [Google Scholar]
  22. Domingos, P.; Pazzani, M. On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 1997, 29, 103–130. [Google Scholar] [CrossRef]
  23. Chow, C.K.; Liu, C.N. Approximating discrete probability distributions dependence trees. IEEE Trans. Inf. Theory. 1968, 14, 462–467. [Google Scholar] [CrossRef]
  24. Madden, M.G. On the classification performance of TAN and general Bayesian networks. Knowl.-Based Syst. 2009, 22, 489–495. [Google Scholar] [CrossRef]
  25. Ruz, G.A.; Pham, D.T. Building Bayesian network classifiers through a Bayesian complexity monitoring system. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2009, 223, 743–755. [Google Scholar] [CrossRef]
  26. Webb, G.I.; Boughton, J.; Zheng, F.; Ting, K.M.; Salem, H. Learning by extrapolation from marginal to full-multivariate probability distributions: Decreasingly naive Bayesian classification. Mach. Learn. 2012, 86, 233–272. [Google Scholar] [CrossRef]
  27. Murphy, P.M.; Aha, D.W. UCI Repository of Machine Learning Databases. Available online: http://www.ics.uci.edu/~mlearn/MLRepository.html (accessed on 28 November 2017).
  28. Fayyad, U.M.; Irani, K.B. Multi-interval Discretization of Continuous-Valued Attributes for Classification Learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France, 28 August–3 September 1993; pp. 1022–1029. [Google Scholar]
  29. Kohavi, R.; Wolpert, D. Bias Plus Variance Decomposition for Zero-One Loss Functions. In Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 3–6 July 1996; pp. 275–283. [Google Scholar]
Figure 1. (a) The proportional distribution of conditional mutual information I ( X i ; X j | C ) ; (b) the proportional distribution of mutual information I ( C ; X i , X j ) . The predictive attributes { b u y i n g , m a i n t , d o o r s , p e r s o n s , l u g _ b o o t , s a f e t y } of Car are represented by { X 1 , X 2 , , X 6 } .
Figure 1. (a) The proportional distribution of conditional mutual information I ( X i ; X j | C ) ; (b) the proportional distribution of mutual information I ( C ; X i , X j ) . The predictive attributes { b u y i n g , m a i n t , d o o r s , p e r s o n s , l u g _ b o o t , s a f e t y } of Car are represented by { X 1 , X 2 , , X 6 } .
Entropy 19 00651 g001
Figure 2. (a) An example of naive Bayes (NB); (b) an example of averaged one-dependence estimator (AODE).
Figure 2. (a) An example of naive Bayes (NB); (b) an example of averaged one-dependence estimator (AODE).
Entropy 19 00651 g002
Figure 3. (a) An example of tree-augmented naive Bayes (TAN), which takes X 1 as the root node; (b) a subclassifier of averaged TAN (ATAN), which takes X 2 as the root node.
Figure 3. (a) An example of tree-augmented naive Bayes (TAN), which takes X 1 as the root node; (b) a subclassifier of averaged TAN (ATAN), which takes X 2 as the root node.
Entropy 19 00651 g003
Figure 4. An example of k-dependence Bayesian classifier (KDB; k = 2).
Figure 4. An example of k-dependence Bayesian classifier (KDB; k = 2).
Entropy 19 00651 g004
Figure 5. An example of the hierarchical dependency relationship.
Figure 5. An example of the hierarchical dependency relationship.
Entropy 19 00651 g005
Figure 6. (a) An exmple of a subclassifier of k-dependence forest (KDF) and supposing the predetermined attribute sequence S i is { X 1 , X 2 , X 3 , X 4 , X 5 } and k = 2 . (b) The corresponding parent-child relationships.
Figure 6. (a) An exmple of a subclassifier of k-dependence forest (KDF) and supposing the predetermined attribute sequence S i is { X 1 , X 2 , X 3 , X 4 , X 5 } and k = 2 . (b) The corresponding parent-child relationships.
Entropy 19 00651 g006
Figure 7. The scatter plot of KDB S and k-dependence Bayesian classifier (KDB) in terms of zero-one loss.
Figure 7. The scatter plot of KDB S and k-dependence Bayesian classifier (KDB) in terms of zero-one loss.
Entropy 19 00651 g007
Figure 8. The comparison results of A v g _ C M I between k-dependence forest (KDF) and k-dependence Bayesian classifier (KDB).
Figure 8. The comparison results of A v g _ C M I between k-dependence forest (KDF) and k-dependence Bayesian classifier (KDB).
Entropy 19 00651 g008
Figure 9. The fitting curves of goal difference (GD) in terms of zero-one loss.
Figure 9. The fitting curves of goal difference (GD) in terms of zero-one loss.
Entropy 19 00651 g009
Figure 10. The fitting curve of (a) average entropy diversity, and (b) zero-one loss of k-dependence forest (KDF) on Poker hand dataset.
Figure 10. The fitting curve of (a) average entropy diversity, and (b) zero-one loss of k-dependence forest (KDF) on Poker hand dataset.
Entropy 19 00651 g010
Figure 11. The scatter plot of (a) k-dependence forest (KDF) and random forest 100 (RF100), and (b) KDF and random forest n (RFn) in terms of zero-one loss.
Figure 11. The scatter plot of (a) k-dependence forest (KDF) and random forest 100 (RF100), and (b) KDF and random forest n (RFn) in terms of zero-one loss.
Entropy 19 00651 g011
Figure 12. The fitting curve of goal difference (GD) in terms of bias.
Figure 12. The fitting curve of goal difference (GD) in terms of bias.
Entropy 19 00651 g012
Figure 13. The fitting curve of goal difference (GD) in terms of variance.
Figure 13. The fitting curve of goal difference (GD) in terms of variance.
Entropy 19 00651 g013
Table 1. Datasets.
Table 1. Datasets.
IndexDataset# InstanceAttributeClassIndexDataset# InstanceAttributeClass
1Post-operative908321Contraceptive-mc147393
2Zoo10116722Car172864
3Promoters10657223Segment2310197
4Echocardiogram1316224Kr vs. kp3196362
5Lymphography14818425Sick3772292
6Hepatitis15519226Spambase4601572
7Autos20525727Optdigits56206410
8Glass-id2149328Satellite6435366
9Heart27012229Mushrooms8124222
10Hungarian29413230Thyroid91692920
11Heart disease-c30313231Sign12,54683
12Ionosphere35134232Nursery12,96085
13House votes-8443516233Magic19,020102
14Chess55139234Adult48,842142
15Breast cancer-w6999235Shuttle58,00097
16Pima-Ind diabetes7688236Connect-467,557423
17Vehicle84618437Waveform100,000213
18Anneal89839638Census income299,285412
19Tic-tac-toe9589239Covtype581,012547
20Vowel990131140Poker hand1,025,0101010
Table 2. Win/draw/loss comparison results between KDB S and k-dependence Bayesian classifier (KDB) in terms of zero-one loss.
Table 2. Win/draw/loss comparison results between KDB S and k-dependence Bayesian classifier (KDB) in terms of zero-one loss.
W/D/LKDB
KDB S 13/19/8
Table 3. Win/draw/loss comparison results of zero-one loss on all datasets.
Table 3. Win/draw/loss comparison results of zero-one loss on all datasets.
NBTANAODEKDBATAN
TAN30/3/7
AODE27/9/410/16/14
KDB27/4/916/11/1316/10/14
ATAN30/3/72/37/114/15/1112/13/15-
KDF31/5/425/12/326/9/523/12/526/12/2
Table 4. Win/draw/loss comparison results of bias on large datasets.
Table 4. Win/draw/loss comparison results of bias on large datasets.
NBTANAODEKDBATAN
TAN16/2/0
AODE17/1/03/10/5
KDB16/2/08/6/48/5/5
ATAN16/2/00/18/05/10/34/6/8
KDF17/0/18/7/311/3/47/6/58/6/4
Table 5. Win/draw/loss comparison results of variance on large datasets.
Table 5. Win/draw/loss comparison results of variance on large datasets.
NBTANAODEKDBATAN
TAN5/1/12
AODE4/4/1011/0/7
KDB4/2/125/3/104/2/12
ATAN5/1/120/17/17/0/1110/2/6
KDF5/4/97/6/57/2/913/1/47/6/5

Share and Cite

MDPI and ACS Style

Duan, Z.; Wang, L. K-Dependence Bayesian Classifier Ensemble. Entropy 2017, 19, 651. https://doi.org/10.3390/e19120651

AMA Style

Duan Z, Wang L. K-Dependence Bayesian Classifier Ensemble. Entropy. 2017; 19(12):651. https://doi.org/10.3390/e19120651

Chicago/Turabian Style

Duan, Zhiyi, and Limin Wang. 2017. "K-Dependence Bayesian Classifier Ensemble" Entropy 19, no. 12: 651. https://doi.org/10.3390/e19120651

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop