Next Article in Journal
General Approach for Composite Thermoelectric Systems with Thermal Coupling: The Case of a Dual Thermoelectric Cooler
Previous Article in Journal
Delayed-Compensation Algorithm for Second-Order Leader-Following Consensus Seeking under Communication Delay
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learning a Flexible K-Dependence Bayesian Classifier from the Chain Rule of Joint Probability Distribution

1
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
2
School of Software, Jilin University, Changchun 130012, China
*
Author to whom correspondence should be addressed.
Entropy 2015, 17(6), 3766-3786; https://doi.org/10.3390/e17063766
Submission received: 30 November 2014 / Accepted: 3 June 2015 / Published: 8 June 2015
(This article belongs to the Section Statistical Physics)

Abstract

:
As one of the most common types of graphical models, the Bayesian classifier has become an extremely popular approach to dealing with uncertainty and complexity. The scoring functions once proposed and widely used for a Bayesian network are not appropriate for a Bayesian classifier, in which class variable C is considered as a distinguished one. In this paper, we aim to clarify the working mechanism of Bayesian classifiers from the perspective of the chain rule of joint probability distribution. By establishing the mapping relationship between conditional probability distribution and mutual information, a new scoring function, Sum_MI, is derived and applied to evaluate the rationality of the Bayesian classifiers. To achieve global optimization and high dependence representation, the proposed learning algorithm, the flexible K-dependence Bayesian (FKDB) classifier, applies greedy search to extract more information from the K-dependence network structure. Meanwhile, during the learning procedure, the optimal attribute order is determined dynamically, rather than rigidly. In the experimental study, functional dependency analysis is used to improve model interpretability when the structure complexity is restricted.

1. Introduction

Graphical models [1,2] provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering: uncertainty and complexity. The two most common types of graphical models are directed graphical models (also called Bayesian networks) [3,4] and undirected graphical models (also called Markov networks) [5]. A Bayesian network (BN) is a type of statistical model consisting of a set of conditional probability distributions and a directed acyclic graph (DAG), in which the nodes denote a set of random variables and arcs describing conditional (in)dependence relationship between them. Therefore, BNs can be used to predict the consequences of intervention. The conditional dependencies in the graph are often estimated using known statistical and computational methods.
Supervised classification is an outstanding task in data analysis and pattern recognition. It requires the construction of a classifier, that is a function that assigns a class label to instances described by a set of variables. There are numerous classifier paradigms, among which Bayesian classifiers [611], based on probabilistic graphical models (PGMs) [2], are well known and very effective in domains with uncertainty. Given class variable C and a set of attributes X = {X1, X2, ⋯, Xn}, the aim of supervised learning is to predict from a training set the class of a testing instance x = {x1, ⋯, xn}, where xi is the value of the i-th attribute. We wish to precisely estimate the conditional probability of P(c|x) by selecting argmaxC P(c|x), where P(•) is a probability distribution function and c ∈ {c1, ⋯, ck} are the k classes. By applying Bayes’ theorem, the classification process can be done in the following way with the BNs:
arg max C P ( c | x 1 , , x n ) = arg max C P ( x 1 , , x n , c ) P ( x 1 , , x n ) arg max C P ( x 1 , , x n , c )
This kind of classifier is known as generative, and it forms the most common approach in the BN literature for classification [611].
Many scoring functions, e.g., maximum likelihood (ML) [12], Bayesian information criterion (BIC) [13], minimum description length (MDL) [14] and Akaike information criterion (AIC) [15], were proposed to evaluate whether the learned BN best fits the dataset. For BN, all attributes (including class variable) are treated equally, while for Bayesian classifiers, the class variable is treated as a distinguished one. Additionally, these scoring functions do not work well for Bayesian classifiers [9]. In this paper, we limit our attention to a class of network structures, restricted Bayesian classifiers, which require that the class variable C be a parent of every attribute and no attribute be the parent of C. P(c, x) can be rewritten in terms of the product of a set of conditional distributions, which is also known as the chain rule of joint probability distribution.
P ( x 1 , , x n , c ) = P ( c ) P ( x 1 | c ) P ( x 2 | x 1 , c ) P ( x n | x 1 , , x n 1 , c ) = P ( c ) i = 1 n P ( x i | P a i , c )
where Pai denotes a set of parent attributes of the node Xi, except the class variable, i.e., Pai = {X1, , Xi–1}. Each node Xi has a conditional probability distribution (CPD) representing P(xi|Pai, c). If the Bayesian classifier can be constructed based on Equation (2), the corresponding model is “optimal”, since all conditional dependencies implicated in the joint probability distribution are fully described, and the main term determining the classification will take every attribute into account.
From Equation (2), the order of attributes {X1, ⋯, Xn} is fixed in such a way that an arc between two attributes {Xl, Xh} always goes from the lower ordered attribute Xl to the higher ordered attribute Xh. That is, the network can only contain arcs Xl → Xh where l < h. The first few lower ordered attributes are more important than the higher ordered ones, because Xl may be possible parent attributes of Xh, but Xh cannot be possible parent attributes of Xl. One attribute may be dependent on several other attributes, and this dependence relationship will propagate to the whole attribute set. A slight move in one part may affect the whole situation. Finding an optimal order requires searching the space of all possible network structures for one that best describes the data. Without restrictive assumptions, learning Bayesian networks from data is NP-hard [16]. Because of the limitation of time and space complexity, only a limited number of conditional probabilities can be encoded in the network. Additionally, precise estimation of P(xi|Pai, c) is non-trivial when given too many parent attributes. One of the most important features of BNs is the fact that they provide an elegant mathematical structure for modeling complicated relationships, while keeping a relatively simple visualization of these relationships. If the network can capture all or at least the most important dependencies that exist in a database, we would expect a classifier to achieve optimal prediction accuracy. If the structure complexity is restricted to some extent, higher dependence cannot be represented. The restricted Bayesian classifier family can offer different tradeoffs between structure complexity and prediction performance. The simplest model is the naive Bayes [6,7], where C is the parent of all predictive attributes, and there are no dependence relationships among them. On the basis of this, we can progressively increase the level of dependence, giving rise to a extension family of naive Bayes models, e.g., tree-augmented naive Bayes (TAN) [8] or K-dependence Bayesian network (KDB) [10,11].
Different Bayesian classifiers correspond to different factorizations of P(x|c). However, few studies have proposed to learn Bayesian classifiers from the perspective of the chain rule. This paper first establishes the mapping relationship between conditional probability distribution and mutual information, then proposes to evaluate the rationality of the Bayesian classifier from the perspective of information quantity. To build an optimal Bayesian classifier, the key point is to achieve the largest sum of mutual information that corresponds to the largest a posteriori probability. The working mechanisms of three classical restricted Bayesian classifiers, i.e., NB, TAN and KDB, are analyzed and evaluated from the perspectives of the chain rule and information quantity implicated in the graphical structure. On the basis of this, the proposed learning algorithm, the flexible K-dependence Bayesian (FKDB) classifier, applies greedy search of the mutual information space to represent high-dependence relationships. The optimal attribute order is determined dynamically during the learning procedure. The experimental results on the UCImachine learning repository [17] validate the rationality of the FKDB classifier from the viewpoints of zero-one loss and information quantity.

2. The Mapping Relationship between Probability Distribution and Mutual Information

Information theory is the theoretical foundation of modern digital communication and was invented in the 1940s by Claude E. Shannon. Though Shannon was principally concerned with the problem of electronic communications, the theory has much broader applicability. Many commonly-used measures are based on the entropy of information theory and used in a variety of classification algorithms [18].
Definition 1. [19]. The entropy of an attribute (or random variable) is a function that attempts to characterize its unpredictability. When given a discrete random variable X with any possible value x and probability distribution function P(·), entropy is defined as follows,
H ( X ) = x X P ( x ) l o g 2 P ( x )
Definition 2. [19]. Conditional entropy measures the amount of information needed to describe attribute X when another attribute Y is observed. Given discrete random variables X and Y and their possible value x, y, conditional entropy is defined as follows,
H ( X | Y ) = x X y Y P ( x , y ) l o g 2 P ( x | y )
Definition 3. [19]. The mutual information I(X; Y ) of two random variables is a measure of the variables’ mutual dependence and is defined as:
I ( X ; Y ) = H ( X ) H ( X | Y ) = x X y Y P ( x , y ) l o g 2 P ( x , y ) P ( x ) P ( y )
Definition 4. [19]. Conditional mutual information I(X; Y |Z) is defined as:
I ( X ; Y | Z ) = x X y Y z Z P ( x , y , z ) l o g 2 P ( x , y | z ) P ( x | z ) P ( y | z )
Each part of the right side of Equation (2), i.e., P(xi|Pai, c), corresponds to a local structure of the restricted Bayesian classifier. Additionally, there should exist a strong relationship between Xi and {Pai, C}, which can be measured by I(Xi; Pai, C).
For example, let us consider the simplest situation in which the attribute set is composed of just two attributes {X1, X2}. The joint probability distribution is:
P ( x 1 , x 2 , c ) = P ( c ) P ( x 1 | c ) P ( x 2 | x 1 , c )
Figure 1a shows the corresponding “optimal” network structure, which is a triangle, and also the basic local structure of restricted Bayesian classifier. Similar to the learning procedure of TAN and KDB, we also use I(Xi;Xj|C) to measure the weight of the arc between attributes Xi and Xj. Besides, we use I(Xi; C) to measure the weight of the arc between class variable C and attribute Xi. The arcs in Figure 1a are divided into two groups by their final targets, i.e., the arc pointing to X1 (as Figure 1b shows) and arcs pointing to X2 (as Figure 1c shows). Suppose there exists information flow in the network, then the information quantity provided to X1 and X2 will be I(X1;C) and I(X2;C) + I(X1;X2|C) = I(X2; X1, C), respectively.
Thus, the mapping relationships between conditional probability distribution and mutual information are:
P ( x i | c ) I ( X i ; C )
and
P ( x i | P a i c ) I ( X i ; P a i , C ) = I ( X i ; C ) + I ( X i ; P a i | C )
To ensure the robustness of entire Bayesian structure, the sum of mutual information ∑ I(Xi; Pai, C) should be maximized. Scoring function Sum_MI is proposed to measure the size of information quantity implicated in the Bayesian classifier and defined as follows,
Sum_MI = X i X ( I ( X i ; C ) + X j P a i I ( X i ; X j | C ) )

3. Restricted Bayesian Classifier Analysis

In the following discussion, we will analyze and summarize the working mechanisms of some popular Bayesian classifiers to clarify their rationality from the viewpoints of information theory and probability theory.
NB: NB simplified the estimation of P(x|c) by conditional independence assumption:
P ( x | c ) = i = 1 n P ( x i | c )
Then, the following equation is often calculated in practice, rather than Equation (2).
P ( c | x ) P ( c ) i = 1 n P ( x i | c )
As Figure 2 shows, the NB classifier can be considered as a BN with a fixed network structure, where every attribute Xi has the class variable as its only parent attribute, i.e., Pai will be restricted to being null. NB can only represent a zero-dependence relationship between predictive attributes. There exists no information flow, but that between predictive attributes and the class variable.
TAN: The disadvantage of the NB classifier is that it assumes that all attributes are conditionally independent given the class, while this often is not a realistic assumption. As Figure 3 shows, TAN introduces more dependencies by allowing each attribute to have an extra parent from the other attributes, i.e., Pai can contain at most one attribute. TAN is based on the Chow–Liu algorithm [20] and can achieve global optimization by building a maximal spanning tree (MST). This algorithm is quadratic in the number of attributes.
As a one-dependence Bayesian classifier, TAN is optimal. Different attribute orders provide the same undirected Bayesian network, which is the basis of TAN. When a different attribute is selected as the root node, the direction of some arcs may reverse. For example, Figure 3a,b represents the same dependence relationship while X1 and X4 are selected as the root nodes, respectively. Additionally, corresponding chain rules are described as:
P ( x 1 , , x 5 , c ) = P ( c ) P ( x 1 | c ) P ( x 2 | x 1 , c ) P ( x 3 | x 2 , c ) P ( x 4 | x 3 , c ) P ( x 5 | x 3 , c )
and:
P ( x 1 , , x 5 , c ) = P ( c ) P ( x 4 | c ) P ( x 3 | x 4 , c ) P ( x 2 | x 3 , c ) P ( x 1 | x 2 , c ) P ( x 5 | x 3 , c )
Sum_MI is the same for Figure 3a, b. That is the main reason why TAN performs almost the same, while the causal relationships implicated in the network structure differ. To achieve diversity, Ma and Shi [21] proposed the RTAN algorithm, the output of which is TAN ensembles. Each sub-classifier is trained with different training subsets sampled from the original instances, and the final decision is generated by a majority of votes.
KDB: In KDB, the probability of each attribute value is conditioned by the class variable and, at most, K predictive attributes. The KDB algorithm adopts a greedy strategy in order to identify the graphical structure of the resulting classifier. KDB sets the order of attributes by calculating mutual information and achieves the weights of the relationship between attributes by calculating conditional mutual information. For example, given five predictive attributes {X1, X2, X3, X4, X5} and supposing that I(X1;C) > I(X2; C) > I(X3;C) > I(X4; C) > I(X5; C), the attribute order is {X1, X2, X3, X4, X5} by comparing mutual information.
From the chain rule of joint probability distribution, there will be:
P ( c , x ) = P ( c ) P ( x 1 | c ) P ( x 2 | c , x 1 ) P ( x 3 | c , x 1 , x 2 ) P ( x 4 | c , x 2 , x 3 , x 1 ) P ( x 5 | c , x 3 , x 1 , x 2 , x 4 )
Obviously, with more attributes to be considered as possible parent attributes, more causal relationships will be represented, and Sum_MI will be larger correspondingly. However, because of the time and space complexity overhead, only a limited number of attributes will be considered. For KDB, each predictive attribute can select at most K attributes as parent attributes. Figure 4 gives an example to show corresponding KDB models when given different K values.
In summary, from the viewpoint of probability theory, all of these algorithms can be regarded as different variations of the chain rule. Different algorithms tried to get different levels of tradeoff between computational complexity and classification accuracy. One advantage of NB is avoiding model selection, because selecting between alternative models can be expected to increase variance and allow a learning system to overfit the training data. However, the conditional independence assumption makes NB neglect the conditional mutual information between predictive attributes. Thus, NB is zero-dependence based and performs the worst among the three algorithms. TAN proposes to achieve global optimization by building MST to weigh the one-dependence causal relationships, i.e., TAN can only have at most one parent, except the class variable. Thus, only a limited number of dependencies or a limited information quantity can be represented in TAN. KDB allows for higher dependence to represent much more complicated relationships between attributes and can have at most K parent attributes. However, KDB is guided by a rigid ordering obtained by using the mutual information between the predictive attribute and the class variable. Mutual information does not consider the interaction between predictive attributes, and this marginal knowledge may result in sub-optimal order. Suppose K = 2 and I(C; X1) > I(C; X2) > I(C; X3) > I(C; X4) > I(C; X5); X3 will use X2 as the parent attribute, even if they are independent of each other. When K = 1, KDB performs poorer than TAN, because it can only achieve a local optimal network structure. Besides, as described in Equation (9)I(Xi; Xj|C) can only partially measure the dependence between Xi and {Xj, C}.

4. The Flexible K-Dependence Bayesian Classifier

To retain the privileges of TAN and KDB, i.e., global optimization and higher dependence representation, we presently give an algorithm, i.e., FKDB, which also allows one to construct K-dependence classifiers along the attribute dependence spectrum. To achieve the optimal attribute order, FKDB considers not only the dependence between the predictive attribute and the class variable, but also the dependencies among predictive attributes. As the learning procedure proceeds, the attributes will be put into order one by one. Thus, the order is determined dynamically.
Let S represent the attribute set, and predictive attributes will be added to S in a sequential order. The newly-added attribute Xj must select parent attributes from S. To achieve global optimization, Xj should have the strongest relationship with its parent attributes on average, i.e., the largest mutual information should be between Xj and {Paj,C}. Once selected, Xj will be added to S as possible parent attributes of the following attribute. FKDB applies greedy search of the mutual information space to find an optimal ordering of all of the attributes, which may help to fully describe the interaction between attributes.
Algorithm 1 is described as follows:
Algorithm 1. Algorithm FKDB.
Algorithm 1. Algorithm FKDB.
Input: a database of pre-classified instances, DB, and the K value for the maximum allowable degree of attribute dependence.
Output: a K-dependence Bayesian classifiers with conditional probability tables determined from the input data.
  • Let the used attribute list, S, be empty.
  • Select attribute Xroot that corresponds to the largest value I(Xi; C), and add it to S.
  • Add an arc from C to Xroot.
  • Repeat until S includes all domain attributes
  • • Select attribute Xi, which is not in S and corresponds to the largest sum value:
    I ( X i ; C ) + j = 1 q I ( X i , X j | C ) ,
    where XjS and q = min(|S|; K).
  • • Add a node to BN representing Xi.
  • • Add an arc from C to Xi in BN.
  • • Add q arcs from q distinct attributes Xj in S to Xi.
  • • Add Xi to S.
  • Compute the conditional probability tables inferred by the structure of BN using counts from DB, and output BN.
FKDB requires that at most K parent attributes can be selected for each new attribute. To make the working mechanism of FKDB clear, we set K = 2 in the following discussion. Because I(Xi; Xj|C) = I(Xj; Xi|C), we describe the relationships between attributes using an upper triangular matrix of conditional mutual information. The format and one example with five predictive attributes {X0,X1,X2,X3,X4} are shown in Figure 5a,b, respectively. Suppose that I(X0; C) > I(X3;C) > I(X2; C) > I(X4; C) > I(X1; C), X0 is added into S as the root node. X3 = argmax (I(Xi; C) + I(X0; Xi|C)) (XiS); thus, X3 is added to S; and S = {X0, X3}. X2 = argmax (I(Xi; C) + I(X0; Xi|C) + I(X3; Xi|C)) (XiS); thus, X2 is added into S; and S = {X0,X2,X3}. Similarly, X4 = argmax (I(Xi; C) + I(Xj,Xi|C) + I(Xk, Xi|C)) (XiS, Xj, XkS); thus, X4 is added into S, and X1 will be the last one in the order. Thus, the whole attribute order and causal relationship can be achieved simultaneously. The final network structures is illustrated in Figure 6.
Optimal attribute order and high dependence representation are two key points for learning KDB. Note that KDB achieves these two goals in different steps. KDB first computes and compares mutual information to get an attribute order before structured learning. Then, during the structured learning procedure, each predictive attribute Xi can select at most K attributes as parent attributes by comparing conditional mutual information (CMI). Because these two steps are separate, the attribute order cannot ensure that the first K strongest dependencies between Xi and other attributes should be represented. On the other hand, to achieve the optimal attribute order, FKDB considers not only the dependence between predictive attribute and class variable, but also the dependencies among predictive attributes. As the learning procedure proceeds, the attributes will be put into order one by one. Thus, the order is determined dynamically. That is why the classifier is named “flexible”.
We will further compare KDB and FKDB with an example. Suppose that for KDB, the attribute order is {X1, X2, X3, X4}; Figure 7 shows the corresponding network structure of KDB when K = 2 corresponds to the CMI matrix shown in Figure 7b, and the learning steps are annotated. The weight of dependencies between attributes are depicted in Figure 7b. Although the dependence relationship between X2 and X1 is the weakest, X1 is selected as the parent attribute of X2; whereas the strong dependence between X4 and X1 is neglected. Suppose that for FKDB, the mutual information I(Xi; C) is the same for all predictive attributes. Figure 8a shows the network structure of FKDB corresponding to the CMI matrix shown in Figure 8b, and learning steps are also annotated. The weights of causal relationships are depicted in Figure 8b, from which we can see that all strong causal relationships are implicated in the final network structure.

5. Experimental Study

In order to verify the efficiency and effectiveness of the proposed FKDB (K = 2), we conduct experiments on 45 datasets from the UCI machine learning repository. Table 1 summarizes the characteristics of each dataset, including the numbers of instances, attributes and classes. Missing values for qualitative attributes are replaced with modes, and those for quantitative attributes are replaced with means from the training data. For each benchmark dataset, numeric attributes are discretized using MDL discretization [22]. The following algorithms are compared:
  • NB, standard naive Bayes.
  • TAN [23], tree-augmented naive Bayes applying incremental learning.
  • RTAN [21], tree-augmented naive Bayes ensembles.
  • KDB (K = 2), standard K-dependence Bayesian classifier.
All algorithms were coded in MATLAB 7.0 (MathWorks, Natick, MA, USA) on a Pentium 2.93 GHz/1 G RAM computer. Base probability estimates P(c), P(c, xi) and P(c, xi, xj) were smoothed using the Laplace estimate, which can be described as follows:
{ P ^ ( c ) = F ( c ) + 1 M + m P ^ ( c , x i ) = F ( c , x i ) + 1 M i + m i P ^ ( c , x i , x j ) = F ( c , x i , x j ) + 1 M i j + m i j
where F(·) is the frequency with which a combination of terms appears in the training data, M is the number of training instances for which the class value is known, Mi is the number of training instances for which both the class and attribute Xi are known and Mij is the number of training instances for which all of the class and attributes Xi and Xj are known. m is the number of attribute values of class C; mi is the number of attribute value combinations of C and Xi; and mij is the number of attribute value combinations of C, Xj and Xi.
In the following experimental study, functional dependencies (FDs) [24] are used to detect redundant attribute values and to improve model interpretability. To maintain the K-dependence restriction, P(xi|x1, ⋯, xk, c) will be used as an approximate estimation of P(xi|x1, ⋯, xi–1, c) when i > K. Obviously, P(xi|x1, , xk+1, c) will be more accurate than P(xi|x1, ⋯, xk, c). If there exists FD:x2 → x1, then x2 functionally determines x1 and x1 is extraneous for classification. According to the augmentation rule of probability [24],
P ( x i | x 1 , , x K + 1 , c ) = P ( x i | x 2 , , x K + 1 , c ) .
Correspondingly, in practice, FKDB uses P(xi|x2, ⋯ ,xk+1, c) instead, which still maintains K-dependence restriction, whereas it represents more causal relationships.
FDs use the following criterion:
C o u n t ( x i ) = C o u n t ( x i , x j ) l
to infer that xi → xj, where Count(xi) is the number of training cases with value xi, Count(xi, xj) is the number of training cases with both values and l is a user-specified minimum frequency. A large number of deterministic attributes, which are on the left side of the FD, will increase the risk of incorrect inference and, at the same time, needs more computer memory to store credible FDs. Consequently, only the one-one FDs are selected in our current work. Besides, as no formal method has been used to select an appropriate value for l, we use the setting that l = 100, which is achieved from empirical studies.
Kohavi and Wolpert [25] presented a powerful tool from sampling theory statistics for analyzing supervised learning scenarios. Suppose c and c are the true class label and that generated by classifier A, respectively, for the i-th testing sample; the zero-one loss is defined as:
ξ i ( A ) = 1 δ ( c , c ^ )
where δ(c, ĉ) = 1 if ĉ = c and 0 otherwise. Table 2 presents for each dataset the zero-one loss and the standard deviation, which are estimated by 10-fold cross-validation to give an accurate estimation of the average performance of an algorithm. Statistically, a win/draw/loss record (W/D/L) is calculated for each pair of competitors A and B with regard to a performance measure M. The record represents the number of datasets in which A respectively beats, loses to or ties with B on M. Small improvements may be attributable to chance. Runs with the various algorithms are carried out on the same training sets and evaluated on the same test sets. In particular, the cross-validation folds are the same for all of the experiments on each dataset. Finally, related algorithms are compared via a one-tailed binomial sign test with a 95 percent confidence level. Table 3 shows the W/D/L records corresponding to zero-one loss. When dependence complexity increases, the performance of TAN gets better than that of NB. RTAN investigates the diversity of TAN by the K statistic. The bagging mechanism helps RTAN to achieve superior performance to TAN. FKDB performs undoubtedly the best. However, surprisingly, as a 2-dependence Bayesian classifier, the advantage of KDB is not obvious when compared to 1-dependence classifiers, and it even performs poorer than RTAN in general. However, when the data size increases to a certain extent, e.g., 4177 (the size of dataset “Abalone”), as Table 4 shows, the prediction performance of all restricted classifiers can be evaluated from the perspective of the dependence level. Two-dependence Bayesian classifiers, e.g., FKDB and KDB, perform the best. The one-dependence Bayesian classifier, e.g., TAN, performs better. Additionally, 0-dependence Bayesian classifiers, e.g., NB, perform the worst.
Friedman proposed a non-parametric measure [28], the Friedman test, which compares the ranks of the algorithms for each dataset separately. The null-hypothesis is that all of the algorithms are equivalent, and there is no difference in average ranks. We can compute the Friedman statistic:
F r = 12 N t ( t + 1 ) j = 1 t R j 2 3 N ( t + 1 )
by using the chi-square distribution with t – 1 degrees of freedom, where R j = i r i j and r i j is the rank of the j-th of t algorithms on the i-th of N datasets. Thus, for any selected level of significance α, we reject the null hypothesis if the computed value of Fr is greater than χ α 2, the upper-tail critical value for the chi-square distribution having t – 1 degrees of freedom. The critical value of χ α 2 for α = 0.05 is 1.8039. The Friedman statistic for 45 datasets and 17 large (size > 4177) datasets are 12 and 28.9, respectively. Additionally, p < 0.001 for both cases. Hence, we reject the null-hypotheses.
The average ranks of zero-one loss of different classifiers on all and large datasets are {NB(3.978), TAN(2.778), RTAN(2.467), KDB(3.078), FKDB(2.811)} and {NB(4.853), TAN(3.118), RTAN(3), KDB(2.176) and FKDB(2)}, respectively. Correspondingly, the order of these algorithms is {RTAN, TAN, FKDB, KDB, NB} when comparing the experimental results on all datasets. The performance of FKDB is not obviously superior to other algorithms. However, when comparing the experimental results on large datasets, the order changes greatly and turns out to be {FKDB, KDB, RTAN, TAN, NB}.
When the class distribution is imbalanced, traditional classifiers are easily overwhelmed by instances from majority classes, while the minority classes instances are usually ignored [26]. A classification system should, in general, work well for all possible class distribution and misclassification costs. This issue was successfully addressed in binary problems using ROC analysis and the area under the ROC curve (AUC) metric [27]. Research on related topics, such as imbalanced learning problems, is highly focused on the binary class problem, while progress on multiclass problems is limited [26]. Therefore, we select 16 datasets with binary class labels for comparison of the AUC. The AUC values are shown in Table 5. With 5 algorithms and 16 datasets, the Friedman statistic Fr = 2.973 and p < 0.004. Hence, we reject the null-hypotheses again. The average ranks of different classifiers are {NB(3.6), TAN(3.0), RTAN(2.833), KDB(2.867) and FKDB(2.7)}. Hence, the order of these algorithms is {FKDB, RTAN, KDB, TAN, NB}. The effectiveness of FKDB is proven from the perspectives of AUC.
To compare the relative performance of classifiers A and B, the zero-one loss ratio (ZLR) is proposed in this paper and defined as ZLR(A/B) = ξi (A)/∑ξi(B). Figures 912 compare FKDB with NB, TAN, RTAN and KDB, respectively. Each figure is divided into four parts by comparing data size and ZLR. That is, the data size is greater than 4177 while ZLR ≥ 1 or ZLR < 1, and the data size is smaller than 4177 while ZLR ≥ 1 or ZLR < 1. In different parts, different symbols are used to represent different situations. When dealing with small datasets (data size <4177), the performance superiority of FKDB is not obvious when compared to the 0-dependence (NB) or 1-dependence Bayesian classifiers (TAN). For some datasets, e.g., “Lung Cancer” and “Hungarian”, NB even performs the best. Because precise estimation of conditional mutual information is determined by probability estimation, which is affected greatly by data size, the robustness of network structure will be affected negatively by imprecise probability estimation. For example, for dataset “Lung Cancer” with 32 instances and 56 attributes, it is almost impossible to ensure that the basic causal relationships learned are of a high confidence level. That is why a simple structure can perform better than a complicated one. Since each submodel of RTAN can represent only a small proportion of all dependencies, the complementarity of the bagging mechanism works and helps to improve the performance of TAN. KDB shows equivalent performance to FKDB.
As data size increases, high-dependence Bayesian classifiers gradually show their superiority, and the advantage of FKDB is almost overwhelming when compared to NB and TAN. Because almost all strong dependencies can be detected and illustrated in each submodel of RTAN, the high degree of uniformity in the basic structure cannot help to improve the prediction performance of TAN. Thus, RTAN shows equivalent performance to TAN. The prediction superiority of FKDB over KDB becomes much more obvious. Because they both are 2-dependence Bayesian classifiers, a minor difference in local structure may be the main cause of the performance difference. To further clarify this idea, we propose a new criterion, Info_ratio(A/B), to compare the information quantity implicated in Bayesian classifiers A and B.
Info_ratio ( A / B ) = Sum_MI ( A ) / Sum_MI ( B )
The comparison results of Info_ratio(FKDB/KDB) are shown in Figure 13, from which the superiority of FKDB in extracting information is much more obvious when dealing with large datasets. The increased information quantity does help to decrease zero-one loss. However, note that the growth rate of information quantity is not in proportion to the descent rate of zero-one loss. For some datasets, e.g., “Localization” and “Poker-hand”, KDB and FKDB achieve the same Sum_MI, while their zero-one losses are different. The same Sum_MI corresponds to the same causal relationships. The network structures learned from KDB and FKDB are similar, because the major dependencies are all implicated, except that the directions of some arcs are different. Dependence “X3X4” can be represented by conditional probability distribution P(x3|x4, c) or P(x4|x3, c). Just as we clarified in Section 3, although the basic structures described in Figure 3a,b are the same, the corresponding joint probability distributions represented by Equations (13) and (14) are different. Since ZLR ≈ 1 for these two datasets, the difference in zero-one loss can be explained from the perspective of probability distribution.
To prove the relevance of information quantity to zero-one loss, Figure 14 is divided into four zones. Similar to the comparison of Equations (13) and (14), the same information quantity does not certainly correspond to the same Bayesian network and, then, the same zero-one loss. Zone A contains 27 datasets and describes the situation that ZLR<1 and Info_ratio ≥ 1. The performance superiority of FKDB over KDB can be attributed to mining more information or correct conditional dependence representation. Zone D contains 6 datasets and describes the situation that ZLR>1 and Info_ratio ≤ 1. The performance inferiority of FKDB over KDB can be attributed to mining less information. Thus, the information quantity is strongly correlated to zero-one loss on 73.3% ( 27 + 6 45 ) of all datasets. On the other hand, although FKDB has proven its effectiveness from the perspective of W/D/L results and the Friedman test, the information quantity is a very important score, but not the only one.

6. Conclusions

BNs can graphically describe conditional dependence between attributes, and they have been previously demonstrated to be computationally efficient approaches to further reducing zero-one loss. Conditional mutual information is commonly applied to weigh the dependencies between attributes, while it cannot measure the information quantity provided to predictive attributes. On the basis of analyzing and summarizing the working mechanisms of three popular Bayesian classifiers from the viewpoints of information theory and probability theory, this paper proposed to mine reliable dependencies by maximizing the sum of mutual information. The experimental results validate the mapping relationship between conditional probability distribution and mutual information.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61272209), the Postdoctoral Science Foundation of China (No. 2013M530980) and the Agreement of Science & Technology Development Project, Jilin Province (No. 20150101014JC).

Author Contributions

All authors have contributed to the study and preparation of the article. The 1st author conceived the idea and wrote the paper. The 2nd author advised for the paper and finished the programming. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Sloin, A.; Wiesel, A. Proper Quaternion Gaussian Graphical Models. IEEE. Trans. Signal Process 2014, 62, 5487–5496. [Google Scholar]
  2. Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  3. Bielza, C.; Larranaga, P. Discrete Bayesian Network Classifiers: A Survey. ACM Comput. Surv. 2014, 47, 1–43. [Google Scholar]
  4. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann Publishers Inc: San Francisco, CA, USA, 1988. [Google Scholar]
  5. Jouneghani, F.G.; Babazadeh, M.; Bayramzadeh, R. Investigation of Commuting Hamiltonian in Quantum Markov Network. Int. J. Theor. Phys. 2014, 53, 2521–2530. [Google Scholar]
  6. Wu, J.; Cai, Z. A naive Bayes probability estimation model based on self-adaptive differential evolution. J. Intell. Inf. Syst. 2014, 42, 671–694. [Google Scholar]
  7. Minsky, M. Steps toward Artificial Intelligence. Proc. IRE. 1961, 49, 8–30. [Google Scholar]
  8. Jiang, L. X.; Cai, Z. H.; Wang, D. H.; Zhang, H. Improving tree augmented naive bayes for class probability estimation. Knowl.-Based Syst. 2012, 26, 239–245. [Google Scholar]
  9. Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar]
  10. Sahami, M. Learning limited dependence Bayesian classifiers. Proceedings of The Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 2–4 August 1996; pp. 335–338.
  11. Francisco, L.; Anderson, A. Bagging k-dependence probabilistic networks: An alternative powerful fraud detection tool. Expert Syst. Appl. 2012, 39, 11583–11592. [Google Scholar]
  12. Andres-Ferrer, J.; Juan, A. Constrained domain maximum likelihood estimation for naive Bayes text classification. Pattern Anal. Appl. 2010, 13, 189–196. [Google Scholar]
  13. Watanabe, S. A Widely Applicable Bayesian Information Criterion. J. Mach. Learn. Res. 2013, 14, 867–897. [Google Scholar]
  14. Chaitankar, V.; Ghosh, P.; Perkins, E. A novel gene network inference algorithm using predictive minimum description length approach. BMC Syst. Biol. 2010, 4, 107–126. [Google Scholar]
  15. Posada, D.; Buckley, T.R. Model selection and model averaging in phylogenetics: Advantages of akaike information criterion and Bayesian approaches over likelihood ratio tests. Syst. Biol. 2004, 53, 793–808. [Google Scholar]
  16. Chickering, D.M.; Heckerman, D; Meek, C. Large-Sample Learning of Bayesian Networks is NP-Hard. J. Mach. Learn. Res. 2004, 5, 1287–1330. [Google Scholar]
  17. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ml/datasets.html accessed on 5 June 2015.
  18. Cheng, J.; Greiner, R.; Kelly, J.; Bell, D.; Liu, W. Learning Bayesian networks from data: An information-theory based approach. Artif. Intell. 2002, 137, 43–90. [Google Scholar]
  19. Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J 1948, 27, 379–423. [Google Scholar]
  20. Chow, C.K.; Liu, C.N. Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 1968, 14, 462–467. [Google Scholar]
  21. Ma, S.H.; Shi, H.B. Tree-augmented naive Bayes ensembles. Proceedings of 2004 International Conference on Machine Learning and Cybernetics, Shanghai, China, 26–29 August 2004; pp. 1497–1502.
  22. Fayyad, U.M.; Irani, K.B. Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambéry, France, 28 August–3 September, 1993; pp. 1022–1029.
  23. Josep, R.A. Incremental Learning of Tree Augmented Naive Bayes Classifiers. Proceedings of the 8th Ibero-American Conference on AI, Seville, Spain, 12–15 November 2002; pp. 32–41.
  24. Wang, L.M.; Yao, G.F. Extracting Logical Rules and Attribute Subset from Confidence Domain. Information 2012, 15, 173–180. [Google Scholar]
  25. Kohavi, R.; Wolpert, D. Bias plus variance decomposition for zero-one loss functions. Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 3–6 July 1996; pp. 275–283.
  26. He, H.B.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
  27. Fawcett, T. An introduction to ROC analysis. Pattern Recogn. Lett. 2006, 27, 861–874. [Google Scholar]
  28. Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar]
Figure 1. Arcs grouped according to their final targets.
Figure 1. Arcs grouped according to their final targets.
Entropy 17 03766f1
Figure 2. The zero-dependence relationship between the attributes of the NB model.
Figure 2. The zero-dependence relationship between the attributes of the NB model.
Entropy 17 03766f2
Figure 3. The one-dependence relationship between the attributes of the tree-augmented naive Bayes (TAN) model.
Figure 3. The one-dependence relationship between the attributes of the tree-augmented naive Bayes (TAN) model.
Entropy 17 03766f3
Figure 4. The K-dependence relationship between attributes inferred from the K-dependence Bayesian (KDB) classifier.
Figure 4. The K-dependence relationship between attributes inferred from the K-dependence Bayesian (KDB) classifier.
Entropy 17 03766f4
Figure 5. The upper triangular matrix of conditional mutual information between attributes and one example.
Figure 5. The upper triangular matrix of conditional mutual information between attributes and one example.
Entropy 17 03766f5
Figure 6. The final network structure of flexible K-dependence Bayesian (FKDB). Additionally, the order number of predictive attributes is also annotated.
Figure 6. The final network structure of flexible K-dependence Bayesian (FKDB). Additionally, the order number of predictive attributes is also annotated.
Entropy 17 03766f6
Figure 7. The K-dependence relationships among attributes inferred from the KDB learning algorithm are shown (a), and the learning steps are annotated. The unused causal relationship (b) is annotated in pink.
Figure 7. The K-dependence relationships among attributes inferred from the KDB learning algorithm are shown (a), and the learning steps are annotated. The unused causal relationship (b) is annotated in pink.
Entropy 17 03766f7
Figure 8. The K-dependency relationships among attributes inferred from the FKBN learning algorithm are shown (a), and the learning steps are annotated. The unused causal relationship (b) is annotated in pink.
Figure 8. The K-dependency relationships among attributes inferred from the FKBN learning algorithm are shown (a), and the learning steps are annotated. The unused causal relationship (b) is annotated in pink.
Entropy 17 03766f8
Figure 9. The experimental results of zero-one loss ratio ZLR(FKDB/NB).
Figure 9. The experimental results of zero-one loss ratio ZLR(FKDB/NB).
Entropy 17 03766f9
Figure 10. The experimental results of zero-one loss ratio ZLR(FKDB/TAN).
Figure 10. The experimental results of zero-one loss ratio ZLR(FKDB/TAN).
Entropy 17 03766f10
Figure 11. The experimental results of zero-one loss ratio ZLR(FKDB/RTAN).
Figure 11. The experimental results of zero-one loss ratio ZLR(FKDB/RTAN).
Entropy 17 03766f11
Figure 12. The experimental results of zero-one loss ratio ZLR(FKDB/KDB).
Figure 12. The experimental results of zero-one loss ratio ZLR(FKDB/KDB).
Entropy 17 03766f12
Figure 13. The experimental results of Info_ratio(FKDB/KDB).
Figure 13. The experimental results of Info_ratio(FKDB/KDB).
Entropy 17 03766f13
Figure 14. The relationship between ZLR and Info_ratio.
Figure 14. The relationship between ZLR and Info_ratio.
Entropy 17 03766f14
Table 1. Datasets.
Table 1. Datasets.
No.Dataset# InstanceAttributeClass
1Lung Cancer32563
2Zoo101167
3Echocardiogram13162
4Hepatitis155192
5Glass Identification21493
6Audio2266924
7Hungarian294132
8Heart Disease303132
9Haberman’s Survival30632
10Primary Tumor3391722
11LiveDisorder (Bupa)34562
12Chess551392
13Syncon600606
14Balance Scale (Wisconsin)62543
15Soybean6833519
16Credit Screening690152
17Breast-cancer-w69992
18Pima-ind-diabetes76882
19Vehicle846184
20Anneal898386
21Vovel9901311
22German1000202
23LED1000710
24Contraceptive Method Choice147393
25Yeast1484810
26Volcanoes152034
27Car172864
28Hypothyroid3163252
29Abalone417783
30Spambase4601572
31Optdigits56206410
32Satellite6435366
33Mushroom8124222
34Thyroid91692920
35Sign12,54683
36Nursery12,96085
37Magic19,020102
38Letter-recog20,0001626
39Adult48,842142
40Shuttle58,00097
41Connect-4 Opening67,557423
42Waveform100,000213
43Localization164,860511
44Census-income299,285412
45Poker-hand1,025,0101010
Table 2. Experimental results of zero-one loss.
Table 2. Experimental results of zero-one loss.
DatasetNBTANRTANKDBFKDB
Lung Cancer0.438 ± 0.2680.594 ± 0.2260.480 ± 0.3190.594 ± 0.3280.688 ± 0.238
Zoo0.029 ± 0.0470.010 ± 0.0530.029 ± 0.0500.050 ± 0.0520.028 ± 0.047
Echocardiogram0.336 ± 0.1210.328 ± 0.1070.308 ± 0.1010.344 ± 0.0670.320 ± 0.072
Hepatitis0.194 ± 0.1000.168 ± 0.0870.173 ± 0.0900.187 ± 0.0920.170 ± 0.089
Glass Identification0.262 ± 0.0790.220 ± 0.0830.242 ± 0.0870.220 ± 0.0860.201 ± 0.079
Audio0.239 ± 0.0550.292 ± 0.0930.195 ± 0.0910.323 ± 0.0880.358 ± 0.073
Hungarian0.160 ± 0.0690.170 ± 0.0630.160 ± 0.0790.180 ± 0.0880.177 ± 0.081
Heart Disease0.178 ± 0.0690.193 ± 0.0920.164 ± 0.0730.211 ± 0.0830.164 ± 0.079
Haberman’s Survival0.281 ± 0.1010.281 ± 0.1000.270 ± 0.0970.281 ± 0.1030.281 ± 0.092
Primary Tumor0.546 ± 0.0910.543 ± 0.1000.552 ± 0.0940.572 ± 0.0910.590 ± 0.089
Live Disorder(Bupa)0.444 ± 0.0780.444 ± 0.0170.426 ± 0.0370.444 ± 0.0460.443 ± 0.067
Chess0.113 ± 0.0550.093 ± 0.0490.096 ± 0.0450.100 ± 0.0540.076 ± 0.048
Syncon0.028 ± 0.0330.008 ± 0.0150.010 ± 0.0250.013 ± 0.0220.011 ± 0.019
Balance Scale0.285 ± 0.0250.280 ± 0.0220.286 ± 0.0260.278 ± 0.0280.280 ± 0.021
Soybean0.089 ± 0.0240.047 ± 0.0140.045 ± 0.0140.056 ± 0.0130.051 ± 0.021
Credit Screening0.141 ± 0.0330.151 ± 0.0480.134 ± 0.0370.146 ± 0.0510.149 ± 0.042
Breast-cancer-w0.026 ± 0.0220.042 ± 0.0480.034 ± 0.0320.074 ± 0.0250.080 ± 0.039
Pima-ind-diabetes0.245 ± 0.0750.238 ± 0.0620.229 ± 0.0650.245 ± 0.1130.247 ± 0.089
Vehicle0.392 ± 0.0590.294 ± 0.0560.278 ± 0.0600.294 ± 0.0610.299 ± 0.056
Anneal0.038 ± 0.3430.009 ± 0.3760.009 ± 0.3500.009 ± 0.2810.008 ± 0.296
Vowel0.424 ± 0.0560.130 ± 0.0460.144 ± 0.0360.182 ± 0.0260.150 ± 0.041
German0.253 ± 0.0340.273 ± 0.0620.238 ± 0.0440.289 ± 0.0680.284 ± 0.052
LED0.267 ± 0.0620.266 ± 0.0570.258 ± 0.0520.262 ± 0.0520.272 ± 0.060
Contraceptive Method0.504 ± 0.0380.489 ± 0.0230.474 ± 0.0280.500 ± 0.0380.488 ± 0.030
Yeast0.424 ± 0.0310.417 ± 0.0370.407 ± 0.0320.439 ± 0.0310.438 ± 0.034
Volcanoes0.332 ± 0.0290.332 ± 0.0300.318 ± 0.0240.332 ± 0.0240.338 ± 0.027
Car0.140 ± 0.0260.057 ± 0.0180.078 ± 0.0220.038 ± 0.0120.046 ± 0.018
Hyprothyroid0.015 ± 0.0040.010 ± 0.0050.013 ± 0.0040.011 ± 0.0120.010 ± 0.008
Abalone0.472 ± 0.0240.459 ± 0.0250.450 ± 0.0240.467 ± 0.0280.467 ± 0.024
Spambase0.102 ± 0.0130.067 ± 0.0100.066 ± 0.0100.064 ± 0.0140.065 ± 0.011
Optdigits0.077 ± 0.0090.041 ± 0.0080.040 ± 0.0070.037 ± 0.0100.031 ± 0.009
Satellite0.181 ± 0.0160.121 ± 0.0110.119 ± 0.0150.108 ± 0.0140.115 ± 0.012
Mushroom0.020 ± 0.0040.000 ± 0.0080.000 ± 0.0040.000 ± 0.0000.000 ± 0.001
Thyroid0.111 ± 0.0100.072 ± 0.0050.071 ± 0.0070.071 ± 0.0060.069 ± 0.008
Sign0.359 ± 0.0070.276 ± 0.0100.270 ± 0.0080.254 ± 0.0060.223 ± 0.007
Nursery0.097 ± 0.0060.065 ± 0.0080.064 ± 0.0060.029 ± 0.0060.028 ± 0.006
Magic0.224 ± 0.0060.168 ± 0.0040.165 ± 0.0090.157 ± 0.0110.160 ± 0.006
Letter-recog0.253 ± 0.0080.130 ± 0.0070.127 ± 0.0080.099 ± 0.0070.081 ± 0.005
Adult0.158 ± 0.0040.138 ± 0.0030.135 ± 0.0040.138 ± 0.0040.132 ± 0.003
Shuttle0.004 ± 0.0010.002 ± 0.0010.001 ± 0.0010.001 ± 0.0010.001 ± 0.001
Connect-4 Opening0.278 ± 0.0060.235 ± 0.0050.231 ± 0.0040.228 ± 0.0040.218 ± 0.005
Waveform0.022 ± 0.0020.020 ± 0.0010.020 ± 0.0020.026 ± 0.0020.018 ± 0.010
Localization0.496 ± 0.0030.358 ± 0.0020.350 ± 0.0030.296 ± 0.0030.280 ± 0.001
Census-income0.237 ± 0.0020.064 ± 0.0020.063 ± 0.0020.051 ± 0.0020.051 ± 0.002
Poker-hand0.499 ± 0.0020.330 ± 0.0020.333 ± 0.0020.196 ± 0.0020.192 ± 0.002
Table 3. Win/draw/loss record (W/D/L) comparison results of zero-one loss on all datasets.
Table 3. Win/draw/loss record (W/D/L) comparison results of zero-one loss on all datasets.
W/D/LNBTANRTANKDB
TAN27/11/7
RTAN29/13/310/27/8
KDB24/13/812/20/1315/12/18
FKDB26/11/816/20/915/15/1512/28/5
Table 4. Win/draw/loss record (W/D/L) comparison results of zero-one loss when the data size > 4177.
Table 4. Win/draw/loss record (W/D/L) comparison results of zero-one loss when the data size > 4177.
W/D/LNBTANRTANKDB
TAN16/1/0
RTAN16/1/00/17/0
KDB15/1/111/5/110/6/1
FKDB16/1/011/6/010/7/04/12/1
Table 5. Experimental results of the average AUCs for datasets with binary class labels.
Table 5. Experimental results of the average AUCs for datasets with binary class labels.
DatasetNBTANRTANKDBFKDB
Adult0.9200.9280.9310.9410.935
Breast-cancer-w0.9921.0001.0001.0001.000
Census-income0.9600.9890.9910.9920.993
Chess0.9570.9860.9920.9880.993
Credit Screening0.9320.9630.9560.9780.967
Echocardiogram0.7370.7710.7750.7710.776
German0.8140.8770.8930.9410.929
Haberman’s Survival0.6590.6580.6870.6570.692
Heart Disease0.9220.9360.9460.9560.951
Hepatitis0.9290.9680.9830.9850.977
Hungarian0.9310.9570.9610.9640.962
Live Disorder(Bupa)0.6200.6200.6200.6200.620
Magic0.8660.9050.9020.9160.911
Mushroom0.9991.0001.0001.0001.000
Pima-ind-diabetes0.8510.8650.8660.8760.877
Spambase0.9660.9800.9870.9890.985

Share and Cite

MDPI and ACS Style

Wang, L.; Zhao, H. Learning a Flexible K-Dependence Bayesian Classifier from the Chain Rule of Joint Probability Distribution. Entropy 2015, 17, 3766-3786. https://doi.org/10.3390/e17063766

AMA Style

Wang L, Zhao H. Learning a Flexible K-Dependence Bayesian Classifier from the Chain Rule of Joint Probability Distribution. Entropy. 2015; 17(6):3766-3786. https://doi.org/10.3390/e17063766

Chicago/Turabian Style

Wang, Limin, and Haoyu Zhao. 2015. "Learning a Flexible K-Dependence Bayesian Classifier from the Chain Rule of Joint Probability Distribution" Entropy 17, no. 6: 3766-3786. https://doi.org/10.3390/e17063766

APA Style

Wang, L., & Zhao, H. (2015). Learning a Flexible K-Dependence Bayesian Classifier from the Chain Rule of Joint Probability Distribution. Entropy, 17(6), 3766-3786. https://doi.org/10.3390/e17063766

Article Metrics

Back to TopTop