Next Article in Journal
2D Anisotropic Wavelet Entropy with an Application to Earthquakes in Chile
Next Article in Special Issue
Averaged Extended Tree Augmented Naive Classifier
Previous Article in Journal
The Non-Equilibrium Statistical Distribution Function for Electrons and Holes in Semiconductor Heterostructures in Steady-State Conditions
Previous Article in Special Issue
A Penalized Likelihood Approach to Parameter Estimation with Integral Reliability Constraints
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

General and Local: Averaged k-Dependence Bayesian Classifiers

1
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, ChangChun 130012, China
2
School of Software, Jilin University, ChangChun 130012, China
*
Author to whom correspondence should be addressed.
Entropy 2015, 17(6), 4134-4154; https://doi.org/10.3390/e17064134
Submission received: 4 May 2015 / Revised: 2 June 2015 / Accepted: 9 June 2015 / Published: 16 June 2015
(This article belongs to the Special Issue Inductive Statistical Methods)

Abstract

:
The inference of a general Bayesian network has been shown to be an NP-hard problem, even for approximate solutions. Although k-dependence Bayesian (KDB) classifier can construct at arbitrary points (values of k) along the attribute dependence spectrum, it cannot identify the changes of interdependencies when attributes take different values. Local KDB, which learns in the framework of KDB, is proposed in this study to describe the local dependencies implicated in each test instance. Based on the analysis of functional dependencies, substitution-elimination resolution, a new type of semi-naive Bayesian operation, is proposed to substitute or eliminate generalization to achieve accurate estimation of conditional probability distribution while reducing computational complexity. The final classifier, averaged k-dependence Bayesian (AKDB) classifiers, will average the output of KDB and local KDB. Experimental results on the repository of machine learning databases from the University of California Irvine (UCI) showed that AKDB has significant advantages in zero-one loss and bias relative to naive Bayes (NB), tree augmented naive Bayes (TAN), Averaged one-dependence estimators (AODE), and KDB. Moreover, KDB and local KDB show mutually complementary characteristics with respect to variance.

1. Introduction

Bayesian networks (BNs), which were introduced by Pearl [1], can encode dependencies among all variables. Their success has led to a recent flurry of algorithms for learning BNs from data [25]. BN =< N, A, Θ > is a directed acyclic graph with a conditional probability distribution for each node, collectively represented by Θ that quantifies how much a node depends on its parents. Each node nN represents a domain variable, and each arc aA between nodes represents a probabilistic dependency. A BN can be used as a classifier that characterizes the joint distribution P(x, y) (In the following discussion, lower-case letters denote specific values taken by corresponding attributes. For instance, xi represents the event that Xi = xi) of class variable Y and a set of attributes X = {X1, X2, ⋯, Xn}, and predicts the class label with the highest conditional probability. Denoting the parent nodes of xi by Pa(xi), the joint distribution PB(x, y) can be represented by factors over the network structure B, as follows:
P B ( x , y ) = i = 1 n P ( x i | P a ( x i ) ) .
The inference of a general BN has been shown to be an NP-hard problem [6] even for approximate solutions [7]. However, learning unrestricted BNs does not necessarily lead to a classifier with good performance. For example, naive Bayes (NB) [8] is the simplest BN, which considers only the dependence between each attribute Xi and the class variable Y. However, Friedman et al. [9] observed that unrestricted BN classifiers do not outperform NB in a large sample of benchmark data sets. Many BN classifiers have been proposed to overcome the limitation of NB. One practical approach for structure learning is to impose some restrictions on the structures of BNs, for example, learning tree-like structures. Sahami [10] proposed to describe the limited dependence among variables within a general framework, which is called k-dependence Bayesian (KDB) classifier. Friedman et al. [9] proposed tree augmented naive Bayes (TAN), a structure-learning algorithm that learns a maximum spanning tree from the attributes. Conditional mutual information is applied in these two algorithms to measure the weight of arcs between predictive attributes. When data size becomes larger, the superiority in high-dependence representation helps KDB obtain better classification performance than TAN.
The key differences between Bayesian classifiers are their structure-learning algorithms. Many criteria, such as Bayesian scoring function [11], minimal description length (MDL) [12] and Akaike Information Criterion (AIC) [13], have been proposed to find out one global graph structure BG that best characterizes the true distribution of given data. Considering the time and space complexity overhead, only a limited number of conditional probabilities can be encoded in BN. All credible dependencies must be represented to obtain a more accurate estimation of the true joint distribution. However, these criteria can only approximately measure the overall interdependencies between attributes, but cannot identify the change of interdependencies when attributes take different values. Thus the candidate graph structures may have very close score values and are non-negligible in the posterior sense [14]. To extend the limited representation of BG, some researchers proposed to aggregate several candidate BNs together. Averaged one-dependence estimators (AODE), which were proposed by Webb et al. [15], aggregate the predictions of all qualified restricted class of one-dependence estimators. Zheng et al. [16] proposed subsumption resolution (SR), to efficiently identify occurrences of the specialization-generalization relationship and eliminate generalizations at classification time. By introducing Functional Dependency (FD) analysis into the learning procedures, the model interpretability and robustness of different Bayesian classifiers can be improved greatly. After eliminating highly dependent attribute values by applying FD analysis, the maximal spanning tree (MST) of TAN is rebuilt with the rest of the attribute values for each test instance. Correspondingly the extraneous effect caused by logical relationships between attribute values will be mitigated [17]. To evaluate the feasibility of integrating probabilistic reasoning and logical reasoning into the framework of AODE, we first select the branch nodes of MST as the super parents, then refine AODE by applying FD analysis to delete redundant children attribute [18].
In this paper, local mutual information and conditional local mutual information, which are deduced from classical information theory, are applied to build the local graph structure BL. BL can be considered a complementary part of BG, to describe local causal relationships. To construct classifiers at arbitrary points (values of k) along the attribute dependence spectrum, both BL and BG are built in the framework of KDB model. Substitution-elimination resolution (SER), a new type of semi-naive Bayesian operation is proposed to substitute or eliminate generalization to achieve accurate estimation of conditional probability distribution while reducing computational complexity. SER deals only with specific values and only in the context of other specific values. We prove that this adjustment is theoretically correct and demonstrate experimentally that it can considerably improve zero-one loss, bias and variance.
The remainder of this paper is organized as follows: Section 2 first proposes the background theory—information theory and functional dependency rules of probability, and then clarifies the rationality of SER. Section 3 introduces the basic ideas of KDB, local KDB and the proposed algorithm, averaged k-dependence Bayesian classifiers (AKDB), which averages the output of KDB and local KDB. Section 4 compares various approaches on data sets from the UCI Machine Learning Repository. Finally, Section 5 presents possible future work.

2. Background Theory and Related Research Work

2.1. Information Theory

In the 1940s, Claude E. Shannon introduced information theory [19], the theoretical basis of modern digital communication. Although Shannon was principally concerned with the problem of electronic communications, the theory has a broader applicability. Many commonly used measures are based on the entropy of information theory and used in a variety of classification algorithms.
Definition 1. [19] Entropy of an attribute (or random variable) is a function that attempts to characterize its unpredictability. When given a discrete random variable X with any possible value x and probability distribution function P(·), entropy is defined as follows,
H ( X ) = x X P ( x ) log 2 P ( x )
Deterministic attributes have zero entropy as entropy measures the amount of uncertainty with which they take some values. Similar to the concept of conditional probability, conditional entropy H(X|Y) may be understood as the amount of randomness in the random variable X when the value of Y is known.
Definition 2. [19] Given discrete random variables X and Y and their possible values x and y, conditional entropy is defined as follows:
H ( X | Y ) = x X y Y P ( x , y ) log 2 P ( x | y )
Using the definition of entropy and conditional entropy, we can calculate the amount of information shared between two attributes. The stronger the correlation, the higher the value of mutual information will be.
Definition 3. [19] The mutual information (MI) I(X; Y) of two random variables is a measure of the mutual dependence of the variables and is defined as follows:
I ( X ; Y ) = H ( X ) H ( X | Y ) = x X y Y P ( x , | y ) log 2 P ( x , y ) P ( x ) P ( y )
Mutual information I(X; Y) between two attributes X and Y measures the expected reduction in entropy and is nonnegative, i.e., I(X; Y) 0. I(X; Y) = 0 if they are independent and is maximized if H(X|Y) = 0. Similar to the definition of conditional entropy, conditional mutual information I(X; Y|Z) indicates the amount of information shared between two attributes X and Y when all the values of attribute Z are known.
Definition 4. [19] Conditional mutual information (CMI) I(X; Y|Z) is defined as follows:
I ( X ; Y | Z ) = x X y Y z Z P ( x , y , z ) log 2 P ( x , y | z ) P ( x | z ) P ( y | z )
Definition 5. Local mutual information (LMI) I(X; y) is defined to measure the reduction of entropy about variable X after observing that Y = y, as follows:
I ( X ; y ) = x X P ( x , y ) log P ( x , y ) P ( x ) P ( y )
Definition 6. Conditional local mutual information (CLMI) I(x; y|Z) is defined to measure the amount of information shared between two attribute values x and y when all the values of attribute Z are known, as follows:
I ( x ; y | Z ) = z Z P ( x , y , z ) log P ( x , y | z ) P ( x | z ) P ( y | z )

2.2. Functional Dependency Analysis and Substitution-Elimination Resolution

Given a data set D, attribute value y is functionally dependent on attribute value x, and x functionally determines y (in symbols x → y). We demonstrated functional dependency rules of probability in [17,18] to build a linkage between probabilistic inference and logical inference, and the following rules are mainly included:
  • Representation equivalence of probability: Suppose two attribute values {x, y} and y can be inferred by x, i.e., the FD x → y holds, then the following joint probability distribution holds:
    P ( x ) = P ( x , y )
  • Augmentation rule of probability: If FD x → y holds and z is another attribute value, then the following joint probability distribution holds:
    P ( x , z ) = P ( x , y , z )
  • Transitivity rule of probability: If FDs x → y and y → z hold, then the following joint probability distribution holds:
    P ( x ) = P ( x , z )
  • Pseudo-transitivity rule of probability: If yz → δ and x → y hold, then the following joint probability distribution holds:
    P ( x , z ) = P ( x , z , δ )
Definition 7. A k-dependence Bayesian classifier (k-DBC) is a BN that contains the structure of the naive Bayesian classifier and allows each attribute Xi to have a maximum of k attribute nodes as parents. Given attribute order {X1, ⋯, Xn}, P a ( X i ) = { Y , X d i } where X d i is a subset of {X1, ⋯, Xi−1}, |di| = min{i − 1, k} and Pa(y) = ∅.
Learning the structure of a k-DBC actually means learning an order of variables and then adding arcs from a variable to all the variables ranked after it. In fact, given the order of variables, learning a k-DBC is relatively easier.
We consider a hypothetical example with four predictive attributes {Pregnant, Gender, Familial Inheritance, and Breast Cancer} and class variable {Normal}. When given different k values, the corresponding k-DBC models are shown in Figure 1, where X = {X1, X2, X3, X4} and Y denote {Pregnant, Gender, Familial Inheritance, Breast Cancer} and class Normal, respectively.
Subsumption is a central concept in Inductive Logic Programming [20], where it is used to identify generalization-specialization relationships between clauses and to support the process of unifying clauses.
Definition 8. (Generalization and specialization) For two attribute values xi and xj, if P(xj|xi) = 1.0 then xj is a generalization of xi and xi is a specialization of xj.
Suppose that Gender has two values: female and male, and Pregnant has two values: yes and no. If Pregnant = yes, it follows that Gender = female. Therefore, Gender = female is a generalization of Pregnant = yes, i.e., FD: {Gender = female} → {Pregnant = yes} holds.
Theorem 1. Substitution resolution: Suppose that for k-DBC the attribute order is {X1, X2, ⋯, Xn} and Xi(i > k) should select k attributes as parents. If xp is a generalization of xq (xp, xqPai), then forxtPai(t < i), xt as a substitute of xp will help achieve a more accurate approximate estimation of probability distribution.
Proof. For k-DBC, conditional probability P(xi|Pai, y) can be considered an approximate estimation of P(xi|x1, ⋯, xi−1, y). Evidently, P(xi|Pai, xt, y) will be more accurate than P(xi|Pai, y), where xtPai and t < i. If xp is a generalization of xq (xp, xqPai), by applying the augmentation rule of probability we will have P(xi|Pai, y) = P(xi|Pai − xp, y), where “ − ” denotes set difference. To retain the k-dependence restriction, we need to select xt as a substitute of xp. □
The example presented in Figure 1 illustrates this relationship. The joint probability distribution of the full Bayesian classifier, as shown in Figure 1c, is expressed as follows:
P ( y , x ) = P ( y ) P ( x 1 | y ) P ( x 2 | y , x 1 ) P ( x 3 | y , x 1 , x 2 ) P ( x 4 | y , x 1 , x 2 , x 3 )
In addition, the joint probability distribution of 2-DBC, as Figure 1b shows, is as follows,
P ( y , x ) = P ( y ) P ( x 1 | y ) P ( x 2 | y , x 1 ) P ( x 3 | y , x 1 , x 2 ) P ( x 4 | y , x 1 , x 2 )
By comparing Equation (12) and Equation (13) we can observe that, Equation (13) uses P(x4|y, x1, x2) to obtain an approximate estimation of P(x4|y, x1, x2, x3). Gender = female is a generalization of Pregnant = yes. Thus, P(Gender = female|Pregnant = yes) = 1 or P(x2|x1) = 1. By applying the augmentation rule of probability, we derive the following equations:
P ( x 4 | y , x 1 , x 2 , x 3 ) = P ( x 4 , y , x 1 , x 2 , x 3 ) P ( y , x 1 , x 2 , x 3 ) = P ( x 4 , y , x 1 , x 3 ) P ( y , x 1 , x 3 ) = P ( x 4 | y , x 1 , x 3 )
and
P ( x 3 | y , x 1 , x 3 ) = P ( x 3 , y , x 1 , x 2 ) P ( y , x 1 , x 2 ) = P ( x 3 , y , x 1 ) P ( y , x 1 ) = P ( x 3 | y , x 1 )
Equation (12) will change to be,
P ( y , x ) = P ( y ) P ( x 1 | y ) P ( x 2 | y , x 1 ) P ( x 3 | y , x 1 ) P ( x 4 | y , x 1 , x 3 )
Thus for Equation (13), if we use x3 to substitute x2 in Pa4 and ∅ to substitute x2 in Pa3, corresponding 2-DBC as shown in Figure 2a is just the same as the full Bayesian classifier for the instances where Pregnant = yes holds. Thus we can obtain a more accurate estimation of P(y, x) and the 2-dependence restriction still retained.
Theorem 2. Elimination resolution: ForxpPai, if xi is a generalization of xp, then P(xi|Pai) = 1.0 and the factor P(xi|Pai) will be eliminated from the joint probability distribution.
For example, Equations (12) and (13) both use the factor P(x2|x1, y). If x2 is a generalization of x1, then, by applying the augmentation rule of probability, we derive the following equation:
P ( x 2 | x 1 , y ) = P ( x 2 , y , x 1 ) P ( y , x 1 ) = P ( y , x 1 ) P ( y , x 1 ) = 1
Thus, Equation (16) can be rewritten as follows:
P ( y , x ) = P ( y ) P ( x 1 | y ) P ( x 3 | y , x 1 ) P ( x 4 | y , x 1 , x 3 )
The corresponding Bayesian structure is shown in Figure 2b. When two attributes are strongly related, the classifier may overweigh the inference from the two attributes, which results in prediction bias. FDs will help avoid this situation, and the high-dimensional representation or even entire classification model is simplified and improved.

3. KDB, Local KDB and AKDB

KDB allows us to construct classifiers at arbitrary points (values of k) along the feature dependence spectrum, while also capturing most of the computational efficiency of the naive Bayesian model. Thus KDB presents an alternative to the general trend in BN learning algorithms that conducts an expensive search through the space of network structures.
KDB is supplied with both a database of pre-classified instances, DB, and the k value for the maximum allowable degree of feature dependence. The KDB outputs a k-dependence Bayesian classifier with conditional probability tables determined from the input data. The algorithm is as follows:
Algorithm 1. KDB.
Algorithm 1. KDB.
  • For each attribute Xi, compute MI, I(Xi; Y), where Y is the class.
  • Compute class CMI I(Xi; Xj|Y) for each pair of attributes Xi and Xj, where ij.
  • Let the used variable list, S, be empty.
  • Let the Bayesian network being constructed, BN, begin with a single class node, Y.
  • Repeat until S includes all domain attributes
    • 5.1. Select feature Xmax which is not in S and has the highest value I(Xmax; Y).
    • 5.2. Add a node to BN representing Xmax.
    • 5.3. Add an arc from Y to Xmax in BN.
    • 5.4. Add m = min(|S|, k) arcs from m distinct attributes Xj in S with the highest value for I(Xmax; Xj|Y).
    • 5.5. Add Xmax to S.
  • Compute the conditional probability tables inferred by the structure of BN by using counts from DB, and output BN.
From Definitions 3–6, we can obtain the following results:
{ I ( X i ; Y ) = X i I ( x i ; y ) I ( X i ; X j | Y ) = X i X j I ( x i ; x j | Y )
MI and CMI are commonly applied to roughly measure the direct or conditional relationships between predictive attributes and class variable Y. However, in the real world, the relationships between attributes may differ significantly as the situation changes. For some instances attributes A and B are highly related. Meanwhile, for other instances, A is independent of B, but highly related to attribute Y. Considering the relationships among attributes Gender, Pregnant and Breast Cancer, Gender = female and Breast Cancer=yes are highly related. By contrast, if Gender = female, then we cannot make any definite conclusion about the value of Pregnant, nor about the value of Gender if Breast Cancer= no. Note that traditional Bayesian classifier, e.g., KDB, which is learned based on classical information theory, cannot describe such interdependencies. However, LMI and CLMI can be used to identify the dynamic changes, thus making the final model much more flexible.
As shown in Figure 3, for the first instance, the attribute value x2 is independent of other attribute values and the local relationship between {x1, x3} and class variable Y is just like a triangle. For the ith instance, {x2, x3} are independent of Y, and the local relationship between x1 and Y is just like an oval. For the last instance, x3 is independent of other attribute values and the local relationship between {x1, x2} and Y is just like a broken line. If all situations are considered together, then the overall relationship between attributes {X1, X2, X3} and class variable Y is just like a rectangle.
KDB learns the basic relationships of full BN. In the first two learning steps of KDB, if I(Y; X) and I(Xi; Xj|Y) are replaced by I(Y; x) and I(xi; xj|Y) respectively, then the local KDB that describes the local relationships of each test instance can be inferred. On the basis of this, FD analysis is introduced into the learning procedure to improve model robustness.
The learning procedure of the local KDB is described as follows:
Algorithm 2. Local KDB.
Algorithm 2. Local KDB.
For each test instance x= {x1, x2, ⋯, xn}
  • For each attribute value xix, compute LMI, I(xi; Y), where Y is the class.
  • Compute CLMI I(xi; xj|Y) for each pair of attribute values xi and xj, where i ≠ j and xi, xjx.
  • Let the used variable list, S, be empty.
  • Let the Bayesian network being constructed, BN, begin with a single class node, Y.
  • Repeat until S includes all attribute values
    • 5.1. Select attribute value xmax which is not in S and has the highest value I(xmax; Y).
    • 5.2. Add a node to BN representing xmax.
    • 5.3. Add an arc from Y to xmax in BN.
    • 5.4. Add m = min(|S|, k) arcs from m distinct attribute values xj in S with the highest value for I(xmax; xj|Y).
    • 5.5. Add xmax to S.
  • Apply SER to substitute generalization or eliminate redundant conditional probability.
  • Compute the conditional probability tables inferred by the structure of BN by using counts from DB, and output BN.
The final classifier, AKDB, estimates the class membership probabilities by averaging KDB and local KDB classifiers. The basic idea of AKDB can be explained from the perspective of medical diagnosis. KDB describes the basic relationships between different symptoms that can be explained by domain knowledge learned from book or in school. Meanwhile, the local KDB describes the possible relationships between different symptoms of a specific patient. To make a definite diagnosis, rich experience (which corresponds to robust KDB model) and flexible mind (which corresponds to dynamic local KDB model) are both necessary and important.
FDs require a method for inferring from the training data whether one attribute value is a generalization of another. FDs use the following criterion:
| T x i | = | T x i , x j | l
to infer that xj is a generalization of xi, where | T x i | is the number of training cases with value xi, | T x i , x j | is the number of training cases with both values, and l is a user-specified minimum frequency. A large number of deterministic attributes, which are on the left side of the FD, will increase the risk of incorrect inference, and at the same time need more computer memory to store credible FDs. Consequently, only the one-one FDs are selected in our current work. Besides, as no formal method has been used to select an appropriate value for l, we use the same setting as that proposed by Webb [15], i.e., l = 100, which is achieved from empirical studies.
The learning framework of local KDB is described as follows. During training time, FD analysis is applied to detect all possible specialization-generalization relationships. During classification time, local KDB first builds the basic network structure for each test instance t; then selects the specialization-generalization relationships that hold in t, and applies SER to refine the network structure. From the definitions of local mutual information and FD we can see that, they both deal with attribute values rather than attributes. In the real world the interdependencies may be varied when attributes take different values. As for some test instances attributes Xi and Xj are independent, for other test instances Xi may be dependent on Xj. Classical Bayesian classifiers, e.g., TAN and KDB, which build the network structure by computing mutual information and conditional mutual information, cannot resolve such situations. Whereas local KDB helps to remedy this limitation.
Another feature of our algorithm which makes it very suitable for data mining domains is its relatively small computational complexity. Computing the actual network structure of KDB requires O(n2mcv2) time (dominated by Step 2) and that of Local KDB only requires O(n2mc) time, where n is the number of attributes, m is the number of training instances, c is the number of classes, and v is the average number of discrete values that an attribute may take. Moreover, classifying an instance both KDB and local KDB require O(nck) time. Forming the additional two-dimensional probability estimate table SER requires O(mn2v2) time. Classification of a single instance requires considering each pair of attributes to detect dependencies and is of time complexity O(cn).

4. Experiments

4.1. Bias and Variance

The classification of each case in the test set is done by choosing, as class label, the value of the class variable that has the highest posterior probability. Classification accuracy is measured by the percentage of correct predictions on the test sets (i.e., using a zero-one loss function). Kohavi and Wolpert presented a bias-variance decomposition of the expected misclassification rate [21], which is a powerful tool from sampling theory statistics for analyzing supervised learning scenarios. Suppose y and ŷ are the true class label and that generated by a learning algorithm, respectively, the zero-one loss function is defined as:
ξ ( y , y ^ ) = 1 δ ( y , y ^ ) ,
where δ(y, ŷ) = 1 if ŷ = y and zero otherwise. The bias term measures the squared difference between the average output of the target and the algorithm. This term is defined as follows:
b i a s = 1 2 y ^ , y Y [ P ( y ^ | x ) P ( y | x ) ] 2 ,
where x is the combination of any attribute value. The variance term is a real valued non-negative quantity and equals zero for an algorithm that always makes the same guess regardless of the training set. The variance increases as the algorithm becomes more sensitive to changes in the training set. It is defined as follows:
v a r i a n c e = 1 2 [ 1 y ^ Y P ( y ^ | x ) 2 ] .

4.2. Statistical Results on UCI Data Sets

In order to verify the efficiency and effectiveness of the proposed AKDB, we conduct experiments on 41 data sets from the UCI machine learning repository. Table 1 summarizes the characteristics of each data set, including the number of instances, attributes and classes. Large data sets with an instance number greater than 3000 are annotated with the symbol “*”. Missing values for qualitative attributes are replaced with modes, and those for quantitative attributes are replaced with means from the training data. For each benchmark data set, numeric attributes are discretized using Minimum Description Length discretization [22]. The following techniques are compared:
  • NB, standard naive Bayes.
  • TAN, tree-augmented naive Bayes.
  • AODE, AODE with subsumption resolution.
  • KDB, standard k-dependence Bayesian classifier.
  • TAN-FDA [17], a variation of TAN that rebuilds MST for each testing instance.
  • AODE-SR [18], a variation of AODE that selects super parent and delete extraneous children attributes.
  • LKDB (Local KDB), a variation of KDB that describes the local dependencies among attributes.
  • AKDB, a combination of KDB and local KDB.
All algorithms were coded in MATLAB 7.0 on a Pentium 2.93 GHz/2GB RAM computer. Base probability estimates P(y) and P(xj|y) with Laplace correction are defined as follows,
{ P ^ ( y ) = i = 1 N δ ( y i , y ) + 1 N + t P ^ ( x j | y ) = i = 1 N δ ( x i j , x j ) δ ( y i , y ) + 1 δ ( y i , y ) + t j
where N is the number of training instances, t is the number of classes, yi is the class label of the ith training instance, tj is the number of values of the jth attribute, xij is the jth attribute value of the ith training instance, xj is the jth attribute value of the test instance, and δ(·) is a binary function, which is one if its two parameters are identical and zero otherwise. Thus, i = 1 N δ ( y i , y ) is the frequency that the class label y occurs in the training data and i = 1 N δ ( x i j , x j ) δ ( y i , y ) is the frequency that the class label y and the attribute value xj occurs simultaneously in the training data.
Table 2 presents for each data set the average zero-one loss, which is estimated by 10-fold cross-validation to obtain an accurate estimation of the average performance of an algorithm. The average bias and variance results are shown in Tables 3 and 4, respectively, in which only 15 large data sets are selected because of statistical significance. The average zero-one loss, bias or variance across multiple data sets provides a gross measure of relative performance. Statistically, a win/draw/loss (W/D/L) record is calculated for each pair of competitors A and B with respect to performance measure M. The record represents the number of data sets in which A either beats, loses to or ties with B on M. We assess a difference as significant if the outcome of a one-tailed binomial sign test is less than 0.05. Tables 5, 6 and 7 show the W/D/L records that correspond to zero-one loss, bias and variance, respectively.
Allowing more dependencies for KDB reduces zero-one loss significantly more often than it increases. As more attributes are utilized for classification, the increase in the value of k will help ensure that more causal relationships will appear and be expressed in the joint probability distribution. By contrast, the AODE family, e.g., AODE and AODE-SR, which utilize a restricted class of one-dependence estimators (ODEs), aggregates the predictions of all qualified estimators within this class. The superiority of AODE family over single-structure classifiers, e.g., TAN, TAN-FDA and KDB, that are learned on the basis of classical information theory, can be attributed to the superiority of the aggregating mechanism. From Table 5 we observed that, AKDB, which is the combination of KDB and local KDB, enjoys a significant zero-one loss advantage relative to other algorithms. Moreover, we applied Friedman test (FT) [23,24], which is a non-parametric measure, to rank and compare the algorithms. FT helps to compare and evaluate the overall prediction performance of different learning algorithms when dealing with numerous data sets. The best performing algorithm getting the rank of 1, the second best rank 2,⋯. In case of ties, average ranks are assigned. Let r i j be the rank of the j-th of k algorithms on the i-th of N data sets. FT compares the average ranks of algorithms, R j = 1 N i r i j. The experimental results of FT are shown in Table 8, from which we can see that, the order of these algorithms is {AKDB, AODE-SR, AODE,TAN-FDA, LKDB, KDB, TAN, NB} when comparing the experimental results on all data sets. Thus the effectiveness of AKDB is proved from the perspective of FT.
We need to further evaluate whether local KDB works as an effective complementary part of AKDB. The relative zero-one loss/bias/variance ratio ϱ(·) is proposed to measure the extent to which local KDB helps to improve the performance of KDB. A higher value of the ratio ϱ(·) corresponds to a smaller ratio of AKDB and KDB, which indicates the better performance of AKDB.
{ ϱ ( Z ) = 1 z e r o o n e l o s s o f A K D B z e r o o n e l o s s o f K D B ϱ ( B ) = 1 b i a s o f A K D B b i a s o f K D B ϱ ( V ) = 1 v a r i a n c e A K D B v a r i a n c e o f K D B
The data sets, in which AODE performs better than KDB, are selected for comparison. Figures 4 and 5 show the experimental results of ϱ(·) with respect to zero-one loss, bias and variance. The index numbers of data sets in Figures 4 and 5, which correspond to that described in Table 1. From Figure 4, we can see clearly that, local KDB works for all the data sets regardless of whether the data size is small or large.
Bias can be used to evaluate the extent to which the final model learned from training data fits the entire data set. From Table 6, we can see that the fitness of NB is the poorest because its structure is definite regardless of the true data distribution. AKDB still performs the best, although the advantage is not significant. By calculating CMI from the global viewpoint and calculating CLMI from the local point, the aggregating mechanism can help AKDB make full use of the information that is supplied by the training data and test instance. The complicated relationship among attributes are measured and depicted from the viewpoint of information theory. Thus, performance robustness can be achieved. In two data sets, the KDB performs more poorly than AODE. From Figure 5, we observed that AKDB works in one data set.
With respect to variance, NB performs the best among these algorithms because its network structure is definite and is therefore insensitive to changes in the training set, as shown in Table 7. By contrast, KDB performs the worst. When k increases, the resulting network tends to have a complex structure. Thus, the network has high variance because of the inaccurate probability estimation caused by the limited amount of training data. Meanwhile, for the local KDB, only the attribute values in the test instance are needed to compute the CLMI. The negative effect caused by probability distribution will be mitigated significantly. Moreover, FDs are extracted from the entire data set and entirely unrelated to the training set. From Figure 5, we observed that the local KDB expresses significant complementary characteristics. Moreover, in 9 of 11 data sets, AKDB performs better than KDB.

5. Conclusions

We propose to build a local KDB as a complementary part of KDB to describe some specific situations to retain the high-dependence representation characteristic of KDB and aggregating mechanism of AODE. The final model, AKDB, has shown its superiority from the comparison results of zero-one loss, bias, and variance. The local KDB is trained in the framework of KDB. Similarly, applying the basic idea of the current work to other high-dependence Bayesian classifiers is possible.

Acknowledgments

This work was supported by the National Science Foundation of China (Grant No. 61272209, 61300145) and the Postdoctoral Science Foundation of China (Grant No. 2013M530980), Agreement of Science & Technology Development Project, Jilin Province (No. 20150101014JC).

Author Contributions

All authors have contributed to the study and preparation of the article. The 1st author conceived the idea, derived equations and wrote the paper. The 2nd author and the 3rd author did the analysis. The 4th author finish the programming work. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.; Morgan Kaufmann: Burlington, MA, USA, 1988. [Google Scholar]
  2. Cheng, J.; Greiner, R.; Kelly, J.; Bell, D.; Liu, W. Learning Bayesian Networks from Data: An Information-Theory Based Approach. Artif. Intell. 2002, 137, 43–90. [Google Scholar]
  3. Jiang, L.X.; Cai, Z.H.; Wang, D.H. Improving Tree Augmented Naive Bayes for Class Probability Estimation. Knowl. Base. Syst. 2012, 26, 239–245. [Google Scholar]
  4. Francisco, L.; Anderson, A. Bagging k-Dependence Probabilistic Networks: An Alternative Powerful Fraud Detection Tool. Expert. Syst. Appl. 2012, 11583–11592. [Google Scholar]
  5. Bielza, C.; Larranaga, P. Discrete Bayesian Network Classifiers: A Survey. ACM Comput. Surv. 2014, 47, 1–43. [Google Scholar]
  6. Cooper, G.F. The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks. Artif. Intell. 1990, 42, 393–405. [Google Scholar]
  7. Dagum, P.; Luby, M. Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard. Artif. Intell. 1993, 60, 141–153. [Google Scholar]
  8. Langley, P.; Iba, W.; Thompson, K. An Analysis of Bayesian Classifiers. Proceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CA, USA, 12–16 July 1992; pp. 223–228.
  9. Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian Network Classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar]
  10. Sahami, M. Learning Limited Dependence Bayesian Classifiers. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; AAAI Press: Palo Alto, CA, USA, 1996; pp. 335–338. [Google Scholar]
  11. Watanabe, S. A Widely Applicable Bayesian Information Criterion. J. Mach. Learn. Res. 2013, 14, 867–897. [Google Scholar]
  12. Chaitankar, V.; Ghosh, P.; Perkins, E. A Novel Gene Network Inference Algorithm Using Predictive Minimum Description Length Approach. BMC. Syst. Biol. 2010, 4, 107–126. [Google Scholar]
  13. Posada, D.; Buckley, T.R. Model Selection and Model Averaging in Phylogenetics: Advantages of Akaike Information Criterion and Bayesian Approaches over Likelihood Ratio Tests. Syst. Biol. 2004, 53, 793–808. [Google Scholar]
  14. Friedman, N.; Koller, D. Being Bayesian about Bayesian Network Structure: A Bayesian Approach to Structure Discovery in Bayesian Networks. Mach. Learn. 2013, 50, 95–125. [Google Scholar]
  15. Webb, G.I.; Boughton, J.; Wang, Z. Not So Naive Bayes: Aggregating One-Dependence Estimators. Mach. Learn. 2005, 58, 5–24. [Google Scholar]
  16. Zheng, F.; Webb, G.I. Subsumption Resolution: An Efficient and Effective Technique for Semi-Naive Bayesian Learning. Mach. Learn. 2012, 87, 1947–1988. [Google Scholar]
  17. Wang, L.M. Extraction of Belief Knowledge from a Relational Database for Quantitative Bayesian Network Inference. Math. Probl. Eng. 2013. [Google Scholar] [CrossRef]
  18. Wang, L.M.; Wang, S.C.; Li, X.F.; Chi, B.R. Extracting Credible Dependencies for Averaged One-Dependence Estimator Analysis. Math. Probl. Eng. 2014. [Google Scholar] [CrossRef]
  19. Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Champaign, IL, USA, 1949. [Google Scholar]
  20. De Raedt, L. Logic of Generality. In Encyclopedia of Machine Learning; Sammut, C., Webb, G.I., Eds.; Springer: New York, NY, USA, 2010; pp. 624–631. [Google Scholar]
  21. Kohavi, R.; Wolpert, D. Bias Plus Variance Decomposition for Zero-One Loss Functions. Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 3–6 July 1996; pp. 275–283.
  22. Fayyad, U.M.; Irani, K.B. Multi-interval Discretization of Continuous-Valued Attributes for Classification Learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France, 28 August–3 September, 1993; pp. 1022–1029.
  23. Garcia, S.; Herrera, F. An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons. J. Mach. Learn. Res. 2008, 9, 2677–2694. [Google Scholar]
  24. Friedman, M. The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar]
Figure 1. The k-dependence relationships between attributes inferred from KDB.
Figure 1. The k-dependence relationships between attributes inferred from KDB.
Entropy 17 04134f1
Figure 2. The 2-dependence relationships between attributes after substitution-elimination resolution.
Figure 2. The 2-dependence relationships between attributes after substitution-elimination resolution.
Entropy 17 04134f2
Figure 3. The basic and local relationships among {X1, X2, X3} and Y.
Figure 3. The basic and local relationships among {X1, X2, X3} and Y.
Entropy 17 04134f3
Figure 4. The comparison results of relative zero-one loss ratio.
Figure 4. The comparison results of relative zero-one loss ratio.
Entropy 17 04134f4
Figure 5. The comparison results of relative bias/variance ratio.
Figure 5. The comparison results of relative bias/variance ratio.
Entropy 17 04134f5
Table 1. Data sets.
Table 1. Data sets.
No.Data Set# InstanceAttributeClass
1Abalone *4,17793
2Adult *48,842152
3Anneal898396
4Audio2267024
5Balance Scale (Wisconsin)62553
6Breast-cancer-w699102
7Car1,72884
8Chess551402
9Connect-4 *67,557433
10Contraceptive-mc1,473103
11Credit690162
12Cylinder-bands540402
13Dermatology366356
14Glass Identification214103
15Heart Disease303142
16Hepatitis155202
17Hungarian294142
18Iris15053
19Kr-vs-kp *3,196362
20Labor57172
21LED1,000810
22Letter-recog*20,0001626
23Localization*164,860511
24Lung Cancer32573
25Lymphography148194
26Magic *19,020112
27Mushroom *8,124222
28Nursery *12,96085
29Optdigits *5,6206410
30Poker-hand*1,025,010 *1110
31Primary Tumor3391822
32Satellite *6,435376
33Segment2,310207
34Shuttle *58,00097
35Sick *3,772302
36Spambase *4,601582
37Teaching-ae15163
38Vehicle846194
39Vowel9901311
40Wine Recognition178143
41Zoo101177
Table 2. Experimental results of 0–1 loss.
Table 2. Experimental results of 0–1 loss.
DatasetNBTANAODETAN-FDAAODE-SRKDBLKDBAKDB
Abalone0.4720.4590.4480.4480.4490.4670.4560.462
Adult0.1580.1380.1490.1330.1300.1380.1320.129
Anneal0.0380.0090.0090.0090.0080.0090.0130.009
Audio0.2390.2920.2040.3250.2840.3230.3310.302
Balance Scale0.2850.2800.2980.2890.2430.2930.2890.292
Breast-cancer-w0.0260.0420.0360.0380.0460.0740.0360.041
Car0.1400.0570.0820.0370.0370.0380.0610.028
Chess0.1130.0930.1000.0820.0750.1000.1110.079
Connect-40.2780.2350.2420.2150.2090.2280.2410.227
Contraceptive-mc0.5040.4890.4940.4850.4840.5000.4820.481
Credit-a0.1410.1510.1390.1610.1490.1460.1490.143
Cylinder-bands0.2150.2830.1890.2610.2430.2260.1810.190
Dermatology0.0190.0330.0160.0450.0480.0660.0440.029
Glass Identification0.2620.2200.2520.2150.1600.2200.1990.201
Heart0.1780.1930.1700.2200.1920.2110.1960.190
Hepatitis0.1940.1680.1810.1770.1720.1870.1480.169
Hungarian0.1600.1700.1670.1600.1600.1800.1840.166
Iris0.0870.0800.0870.0850.0810.0870.0800.080
Kr-vs-kp0.1210.0780.0840.0450.0780.0420.0470.037
Labor0.0350.0530.0530.0690.0520.0350.0510.052
Led0.2670.2660.2680.2570.2630.2620.2640.265
Letter-recog0.2530.1300.0880.0860.0910.0990.1180.077
Localization0.4960.3580.3600.2910.2970.2960.3190.277
Lung-cancer0.4380.5940.5000.7350.6660.5940.6210.560
Lymphography0.1490.1760.1690.2120.2090.2370.1690.172
Magic0.2240.1680.1750.1580.1550.1640.1710.154
Mushrooms0.0200.0000.0000.0000.0000.0000.0000.000
Nursery0.0970.0650.0730.0280.0410.0290.0650.039
Optdigits0.0770.0410.0310.0340.0290.0370.0490.031
Poker-hand0.4990.3300.4810.3110.2070.1960.0500.053
Primary-tumor0.5460.5430.5750.5550.5570.5720.5790.561
Satellite0.1810.1210.1150.1120.1090.1080.1390.102
Segment0.0790.0390.0340.0400.0430.0470.0330.033
Shuttle0.0040.0020.0010.0010.0010.0010.0010.001
Sick0.0310.0260.0270.0240.0230.0220.0310.022
Spambase0.1020.0670.0670.0650.0580.0640.0690.057
Teaching-ae0.4970.5500.4900.5190.5070.5360.4670.460
Vehicle0.3920.2940.2900.2480.2790.2940.3070.291
Vowel0.4240.1300.1500.2120.1260.1820.2470.132
Wine0.4970.0340.0230.0330.0280.0230.0480.022
Zoo0.0300.0100.0300.0310.0300.0500.0270.009
Table 3. Experimental results of bias.
Table 3. Experimental results of bias.
DatasetNBTANAODETAN-FDAAODE-SRKDBLKDBAKDB
Abalone0.4030.3200.3300.3210.3210.3280.3110.312
Adult0.1400.1070.1130.1130.1040.1070.1080.101
Connect-40.2320.1820.1920.1700.2500.1780.1770.181
Kr-vs-kp0.1110.0660.0690.0350.0710.0380.0400.032
Letter-recog0.2290.1750.1820.1640.1670.1650.1680.159
Localization0.3820.3260.3210.3080.3020.3140.3110.302
Magic0.1980.1350.1610.1330.1390.1320.1310.133
Mushrooms0.0390.0000.0000.0010.0020.0000.0000.000
Nursery0.0730.0510.0510.0460.0610.0410.0410.040
Optdigits0.0650.0310.0290.0270.0350.0280.0290.026
Poker-hand0.3260.2260.2630.2760.2540.3310.2900.244
Satellite0.1660.0940.0800.0860.0810.0810.0780.090
Shuttle0.0850.0450.0390.0320.0320.0390.0310.031
Sick0.0060.0020.0020.0020.0010.0030.0020.002
Spambase0.0960.0650.0670.0580.0680.0510.0450.051
Table 4. Experimental results of variance.
Table 4. Experimental results of variance.
DatasetNBTANAODETAN-FDAAODE-SRKDBLKDBAKDB
Abalone0.0930.1610.1370.1580.1580.1530.1710.160
Adult0.0270.0660.0440.0710.0550.0730.0810.051
Connect-40.0950.0880.1100.0970.0400.1040.1090.086
Kr-vs-kp0.0250.0190.0230.0120.0100.0100.0080.005
Letter-recog0.1420.1650.1420.1670.1730.1740.1770.160
Localization0.1910.2560.2190.2540.2580.2710.2810.249
Magic0.0410.0790.0610.0760.0310.0820.0800.072
Mushrooms0.0080.0010.0000.0010.0010.0010.0000.001
Nursery0.0270.0380.0300.0390.0430.0450.0520.034
Optdigits0.0250.0280.0240.0300.0250.0320.0320.029
Poker-hand0.2090.2310.2140.2790.2160.3280.2870.262
Satellite0.0210.0410.0490.0460.0510.0530.0490.040
Shuttle0.0040.0010.0020.0020.0000.0020.0010.001
Sick0.0060.0070.0050.0070.0050.0060.0060.007
Spambase0.0100.0170.0140.0200.0070.0240.0250.016
Table 5. W/D/L comparison results of 0–1 loss on all data sets.
Table 5. W/D/L comparison results of 0–1 loss on all data sets.
W/D/LNBTANAODETAN-FDAAODE-SRKDBLKDB
TAN25/5/11
AODE24/12/514/13/14
TAN-FDA24/8/918/12/1112/15/14
AODE-SR25/7/922/13/618/13/1015/19/7
KDB22/10/915/13/1314/12/159/20/127/11/23
LKDB24/6/1112/14/1510/15/1614/11/1611/11/1913/12/16
AKDB27/7/721/18/220/14/723/14/415/19/723/15/322/17/2
Table 6. W/D/L comparison results of bias on large data sets.
Table 6. W/D/L comparison results of bias on large data sets.
W/D/LNBTANAODETAN-FDAAODE-SRKDBLKDB
TAN14/1/0
AODE14/1/02/10/3
TAN-FDA14/1/09/3/38/5/2
AODE-SR14/0/14/5/66/5/44/5/6
KDB13/2/08/5/28/5/25/6/47/5/3
LKDB14/1/07/7/18/6/14/9/27/6/23/11/1
AKDB14/1/08/6/110/4/16/8/16/7/24/8/34/8/3
Table 7. W/D/L comparison results of variance on large data sets.
Table 7. W/D/L comparison results of variance on large data sets.
W/D/LNBTANAODETAN-FDAAODE-SRKDBLKDB
TAN4/1/10
AODE4/4/710/1/4
TAN-FDA3/1/112/6/73/1/11
AODE-SR7/2/610/3/25/3/79/4/2
KDB3/2/102/3/102/2/112/4/90/5/10
LKDB3/2/103/2/102/3/104/1/102/2/114/10/1
AKDB4/1/103/9/34/1/108/6/15/2/812/1/212/1/2
Table 8. Ranks of different classifiers.
Table 8. Ranks of different classifiers.
DatasetNBTANAODETAN-FDAAODE-SRKDBLKDBAKDB
Abalone5.55.51.51.55.55.55.55.5
Adult8.04.57.04.51.54.54.51.5
Anneal8.04.04.04.01.04.07.04.0
Audio2.04.51.07.03.07.07.04.5
Balance Scale5.52.05.55.51.05.55.55.5
Breast-cancer-w1.05.52.54.07.08.02.55.5
Car8.05.07.03.03.03.06.01.0
Chess7.54.05.52.51.05.57.52.5
Connect-48.06.06.01.51.53.56.03.5
Contraceptive-mc4.54.54.54.54.54.54.54.5
Credit-a2.05.52.08.05.55.55.52.0
Cylinder-bands4.58.02.07.06.04.52.02.0
Dermatology2.04.01.05.57.08.05.53.0
Glass Identification7.55.07.55.01.05.02.52.5
Heart1.54.51.57.54.57.54.54.5
Hepatitis7.52.55.05.05.07.51.02.5
Hungarian2.05.05.02.02.07.57.55.0
Iris6.52.56.56.52.56.52.52.5
Kr-vs-kp8.05.57.03.55.52.03.51.0
Labor1.55.05.08.05.01.55.05.0
Led4.54.54.54.54.54.54.54.5
Letter-recog8.07.03.52.03.55.06.01.0
Localization8.06.56.53.03.03.05.01.0
Lung-cancer1.05.02.08.07.05.05.03.0
Lymphography1.03.53.56.56.58.03.53.5
Magic8.06.06.03.51.53.56.01.5
Mushrooms8.07.06.05.04.03.02.01.0
Nursery8.05.57.01.53.51.55.53.5
Optdigits8.06.02.54.01.05.07.02.5
Poker-hand7.56.07.55.04.03.01.02.0
Primary-tumor1.51.55.55.55.55.55.55.5
Satellite8.05.55.53.03.03.07.01.0
Segment8.04.52.04.56.07.02.02.0
Shuttle8.07.03.53.53.53.53.53.5
Sick7.55.55.53.53.51.57.51.5
Spambase8.06.06.03.51.53.56.01.5
Teaching-ae5.07.52.55.05.07.52.51.0
Vehicle8.06.03.01.03.06.06.03.0
Vowel8.02.04.06.02.05.07.02.0
Wine8.05.52.05.54.02.07.02.0
Zoo5.52.05.55.55.58.03.01.0
Avg5.85.04.44.53.84.94.82.8

Share and Cite

MDPI and ACS Style

Wang, L.; Zhao, H.; Sun, M.; Ning, Y. General and Local: Averaged k-Dependence Bayesian Classifiers. Entropy 2015, 17, 4134-4154. https://doi.org/10.3390/e17064134

AMA Style

Wang L, Zhao H, Sun M, Ning Y. General and Local: Averaged k-Dependence Bayesian Classifiers. Entropy. 2015; 17(6):4134-4154. https://doi.org/10.3390/e17064134

Chicago/Turabian Style

Wang, Limin, Haoyu Zhao, Minghui Sun, and Yue Ning. 2015. "General and Local: Averaged k-Dependence Bayesian Classifiers" Entropy 17, no. 6: 4134-4154. https://doi.org/10.3390/e17064134

Article Metrics

Back to TopTop