Next Article in Journal
Nonequilibrium Thermodynamics and Steady State Density Matrix for Quantum Open Systems
Next Article in Special Issue
Peierls–Bogolyubov’s Inequality for Deformed Exponentials
Previous Article in Journal
Use of Exergy Analysis to Quantify the Effect of Lithium Bromide Concentration in an Absorption Chiller
Previous Article in Special Issue
Intra-Day Trading System Design Based on the Integrated Model of Wavelet De-Noise and Genetic Programming
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Quadratic Mutual Information Feature Selection

University of Ljubljana, Faculty of Computer and Information Science, Ljubljana 1000, Slovenia
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(4), 157; https://doi.org/10.3390/e19040157
Submission received: 13 December 2016 / Revised: 27 March 2017 / Accepted: 30 March 2017 / Published: 1 April 2017
(This article belongs to the Collection Advances in Applied Statistical Mechanics)

Abstract

:
We propose a novel feature selection method based on quadratic mutual information which has its roots in Cauchy–Schwarz divergence and Renyi entropy. The method uses the direct estimation of quadratic mutual information from data samples using Gaussian kernel functions, and can detect second order non-linear relations. Its main advantages are: (i) unified analysis of discrete and continuous data, excluding any discretization; and (ii) its parameter-free design. The effectiveness of the proposed method is demonstrated through an extensive comparison with mutual information feature selection (MIFS), minimum redundancy maximum relevance (MRMR), and joint mutual information (JMI) on classification and regression problem domains. The experiments show that proposed method performs comparably to the other methods when applied to classification problems, except it is considerably faster. In the case of regression, it compares favourably to the others, but is slower.

1. Introduction

Modelling data using machine learning approaches usually involves taking some kind of learning machine (e.g., decision tree, neural network, support vector machine) to train a model using already known input and output data. For example, based on features collected about patients (gender, blood pressure, presence or absence of certain symptoms, etc.) and given the patients’ diagnoses (the outputs), we can build a model and use it afterwards as a diagnosis tool for new patients. The input features and the output can be discrete (e.g., gender) or continuous (e.g., body temperature). In the first case we are dealing with a classification problem, and in the last with a regression problem.
Many classification or regression problems involve high-dimensional input data. For example, gene expression data can easily reach into tens of thousands of features [1]. The majority of these features are either irrelevant or redundant for the given classification or regression task. A large number of features can lead to poor inference performance, possible over-fitting of the model, and increased training time [2].
To tackle these problems, feature selection algorithms try to select a smaller feature subset which is highly relevant to the output. There exists a great number of approaches to feature selection. We can divide them into three main groups; specifically, wrapper, embedded, and filter. The wrapper approach [3] uses the performance of a learning machine to evaluate the relevance of feature subsets. They usually achieve good performance, but can be computationally costly and infeasible for use on large data sets. Moreover, their performance depends on the learning machine being used in the evaluation. The embedded approach integrates feature selection into the learning machine itself and performs selection implicitly during the training phase. This method is faster [1], but still dependent on the learning machine. Filters are faster than both of the previous approaches and use a simple relevance criterion based on some measure such as correlation coefficient [4] or mutual information [5] to asses the goodness of a feature subset. The evaluation is independent of the learning machine and is less prone to over-fitting, but may fail to find the optimal feature subset for a given learning machine.
In addition to the relevance criterion, feature selection must also employ a certain search process, which drives the feature selection. Optimally, exhaustive search evaluates all possible feature subsets and selects the best one. This is usually computationally prohibitive, so greedy approaches like sequential search or random search [6] are used in practice.
In this work, we focus on the filter approach to feature selection and present a novel method based on quadratic mutual information. The method’s criterion has its roots in Cauchy–Schwarz divergence and quadratic Renyi entropy. Our motivation is the straightforward estimation of Cauchy–Schwarz divergence [7] for discrete features, continuous features, or their combination, which makes it suitable to use without any preprocessing dependent on expert knowledge about the data. Moreover, it avoids the use of parameters, which are inconvenient for non-experts in the field. It is possible to use it as a precursor to classification and regression problems in order to avoid over-fitting and to improve the learning machine performance.
The paper is organized as follows. Section 2 briefly reviews previous work on feature selection using information-theoretic measures and their generalizations. Section 3 presents the proposed measure and search organization for the task of finding relevant features. Section 4 presents the experimental setting and the results obtained on classification and regression problems. Lastly, Section 5 gives conclusions and possible future research directions.

2. Related Work

There are many information-theoretic feature selection methods that have been proposed in the last two decades. Brown et al. [8] and Vergara et al. [2] unified most of them in the mutual information feature selection framework. There are also a few cases of using mutual information derived from Renyi [9] and Tsallis entropy [10] showing promising results. Chown and Huang [11] proposed the use of a data compression algorithm along with quadratic mutual information to perform feature selection, but their method is prone to over-fitting, due to the estimation of the criterion in high-dimensional space. Here we mention a few of the most well-known information-theoretic measures and criteria reviewed in [8], since they are all based on similar ideas.

2.1. Information-Theoretic Measures

Information-theoretic measures offer means to rank feature subsets according to the information they provide about the output. A finite set of features X = { X 1 , , X N } , which can acquire values x 1 , . . . , x m 1 with probabilities p 1 ( x 1 ) , . . . , p 1 ( x m 1 ) , has the Shannon entropy
H ( X ) = i = 1 m 1 p 1 ( x i ) log p 1 ( x i ) .
Similarly, we can calculate the entropy of the output H ( Y ) given the possible values Y = { y 1 , , y m 2 } with probabilities p 2 ( y 1 ) , , p 2 ( y m 2 ) and the joint entropy H ( X , Y ) given the joint probabilities p 12 ( x i , y j ) .
Another important information-theoretic measure is the Kullback–Leibler divergence (KL), which measures discrepancy between two probability distributions p and p
D KL ( p , p ) = i = 1 m 1 p ( x i ) log p ( x i ) p ( x i ) .
The Kulback–Leibler divergence between the joint probability distribution p 12 ( x , y ) and the distribution p 1 ( x ) p 2 ( y ) is mutual information (MI):
I ( X ; Y ) = D KL ( p 12 ( x , y ) , p 1 ( x ) p 2 ( y ) ) .
We can usually estimate the probability distributions using one of the histogram-based methods. When the features are continuous, one option is to apply a discretization step beforehand (equal width binning, equal frequency binning) [12]. The manual selection of the number of bins can affect the estimation of MI and can lead to spurious results by shrouding some properties of the probability distribution. A better approach is to perform the discretization using an adaptive technique like minimum description length (MDL) [13], but this does not work for continuous output and thus cannot be used in a regression problem.
To avoid discretization, we can compute the differential mutual information directly from continuous data using the differential Kullback–Leibler divergence
D KL d ( p , p ) = p ( x ) log p ( x ) p ( x ) d x ,
I d ( X ; Y ) = D KL d ( p 12 ( x , y ) , p 1 ( x ) p 2 ( y ) ) ,
but we must estimate the probability density functions p 1 ( x ) , p 2 ( y ) , and p 12 ( x , y ) beforehand.
The non-parametric Parzen-window method [14] is the most straightforward approach to density function estimation. The estimate is obtained by spanning kernel functions around the data samples, p ( x ) = i = 1 n G ( x x i , h ) / n . The most commonly used kernel function is the product of Gaussians G ( x , h ) = d = 1 D G ( x d , h d ) , where D is the size of the feature set. The estimate depends on the choice of kernel width h d , for which there are several recipes in the literature [15]. However, the numerical computation of differential MI for a set of features is computationally quite expensive and prone to error. Another approach to differential MI estimation is the k-nearest neighbors (kNN) estimator [16] of MI, which in certain situations provides better results than the Parzen-window, but is still computationally expensive and not suitable to use directly on data sets comprised of discrete and continuous data [17]. A more recent approach is to estimate the density ratio in (4) directly. However, due to the logarithm in (4), this approach becomes computationally expensive and susceptible to outliers [18]. To alleviate this problem, the authors [18] propose a squared-loss mutual information measure which makes the computation more robust.
Besides the classical Shannon entropy, there exists a range of information entropy generalizations [19]. One of the more widely known is the Renyi entropy [20]
H R q ( X ) = 1 1 q log i = 1 m 1 p 1 ( x i ) q ,
which extends the original concept by introducing an additional parameter q. It should be noted that Renyi entropy converges to Shannon entropy as q approaches 1 in the limit. Renyi also defined the differential Renyi entropy, where the integral p 1 ( x ) q d x substitutes the sum i = 1 m 1 p 1 ( x i ) q in (6).
Usually, the estimation of differential entropy includes a probability density function (PDF) estimation from the data followed by the integral estimation from the PDF, which is challenging in high-dimensional problems. Erdogmus et al. [21] showed that quadratic Renyi entropy ( q = 2 ) can be directly estimated from the data, bypassing the explicit need to estimate the PDF. Namely, the information potential V ( X ) = p 1 ( x ) 2 d x can be estimated as
V ( x ) = 1 n 2 k = 1 n j = 1 n G ( x k x j , 2 h ) ,
replacing the numerical integration of the PDF with sums over the data samples. The Renyi differential quadratic entropy estimator thus becomes
H ^ R 2 ( X ) = log V ( x ) .
There exist many proposals on how to compute mutual information with regards to Renyi entropy, but each lacks some of the properties that the Shannon mutual information exhibits [22]. One of the proposed measures—the Cauchy–Schwarz divergence
D CS ( p 1 , p 2 ) = log ( p 1 ( x ) p 2 ( x ) d x ) 2 p 1 2 ( x ) d x p 2 2 ( x ) d x
by Principe et al. [7]—is especially suitable as a substitute for Kullback–Liebler divergence, as it enables assessment of the dependence between variables directly from the data samples. By rearranging the above equation, we obtain
D CS ( p 1 , p 2 ) = 2 H R 2 ( X ; Y ) H R 2 ( X ) H R 2 ( Y ) ,
where the first term H R 2 ( X ; Y ) is the quadratic Renyi cross-entropy [7], which can be directly estimated from the data using a similar approach as in the case of H ^ R 2 ( X )
H ^ R 2 ( X ; Y ) = log 1 n D + 2 i = 1 n k = 1 n G ( y i y k , 2 h ) × d = 1 D j = 1 n G ( x d i x d j , 2 h d ) .
On the basis of Cauchy–Schwarz divergence (10), Principe et al. [7] proposed quadratic mutual information (QMI)
I CS ( X ; Y ) = D CS ( p 12 ( x , y ) , p 1 ( x ) p 2 ( y ) )
as a candidate for measuring dependence. They prove that I CS ( X ; Y ) = 0 if and only if X and Y are independent of each other and positive otherwise, similar to the Kullback–Liebler divergence.

2.2. Information-Theoretic Feature Selection Methods

Given a set of already-selected features X S = { X 1 , . . , X M } and a set of candidate features X C = { X M + 1 , , X N } , Battiti [23] proposed to compute a mutual information feature selection criterion (MIFS) for each candidate feature X c
S MIFS ( X c ) = I ( X c ; Y ) β s = 1 M I ( X c ; X s )
and add the feature with the maximum value to the set of already selected features. The criterion is a heuristic which takes into account first order relevancy I ( X c ; Y ) and first order redundancy I ( X c ; X s ) . It includes the parameter β , which greatly affects performance [24].
Peng et al. [25] improved on the MIFS idea and proposed the minimum redundancy maximum relevance criterion (MRMR), which uses MIFS with automatic setting of parameter β
S MRMR ( X c ) = I ( X c ; Y ) 1 M s = 1 M I ( X c ; X s ) .
MRMR avoids using parameters, but still considers only first-order interactions.
Yang and Moody [26] used joint mutual information (JMI) as a criterion for feature selection
S JMI ( X c ) = s = 1 M I ( X c , X s ; Y ) .
This criterion considers second-order interactions between features and the output, thus increasing computational costs on one hand, but on the other hand also allowing detection of features which, when taken in pairs, provide more information about the output than the sum of both features’ individual contributions.
Several methods have been developed which go beyond second-order interactions [27,28,29]. The joint search for multiple features is difficult, as multidimensional probability distributions are hard to estimate, and becomes especially problematic when the number of samples is small. However, this is a favourable approach when questing for a small number of features, as some subtle interactions can be revealed. When using filter methods as a pre-processing stage for a machine learning task, it is usually better to select more features and give the learning machine more options to choose from and possibly find higher-order interactions during the learning phase [30].
These methods are usually used on discrete/discretized data for classification problems. Frenay et al. [31] examined the adequacy of MI for feature selection in regression tasks and argue that in most cases it is a suitable criterion. However, regardless of feature selection being a precursor to classification or regression task, most problems arise from the difficulty of estimating the MI.

3. The Proposed Method

The quadratic mutual information (12) works as the basis for our feature selection method, because it can be computed directly from the data samples and works for both discrete and continuous features. Optimally, the method should assess every possible subset of feature candidates and select the subset with maximum QMI. However, evaluating all possible subsets of features is prohibitively time-consuming. Another problem is that the estimation of I CS is prone to over-fitting, especially if the number of samples is not much larger than the number of features in the subset. This is a common problem in machine learning when dealing with high-dimensional data. To cope with it, feature selection methods usually rank or select features iteratively one by one. Even if the features are added to the relevant set one by one, it is still important to consider possible interactions between them to prevent adding redundant features, or to include those that are not informative about the output on their own, but are useful when taken with other features.
The proposed method (Algorithm 1) selects features iteratively until it reaches an ending criterion—the number of features we want to have. At each step, the algorithm considers all possible candidates from the set of candidate features X C . It checks each candidate feature X c against the set of already selected features X s X S from the previous steps
S QMIFS ( X c ) = S QMIFS ( X c , X S , Y ) = I CS ( X c ; Y ) if  M = 0 s = 1 M I CS ( X c , X s ; Y ) I CS ( X c ; X s ) if  M > 0 .
It adds the candidate feature X c with maximum S QMIFS to the set of already selected features. In the beginning, X S is empty, so the algorithm considers only quadratic mutual information between candidates and output. For later steps, the criterion function (16) is composed of sums of pairs of terms. The first term rewards the candidate features that are the most informative about the output when taken along with an already selected feature. The second term penalizes the features that have a strong correlation with already selected features. On one hand, this ensures the detection of features which work better in pairs—they provide more information about the output when taken together than the sum of both features’ individual contributions. On the other hand, it avoids selecting redundant features—the information they provide about the output is present in one of the already selected features. Extension of the criterion to include higher-order interactions between features is possible, but considerably increases the computational time and is more prone to over-fitting.
Algorithm 1: Quadratic mutual information feature selection—QMIFS
Data: Set of candidate features X C and output Y
Result: Set of selected features indices S
Standardize X C and Y
X S
S
while ending condition not met do
S max = 0
for X c X C do
   S c S QMIFS ( X c , X S , Y )
  if S c > S max then
    S max S c
    X max X c
    c max c
  end
end
X C X C / X max
X S X S X max
S S c max
end
There are a few considerations we must take into account before using this method to select the features. Firstly, the estimation of I CS depends heavily on the kernel width h [7]. The Silverman rule [32] is a common way to estimate it, but the width ( h d ) must be the same across all features. Neglecting this, the value of the criterion function will vary even if all candidate features are equally relevant to the output [7], and will fail to choose the correct ones. We take care of this problem by standardizing the data, which in turn causes the Silverman rule to produce the same h d for every feature. Secondly, the magnitude of I CS has no meaning [7] due to the dependence on the choice of window width. However, correct identification of the most relevant features requires only relative difference among them. That is, given two features X a , X b , and the output, and knowing that feature X a is more informative about the output than X b , the S QMIFS estimate is acceptable when S QMIFS ( X a ) > S QMIFS ( X b ) . The following small-scale experiment nicely presents some of the important properties of the proposed criterion.
We generate correlated data composed from two features X s , X c , and an output Y. All three are continuous with 2000 samples drawn from normal distribution with zero mean and unit variance. We assume that feature X s is already in the set of selected features, and treat X c as the current candidate. Figure 1a shows how S QMIFS ( X c ) changes while keeping correlation corr( X c , X s ) fixed at 0 . 1 , corr( X s ,Y) at 0 . 6 , and varying the correlation between X c and output Y from 0 to 1. As the correlation increases, S QMIFS ( X c ) also increases, but non-linearly. This behaviour is expected, since correlation is not comparable to quadratic mutual information. Figure 1b shows the opposite, how increasing the correlation between features affects the criterion value. We fix correlations corr( X c ,Y) and corr( X c ,Y) to 0 . 6 and vary the corr( X c , X s ) from 0 to 1. The result shows that S QMIFS penalizes redundant features—the higher the redundancy (represented here as inter-feature correlation), the lower the criterion value. These findings demonstrate that S QMIFS follows the aforementioned propriety of guaranteeing the correct ordering of features.
Authors in [33] present an efficient approach to speeding up the computation of I CS with an insignificant loss to precision. The basic algorithm for computing I CS has a time complexity of O ( n 2 ) . They use a greedy incomplete Cholesky decomposition algorithm in order to achieve the computational complexity of O ( n d 2 ) , where d depends on the data. This approach is useful only when d 2 < n . In their work, they achieve substantial time savings when dealing with common data sets, so we adopted their approach in the computation of I CS .

4. Results and Discussion

For our experiments, we use ten data sets; nine are from the UCI machine learning repository [34], and one is from a company which deals with web advertisement placement. To compare the methods over a wide variety of scenarios, we choose the data sets so that some include only discrete data, some only continuous, and some mixed. The experiments cover two problem domains: one is dealing with classification, and the other with regression. Table 1 briefly summarizes the information about the data sets. For each data set it lists number of instances, number of features and their type, the type of the output, and the problem domain.

4.1. Experimental Methodology

We compare our method QMIFS to three other common and comparable methods which use an information-theoretic approach to feature selection: MIFS with β = 1 , MRMR, and JMI. These three methods all need discretization of the continuous features before using them. The results are obtained using Matlab R2016a running on an Intel i7-6820HQ processor (Intel, Santa Clara, CA, USA) with 16 GB of main memory.
Classification tree from the Matlab Statistics and Machine Learning Toolbox serves as the indirect performance evaluation tool on the classification problem domain. The MDL discretization procedure from WEKA [35]—which promises better results than the usual approach of equal frequency or equal width binning [12]—acts as the preprocessing step, where needed. We evaluate the performance of the methods using the classification accuracy (CA), the area under the curve (AUC), Youden index (Y-index)—the difference between true positive rate (TPR) and false positive rate (FPR)—calculated in the optimal receiver operating characteristic (ROC) point, and the execution time.
In the regression problem domain, we asses the performance using the regression tree from the Matlab Statistics and Machine Learning Toolbox and measure the root-mean-square-error (RMSE) along with the execution time. As the output is continuous, MDL discretization is useless. Instead, equal frequency binning is used, with five bins for every feature and output. Equal frequency binning usually works better than equal width binning [12], and the empirical evidence from experimenting with MDL discretization shows that the number of bins per feature is often between three and seven.
In both problem domains, one thousand hold-out validations are performed on each data set. Each time, two thirds of randomly sampled instances act as the training set to build the model and the rest as the validation set to measure the performance. For each method, we vary the number of selected features: 3, 5, 7, or 10, and compare the results against the baseline performance where all features are used to train the model.
To get a clearer representation of result in both problem domains, we rank the methods according to the measures CA, AUC, Y-index, and RMSE. Each method obtains a rank from 1 (best) to 4 (worst). The ranked values get the same (average) rank if their 95% confidence intervals overlap.

4.2. Classification Performance

Table 2 shows which features are selected by each method, and Table 3 summarizes the ranking of the methods for each test scenario for measures CA, AUC, and Y-index and their average ranks. The ranks imply that all three measures behave similarly, which is expected since the data sets are well balanced with respect to the number of class values. Table 4 shows a more detailed insight into the performance of the methods for seven selected features. It includes only the maximum standard error of the performance indexes, as standard errors across different methods are practically the same. Additionally, Figure 2 reveals how different numbers of selected features affect the performance in terms of CA.
Chess data set: Baseline performs better in this case—looks like the learning machine can handle all 36 features. CA drops by about 0.03 after reducing the number of features to seven, and all the methods show similar behaviour in prioritizing features. According to Table 3 and Table 4, our method is better than the others when selecting five, seven, and ten features, with regards to all three performance indexes. The time measurements in Table 4 show that it is also at least seven times faster at selecting seven features.
Breast Cancer data set: In this case, classification tree benefits from feature selection—even with only three features selected. Table 3 shows that all methods perform similarly; the largest discrepancy among them being at five selected features, where JMI overcomes the others in all three performance indexes. Again, QMIFS is the fastest method, with a three-to-four times lower running time.
Ionosphere data set: All methods improve the performance compared to the baseline. Our method does not perform very well in terms of CA, AUC, or Y-index, even though Table 2 shows that five out of seven selected features are the same as in the best performing method—JMI. It ranks second at three selected features, but then falls behind when selecting more of them. However, in terms of execution time, it is again three-to-four times faster.
Sonar data set: Only a few features are common to all the methods, so the performance varies substantially between them. At three selected features, JMI and QMIFS work the best and offer similar performance, having the same AUC ranks, with CA and Y-index being worse for QMIFS. At five features, CA improves for all methods but is overall still worse than the baseline. Using seven features selected by MIFS, JMI, or QMIFS offers a considerable improvement in comparison to the baseline (3% better CA). The methods achieve the same ranks since the differences between them are small, causing the confidence intervals to overlap. At 10 selected features, all the methods offer improvement over the baseline, with JMI and MRMR having 2% better CA than QMIFS and MIFS. Our method again has the lowest execution time when selecting seven features.
Wine data set: Feature selection improves performance in comparison to the baseline, even though there are only 11 features in the data set. All methods select similar features, manifesting in similar performance. This can be seen in the rankings and in Table 4. QMIFS achieves the best ranks when selecting five, seven, or ten features, and is also at least 1.5 times faster than the other three.
Table 4 reveals that CA, AUC, and Y-index behave similarly because the data sets used are well balanced in terms of class values. In all cases except the Chess data set, the classification tree benefits from the feature selection with a 0.01–0.03 increase in CA. The differences between methods in terms of CA, AUC, and Y-index are small—relative difference is mainly less than 1%. The execution times clearly show that our method is the fastest. Due to the similarity of the first-order methods MRMR and MIFS, their execution times are equal and smaller in comparison to the second order method (JMI). Even though the time measurements are given only for seven selected features, the behaviour is similar in all test cases. What causes the other methods to be considerably slower is the MDL discretization done beforehand, which produces a large amount of computational overhead.
Overall, QMIFS offers performance similar to the other methods in terms of CA, AUC, and Y-index. Its average ranks shown in Table 3 across all data sets and number of selected features are 2.4/2.3/2.4, placing it somewhere in the middle—better than MIFS and MRMR, but lagging behind JMI. The subtle differences in the rankings can be attributed to the fact that both QMIFS and JMI are second-order methods and can detect some more peculiar relations between features. The difference between QMIFS and JMI could be attributed to the superiority of MDL discretization compared to the direct estimation in the case of QMIFS.

4.3. Regression Performance

Table 5 shows which features are selected by each method, and Table 6 summarizes the ranking of the methods for each test scenario and the average ranks. Table 7 and Figure 3 show the RMSE for each method and data set. Additionally, execution times are presented in Table 7.
Communities data set: Table 7 shows that the methods improve RMSE in all cases, even if we use only three features to train the model. This is expected, since the number of features in the data set is quite large (100) and difficult for the learning machine to tackle. Our method ranks last when selecting three or five features, but improves afterwards with RMSE comparable to other three methods (second and third best at seven and ten selected features, respectively). The selected features across different methods are much more versatile on this data set, owing to the fact that there are many input features to begin with. It is slower than the other three methods by a factor of 1.5–3.
Parkinson Telemonitoring data set: There is only a small gain in the performance by using at least seven features chosen by MRMR. The top three ranking features across all the methods are very similar, with only JMI offering 6–8% lower RMSE in comparison to others. However, our method performs equally well as JMI for five and more features. The execution times are comparable, with the first-order methods being faster, which is expected.
Wine Quality data set: In some cases, feature selection offers an improvement in the regression performance even though the total number of features in the data set is only 11. Overall, our method and MRMR are superior to MIFS and JMI, selecting similar features and offering improvement over the baseline. The execution times behave similarly as in the previous case.
Housing data set: The baseline performs better here for the most part, but there are only 13 features in the data set, so the learning machine does not have a difficult task in training the model. Only our method shows a small performance (2%) benefit compared to baseline when using the top ten features, and it achieves the best overall performance among the four methods, with an average rank of 1.5. Execution times are roughly 30% higher for the second order methods.
Web Advertisement data set: Our method improves the model’s performance dramatically compared to the baseline and other feature selection methods, which all exhibit similar behaviour. The number of input features in the data set is large enough to pose a difficult task to the learning machine, so it benefits considerably from feature selection, at least when QMIFS is used. However, our method is much slower than the other three methods—by a factor of 3–6.
In terms of average RMSE ranks, our method outperforms the other three, achieving a value of 2.1 across all test cases. JMI and MRMR are tied for the second place, with average ranks of 2.4 and 2.5; MIFS is lagging behind, with an average rank of 3.1. These results suggest that without the possibility of using MDL to discretize the data, the other methods lag behind our approach. There are probably not many higher-order relations in the data, since JMI is comparable to MRMR in terms of overall performance. Obviously, the way in which underlying probability densities are estimated has a higher impact on the performance than the order of the method. We believe that QMIFS better distinguishes relations in the data than ad-hoc binning used in the other three methods.
Due to the higher versatility of the dependent variable values in the regression problem domain, incomplete Cholesky decomposition is not so effective, leading to longer execution times for our method. This is especially obvious in the Communities and Web advertisement data sets. Additionally, equal frequency binning causes much less computational overhead than MDL to MIFS, MRMR, and JMI, which consequently outperform QMIFS regarding execution times.

5. Conclusions

In this paper, we propose a quadratic mutual information feature selection method (QMIFS). Our goal was to detect second-order non-linear relations between features and the output, similarly to joint mutual information. Additionally, we focused on the analysis of both discrete and continuous features and outputs, avoiding the intermediate step of estimating underlying probability density functions using histograms or kernel density estimation. To achieve these goals, we employed a quadratic mutual information measure, as it enables direct estimation from the data samples. The measure itself does not exhibit all the properties intrinsic to mutual information measure, and therefore our method was developed to compensate for deficiencies.
We compare our method to three other methods based on information-theoretic measures: mutual information feature selection (MIFS), minimum redundancy maximum relevance (MRMR), and joint mutual information (JMI). The methods are compared indirectly, on the classification problem domain using models built by the classification tree learning machine, and on the regression problem domain using the regression tree learning machine. The results show that our method offers similar performance on the classification problem domain in terms of classification accuracy, area under the curve, and Youden index as the other three, but is considerably faster. When dealing with regression, it compares favourably to the others regarding root-mean-squared-error, but is slower.
We conclude that our method is universal, capable of feature selection on classification or regression problem domain. QMIFS does not need an additional preprocessing step to estimate the probability density function, as is the case in the other three methods. This and the fact that it avoids using parameters makes it simple to use for non-experts in the field. Experiments show that straightforward estimation of QMI from data samples using quadratic Renyi entropy and Gaussian kernels does a good job at identifying the important information in the data. Additionally, it offers considerable execution time savings compared to other feature selection methods coupled with advanced discretization techniques like MDL.
Future research should go towards finding better estimators for the width of the kernel, which importantly affects estimation of QMI. Potential other measures could also be investigated for compatibility with our approach. Moreover, the computational cost of the QMI and other potential measures can be further reduced by using the fast Gauss transform, as proposed in [7].

Acknowledgments

This research was supported by Slovenian Research Agency under grant P2-0241 (National research program Synergetics of complex systems and processes).

Author Contributions

Davor Sluga and Uroš Lotič conceived and designed the experiments, analyzed the data, and wrote the paper; Davor Sluga performed the experiments. Both authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  2. Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
  3. Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997, 97, 273–324. [Google Scholar] [CrossRef]
  4. Hall, M.A. Correlation-based feature selection of discrete and numeric class machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; pp. 359–366. [Google Scholar]
  5. Fleuret, F. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 2004, 5, 1531–1555. [Google Scholar]
  6. Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
  7. Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer Science & Business Media: New York, NY, USA, 2010. [Google Scholar]
  8. Brown, G. A new perspective for information theoretic feature selection. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS-09), Clearwater Beach, FL, USA, 16–18 April 2009; pp. 49–56. [Google Scholar]
  9. Gonçalves, L.B.; Macrini, J.L.R. Rényi entropy and Cauchy-Schwarz mutual information applied to mifs-u variable selection algorithm: A comparative study. Pesqui. Oper. 2011, 31, 499–519. [Google Scholar] [CrossRef]
  10. Sluga, D.; Lotric, U. Generalized information-theoretic measures for feature selection. In Proceedings of the International Conference on Adaptive and Natural Computing Algorithms, Lausanne, Switzerland, 4–6 April 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 189–197. [Google Scholar]
  11. Chow, T.W.; Huang, D. Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information. IEEE Trans. Neural Netw. 2005, 16, 213–224. [Google Scholar] [CrossRef] [PubMed]
  12. Garcia, S.; Luengo, J.; Sáez, J.A.; Lopez, V.; Herrera, F. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 2013, 25, 734–750. [Google Scholar] [CrossRef]
  13. Irani, K.B. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France, 28 August–3 September 1993; pp. 1022–1029. [Google Scholar]
  14. Parzen, E. On estimation of a probability density function and mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
  15. Katkovnik, V.; Shmulevich, I. Kernel density estimation with adaptive varying window size. Pattern Recognit. Lett. 2002, 23, 1641–1648. [Google Scholar] [CrossRef]
  16. Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed]
  17. Walters-Williams, J.; Li, Y. Estimation of mutual information: A survey. In Proceedings of the International Conference on Rough Sets and Knowledge Technology, Gold Coast, QLD, Australia, 14–16 July 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 389–396. [Google Scholar]
  18. Sugiyama, M. Machine learning with squared-loss mutual information. Entropy 2012, 15, 80–112. [Google Scholar] [CrossRef]
  19. Beck, C. Generalised information and entropy measures in physics. Contemp. Phys. 2009, 50, 495–510. [Google Scholar] [CrossRef]
  20. Renyi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; pp. 547–561. [Google Scholar]
  21. Erdogmus, D.; Principe, J.C. Generalized information potential criterion for adaptive system training. IEEE Trans. Neural Netw. 2002, 13, 1035–1044. [Google Scholar] [CrossRef] [PubMed]
  22. Renyi, A. Some Fundamental Questions About Information Theory; Akademia Kiado: Budapest, Hungary, 1976; Volume 2. [Google Scholar]
  23. Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 1994, 5, 537–550. [Google Scholar] [CrossRef] [PubMed]
  24. Kwak, N.; Choi, C.H. Input feature selection for classification problems. IEEE Trans. Neural Netw. 2002, 13, 143–159. [Google Scholar] [CrossRef] [PubMed]
  25. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
  26. Yang, H.; Moody, J. Feature selection based on joint mutual information. In Proceedings of the International ICSC Symposium on Advances in Intelligent Data Analysis, Rochester, NY, USA, 22–25 June 1999; pp. 22–25. [Google Scholar]
  27. Rajan, K.; Bialek, W. Maximally informative “stimulus energies” in the analysis of neural responses to natural signals. PLoS ONE 2013, 8, e71959. [Google Scholar] [CrossRef] [PubMed]
  28. Fitzgerald, J.D.; Rowekamp, R.J.; Sincich, L.C.; Sharpee, T.O. Second order dimensionality reduction using minimum and maximum mutual information models. PLoS Comput. Biol. 2011, 7, e1002249. [Google Scholar] [CrossRef] [PubMed]
  29. Rowekamp, R.J.; Sharpee, T.O. Analyzing multicomponent receptive fields from neural responses to natural stimuli. Netw. Comput. Neural Syst. 2011, 22, 45–73. [Google Scholar]
  30. Sánchez-Maroño, N.; Alonso-Betanzos, A.; Tombilla-Sanromán, M. Filter methods for feature selection—A comparative study. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning, Birmingham, UK, 16–19 December 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 178–187. [Google Scholar]
  31. Frénay, B.; Doquire, G.; Verleysen, M. Is mutual information adequate for feature selection in regression? Neural Netw. 2013, 48, 1–7. [Google Scholar] [CrossRef] [PubMed]
  32. Silverman, B.W. Density Estimation for Statistics and Data Analysis; CRC Press: Boca Raton, FL, USA, 1986; Volume 26. [Google Scholar]
  33. Seth, S.; Príncipe, J.C. On speeding up computation in information theoretic learning. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2009), Atlanta, GA, USA, 14–19 June 2009; pp. 2883–2887. [Google Scholar]
  34. Lichman, M. UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 1 December 2016).
  35. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA data mining software: An update. ACM SIGKDD Explor. Newsl. 2009, 11, 10–18. [Google Scholar] [CrossRef]
Figure 1. Properties of quadratic mutual information feature selection (QMIFS) criterion: (a) relevance of feature X c as its correlation with the output Y increases; and (b) redundancy of feature X c as its inter-feature correlation with already selected feature X s increases.
Figure 1. Properties of quadratic mutual information feature selection (QMIFS) criterion: (a) relevance of feature X c as its correlation with the output Y increases; and (b) redundancy of feature X c as its inter-feature correlation with already selected feature X s increases.
Entropy 19 00157 g001
Figure 2. Performance of feature selection methods in the classification problem domain given in terms of classification accuracy of a classification tree. Higher values mean better performance. JMI: joint mutual information; MIFS: mutual information feature selection; MRMR: minimum redundancy maximum relevance; QMIFS: quadratic MIFS.
Figure 2. Performance of feature selection methods in the classification problem domain given in terms of classification accuracy of a classification tree. Higher values mean better performance. JMI: joint mutual information; MIFS: mutual information feature selection; MRMR: minimum redundancy maximum relevance; QMIFS: quadratic MIFS.
Entropy 19 00157 g002
Figure 3. Performance of feature selection methods in the regression problem domain given in terms of RMSE achieved by regression tree. Lower values indicate better performance.
Figure 3. Performance of feature selection methods in the regression problem domain given in terms of RMSE achieved by regression tree. Lower values indicate better performance.
Entropy 19 00157 g003
Table 1. Properties of used data sets. All but the Web Advertisement data set are from the UCI collection.
Table 1. Properties of used data sets. All but the Web Advertisement data set are from the UCI collection.
Data SetInstancesFeaturesOutputProblem Domain
DiscreteContinuous
Chess1000360binaryClassification
Breast Cancer569030binary
Ionosphere351034binary
Sonar208060binary
Wine178013ternary
Communities19930100continuousRegression
Parkinson Telemonitoring1000016continuous
Wine Quality1599011continuous
Housing506121continuous
Web Advertisement950388continuous
Table 2. Classification problem domain: selected features.
Table 2. Classification problem domain: selected features.
Data SetMethodSelected Features
35710
ChessMIFS21, 10, 3332, 91, 23, 4, 5
MRMR21, 10, 3332, 158, 1618, 6, 27
JMI21, 10, 3332, 158, 1618, 6, 7
QMIFS33, 10, 2135, 68, 187, 15, 13
Breast cancerMIFS23, 22, 519, 1012, 1530, 18, 29
MRMR23, 22, 2814, 2721, 829, 25, 24
JMI23, 28, 248, 2221, 47, 27, 14
QMIFS28, 2, 238, 2721, 223, 7, 1
IonosphereMIFS5, 16, 218, 110, 3032, 7, 24
MRMR5, 16, 1827, 36, 74, 34, 31
JMI5, 6, 333, 817, 214, 13, 29
QMIFS5, 6, 3315, 87, 2128, 31, 24
SonarMIFS11, 51, 3644, 41, 23, 6, 7
MRMR11, 51, 3648, 129, 5445, 4, 21
JMI11, 4, 1248, 921, 4510, 36, 49
QMIFS12, 27, 1148, 1016, 913, 49, 28
WineMIFS7, 1, 115, 34, 98, 10, 2
MRMR7, 1, 1311, 1012, 65, 2, 4
JMI7, 1, 1311, 1012, 62, 5, 4
QMIFS7, 1, 1312, 1011, 65, 9, 4
Table 3. Classification problem domain: ranking of feature selection methods. Ranks calculated from the measures classification accuracy (CA), area under the curve (AUC), and Y-index (the difference between true positive rate, TPR, and false positive rate, FPR) are presented as triplets CA/AUC/Y-index.
Table 3. Classification problem domain: ranking of feature selection methods. Ranks calculated from the measures classification accuracy (CA), area under the curve (AUC), and Y-index (the difference between true positive rate, TPR, and false positive rate, FPR) are presented as triplets CA/AUC/Y-index.
Data SetMethodSelected FeaturesAverage
35710
ChessMIFS2.5/2.5/2.53/3/32/4/34/4/42.9/3.4/3.1 Entropy 19 00157 i001
MRMR2.5/2.5/2.53/3/34/2.5/32.5/1.5/2.53.0/2.4/2.8
JMI2.5/2.5/2.53/3/33/2.5/32.5/3/2.52.8/2.8/2.8
QMIFS2.5/2.5/2.51/1/11/1/11/1.5/11.4/1.5/1.4
Breast CancerMIFS3.5/4/3.52.5/3/34/2.5/42/4/2.53.0/3.4/3.3
MRMR1.5/2/1.52.5/3/22/2.5/1.54/2/2.52.5/2.4/1.9
JMI3.5/2/3.51/1/12/2.5/1.52/2/2.52.1/1.9/2.1
QMIFS1.5/2/1.54/3/42/2.5/32/2/2.52.4/2.4/2.8
Iono-sphereMIFS3.5/3.5/3.51.5/3.5/2.51.5/2.5/1.52/2/22.1/2.9/2.42.5/2.9/2.7
MRMR3.5/3.5/3.53.5/2/2.53.5/2.5/33/3.5/33.4/2.9/3.03.0/2.8/2.8
JMI1/1/11.5/1/11.5/1/1.51/1/11.3/1.0/1.12.1/2.1/2.1
QMIFS2/2/23.5/3.5/43.5/4/44/3.5/43.3/3.3/3.52.4/2.3/2.4
SonarMIFS3/3/31/1/12/2/23.5/3.5/3.52.4/2.4/2.4
MRMR4/4/33.5/4/44/4/41.5/1.5/1.53.3/3.4/3.1
JMI1/1.5/12/2.5/22/2/21.5/1.5/1.51.6/1.9/1.6
QMIFS2/1.5/33.5/2.5/32/2/23.5/3.5/3.52.8/2.4/2.9
WineMIFS1/2.5/12.5/3/32.5/2.5/31.5/1.5/2.51.9/2.4/2.4
MRMR3/2.5/32.5/3/32.5/2.5/33.5/3/42.9/2.8/3.3
JMI3/2.5/32.5/3/32.5/2.5/33.5/3/2.52.9/2.8/2.9
QMIFS3/2.5/32.5/1/12.5/2.5/11.5/1.5/12.4/1.9/1.5
Table 4. Classification problem domain: values of measures and the execution times for seven selected features. The values obtained with all features are shown for comparison. Maximum standard error for given measure and data set is included in parentheses.
Table 4. Classification problem domain: values of measures and the execution times for seven selected features. The values obtained with all features are shown for comparison. Maximum standard error for given measure and data set is included in parentheses.
Data SetMethodMeasure
CAAUCY-IndexFPR, TPRTime (s)
ChessMIFS0.9340.9560.8650.106, 0.9702.48
MRMR0.9320.9640.8660.099, 0.9652.51
JMI0.9330.9650.8660.098, 0.9652.68
QMIFS0.9380.9760.8860.032, 0.9180.32
All features0.969 (±4 × 10 4 )0.985 ( ± 3 × 10 4 )0.943 ( ± 7 × 10 4 )0.032, 0.975(±0.04)
Breast cancerMIFS0.9290.9390.8580.095, 0.9531.53
MRMR0.9320.9420.8680.080, 0.9481.52
JMI0.9350.9440.8730.074, 0.9471.56
QMIFS0.9320.9410.8660.080, 0.9470.42
All features0.923 ( ± 6 × 10 4 )0.927 ( ± 8 × 10 4 )0.849 ( ± 1 × 10 3 )0.093, 0.941(±0.02)
IonosphereMIFS0.8970.9040.7940.150, 0.9441.15
MRMR0.8880.9030.7800.152, 0.9311.17
JMI0.8970.9150.7960.138, 0.9331.21
QMIFS0.8810.8970.7640.160, 0.9240.39
All features0.877 ( ± 1 × 10 3 )0.878 ( ± 1 × 10 3 )0.755 ( ± 2 × 10 3 )0.164, 0.919(±0.02)
SonarMIFS0.7280.7550.5040.267, 0.7701.18
MRMR0.6810.7110.4240.301, 0.7251.20
JMI0.7310.7590.5100.280, 0.7901.23
QMIFS0.7340.7590.5060.270, 0.7770.38
All features0.708 ( ± 2 × 10 3 )0.724 ( ± 2 × 10 3 )0.452 ( ± 3 × 10 3 )0.302, 0.754(±0.02)
WineMIFS0.9110.9190.8260.050, 0.8760.51
MRMR0.9100.9170.8280.070, 0.8980.49
JMI0.9110.9160.8290.069, 0.8980.54
QMIFS0.9150.9230.8390.068, 0.9070.29
All features0.906 ( ± 1 × 10 3 )0.912 ( ± 1 × 10 3 )0.813 ( ± 3 × 10 3 )0.066, 0.878(±0.02)
Table 5. Regression problem domain: selected features.
Table 5. Regression problem domain: selected features.
Data SetMethodSelected Features
35710
CommunitiesMIFS45, 52, 6795, 9748, 3624, 15, 89
MRMR45, 52, 441, 5172, 6918, 3, 50
JMI45, 4, 4450, 6951, 4641, 3, 16
QMIFS45, 41, 7842, 444, 6829, 39, 16
Parkinson TelemonitoringMIFS15, 12, 1416, 85, 102, 13, 4
MRMR15, 12, 148, 162, 105, 13, 7
JMI15, 14, 26, 139, 410, 7, 11
QMIFS15, 14, 132, 108, 411, 9, 6
Wine QualityMIFS11, 10, 64, 92, 58, 7, 3
MRMR11, 10, 25, 78, 49, 3, 6
JMI11, 10, 83, 25, 71, 4, 6
QMIFS11, 10, 28, 37, 19, 6, 5
HousingMIFS13, 11, 46, 127, 98, 2, 10
MRMR13, 11, 612, 710, 43, 1, 5
JMI13, 6, 113, 110, 52, 7, 9
QMIFS13, 8, 36, 711, 510, 9, 4
Web AdvertisementMIFS29, 9, 2144, 3515, 1312, 22, 6
MRMR29, 9, 2144, 3539, 1312, 15, 22
JMI29, 4, 4130, 2810, 3931, 3, 37
QMIFS3, 8, 402, 446, 1636, 10, 45
Table 6. Regression problem domain: ranking of feature selection methods.
Table 6. Regression problem domain: ranking of feature selection methods.
Data SetMethodSelected FeaturesAverage
35710
CommunitiesMIFS1.52442.9 Entropy 19 00157 i001
MRMR1.52121.6
JMI322.512.1
QMIFS442.533.4
Parkinson TelemonitoringMIFS3.53.541.53.1
MRMR3.53.511.52.4
JMI11.52.53.52.1
QMIFS21.52.53.52.4
Wine QualityMIFS342.52.53.03.1
MRMR1.512.52.51.92.5
JMI42.52.52.52.92.4
QMIFS1.52.52.52.52.32.1
HousingMIFS43.53.523.3
MRMR1.53.53.543.1
JMI1.52232.1
QMIFS31111.5
Web AdvertisementMIFS3.53.5323.0
MRMR3.53.5433.5
JMI22242.5
QMIFS11111.0
Table 7. Regression problem domain: values of root-mean-square-error (RMSE) measure and execution times for different numbers of selected features. Column All Features holds the value of RMSE obtained with all features and maximum standard error given in parentheses.
Table 7. Regression problem domain: values of root-mean-square-error (RMSE) measure and execution times for different numbers of selected features. Column All Features holds the value of RMSE obtained with all features and maximum standard error given in parentheses.
Data SetMethodRMSETime (s)
35710All Features35710
CommunitiesMIFS0.1760.1800.1860.1880.1900.971.431.872.49
MRMR0.1760.1800.1800.181 0.971.421.862.50
JMI0.1820.1810.1830.183 1.532.343.224.49
QMIFS0.1900.1900.1830.185( ± 3 × 10 4 )2.744.065.467.51 (±0.02)
Parkinson TelemonitoringMIFS9.128.898.938.218.300.320.350.370.40
MRMR9.118.928.238.23 0.320.350.370.40
JMI8.408.328.438.48 0.380.430.480.54
QMIFS8.838.348.428.45( ± 2 × 10 2 )0.490.590.660.77 (±0.01)
Wine QualityMIFS0.7720.7890.7720.7730.7710.320.350.370.39
MRMR0.7600.7580.7730.773 0.320.350.370.39
JMI0.7980.7750.7700.772 0.360.420.450.48
QMIFS0.7600.7740.7710.770( ± 1 × 10 3 )0.420.430.460.48 (±0.01)
HousingMIFS5.495.125.124.744.630.290.300.300.32
MRMR5.085.105.064.93 0.290.300.310.32
JMI5.085.034.854.90 0.310.330.350.37
QMIFS5.254.994.614.54( ± 3 × 10 2 )0.340.380.410.43 (±0.01)
Web AdvertisementMIFS3.7463.7573.6683.4663.590.510.600.690.82
MRMR3.7463.7553.7033.526 0.510.600.720.84
JMI3.5623.6993.5673.582 0.630.881.121.43
QMIFS2.9532.9563.1743.175( ± 6 × 10 3 )3.193.493.764.17 (±0.02)

Share and Cite

MDPI and ACS Style

Sluga, D.; Lotrič, U. Quadratic Mutual Information Feature Selection. Entropy 2017, 19, 157. https://doi.org/10.3390/e19040157

AMA Style

Sluga D, Lotrič U. Quadratic Mutual Information Feature Selection. Entropy. 2017; 19(4):157. https://doi.org/10.3390/e19040157

Chicago/Turabian Style

Sluga, Davor, and Uroš Lotrič. 2017. "Quadratic Mutual Information Feature Selection" Entropy 19, no. 4: 157. https://doi.org/10.3390/e19040157

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop