Next Article in Journal
Approximate Noether Symmetries of Perturbed Lagrangians and Approximate Conservation Laws
Previous Article in Journal
Train Bi-Control Problem on Riemannian Setting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Redundancy Is Not Necessarily Detrimental in Classification Problems

by
Sebastián Alberto Grillo
1,
José Luis Vázquez Noguera
1,*,
Julio César Mello Román
1,2,3,
Miguel García-Torres
1,4,
Jacques Facon
5,
Diego P. Pinto-Roa
1,2,3,
Luis Salgueiro Romero
6,
Francisco Gómez-Vela
4,
Laura Raquel Bareiro Paniagua
1 and
Deysi Natalia Leguizamon Correa
1
1
Computer Engineer Department, Universidad Americana, Asunción 1206, Paraguay
2
Facultad Politécnica, Universidad Nacional de Asunción, San Lorenzo 111421, Paraguay
3
Facultad de Ciencias Exactas y Tecnológicas, Universidad Nacional de Concepción, Concepción 010123, Paraguay
4
Data Science and Big Data Lab, Universidad Pablo de Olavide, 41013 Seville, Spain
5
Department of Computer and Electronics, Universidade Federal do Espírito Santo, São Mateus 29932-540, Brazil
6
Signal Theory and Communications Department, Universitat Politècnica de Catalunya, 08034 Barcelona, Spain
*
Author to whom correspondence should be addressed.
Mathematics 2021, 9(22), 2899; https://doi.org/10.3390/math9222899
Submission received: 24 September 2021 / Revised: 5 November 2021 / Accepted: 5 November 2021 / Published: 15 November 2021

Abstract

:
In feature selection, redundancy is one of the major concerns since the removal of redundancy in data is connected with dimensionality reduction. Despite the evidence of such a connection, few works present theoretical studies regarding redundancy. In this work, we analyze the effect of redundant features on the performance of classification models. We can summarize the contribution of this work as follows: (i) develop a theoretical framework to analyze feature construction and selection, (ii) show that certain properly defined features are redundant but make the data linearly separable, and (iii) propose a formal criterion to validate feature construction methods. The results of experiments suggest that a large number of redundant features can reduce the classification error. The results imply that it is not enough to analyze features solely using criteria that measure the amount of information provided by such features.

1. Introduction

In the classification, the quality of information in the features is essential to building a high-quality predictive model. Furthermore, the rapid advances in data acquisition and storage technologies have created high-dimensional data. However, noise, non-informative features, and redundancy, among other issues, make the classification task challenging [1]. Therefore, selecting suitable features is an important task, as a preliminary step, for building highly predictive classifiers [2].
To reduce dimensionality, there are two main approaches—feature selection and feature construction. Feature selection selects a subset of features from the input to reduce the effects of noise or irrelevant features, while still providing good prediction results [2]. In contrast, feature construction refers to the task of transforming a given set of input features to generate a new set of more predictive features [3].
According to [2], feature selection can be divided into three major categories depending on the evaluation criteria—filter, wrapper, and embedded. Filter methods use intrinsic properties of the data to select a subset of features and are applied as a preprocessing task [4]. Wrappers, in contrast, use learning to guide the search. The learning bias is included in the search and, therefore, they achieve better results. However, they are computationally expensive [5,6] and cause overfitting [7]. Finally, embedded methods perform the search at the same time the model is learned.
However, we can also classify the feature selection methods according to the strategy to search for subsets of features, which are divided into exponential search, sequential search, and random search [8]. The exponential search consists of the exhaustive evaluation of all possible subsets, which makes it impractical most of the time. The sequential search consists of the application of a local search method with a hill descent strategy [9,10]. The use of such strategies means that the search is stuck in a local optima. Finally, we have random search strategies that consist of the application of metaheuristic optimization algorithms [11,12,13,14,15,16,17,18].
Despite the success of feature selection techniques, a good feature space is a prerequisite for achieving high performance in the classification. In this sense, feature construction aims to engineer new features to detect the hidden relations of the original features [19,20]. New features are constructed based on the relations of the original ones pursuing a more meaningful feature space capable of achieving a more accurate classifier [3]. As in the case of feature selection, in feature construction, we can find three approaches: filter, wrapper, and embedded methods [21]. Among the main approaches for constructing features, we have (i) methods based on decision trees [22,23], (ii) evolutionary meta-heuristics [24,25], (iii) the application of inductive logic programming [26,27], (iv) methods that use annotations with the training set [28], and (v) unsupervised methods such as clustering [29], PCA [30], or SVD [31].
In this work, we study the relationship of feature construction and assumptions applied in selecting those features. Denote—as redundant—a subset of features that do not provide more information than what exists in the other features. We are particularly interested in analyzing the assumption that minimizing the number of redundant features is best for classification problems. Especially how the defined features can affect the capacity of a model required to perform the classification. We first present a mathematical framework for modeling feature construction and selection for classification problems with discrete features. Second, we show that there are datasets where small feature subsets can be much more complex than large feature subsets. We denote complexity concerning the capacity that the model requires to classify the problem and highlight the linearly separable problems as the least complex. This construction violates the assumption that fewer features with equal or more information are better than many features. Third, we extend the analysis of feature construction using monomials of degree k [32] and conclude that this method tends to produce linearly separable binary classification problems as k grows. Therefore, we propose that one way to validate feature construction methods is by analyzing whether the classification problems tend to become linearly separable with the iterative application of the method. Finally, we apply the construction of features with monomials of degree k in real and artificial datasets, where we apply the following classification algorithms, naive Bayes [33], logistic regression [34], KNN [35], PART [36], JRIP [37], J48 [38] and random forest [39]. Experiments show that even though redundant features grow extensively, the score increases or does not decrease too much. Therefore, both theoretical and experimental evidence agree that the criterion of choosing minimum feature subsets is not always correct. This is because the assumption considers only the information about the features but not the complexity of the classification problem.
The contributions of this work can be synthesized in the following items: (a) showing that the redundancy of features can reduce data complexity, (b) developing a theoretical framework to model construction and selection of features and, (c) proposing a mathematical criterion to validate feature construction methods. The experiments performed suggest that the presence of redundant features does not necessarily prejudice classification tasks.
This work is organized into the following sections. Section 2 presents the mathematical formulation used to describe the theoretical results. Section 3 introduces basic ideas with simple examples, while Section 4 formalizes those ideas to more general results. Section 5 shows the experimental results, and finally, Section 6 presents a discussion of all results obtained.

2. A Mathematical Model for Feature Selection and Construction

In this section, we present a formal framework for the mathematical analysis of feature selection and construction. Let A i be a finite sequence of finite sets in R and another finite set C, where each A i is denoted as feature i and C is the set of possible classes. Taking A = A 1 × A 2 × × A n , we consider a probability distribution P over A × C , we denote P . and P . | . as the probability and conditional probability determined by P , respectively. Notice that we may generate a dataset using distribution P , where each record is an element from A × C and we denote P as a dataset distribution. Denote the sequence A ^ i , such that A ^ i = A i for i n and A ^ n + 1 = C . Let S i be a subsequence of A ^ i , we denote (i) S = S 1 × S 2 × × S m , (ii) if s S then s is denoted as a pattern of S and (iii) E S P x is denoted as the event where we sample an instance such that s = x for a pattern s of S according to distribution P . We say that s is a not-null pattern of S if P E S P s > 0 .
Notice that our definition of the dataset distribution is general enough for a dataset or its real distribution. For example, given the dataset distribution P in Table 1, we can take A 1 = 1 , 2 , 3 , A 2 = 1 , 2 , A 3 = 0 , 1 , 2 , 3 , and C = 0 , 1 . As S = S 1 × S 3 represents all possible values taken by the first and third feature, if s = 1 , 1 is a pattern of S , then E S P x = 1 , 1 , 1 , 0 , 1 , 2 , 1 , 0 , 1 , 1 , 1 , 1 , 1 , 2 , 1 , 1 is the event where the first and third features have value one. Notice that s is a not-null pattern because P E S P s = 2 / 5 .
The following definition formalizes the notion of patterns that do not contradict each other.
Definition 1.
Let B   =   B i and D   =   D i be sub-sequences of A i , we denote B   =   B 1 × B 2 × × B p and D   =   D 1 × D 2 × × D q . Taking b   =   b 1 , b 2 , , b p     B and d   =   d 1 , d 2 , , d q     D , we say that b and d are congruent patterns, if b and d are not distinct in the features of A i preserved by both B   =   B i and D   =   D i .
For example, take the dataset distribution P of Table 1, B   =   A 1 , A 2 and D   =   A 2 , A 3 . We have that b   =   1 , 2     B and d   =   2 , 1     D are congruent patterns, because they have the same value in their single shared feature. However, if d ^   =   1 , 2     D , then b and d ^ are not congruent patterns, because both have different values given the second feature of the dataset.
As a dataset distribution P may not be consistent (inconsistent), we define a function f P : A C , where P E C P c E A P a   =   max i P E C P i E A P a for all not-null patterns a A . Notice that an inconsistent dataset distribution always has classification error because a classifier does not have enough features, then f P gives the category that minimize error for any configuration of features. If we consider the dataset distribution of Table 1, we must define a f P , such that f P 1 , 1 , 0   =   0 , f P 2 , 1 , 2   =   1 and f P 3 , 1 , 3   =   0 ; however for any other pattern a A we can take 0 or 1 for f P .
Definition 2.
Let P be a dataset distribution, B   =   B i a sub-sequence of sequence A   =   A i and B   =   B 1 × B 2 × × B p . The subsequence B of features is complete for P if satisfies that for all class c and all congruent not-null patterns a , b of A , B , respectively, we have:
P E C P c E A P a = P E C P c E B P b .
Definition 2 formalizes the notion of a subset of features with the same amount of information as all features as a whole. This notion of information considers that the subset of features is sufficient to estimate the class with the same probability as the original set of features.
Definition 3.
Maintaining the same terms of Definition 2. Let B k ^   =   B i ^ be a sub-sequence of sequence B without the term A k and B k ^   =   B 1 ^ × B 2 ^ × × B q ^ . The subsequence B of features is non-redundant for P , if it satisfies that for all k there is some class c, and some not-null congruent patterns b , b ^ of B , B k ^ , respectively, such that:
P E C P c E B P b P E C P c E B k ^ P b ^ .
Definition 3 formalizes the notion of a subset of features where each feature provides information that does not exist in other features of the subset. This notion of information considers that if we eliminate a feature from the subset, we will not obtain the same probability of obtaining a class. Under this definition of a non-redundant subset of features, we can say that the other features of the dataset are redundant because they can be eliminated without losing information in the dataset. We formulate Definition 4 for redundant features.
Definition 4.
Maintaining the same terms of Definition 2. Let A ^ = A i ^ be a subsequence of A, obtained by eliminating the features of a subsequence B from A. The subsequence B of features is redundant for P if it satisfies that for all class c and not-null congruent patterns a , a ^ of A , A ^ respectively, we have:
P E C P c E A P a = P E C P c E A ^ P a ^ .
Taking the dataset distribution P of Table 1 again, we can see that A 1 , A 2 and A 3 are complete and non-redundant for P . Sub-sequences composed by individual non-constant features like A 1 and A 2 are non-redundant, but not complete for P . Finally, sub-sequences like A 1 , A 2 , A 3 , A 2 , A 3 and A 1 , A 3 are complete, but not non-redundant for P .
Definition 5.
Let P be a dataset distribution over A × C . Let B i be a sequence of finite sets, B = B 1 × B 2 × × B p and A ^ = A × B . We say that a dataset distribution Q over A ^ × C is an extension of P , if (i) P E S Q s = P E S P s for all S = S 1 × S 2 × × S m and s S , where S A i C and (ii) for all not null pattern a A there is some pattern b B such that P E B Q b E A Q a = 1 .
Definition 5 formalizes the notion of feature construction. It consists of a new dataset distribution whose set of features contains the set of features of the original dataset with the same distribution according to the first property. However, according to the second property, the new distribution also contains new features whose values are entirely determined by the shared features.
Following the dataset distribution P of Table 1, we denote a dataset distribution P ^ from Table 1 where we eliminate the feature A 3 . Notice that P is an extension of P ^ , because (i) as P ^ is P without a feature, then they have the same probabilities for the common features and (ii) if we know the values of features 1 and 2, then we know the value of feature 3 with probability 1 for any not-null pattern.

3. Features: Selection vs. Construction

In this section, we use the mathematical notions defined above to compare selection with feature construction. In this sense, feature selection is denoted as an elimination of features, while feature construction is denoted as incorporating new features.
Feature selection methods that do not involve the classifier in the selection are called filter methods. These methods are based on applying some measure that seeks to obtain a subset of features, which contains the same amount of information as the original set but without any redundancy. The literature reports several of these methods; however, they believe that a non-redundant set of features should be as small as possible. This condition can be mathematically described as obtaining a complete and non-redundant sub-sequence of features B for P , where we minimize B .
Mathematically we can define the construction of features from a dataset distribution P as any extension of P . Feature construction consists of computing new features from the original features. If the result ends up with more features than the original, we come across a method contrary to the minimization criterion of the feature selection by filtering methods.
One of the principles of feature selection by the filtering methods is that redundancy in features is detrimental. We refer to a redundant feature in the sense that all the information existing in the feature can be obtained from a subset of features that does not contain the feature itself. In that sense, the construction of features without a subsequent selection of features only produces redundant features. Formally, we are saying that if Q is an extension of P as constructed in Definition 5, then P E C P c E A P a = P E C Q c E A ^ Q a ^ for all c C and all not-null congruent patterns a , a ^ of A , A ^ , respectively. In other words, although pattern a has extra features to a ^ , that does not modify the probabilities of obtaining any class c; therefore, the extra features do not provide information.
The notion of feature construction introduced by Definition 5 does not add more information because the original features define the new features entirely. Therefore, we are interested in knowing what else can be provided by new features in case these features do not have more information than what already exists.
We analyze a simple example of a classification algorithm interacting with a constructed feature before presenting theorems with more general results. First, we consider the distribution of Table 2, where we assume that the original features are 1 and 2. For each pattern a A , feature 3 is defined as a 3 = a 1 2 . Second, we consider a classifier based on the logistic model. If we denote L x = 1 1 + e x and the internal parameters or weights v 0 , v 1 , v 2 R , the logistic model applied to the original features of pattern a A outputs 1 if L v 0 + a 1 v 1 + a 2 v 2 > 1 2 and 0 otherwise. Denoting another parameter v 3 R , the logistic model applied to all features of pattern a A outputs 1 if L v 0 + a 1 v 1 + a 2 v 2 + a 3 v 3 > 1 2 and 0 otherwise. Notice that in Figure 1, if we apply the logistic model in the original features, we obtain a linear classifier on the plane for features 1–2 that cannot give the correct class to all instances. Therefore this first model has under-fitting problems. However, if we take the second logistic model with the parameters v 0 = 17 / 4 , v 1 = 4 , v 2 = 0 , and v 3 = 1 , we obtain a non-linear model over the plane for features 1–2 with the region between Att. 1 = 1.5 and Att. 2 = 2.5 for class 0 and the rest of the plane for class 1. This second logistic model is equivalent to a third logistic model applied to the original features of pattern a A that outputs 1 if L v 0 + a 1 v 1 + a 2 v 2 + a 1 2 v 3 > 1 2 and 0 otherwise. We say that both logistic models are equivalent because they partition the plane of features 1–2 exactly as Figure 1 shows. In both the second and third models there is an extra parameter v 3 that modulates the non-linearity in the plane of features 1–2. The second model is a linear model over the space produced by the features 1–3 and behaves non-linearly in the plane of features 1–2 due to feature 3. Instead, the third model is an inherently nonlinear model for features 1–2 for v 3 distinct to 0. Therefore, the construction of features can increase the representation capacity of the model and solve under-fitting problems like the one we observed with the first model.

4. A Theoretical Analysis of Feature Construction

In this section, we present results that generalize what was stated in Section 3. The following theorem refutes the idea that the fewer features we use without losing information, the better for the classification problem.
Theorem 1.
Let P be a non-constant dataset distribution over A × 0 , 1 , whose set of features A i is non-redundant in P . Take p as the total number of non-null patterns in P whose value by f P is the minority class between zero and one. For all integer m in p , n + 1 there is a set of m features B i and an extension Q of P , such that (i) Q is a distribution over A ^ × 0 , 1 where A ^ = A × B , (ii) B i is a non-redundant set of features in Q , (iii) there is a linear classifier that computes f P a from b, if b ^   =   a , b is a not-null pattern of A ^ .
Proof. 
Let N be the set of not-null patterns of A according to P . We denote a partition N i of N of size m, where each N i contains a pattern with value one and a pattern with value zero according f P . We also take B i = 0 , 1 for all i. Then we construct Q : for each a N we have a , b A ^ , such that: if a N k then b k = f P a and b i = 0 for all i k . As f P a is fully determined by a and b is fully determined by f P a , then b is fully determined by a and Q is an extension of P .
For the second property, we denote B ^ i = B 1 × × B i 1 × B i + 1 × × B m and three patterns b , b ˜ B , b ^ B ^ i . We take b, b ^ with all terms zero and b ˜ with all terms zero except b ˜ i = 1 . Notice that b ^ is congruent to the other patterns and all are non-null patterns, this implies that E B Q b , E B Q b ˜ , and E B Q b ^ are events with non-zero probability. Then we have:
P E C Q 1 E B Q b = P E C Q 1 E B Q b P E B Q b
and:
P E C Q 1 E B ^ i Q b ^ = P E C Q 1 E B ^ i Q b ^ P E B ^ i Q b ^ = P E C Q 1 E B Q b + P E C Q 1 E B Q b ˜ P E B Q b + P E B Q b ˜ .
As x y < z w implies that x y < x + z y + w for real positive numbers x , y , z , w and:
P E C Q 1 E B Q b P E B Q b < P E C Q 1 E B Q b ˜ P E B Q b ˜ .
Thus, we have:
P E C Q 1 E B Q b < P E C Q 1 E B ^ i Q b ^ .
For the last property, we need to construct a logistic model that outputs one if L i b i > 1 2 and zero otherwise. Notice that this linear classifier computes f P a for all not-null pattern b ^ = a , b from Q . □
Notice that B i can be much bigger than A i . However, inferring the category labels from A i can be as complex as we want, at the same time that selecting the bigger set B i instead we will have a problem that is solved by a linear classifier. Therefore, a feature selection method would choose A i over B i under the criterion of minimizing the number of features.
Although we refer to complexity, there is no single measure of complexity for classification problems [40]. However, it is observed that classifiers that use a single variable, artificial neural networks of a single neuron, and the simplest SVM models are linear classifiers. Additionally, linear classifiers have a VC dimension of the value of only two [41]. Therefore, for our purpose, we consider linearly separable sets as those with less complexity.
Theorem 1 shows an extreme case where feature construction breaks a standard criterion for feature selection methods. However, theorem-proof does not present a practical method for feature construction because we can only build the features in the training set. We note that to construct features B i , we must know in advance the most probable class for each pattern in A . That is to say, first solve the classification problem only with the features of A i , which does not make sense. Therefore, we will now study a standard method for constructing features.
The following definition generalizes the construction of features using monomials, which was used as an example in Section 3. The idea is that there is a feature equivalent to each monomial of degree less than or equal to k from the original features.
Definition 6.
Taking same terms from Definition. We denote P k as a k-monomial extension of P and A k as the product of features of P k , if (i) for each i, there is a monomial function f : A k R of grade equal or less than k, such that a ^ i = f a ^ 1 , a ^ 2 , , a ^ n for each not-null pattern a ^ A k and (ii) for each monomial function f : A R of grade equal or less than k there is some i, such that a ^ i = f a ^ 1 , a ^ 2 , , a ^ n for each not-null pattern a ^ A k .
For example, suppose that the dataset distribution P has three features and denote a 1 , a 2 , a 3 as a pattern for those features. Then, a pattern from P 2 could of be the form a 1 , a 2 , a 3 , a 1 2 , a 2 2 , a 3 2 , a 1 a 2 , a 1 a 3 , a 2 a 3 and a pattern from P 3 could be of the form a 1 , a 2 , a 3 , a 1 2 , a 2 2 , a 3 2 , a 1 a 2 , a 1 a 3 , a 2 a 3 , a 1 3 , a 2 3 , a 3 3 , a 1 2 a 2 , a 1 2 a 3 , a 2 2 a 3 , a 1 a 2 2 , a 1 a 3 2 , a 2 a 3 2 .
Notice that Definition 6 does not give an explicit order for the new features, however Definition 5 just guarantees that the first n features of P k are the original features of P . Then the features i in P k for i > n are in function of the first n features in P k (that also are the features of P ).
The following theorem describes how the feature construction method described in Definition 6 can reduce the complexity of the classification problem.
Theorem 2.
For all dataset distribution P over A × 0 , 1 , there is some k such that some linear classifier computes f P k from the not-null patterns of P k .
Proof. 
Let P be a dataset distribution over A × C whose features A i have more than one possible value, without loss of generalization. We denote (i) the minimum absolute difference between values in the feature A i as β i , (ii) the difference between the maximum and minimum values in the feature A i as δ i and (iii) the maximum δ i / β i as D. Then, from a A we define the function g a = i a i 3 D i 1 , which is a polynomial of grade 1 on the terms of a. Notice that g is an injective function if we take A as the domain. We denote P as the Lagrange polynomial, such that P g a = f P a . Let k be the maximum grade in a monomial from P g a where we take the variables a i . Then P g a = P ^ a ^ for some polynomial P ^ of grade 1 and a ^ A k . Although P ^ is a regression model, it takes only zero or one values in the patterns a ^ A k and therefore can be taken as a linear classification model. □
We present an example with Table 3, the first two columns with the class corresponding to a dataset distribution P for an XOR function, which is not linearly separable. However, the 2 monomial expansion P 2 is a linearly separable dataset distribution.
Definition 7.
Let P i be a sequence of dataset distributions, if P i + 1 is an extension of P i for all i, then P i is a progressive sequence of dataset distributions.
This definition seeks to formalize the notion of a feature construction method that is applied iteratively, producing an unbounded quantity of new features. For example, if we construct each k-monomial extension of P , such that the features of P k have the same indices in P k + 1 , then P i is a progressive sequence of dataset distributions.
Definition 8.
We say that a feature construction method is linearly asymptotic if from all dataset distribution P over A × 0 , 1 , feature construction methods produce a progressive sequence of dataset distributions P i , such that there is some k and a linear classifier that can compute f P k from P k .
Finally, we present a desirable property for any feature construction method. This property is equivalent to a feature construction method never getting stuck in patterns that are not linearly separable. Proving that a feature construction method is linearly asymptotic represents a formal validation of the method. For example, by Theorem 2, we conclude that the k monomial construction method is linearly asymptotic.
Note that this desired property is similar to the kernel trick exploited by SVM models, where the data are mapped to a larger-dimensional space, such that a low-capacity classifier can separate the classes [42].

5. Experimental Results

In this section, we present the experimental results. We analyze the accuracy under the application of classification algorithms on pre-processed real and artificial datasets with their k-monomial extensions. The classification algorithms used are Naive Bayes, logistic regression, KNN, PART, JRIP, J48, and random forest. The classifiers mentioned were executed using the Waikato Environment for Knowledge Analysis (Weka) software [43].

5.1. Datasets from Real Classification Problems

The real data correspond to the Speaker Accent Recognition dataset [44], Algerian Forest Fires dataset [45], Banknote Authentication dataset [46], User Knowledge Modeling dataset [47], Glass Identification dataset [48], Wine Quality dataset [49], Somerville Happiness Survey dataset [50], Melanoma dataset, and Pima Indians Diabetes dataset [51]. As the experimental analysis is limited to binary classification problems, we took only the instances that belong to one of the two majority classes in the case of the Speaker Accent Recognition dataset, User Knowledge Modeling dataset, Glass Identification dataset, and Wine Quality dataset.
Before the analysis, we applied the k-monomial extension for k = 2 and 3 in the datasets obtaining two new datasets per original dataset. Finally, we applied a normalization
f a i = a i inf A i sup A i inf A i
on all datasets and features A i , where a i A i . Table A1 shows more details about the datasets and their k-monomial extensions.

5.2. Datasets from Artificial Classification Problems

The synthetic datasets are generated according to five rules that organize the datasets into five corresponding families. We first generate n features with r possible values for each dataset. The value of each feature given in an instance is generated from the ceiling function applied on a value x with uniform distribution in the interval 0 , r . For each rule, four datasets are generated with the following characteristics: (1) 2 features and 50 possible values; (2) 3 features and 30 possible values; (3) 4 features and 10 possible values; (4) 4 features and 5 possible values. The five binary rules for assigning classes to each instance are described below:
  • The first rule assigns the category TRUE if the function,
    Υ n r : 1 , 2 , , r n     T R U E , F A L S E ,
    is greater than zero and otherwise assigns the category FALSE. The function Υ n r is defined as:
    Υ n r a = cos i = 1 n a i π r 1 n .
  • The second rule assigns the category TRUE if the function,
    Φ n r : 1 , 2 , , r n T R U E , F A L S E ,
    is greater than zero, and otherwise assigns the category FALSE. The function Φ n r is defined as:
    Φ n r a = i = 1 n cos a i π r 1 .
  • The third rule assigns the category TRUE if the function,
    Ψ n r : 1 , 2 , , r n T R U E , F A L S E ,
    is greater than zero, and otherwise it assigns the category FALSE. The function Ψ n r is defined as:
    Ψ n r a = i = 1 n a i + 1 r 2 n .
  • The fourth rule assigns the category TRUE if the function,
    Ω n r : 1 , 2 , , r n T R U E , F A L S E ,
    is greater than zero, and otherwise assigns the category FALSE. The function Ω n r is defined as:
    Ω n r a = i = 1 n a i r 1 2 2 n r 1 3 2 .
  • The fifth rule assigns the category TRUE if the function,
    Γ n r : 1 , 2 , , r n T R U E , F A L S E ,
    is greater than zero, and otherwise assigns the category FALSE. The function Γ n r is defined as:
    Γ n r a = i = 1 n a i n r 2 .
Before the analysis, we applied the k-monomial extension for k = 2, 3, 4, and 5 in the datasets obtaining four new datasets per original dataset. Finally, we applied the normalization
f a i = a i inf A i sup A i inf A i
on all datasets and features A i , where a i A i . Table A6 shows more details about the datasets and their k-monomial extensions.

5.3. Analysis from the Real Datasets

In this subsection, we present the results corresponding to the real datasets. For the real datasets we have graphics like Figure 2 for the Speaker Accent Recognition dataset, that show the true positive, true negative, false positive, and false negative of the classification algorithms on each dataset, and their k-monomial extensions (Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, Figure A7 and Figure A8, corresponding to the rest of the datasets are in the appendix). The values are calculated using 10-fold cross validation. For each algorithm, three joined bars are presented, showing the configuration of the confusion matrix. From left to right, the first bar corresponds to the original dataset, the second corresponds to the 2-monomial extension, and the last one corresponds to the 3-monomial extension. We represent the confusion matrix to show that the criteria for evaluating improvements in classification are adequate for these examples. We can see that there is little difference between the values of the original dataset and the k-monomial extensions most of the time. However, there are a few cases where the original dataset presents a significantly better accuracy, such as the naive Bayes classifier in Figure A1 and the J48 classifier in Figure A2. However, there are some cases where some k-monomial extension presents some accuracy slightly higher than the original dataset.

5.4. Analysis from the Artificial Datasets

In this subsection, we present the results corresponding to the artificial datasets. For the synthetic datasets we present results like Table 4 (for the first family of datasets) that shows the accuracy of the classification algorithms on each dataset and their k-monomial extensions (Table A2, Table A3, Table A4 and Table A5, corresponding to the rest of the families of datasets are in the Appendix A). The values are calculated using 10-fold cross-validation. Each dataset has a column indexed by “n-r”, where n is the number of features, and r is the cardinality of the features. For each dataset and algorithm, the original accuracy corresponds to the original dataset accuracy. Best accuracy corresponds to the highest precision between the k-monomial extensions, and grade corresponds to the k for which the k-monomial extensions reach the highest precision. In all families of datasets, we can see that the k-monomial extensions tend to have better accuracy than the original datasets. However, there are cases where the original dataset has more accuracy, but without exceeding 5%. We can also observe that the 5-monomial extension is common, as the case with greater accuracy. Notice that the 5-monomial extension is the dataset with a larger subset of redundant features.

6. Discussion

This is not the first work that relates features to data-complexity. The quotient between the number of instances and the number of features (known as the T2 measure) has been studied as a measure of data complexity [40]. However, T2 is independent of the notion of complexity in this work, since we can define linearly separable datasets in all ranges of T2. There are also applications of complexity measures for the feature selection problem, but applying a mainly experimental analysis [52,53,54,55].
The concept of a redundant set of features is based on the relevant feature definition of John et al. [56]. There are several other definitions for redundancy or redundant features. However, these definitions are more oriented to applications than a theoretical analysis of redundancy and its effects [57,58,59,60,61,62,63,64].
Our theoretical results show that many redundant features can reduce the complexity of the data. This result is interpreted in that a feature can provide representativeness without providing extra information, as seen in the example in Section 3. It can also be interpreted that redundant features are capable of increasing the capacity of the model.
Our experimental results reinforce the evidence that redundancy itself is not necessarily detrimental. The real and synthetic datasets showed that extended datasets with many redundant features constructed as monomials could achieve higher accuracy. However, higher accuracy was more pronounced in synthetic datasets. The synthetic datasets applied did not have noise and had few dimensions, which are the main differences to the real datasets studied.
Usually, redundant features before preprocessing entail a greater complexity of the algorithm than the classifier induces. The reason is that the classifier cannot find the optimal (global) rule, because the search space increases exponentially. Therefore, it returns a local optimum. Due to this increased search space, as we increase the features, the problem increases the difficulty and, tends to be classifiers with poorer performance. However, this fact occurs because those initial features do not add enough expressiveness. Therefore, features obtained from suitable construction methods cannot be equally treated in the same way as an initial feature.
Finally, the increase or decrease of features implies an increase or decrease of parameters in the model, respectively. Therefore, the choice of features can induce overfitting or underfitting. However, these learning problems are not commonly studied in the development of feature selection methods. Therefore, the criteria for selecting features should consider the information provided by each feature and the representativeness provided by the features. Furthermore, in the same way that there are regularization methods to avoid overfitting by the internal parameters of the model, regularization methods could be developed against the excess of features.

7. Conclusions

The main finding of this work is that attributes that are redundant from an information viewpoint indeed reduce under-fitting. Theoretical and experimental evidence is provided for this finding. However, these results are limited to binary classification problems with numerical attributes. Therefore, continuations of this work can be extended on the following points:
  • Extension of the analysis on multi-class classification problems and with a significant proportion of categorical attributes.
  • Extension of the analysis on regression problems.
  • Extension of the analysis on models with a large number of parameters, where the phenomenon of under-fitting is unlikely, such as deep learning models.

Author Contributions

Conceptualization, S.A.G.; Formal analysis, S.A.G.; Investigation, S.A.G. and M.G.-T.; Project administration, J.L.V.N.; Software, L.R.B.P. and D.N.L.C.; Validation, J.L.V.N.; Visualization, J.C.M.R.; Writing—original draft, S.A.G.; Writing—review and editing, J.L.V.N., M.G.-T., J.F., D.P.P.-R., L.S.R. and F.G.-V. All authors contributed equally to this paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by CONACYT-Paraguay grant number PINV18-1199.

Data Availability Statement

The Algerian Forest Fires, Banknote Authentication, User Knowledge Modeling, Glass Identification, Somerville Happiness Survey and Wine quality data sets are available at https://archive.ics.uci.edu/ml/index.php, accessed on 24 September 2021. The Pima Indians Diabetes data set is available at https://www.kaggle.com/uciml/pima-indians-diabetes-database, accessed on 24 September 2021. The artificial data-sets are available at https://drive.google.com/drive/folders/1RW4EAR4ZxP8ZHCW1ErOMCAg24EXEgHFB?usp=sharing, accessed on 4 November 2021.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Figure A1. Graph corresponding to the Algerian Forest Fires dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Figure A1. Graph corresponding to the Algerian Forest Fires dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Mathematics 09 02899 g0a1
Figure A2. Graph corresponding to the Banknote Authentication dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Figure A2. Graph corresponding to the Banknote Authentication dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Mathematics 09 02899 g0a2
Figure A3. Graph corresponding to the User Knowledge Modeling dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Figure A3. Graph corresponding to the User Knowledge Modeling dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Mathematics 09 02899 g0a3
Figure A4. Graph corresponding to the Glass Identification dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Figure A4. Graph corresponding to the Glass Identification dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Mathematics 09 02899 g0a4
Figure A5. Graph corresponding to the Somerville Happiness Survey dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Figure A5. Graph corresponding to the Somerville Happiness Survey dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Mathematics 09 02899 g0a5
Figure A6. Graph corresponding to the Melanoma dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Figure A6. Graph corresponding to the Melanoma dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Mathematics 09 02899 g0a6
Figure A7. Graph corresponding to the Pima Indians Diabetes dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Figure A7. Graph corresponding to the Pima Indians Diabetes dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Mathematics 09 02899 g0a7
Figure A8. Graph corresponding to the Wine Quality dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Figure A8. Graph corresponding to the Wine Quality dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives, and in purple are unclassified instances.
Mathematics 09 02899 g0a8
Table A1. Basic information about the real datasets. The column “Instances” denotes the number of entries in the dataset. The column “Original” denotes the number of features in the original dataset. The columns “2-Mon. Ext.” and “3-Mon. Ext.” denote the number of features in the 2-monomial extension and 3-monomial extension, respectively.
Table A1. Basic information about the real datasets. The column “Instances” denotes the number of entries in the dataset. The column “Original” denotes the number of features in the original dataset. The columns “2-Mon. Ext.” and “3-Mon. Ext.” denote the number of features in the 2-monomial extension and 3-monomial extension, respectively.
Data-SetInstancesOriginal2-Mon. Ext.3-Mon. Ext.
Speaker Accent Recognition2101290454
Algerian Forest Fires12213104559
Banknote Authentication137141434
User Knowledge Modeling25152055
Glass Identification146954219
Melanoma104192091539
Pima Indians Diabetes767844164
Somerville Happiness Survey14362783
Wine Quality36551177363
Table A2. Table corresponding to results of artificial data from Family 2.
Table A2. Table corresponding to results of artificial data from Family 2.
Family 2
2-503-304-105-5
Naive BayesOriginal Accuracy49.6752.3048.0056.00
Best Accuracy71.3357.1049.0053.33
Grade5543-4
Logistic RegressionOriginal Accuracy45.3351.5048.8051.33
Best Accuracy99.0099.2091.8056.0
Grade3345
KNNOriginal Accuracy98.0091.6085.4059.33
Best Accuracy97.3391.1080.2062.67
Grade2-3-4224
Rules PARTOriginal Accuracy53.6751.7049.2052.67
Best Accuracy98.3395.6069.4053.33
Grade5342
Rules JRipOriginal Accuracy99.3399.7061.2052.00
Best Accuracy97.3395.8062.0053.33
Grade5232-3-4
Trees J48Original Accuracy53.6753.3049.2053.33
Best Accuracy98.6798.4071.2056.00
Grade2-3-5552
Trees RFOriginal Accuracy99.3399.8080.8049.33
Best Accuracy99.0099.8079.6047.33
Grade3232
SVMOriginal Accuracy53.6752.1045.6054.67
Best Accuracy92.0078.9055.4056.00
Grade5553
ANNOriginal Accuracy81.6765.9065.6053.33
Best Accuracy96.0093.1077.2066.00
Grade3-4-5433
Table A3. Table corresponding to results of artificial data from Family 3.
Table A3. Table corresponding to results of artificial data from Family 3.
Family 3
2-503-304-105-5
Naive BayesOriginal Accuracy93.0091.3390.1789.33
Best Accuracy98.0091.6788.3389.00
Grade2554
Logistic RegressionOriginal Accuracy93.0091.6788.6794.67
Best Accuracy97.0098.8396.5096.33
Grade2-53-432
KNNOriginal Accuracy95.0096.0091.1789.67
Best Accuracy97.0096.3389.8387.33
Grade2-3-4-5223
Rules PARTOriginal Accuracy91.0095.0090.1785.67
Best Accuracy99.0099.5097.6792.00
Grade2-3-4-53-4-54-55
Rules JRipOriginal Accuracy94.0093.3387.8378.33
Best Accuracy99.0099.1797.3391.00
Grade2-3-4-53-4-555
Trees J48Original Accuracy90.0094.1790.3383.00
Best Accuracy99.0099.5097.1792.33
Grade2-3-4-53-4-554
Trees RFOriginal Accuracy94.0096.6793.3388.33
Best Accuracy100.0099.8398.5095.67
Grade2-3-4-5555
SVMOriginal Accuracy93.0091.3389.1792.67
Best Accuracy96.0097.1795.6794.67
Grade54-555
ANNOriginal Accuracy96.0093.0094.3393.00
Best Accuracy98.0098.6797.0097.00
Grade2542
Table A4. Table corresponding to results of artificial data from Family 4.
Table A4. Table corresponding to results of artificial data from Family 4.
Family 4
2-503-304-105-5
Naive BayesOriginal Accuracy82.0077.6082.0077.33
Best Accuracy76.0080.1075.0077.67
Grade2-3-4-5222
Logistic RegressionOriginal Accuracy68.0069.8055.0062.67
Best Accuracy98.0099.4095.0096.67
Grade2-3222
KNNOriginal Accuracy82.0091.0070.0079.33
Best Accuracy82.0091.4071.0075.67
Grade2-3-4-5223
Rules PARTOriginal Accuracy80.0089.3075.0088.33
Best Accuracy82.0091.5077.0090.00
Grade432-42
Rules JRipOriginal Accuracy72.0088.4062.0083.67
Best Accuracy72.0090.3066.0080.67
Grade2-3-532-35
Trees J48Original Accuracy78.0091.0062.0085.33
Best Accuracy80.0090.9075.0087.33
Grade3-42-532
Trees RFOriginal Accuracy76.0093.6073.0088.33
Best Accuracy78.0094.1077.0089.67
Grade243-43
SVMOriginal Accuracy66.0069.8061.0064.67
Best Accuracy82.0094.4073.0094.33
Grade454-54
ANNOriginal Accuracy84.0075.9066.0069.67
Best Accuracy96.0097.7090.0093.33
Grade3222
Table A5. Table corresponding to results of artificial data from Family 5.
Table A5. Table corresponding to results of artificial data from Family 5.
Family 5
2-503-304-105-5
Naive BayesOriginal Accuracy95.0094.0092.0090.00
Best Accuracy98.0098.0094.2592.00
Grade2-3-4522
Logistic RegressionOriginal Accuracy99.50100.00100.00100.00
Best Accuracy99.50100.0099.7593.00
Grade3232
KNNOriginal Accuracy97.0094.5096.2584.00
Best Accuracy97.0094.5096.5085.00
Grade2-3-4-5224-5
Rules PARTOriginal Accuracy91.0084.5090.2580.00
Best Accuracy97.0095.0094.5088.00
Grade3452
Rules JRipOriginal Accuracy92.5082.5090.2578.00
Best Accuracy96.5093.5093.5090.00
Grade2344
Trees J48Original Accuracy90.5087.0089.2579.00
Best Accuracy95.5094.0093.5088.00
Grade3243
Trees RFOriginal Accuracy94.5090.0093.7585.00
Best Accuracy98.0095.5096.2593.00
Grade3-4-54-535
SVMOriginal Accuracy96.0096.5096.5092.00
Best Accuracy98.0096.5099.25100.00
Grade2-5224-5
ANNOriginal Accuracy100.00100.00100.00100.00
Best Accuracy100.0098.00100.00100.00
Grade3-4-5222
Table A6. Basic information about the artificial datasets. The column “Family” denotes the corresponding family function. The column “Indices” denotes the number of features and their cardinality. The column “Instances” denotes the number of entries in the dataset. The column “Original” denotes the number of features in the original dataset. The columns “2-Mon. Ext.”, “3-Mon. Ext.”, “4-Mon. Ext.”, and “5-Mon. Ext.”, denote the number of features in the 2-monomial extension, 3-monomial extension, 4-monomial extension, and 5-monomial extension, respectively.
Table A6. Basic information about the artificial datasets. The column “Family” denotes the corresponding family function. The column “Indices” denotes the number of features and their cardinality. The column “Instances” denotes the number of entries in the dataset. The column “Original” denotes the number of features in the original dataset. The columns “2-Mon. Ext.”, “3-Mon. Ext.”, “4-Mon. Ext.”, and “5-Mon. Ext.”, denote the number of features in the 2-monomial extension, 3-monomial extension, 4-monomial extension, and 5-monomial extension, respectively.
FamilyIndicesInstancesOriginal2-Mon. Ext.3-Mon. Ext.4-Mon. Ext.5-Mon. Ext.
12-505002591420
13-3050039193455
14-103004143469125
15-520052055125251
22-503002591420
23-30100039193455
24-105004143469125
25-515052055125251
32-501002591420
33-3060039193455
34-106004143469125
35-530052055125251
42-50502591420
43-30100039193455
44-101004143469125
45-530052055125251
52-502002591420
53-3020039193455
54-104004143469125
55-510052055125251

References

  1. Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2013, 24, 175–186. [Google Scholar] [CrossRef]
  2. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  3. Sondhi, P. Feature construction methods: A survey. Sifaka. Cs. Uiuc. Edu. 2009, 69, 70–71. [Google Scholar]
  4. Tang, J.; Alelyani, S.; Liu, H. Feature selection for classification: A review. Data Classif. Algorithms Appl. 2014, 37. [Google Scholar] [CrossRef]
  5. Yang, C.H.; Chuang, L.Y.; Yang, C.H. IG-GA: A Hybrid Filter/Wrapper Method for Feature Selection of Microarray Data. J. Med. Biol. Eng. 2010, 30, 23–28. [Google Scholar]
  6. Hsu, H.H.; Hsieh, C.W.; Lu, M.D. Hybrid feature selection by combining filters and wrappers. Expert Syst. Appl. 2011, 38, 8144–8150. [Google Scholar] [CrossRef]
  7. Chandrashekar, G.; Sahin, F. A Survey on Feature Selection Methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
  8. Venkatesh, B.; Anuradha, J. A review of feature selection and its methods. Cybern. Inf. Technol. 2019, 19, 3–26. [Google Scholar] [CrossRef] [Green Version]
  9. Pudil, P.; Novovičová, J.; Kittler, J. Floating search methods in feature selection. Pattern Recognit. Lett. 1994, 15, 1119–1125. [Google Scholar] [CrossRef]
  10. Ferri, F.J.; Pudil, P.; Hatef, M.; Kittler, J. Comparative study of techniques for large-scale feature selection. In Machine Intelligence and Pattern Recognition; Elsevier: Amsterdam, The Netherlands, 1994; Volume 16, pp. 403–413. [Google Scholar]
  11. Ghosh, K.K.; Ahmed, S.; Singh, P.K.; Geem, Z.W.; Sarkar, R. Improved binary sailfish optimizer based on adaptive β-hill climbing for feature selection. IEEE Access 2020, 8, 83548–83560. [Google Scholar] [CrossRef]
  12. Yan, C.; Ma, J.; Luo, H.; Wang, J. A hybrid algorithm based on binary chemical reaction optimization and tabu search for feature selection of high-dimensional biomedical data. Tsinghua Sci. Technol. 2018, 23, 733–743. [Google Scholar] [CrossRef] [Green Version]
  13. Sayed, S.; Nassef, M.; Badr, A.; Farag, I. A nested genetic algorithm for feature selection in high-dimensional cancer microarray datasets. Expert Syst. Appl. 2019, 121, 233–243. [Google Scholar] [CrossRef]
  14. Jia, H.; Li, J.; Song, W.; Peng, X.; Lang, C.; Li, Y. Spotted hyena optimization algorithm with simulated annealing for feature selection. IEEE Access 2019, 7, 71943–71962. [Google Scholar] [CrossRef]
  15. Mafarja, M.M.; Mirjalili, S. Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 2017, 260, 302–312. [Google Scholar] [CrossRef]
  16. Paniri, M.; Dowlatshahi, M.B.; Nezamabadi-pour, H. MLACO: A multi-label feature selection algorithm based on ant colony optimization. Knowl.-Based Syst. 2020, 192, 105285. [Google Scholar] [CrossRef]
  17. Gharehchopogh, F.S.; Maleki, I.; Dizaji, Z.A. Chaotic vortex search algorithm: Metaheuristic algorithm for feature selection. Evol. Intell. 2021, 1–32. [Google Scholar] [CrossRef]
  18. Sakri, S.B.; Rashid, N.B.A.; Zain, Z.M. Particle swarm optimization feature selection for breast cancer recurrence prediction. IEEE Access 2018, 6, 29637–29647. [Google Scholar] [CrossRef]
  19. Liu, H.; Motoda, H. Feature Extraction, Construction and Selection: A Data Mining Perspective; Springer Science & Business Media: New York, USA, 2012; Volume 453. [Google Scholar]
  20. Mahanipour, A.; Nezamabadi-Pour, H.; Nikpour, B. Using fuzzy-rough set feature selection for feature construction based on genetic programming. In Proceedings of the 2018 3rd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC), Bam, Iran, 6–8 March 2018; pp. 1–6. [Google Scholar]
  21. Neshatian, K.; Zhang, M.; Andreae, P. A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Trans. Evol. Comput. 2012, 16, 645–661. [Google Scholar] [CrossRef]
  22. Markovitch, S.; Rosenstein, D. Feature generation using general constructor functions. Mach. Learn. 2002, 49, 59–98. [Google Scholar] [CrossRef] [Green Version]
  23. Fan, W.; Zhong, E.; Peng, J.; Verscheure, O.; Zhang, K.; Ren, J.; Yan, R.; Yang, Q. Generalized and heuristic-free feature construction for improved accuracy. In Proceedings of the 2010 SIAM International Conference on Data Mining, Columbus, OH, USA, 29 April–1 May 2010; pp. 629–640. [Google Scholar]
  24. Ma, J.; Gao, X. A filter-based feature construction and feature selection approach for classification using Genetic Programming. Knowl.-Based Syst. 2020, 196, 105806. [Google Scholar] [CrossRef]
  25. Tran, B.; Xue, B.; Zhang, M. Genetic programming for multiple-feature construction on high-dimensional classification. Pattern Recognit. 2019, 93, 404–417. [Google Scholar] [CrossRef]
  26. Specia, L.; Srinivasan, A.; Ramakrishnan, G.; Nunes, M.d.G.V. Word sense disambiguation using inductive logic programming. In Proceedings of the 16th International Conference, ILP 2006, Santiago de Compostela, Spain, 24–27 August 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 409–423. [Google Scholar]
  27. Specia, L.; Srinivasan, A.; Joshi, S.; Ramakrishnan, G.; Nunes, M.d.G.V. An investigation into feature construction to assist word sense disambiguation. Mach. Learn. 2009, 76, 109–136. [Google Scholar] [CrossRef] [Green Version]
  28. Roth, D.; Small, K. Interactive feature space construction using semantic information. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), Boulder, CO, USA, 4–5 June 2009; pp. 66–74. [Google Scholar]
  29. Derczynski, L.; Chester, S. Generalised Brown clustering and roll-up feature generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
  30. Siwek, K.; Osowski, S. Comparison of Methods of Feature Generation for Face Recognition; University of West Bohemia: Pilsen, Czechia, 2013. [Google Scholar]
  31. Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification. A Wiley-Interscience Publication, 2nd ed.; John Wiley & Sons, Inc.: New York, NY, USA, 2001. [Google Scholar]
  32. Sutton, R.S.; Matheus, C.J. Learning polynomial functions by feature construction. In Machine Learning Proceedings 1991; Elsevier: San Mateo, CA, USA, 1991; pp. 208–212. [Google Scholar]
  33. Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, USA, 4–6 August 2001; Volume 3, pp. 41–46. [Google Scholar]
  34. Wright, R.E. Logistic regression. In Reading and Understanding Multivariate Statistics; Grimm, L.G., Yarnold, P.R., Eds.; American Psychological Association: Washington, DC, USA, 1995; pp. 217–244. [Google Scholar]
  35. Fix, E.; Hodges, J.L. Discriminatory analysis. Nonparametric discrimination: Consistency properties. Int. Stat. Rev./Revue Internationale de Statistique 1989, 57, 238–247. [Google Scholar] [CrossRef]
  36. Frank, E.; Witten, I.H. Generating Accurate Rule Sets Without Global Optimization. In Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI, USA, 24–27 July 1998; Shavlik, J., Ed.; Morgan Kaufmann: San Francisco, CA, USA, 1998; pp. 144–151, ISBN 978-1-55860-556-5. [Google Scholar]
  37. Cohen, W.W. Fast Effective Rule Induction. In Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; Morgan Kaufmann: San Francisco, CA, USA, 1995; pp. 115–123, ISBN 978-1-55860-377-6. [Google Scholar]
  38. Quinlan, R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers: San Mateo, CA, USA, 1993. [Google Scholar]
  39. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  40. Ho, T.K.; Basu, M. Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 289–300. [Google Scholar]
  41. Blumer, A.; Ehrenfeucht, A.; Haussler, D.; Warmuth, M.K. Learnability and the Vapnik-Chervonenkis dimension. J. ACM 1989, 36, 929–965. [Google Scholar] [CrossRef]
  42. Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef]
  43. Garner, S.R. Weka: The waikato environment for knowledge analysis. In Proceedings of the New Zealand Computer Science Research Students Conference, Hamilton, New Zealand, 14–18 April 1995; Volume 1995, pp. 57–64. [Google Scholar]
  44. Fokoue, E. Speaker Accent Recognition Data Set. UCI Machine Learning Repository, 2020. Available online: https://archive.ics.uci.edu/ml/datasets/Speaker+Accent+Recognition (accessed on 24 September 2021).
  45. Abid, F.; Izeboudjen, N. Predicting Forest Fire in Algeria Using Data Mining Techniques: Case Study of the Decision Tree Algorithm. In Proceedings of the International Conference on Advanced Intelligent Systems for Sustainable Development, Tangier, Morocco, 12–14 July 2019; Springer: Cham, Switzerland, 2019; pp. 363–370. [Google Scholar]
  46. Lohweg, V. Banknote Authentication Data Set. UCI Machine Learning Repository, 2012. Available online: https://archive.ics.uci.edu/ml/datasets/banknote+authentication (accessed on 24 September 2021).
  47. Kahraman, H.T.; Sagiroglu, S.; Colak, I. The development of intuitive knowledge classifier and the modeling of domain dependent data. Knowl.-Based Syst. 2013, 37, 283–295. [Google Scholar] [CrossRef]
  48. German, B. Glass Identification Data Set. UCI Machine Learning Repository, 1987. Available online: https://archive.ics.uci.edu/ml/datasets/glass+identification (accessed on 24 September 2021).
  49. Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J. Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 2009, 47, 547–553. [Google Scholar] [CrossRef] [Green Version]
  50. Koczkodaj, W.W. Somerville Happiness Survey Data Set. UCI Machine Learning Repository, 2018. Available online: https://archive.ics.uci.edu/ml/datasets/Somerville+Happiness+Survey (accessed on 24 September 2021).
  51. Rossi, R.A.; Ahmed, N.K. The Network Data Repository with Interactive Graph Analytics and Visualization. In Proceedings of the AAAI15: Twenty-Ninth Conference on Artificial Intelligence, Austin, TX, USA, 25–30 July 2015. [Google Scholar]
  52. Seijo-Pardo, B.; Bolón-Canedo, V.; Alonso-Betanzos, A. Using data complexity measures for thresholding in feature selection rankers. In Proceedings of the Conference of the Spanish Association for Artificial Intelligence, Salamanca, Spain, 14–16 September 2016; Springer: Cham, Switzerland, 2016; pp. 121–131. [Google Scholar]
  53. Dom, B.; Niblack, W.; Sheinvald, J. Feature selection with stochastic complexity. In Proceedings of the 1989 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 4–8 June 1989; pp. 241–242. [Google Scholar]
  54. Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A. A distributed feature selection approach based on a complexity measure. In Proceedings of the International Work-Conference on Artificial Neural Networks, Palma de Mallorca, Spain, 10–12 June 2015; Springer: Cham, Switzerland, 2015; pp. 15–28. [Google Scholar]
  55. Okimoto, L.C.; Lorena, A.C. Data complexity measures in feature selection. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
  56. John, G.H.; Kohavi, R.; Pfleger, K. Irrelevant features and the subset selection problem. In Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ, USA, 10–13 July 1994; pp. 121–129. [Google Scholar]
  57. Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 2005, 3, 185–205. [Google Scholar] [CrossRef]
  58. Yu, L.; Liu, H. Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 2004, 5, 1205–1224. [Google Scholar]
  59. Gao, W.; Hu, L.; Zhang, P. Feature redundancy term variation for mutual information-based feature selection. Appl. Intell. 2020, 50, 1272–1288. [Google Scholar] [CrossRef]
  60. Zhou, T.; Zhang, C.; Gong, C.; Bhaskar, H.; Yang, J. Multiview latent space learning with feature redundancy minimization. IEEE Trans. Cybern. 2018, 50, 1655–1668. [Google Scholar] [CrossRef]
  61. Cheng, G.; Qin, Z.; Feng, C.; Wang, Y.; Li, F. Conditional Mutual Information-Based Feature Selection Analyzing for Synergy and Redundancy. ETRI J. 2011, 33, 210–218. [Google Scholar] [CrossRef]
  62. Zhao, Z.; Wang, L.; Liu, H. Efficient spectral feature selection with minimum redundancy. In Proceedings of the AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; Volume 24. [Google Scholar]
  63. Tabakhi, S.; Moradi, P. Relevance–redundancy feature selection based on ant colony optimization. Pattern Recognit. 2015, 48, 2798–2811. [Google Scholar] [CrossRef]
  64. Wang, M.; Tao, X.; Han, F. A New Method for Redundancy Analysis in Feature Selection. In Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence, Sanya, China, 24–26 December 2020; pp. 1–5. [Google Scholar]
Figure 1. Graph corresponding to Table 2 without feature 3.
Figure 1. Graph corresponding to Table 2 without feature 3.
Mathematics 09 02899 g001
Figure 2. Graph corresponding to the Speaker Accent Recognition dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives and in purple are unclassified instances.
Figure 2. Graph corresponding to the Speaker Accent Recognition dataset. In blue are true positives, in orange are true negatives, in green are false positives, in red are false negatives and in purple are unclassified instances.
Mathematics 09 02899 g002
Table 1. Simple example of dataset distribution.
Table 1. Simple example of dataset distribution.
Att. 1Att. 2Att. 3Class
1100
1210
1211
2121
3130
Table 2. Example of linear separability thanks to a new redundant feature.
Table 2. Example of linear separability thanks to a new redundant feature.
Att. 1Att. 2Att. 3Class
1011
2040
2140
3091
Table 3. XOR function under 2-monomial expansion.
Table 3. XOR function under 2-monomial expansion.
a 1 a 2 a 1 2 a 1 a 2 a 2 2 Class
111110
101001
010011
000000
Table 4. Results of artificial data from Family 1, where we only show the accuracy for the best values of k.
Table 4. Results of artificial data from Family 1, where we only show the accuracy for the best values of k.
Family 1
2-503-304-105-5
Naive BayesOriginal Accuracy46.8050.0054.3362.50
Best Accuracy56.0051.6053.6765.00
Grade3222
Logistic RegressionOriginal Accuracy47.8052.0053.3364.00
Best Accuracy99.2056.2054.6765.00
Grade5532
KNNOriginal Accuracy94.8082.2053.0061.50
Best Accuracy94.8084.4053.6764.5
Grade33-4-533-4
Rules PARTOriginal Accuracy54.6052.4052.0064.50
Best Accuracy96.2063.2052.6764.50
Grade554-53-4
Rules JRipOriginal Accuracy86.2053.0054.3359.50
Best Accuracy95.2063.656.3362.50
Grade3543
Trees J48Original Accuracy54.6052.0051.3368.00
Best Accuracy97.2066.854.0065.50
Grade3-4554
Trees RFOriginal Accuracy93.4068.6051.6765.50
Best Accuracy97.4081.055.6763.00
Grade4325
SVMOriginal Accuracy50.8049.6055.6764.00
Best Accuracy70.4053.8058.6765.50
Grade4233
ANNOriginal Accuracy90.0051.2050.6759.50
Best Accuracy98.2087.0053.6765.00
Grade4342
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Grillo, S.A.; Noguera, J.L.V.; Mello Román, J.C.; García-Torres, M.; Facon, J.; Pinto-Roa, D.P.; Salgueiro Romero, L.; Gómez-Vela, F.; Paniagua, L.R.B.; Correa, D.N.L. Redundancy Is Not Necessarily Detrimental in Classification Problems. Mathematics 2021, 9, 2899. https://doi.org/10.3390/math9222899

AMA Style

Grillo SA, Noguera JLV, Mello Román JC, García-Torres M, Facon J, Pinto-Roa DP, Salgueiro Romero L, Gómez-Vela F, Paniagua LRB, Correa DNL. Redundancy Is Not Necessarily Detrimental in Classification Problems. Mathematics. 2021; 9(22):2899. https://doi.org/10.3390/math9222899

Chicago/Turabian Style

Grillo, Sebastián Alberto, José Luis Vázquez Noguera, Julio César Mello Román, Miguel García-Torres, Jacques Facon, Diego P. Pinto-Roa, Luis Salgueiro Romero, Francisco Gómez-Vela, Laura Raquel Bareiro Paniagua, and Deysi Natalia Leguizamon Correa. 2021. "Redundancy Is Not Necessarily Detrimental in Classification Problems" Mathematics 9, no. 22: 2899. https://doi.org/10.3390/math9222899

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop