1. Introduction
In the classification, the quality of information in the features is essential to building a high-quality predictive model. Furthermore, the rapid advances in data acquisition and storage technologies have created high-dimensional data. However, noise, non-informative features, and redundancy, among other issues, make the classification task challenging [
1]. Therefore, selecting suitable features is an important task, as a preliminary step, for building highly predictive classifiers [
2].
To reduce dimensionality, there are two main approaches—feature selection and feature construction. Feature selection selects a subset of features from the input to reduce the effects of noise or irrelevant features, while still providing good prediction results [
2]. In contrast, feature construction refers to the task of transforming a given set of input features to generate a new set of more predictive features [
3].
According to [
2], feature selection can be divided into three major categories depending on the evaluation criteria—filter, wrapper, and embedded. Filter methods use intrinsic properties of the data to select a subset of features and are applied as a preprocessing task [
4]. Wrappers, in contrast, use learning to guide the search. The learning bias is included in the search and, therefore, they achieve better results. However, they are computationally expensive [
5,
6] and cause overfitting [
7]. Finally, embedded methods perform the search at the same time the model is learned.
However, we can also classify the feature selection methods according to the strategy to search for subsets of features, which are divided into exponential search, sequential search, and random search [
8]. The exponential search consists of the exhaustive evaluation of all possible subsets, which makes it impractical most of the time. The sequential search consists of the application of a local search method with a hill descent strategy [
9,
10]. The use of such strategies means that the search is stuck in a local optima. Finally, we have random search strategies that consist of the application of metaheuristic optimization algorithms [
11,
12,
13,
14,
15,
16,
17,
18].
Despite the success of feature selection techniques, a good feature space is a prerequisite for achieving high performance in the classification. In this sense, feature construction aims to engineer new features to detect the hidden relations of the original features [
19,
20]. New features are constructed based on the relations of the original ones pursuing a more meaningful feature space capable of achieving a more accurate classifier [
3]. As in the case of feature selection, in feature construction, we can find three approaches: filter, wrapper, and embedded methods [
21]. Among the main approaches for constructing features, we have (i) methods based on decision trees [
22,
23], (ii) evolutionary meta-heuristics [
24,
25], (iii) the application of inductive logic programming [
26,
27], (iv) methods that use annotations with the training set [
28], and (v) unsupervised methods such as clustering [
29], PCA [
30], or SVD [
31].
In this work, we study the relationship of feature construction and assumptions applied in selecting those features. Denote—as redundant—a subset of features that do not provide more information than what exists in the other features. We are particularly interested in analyzing the assumption that minimizing the number of redundant features is best for classification problems. Especially how the defined features can affect the capacity of a model required to perform the classification. We first present a mathematical framework for modeling feature construction and selection for classification problems with discrete features. Second, we show that there are datasets where small feature subsets can be much more complex than large feature subsets. We denote complexity concerning the capacity that the model requires to classify the problem and highlight the linearly separable problems as the least complex. This construction violates the assumption that fewer features with equal or more information are better than many features. Third, we extend the analysis of feature construction using monomials of degree k [
32] and conclude that this method tends to produce linearly separable binary classification problems as k grows. Therefore, we propose that one way to validate feature construction methods is by analyzing whether the classification problems tend to become linearly separable with the iterative application of the method. Finally, we apply the construction of features with monomials of degree k in real and artificial datasets, where we apply the following classification algorithms, naive Bayes [
33], logistic regression [
34], KNN [
35], PART [
36], JRIP [
37], J48 [
38] and random forest [
39]. Experiments show that even though redundant features grow extensively, the score increases or does not decrease too much. Therefore, both theoretical and experimental evidence agree that the criterion of choosing minimum feature subsets is not always correct. This is because the assumption considers only the information about the features but not the complexity of the classification problem.
The contributions of this work can be synthesized in the following items: (a) showing that the redundancy of features can reduce data complexity, (b) developing a theoretical framework to model construction and selection of features and, (c) proposing a mathematical criterion to validate feature construction methods. The experiments performed suggest that the presence of redundant features does not necessarily prejudice classification tasks.
This work is organized into the following sections.
Section 2 presents the mathematical formulation used to describe the theoretical results.
Section 3 introduces basic ideas with simple examples, while
Section 4 formalizes those ideas to more general results.
Section 5 shows the experimental results, and finally,
Section 6 presents a discussion of all results obtained.
2. A Mathematical Model for Feature Selection and Construction
In this section, we present a formal framework for the mathematical analysis of feature selection and construction. Let be a finite sequence of finite sets in and another finite set C, where each is denoted as feature i and C is the set of possible classes. Taking we consider a probability distribution over , we denote and as the probability and conditional probability determined by respectively. Notice that we may generate a dataset using distribution , where each record is an element from and we denote as a dataset distribution. Denote the sequence , such that for and . Let be a subsequence of , we denote (i) , (ii) if then s is denoted as a pattern of and (iii) is denoted as the event where we sample an instance such that for a pattern s of according to distribution . We say that s is a not-null pattern of if .
Notice that our definition of the dataset distribution is general enough for a dataset or its real distribution. For example, given the dataset distribution
in
Table 1, we can take
,
,
, and
. As
represents all possible values taken by the first and third feature, if
is a pattern of
, then
is the event where the first and third features have value one. Notice that
s is a not-null pattern because
.
The following definition formalizes the notion of patterns that do not contradict each other.
Definition 1. Let and be sub-sequences of , we denote and . Taking and , we say that b and d are congruent patterns, if b and d are not distinct in the features of preserved by both and .
For example, take the dataset distribution
of
Table 1,
and
. We have that
and
are congruent patterns, because they have the same value in their single shared feature. However, if
, then
b and
are not congruent patterns, because both have different values given the second feature of the dataset.
As a dataset distribution
may not be consistent (inconsistent), we define a function
, where
for all not-null patterns
. Notice that an inconsistent dataset distribution always has classification error because a classifier does not have enough features, then
gives the category that minimize error for any configuration of features. If we consider the dataset distribution of
Table 1, we must define a
, such that
,
and
; however for any other pattern
we can take 0 or 1 for
.
Definition 2. Let be a dataset distribution, a sub-sequence of sequence and . The subsequence B of features is complete for if satisfies that for all class c and all congruent not-null patterns of , respectively, we have: Definition 2 formalizes the notion of a subset of features with the same amount of information as all features as a whole. This notion of information considers that the subset of features is sufficient to estimate the class with the same probability as the original set of features.
Definition 3. Maintaining the same terms of Definition 2. Let be a sub-sequence of sequence B without the term and . The subsequence B of features is non-redundant for , if it satisfies that for all k there is some class c, and some not-null congruent patterns of , respectively, such that: Definition 3 formalizes the notion of a subset of features where each feature provides information that does not exist in other features of the subset. This notion of information considers that if we eliminate a feature from the subset, we will not obtain the same probability of obtaining a class. Under this definition of a non-redundant subset of features, we can say that the other features of the dataset are redundant because they can be eliminated without losing information in the dataset. We formulate Definition 4 for redundant features.
Definition 4. Maintaining the same terms of Definition 2. Let be a subsequence of A, obtained by eliminating the features of a subsequence B from A. The subsequence B of features is redundant for if it satisfies that for all class c and not-null congruent patterns of respectively, we have: Taking the dataset distribution
of
Table 1 again, we can see that
and
are complete and non-redundant for
. Sub-sequences composed by individual non-constant features like
and
are non-redundant, but not complete for
. Finally, sub-sequences like
,
and
are complete, but not non-redundant for
.
Definition 5. Let be a dataset distribution over . Let be a sequence of finite sets, and . We say that a dataset distribution over is an extension of , if (i) for all and , where and (ii) for all not null pattern there is some pattern such that .
Definition 5 formalizes the notion of feature construction. It consists of a new dataset distribution whose set of features contains the set of features of the original dataset with the same distribution according to the first property. However, according to the second property, the new distribution also contains new features whose values are entirely determined by the shared features.
Following the dataset distribution
of
Table 1, we denote a dataset distribution
from
Table 1 where we eliminate the feature
. Notice that
is an extension of
, because (i) as
is
without a feature, then they have the same probabilities for the common features and (ii) if we know the values of features 1 and 2, then we know the value of feature 3 with probability 1 for any not-null pattern.
3. Features: Selection vs. Construction
In this section, we use the mathematical notions defined above to compare selection with feature construction. In this sense, feature selection is denoted as an elimination of features, while feature construction is denoted as incorporating new features.
Feature selection methods that do not involve the classifier in the selection are called filter methods. These methods are based on applying some measure that seeks to obtain a subset of features, which contains the same amount of information as the original set but without any redundancy. The literature reports several of these methods; however, they believe that a non-redundant set of features should be as small as possible. This condition can be mathematically described as obtaining a complete and non-redundant sub-sequence of features B for , where we minimize .
Mathematically we can define the construction of features from a dataset distribution as any extension of . Feature construction consists of computing new features from the original features. If the result ends up with more features than the original, we come across a method contrary to the minimization criterion of the feature selection by filtering methods.
One of the principles of feature selection by the filtering methods is that redundancy in features is detrimental. We refer to a redundant feature in the sense that all the information existing in the feature can be obtained from a subset of features that does not contain the feature itself. In that sense, the construction of features without a subsequent selection of features only produces redundant features. Formally, we are saying that if is an extension of as constructed in Definition 5, then for all and all not-null congruent patterns of , respectively. In other words, although pattern a has extra features to , that does not modify the probabilities of obtaining any class c; therefore, the extra features do not provide information.
The notion of feature construction introduced by Definition 5 does not add more information because the original features define the new features entirely. Therefore, we are interested in knowing what else can be provided by new features in case these features do not have more information than what already exists.
We analyze a simple example of a classification algorithm interacting with a constructed feature before presenting theorems with more general results. First, we consider the distribution of
Table 2, where we assume that the original features are 1 and 2. For each pattern
, feature 3 is defined as
. Second, we consider a classifier based on the logistic model. If we denote
and the internal parameters or weights
, the logistic model applied to the original features of pattern
outputs 1 if
and 0 otherwise. Denoting another parameter
, the logistic model applied to all features of pattern
outputs 1 if
and 0 otherwise. Notice that in
Figure 1, if we apply the logistic model in the original features, we obtain a linear classifier on the plane for features 1–2 that cannot give the correct class to all instances. Therefore this first model has under-fitting problems. However, if we take the second logistic model with the parameters
,
,
, and
, we obtain a non-linear model over the plane for features 1–2 with the region between Att. 1 = 1.5 and Att. 2 = 2.5 for class 0 and the rest of the plane for class 1. This second logistic model is equivalent to a third logistic model applied to the original features of pattern
that outputs 1 if
and 0 otherwise. We say that both logistic models are equivalent because they partition the plane of features 1–2 exactly as
Figure 1 shows. In both the second and third models there is an extra parameter
that modulates the non-linearity in the plane of features 1–2. The second model is a linear model over the space produced by the features 1–3 and behaves non-linearly in the plane of features 1–2 due to feature 3. Instead, the third model is an inherently nonlinear model for features 1–2 for
distinct to 0. Therefore, the construction of features can increase the representation capacity of the model and solve under-fitting problems like the one we observed with the first model.
4. A Theoretical Analysis of Feature Construction
In this section, we present results that generalize what was stated in
Section 3. The following theorem refutes the idea that the fewer features we use without losing information, the better for the classification problem.
Theorem 1. Let be a non-constant dataset distribution over , whose set of features is non-redundant in . Take p as the total number of non-null patterns in whose value by is the minority class between zero and one. For all integer m in there is a set of m features and an extension of , such that (i) is a distribution over where , (ii) is a non-redundant set of features in , (iii) there is a linear classifier that computes from b, if is a not-null pattern of .
Proof. Let be the set of not-null patterns of according to . We denote a partition of of size m, where each contains a pattern with value one and a pattern with value zero according . We also take for all i. Then we construct : for each we have such that: if then and for all . As is fully determined by a and b is fully determined by , then b is fully determined by a and is an extension of .
For the second property, we denote
and three patterns
,
. We take
b,
with all terms zero and
with all terms zero except
. Notice that
is congruent to the other patterns and all are non-null patterns, this implies that
,
, and
are events with non-zero probability. Then we have:
and:
As
implies that
for real positive numbers
and:
For the last property, we need to construct a logistic model that outputs one if and zero otherwise. Notice that this linear classifier computes for all not-null pattern from . □
Notice that can be much bigger than . However, inferring the category labels from can be as complex as we want, at the same time that selecting the bigger set instead we will have a problem that is solved by a linear classifier. Therefore, a feature selection method would choose over under the criterion of minimizing the number of features.
Although we refer to complexity, there is no single measure of complexity for classification problems [
40]. However, it is observed that classifiers that use a single variable, artificial neural networks of a single neuron, and the simplest SVM models are linear classifiers. Additionally, linear classifiers have a VC dimension of the value of only two [
41]. Therefore, for our purpose, we consider linearly separable sets as those with less complexity.
Theorem 1 shows an extreme case where feature construction breaks a standard criterion for feature selection methods. However, theorem-proof does not present a practical method for feature construction because we can only build the features in the training set. We note that to construct features , we must know in advance the most probable class for each pattern in . That is to say, first solve the classification problem only with the features of , which does not make sense. Therefore, we will now study a standard method for constructing features.
The following definition generalizes the construction of features using monomials, which was used as an example in
Section 3. The idea is that there is a feature equivalent to each monomial of degree less than or equal to
k from the original features.
Definition 6. Taking same terms from Definition. We denote as a k-monomial extension of and as the product of features of , if (i) for each i, there is a monomial function of grade equal or less than k, such that for each not-null pattern and (ii) for each monomial function of grade equal or less than k there is some i, such that for each not-null pattern .
For example, suppose that the dataset distribution has three features and denote as a pattern for those features. Then, a pattern from could of be the form and a pattern from could be of the form .
Notice that Definition 6 does not give an explicit order for the new features, however Definition 5 just guarantees that the first n features of are the original features of . Then the features i in for are in function of the first n features in (that also are the features of ).
The following theorem describes how the feature construction method described in Definition 6 can reduce the complexity of the classification problem.
Theorem 2. For all dataset distribution over , there is some k such that some linear classifier computes from the not-null patterns of .
Proof. Let be a dataset distribution over whose features have more than one possible value, without loss of generalization. We denote (i) the minimum absolute difference between values in the feature as , (ii) the difference between the maximum and minimum values in the feature as and (iii) the maximum as D. Then, from we define the function , which is a polynomial of grade 1 on the terms of a. Notice that g is an injective function if we take as the domain. We denote P as the Lagrange polynomial, such that . Let k be the maximum grade in a monomial from where we take the variables . Then for some polynomial of grade 1 and . Although is a regression model, it takes only zero or one values in the patterns and therefore can be taken as a linear classification model. □
We present an example with
Table 3, the first two columns with the class corresponding to a dataset distribution
for an XOR function, which is not linearly separable. However, the
monomial expansion
is a linearly separable dataset distribution.
Definition 7. Let be a sequence of dataset distributions, if is an extension of for all i, then is a progressive sequence of dataset distributions.
This definition seeks to formalize the notion of a feature construction method that is applied iteratively, producing an unbounded quantity of new features. For example, if we construct each k-monomial extension of , such that the features of have the same indices in , then is a progressive sequence of dataset distributions.
Definition 8. We say that a feature construction method is linearly asymptotic if from all dataset distribution over , feature construction methods produce a progressive sequence of dataset distributions such that there is some k and a linear classifier that can compute from .
Finally, we present a desirable property for any feature construction method. This property is equivalent to a feature construction method never getting stuck in patterns that are not linearly separable. Proving that a feature construction method is linearly asymptotic represents a formal validation of the method. For example, by Theorem 2, we conclude that the monomial construction method is linearly asymptotic.
Note that this desired property is similar to the kernel trick exploited by SVM models, where the data are mapped to a larger-dimensional space, such that a low-capacity classifier can separate the classes [
42].
6. Discussion
This is not the first work that relates features to data-complexity. The quotient between the number of instances and the number of features (known as the T2 measure) has been studied as a measure of data complexity [
40]. However, T2 is independent of the notion of complexity in this work, since we can define linearly separable datasets in all ranges of T2. There are also applications of complexity measures for the feature selection problem, but applying a mainly experimental analysis [
52,
53,
54,
55].
The concept of a redundant set of features is based on the relevant feature definition of John et al. [
56]. There are several other definitions for redundancy or redundant features. However, these definitions are more oriented to applications than a theoretical analysis of redundancy and its effects [
57,
58,
59,
60,
61,
62,
63,
64].
Our theoretical results show that many redundant features can reduce the complexity of the data. This result is interpreted in that a feature can provide representativeness without providing extra information, as seen in the example in
Section 3. It can also be interpreted that redundant features are capable of increasing the capacity of the model.
Our experimental results reinforce the evidence that redundancy itself is not necessarily detrimental. The real and synthetic datasets showed that extended datasets with many redundant features constructed as monomials could achieve higher accuracy. However, higher accuracy was more pronounced in synthetic datasets. The synthetic datasets applied did not have noise and had few dimensions, which are the main differences to the real datasets studied.
Usually, redundant features before preprocessing entail a greater complexity of the algorithm than the classifier induces. The reason is that the classifier cannot find the optimal (global) rule, because the search space increases exponentially. Therefore, it returns a local optimum. Due to this increased search space, as we increase the features, the problem increases the difficulty and, tends to be classifiers with poorer performance. However, this fact occurs because those initial features do not add enough expressiveness. Therefore, features obtained from suitable construction methods cannot be equally treated in the same way as an initial feature.
Finally, the increase or decrease of features implies an increase or decrease of parameters in the model, respectively. Therefore, the choice of features can induce overfitting or underfitting. However, these learning problems are not commonly studied in the development of feature selection methods. Therefore, the criteria for selecting features should consider the information provided by each feature and the representativeness provided by the features. Furthermore, in the same way that there are regularization methods to avoid overfitting by the internal parameters of the model, regularization methods could be developed against the excess of features.