You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

Published: 15 November 2020

An Effective Multi-Label Feature Selection Model Towards Eliminating Noisy Features

,
,
,
,
and
1
College of Mathematics and Statistics Science, Ludong University, Yantai 264025, China
2
College of Computer Science, Nankai University, Tianjin 300071, China
3
Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, College of Electronic and Communication Engineering, Tianjin Normal University, Tianjin 300387, China
4
RIKEN National Science Institute, Wako, Saitama 351-0198, Japan
This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets

Abstract

Feature selection has devoted a consistently great amount of effort to dimension reduction for various machine learning tasks. Existing feature selection models focus on selecting the most discriminative features for learning targets. However, this strategy is weak in handling two kinds of features, that is, the irrelevant and redundant ones, which are collectively referred to as noisy features. These features may hamper the construction of optimal low-dimensional subspaces and compromise the learning performance of downstream tasks. In this study, we propose a novel multi-label feature selection approach by embedding label correlations (dubbed ELC) to address these issues. Particularly, we extract label correlations for reliable label space structures and employ them to steer feature selection. In this way, label and feature spaces can be expected to be consistent and noisy features can be effectively eliminated. An extensive experimental evaluation on public benchmarks validated the superiority of ELC.

1. Introduction

For pattern recognition, feature selection is important for its effectiveness in reducing dimensionality. Feature selection methods are divided into supervised, semi-supervised, and unsupervised ones, according to whether the instances are labeled, partially labeled, or not [1,2,3,4]. For supervised cases, class labels are employed for measuring features’ discriminative abilities. Many popular and efficient feature selection methods belong to this group [5,6,7,8,9,10]. Supervised methods are further categorized into three well-known models: filter, wrapper, and embedded [11]. In recent years, some hybrid methods have emerged that combine filter and wrapper processes for enhancing performance and reducing computational cost [12,13].
In another categorization view, existing feature selection approaches can also be grouped to single-label and multi-label ones, whose difference lies in the size of labels that each instance is related with [14]. In single-label FS, instances and labels hold many-to-one connections and the target separability is emphasized in this learning task. With the great potential and success of multi-label learning in many machine learning fields, such as text categorization [15], content annotation [16], and protein location prediction [17], multi-label feature selection has received considerable attention in recent years. We approach the supervised multi-label feature selection in this study.
In multi-label learning, label correlations are the key to combining the complicated relationships among instances, which are typically annotated with multiple labels [18,19]. The mainstream multi-label feature selection strategy is to extract label correlations (via statistical or information-based measurements) and employ them to help find the most remarkable features. A critical issue is, however, this strategy would be trapped by two kinds of features, that is, irrelevant and redundant ones. Irrelevant features represent those lowly discriminative ones. Features of this kind are loosely correlated with learning targets and even may provide misleading information. Compared with irrelevant features, redundant features seem more deceptive. They may exhibit excellent (or comparably superior) performances and mix with remarkable features. Nevertheless, redundant features also lowly contribute to enhancing the discriminative ability of the constructed low-dimensional subspace, because the learning information they provide is redundant with the already distilled information. In general, we regard both irrelevant and redundant features as noisy ones, which may confuse selection processes and compromise the learning performance of downstream tasks.
In this paper, we present an effective multi-label feature selection model by embedding label correlations to eliminate noisy features, named ELC. Our major strategy is to keep feature-label space consistent and explore reliable label structures to drive feature selection. Concretely, we qualitatively assess label correlations in the label space and embed them in feature selection. In this way, the label structure information can be maximally preserved in the constructed low-dimensional subspace, and eventually the consistency between feature and label spaces can be achieved. Furthermore, we devise an efficient framework base on the sparse multi-task learning to optimize ELC, which can help ELC find globally optimal solutions and efficiently converge.
The major contributions of this paper are as follows:
  • We present a novel multi-label feature selection model to address the issue of noisy features. This model qualitatively measures label correlations and employs feature-label space consistency to steer feature selection.
  • We devised a compact framework to optimize the proposed model. This framework resorts to the multi-task learning strategy and promises globally optimal solutions and efficient convergence.
  • Comprehensive experiments on openly available benchmarks were conducted to validate the performance of the proposed model in feature selection and noise elimination.
The remaining parts of this paper are arranged as follows: related works are reviewed in Section 2; the proposed model ELC and its optimization framework are respectively introduced in Section 3 and Section 4; the experimental comparisons of ELC with several popular feature selection approaches are presented in Section 5; finally, conclusions are drawn in Section 6.

3. The Methodology: ELC

3.1. Model Description

In this paper, we use { x i , y i } i = 1 n to denote the data set, where X = [ x 1 ; ; x n ] R n × d represents the instance matrix and instances are characterized by d features in the feature set F = { f 1 , , f d } . Y = [ y 1 , , y l ] { 0 , 1 } n × l denotes the target label matrix, where y i j = 1 represents a positive label and y i j = 0 corresponds to a negative one.
Then, we formulate the multi-label feature selection by embedding label correlation (ELC) as follows:
min W 1 2 Y ^ T Y ^ S F 2 , s . t . Y ^ = 1 n ( X W ) T Y , W { 0 , 1 } d × l , W 2 , 0 = k ,
where S R l × l represents the label correlation matrix calculated over the initial label matrix, and k is the number of selected features. W R d × l is the feature selection matrix, where w i j indicates the importance (also known as weight) of the i-th feature to the j-th label.
Equation (1) is actually the feature evaluation function of ELC, which is essentially a Frobenius-norm quadratic model. The matrix S represents the label correlations extracted from the label space, and its each element describes a relation between two target labels. These correlations can be easily obtained by some quantitative measurements, including RBF kernel function, Pearson correlation coefficient, etc. Y ^ T Y ^ represents the label correlations extracted from the reduced feature space. Y ^ T Y ^ is differentiated from S on account of the disturbance of noisy features. As described in Section 1, noisy features may distort the structure of the feature space and provide negative learning information. Considering this, ELC evaluates features based on their abilities of preserving label correlations in the feature space, that is, keeping feature-label space consistency. The features that can minimize the discrepancy between Y ^ T Y ^ and S will be highly scored by ELC. In this way, ELC can be expected to construct an optimal feature subspace with eliminating different kinds of noisy features.
Under the constraint of the 2 , 0 -norm in Equation (1), only k row in W is nonzero. This corresponds to the k selected features for l target labels, where 1 represents selected and 0 represents none. Note that k is most likely to be unequal to l. That is, more than one feature may be selected responsible for discriminating the same label, or only one feature is discriminative for more than one label. In the former case, multiple features are unified to recognize one target, while one feature deals with multiple recognition sub-tasks in the latter case.

3.2. Property Analysis

The feature subset F ^ = f ^ 1 , f ^ 2 , , f ^ k that is selected by ELC can be considered as maximally maintaining feature-label space consistency. F ^ is expected to be constituted by the remarkable features and exclude the noisy ones. In this subsection, we will further analyze the properties of ELC and reveal its underlying characteristics.
Suppose that each feature in F has been standardized to have mean zero and unit length. Then, the following things hold for Equation (1):
Y ^ T Y ^ S F 2 = 1 n 2 Y T ( X W ) ( X W ) T Y S F 2 .
This is the objective of ELC. For more clearly illustrating its properties, let S ^ = n 2 S and H = Y T ( X W ) ( X W ) T Y . Then,
Y ^ T Y ^ S F 2 = 1 n 2 t r ( H T H ) + t r ( S ^ T S ^ ) 2 t r ( S ^ T H ) .
Three terms are involved in this equation. Clearly, t r ( S ^ T S ^ ) represents the label correlation information extracted from the label space and is constant in the selection process. Thus, it is easy to conclude that min W Y ^ T Y ^ S F 2 is equivalent to min W t r ( H T H ) and max W t r ( S ^ T H ) . Then, two properties of ELC are given as follows:
Property 1.
Label correlation information can be maximally embedded in feature selection by ELC.
Proof. 
t r ( S ^ T H ) = t r ( XW ) T Y S ^ Y T ( XW ) = i = 1 k f ^ i T ( Y S ^ Y T ) f ^ i = i = 1 k f ^ i T c 1 = 1 l c 2 = 1 l y c 1 s c 1 , c 2 y c 2 T f ^ i , where s c 1 , c 2 is the correlation degree of the labels y c 1 and y c 2 , and XW indicates the selected features. Then, the following things holds: min W Y ^ T Y ^ S F 2 max W i = 1 k f ^ i T c 1 = 1 l c 2 = 1 l y c 1 s c 1 , c 2 y c 2 T f ^ i . c 1 = 1 l c 2 = 1 l y c 1 s c 1 , c 2 y c 2 T can be regarded as the correlation information of pairwise labels. Therefore, ELC can maximally embed label correlations in its feature selection process. □
Label correlation information is important for multi-label learning. For example, the images about seas may share some common labels for recognition, such as ship, fish, and seagull, and their close correlations may help us distinguish the image category and find their shared features. The existing multi-label learning methods are categorized on the basis of the label correlation orders they consider [39]. Their correlation modeling capabilities directly affect their discriminative performance. As demonstrated in Property 1, ELC can measure the pairwise label correlations. Furthermore, it can also preserve this correlation information in its constructed feature subspace, which is crucial for ELC to eliminate noisy features. In other words, the features that can maximally preserve label correlation information are preferred by ELC. This strategy facilitates ELC building a low-dimensional feature space that is consistent with the label space and also suitable for multi-label learning.
In addition to the above property with respect to maximally embedding label correlations, another important property of ELC is illustrated as follows:
Property 2.
Feature redundancy can be minimized by ELC.
Proof. 
t r ( H T H ) = i , j = 1 k ( f ^ i T Y ) ( f ^ j T Y ) T 2 = i , j = 1 k c = 1 l f ^ i , y c f ^ j , y c 2
= i , j = 1 k c = 1 l n 4 σ y c 4 ρ f ^ i , y c 2 ρ f ^ j , y c 2 , where σ y c is the standard deviation of the label y c , and ρ f ^ i , y c and ρ f ^ j , y c are the Pearson correlation coefficients of y c with the features f ^ i and f ^ j , respectively. Then, we have min W Y ^ T Y ^ S F 2 min W i , j = 1 k c = 1 l n 4 σ y c 4 ρ f ^ i , y c 2 ρ f ^ j , y c 2 .
Clearly, n and σ y c are constant in the feature selection process. c = 1 l ρ f ^ i , y c ρ f ^ j , y c can be regarded as the shared label dependency of the features f ^ i and f ^ j , that is, the feature redundancy for recognizing the target y c . Therefore, ELC can minimize feature redundancy in its feature selection process. □
Note that the term c = 1 l ρ f ^ i , y c ρ f ^ j , y c in Property 2 is obtained by introducing the label correlation information. This is a completely novel estimation for the label-specific feature redundancy. The most majority of existing feature selection approaches (including the single-label and multi-label ones) adopt a univariate measurement criterion and merely the top-k features have opportunities to prevail. This strategy largely increases the redundant recognition information shared between features. For example, if we select the genes that are all discriminative for the diabetes type 1, we probably cannot give an accurate diagnosis since these features may be less aware of other types of diabetes. This is why we have to reduce recognition redundancy and enrich recognition information. Some approaches are able to reduce feature redundancy, while their focus is not the label-specific redundancy. For example, i , j = 1 k ρ f ^ i , f ^ j is actually reduced in SPFS [20]. This term includes an additional information irrelevant to recognition, and correspondingly, it is inappropriate. In contrast, ELC removes label-specific feature redundancy and is more suitable for multi-label learning with eliminating noisy features.
As discussed above, ELC processes two properties, i.e., maximally preserving label correlation information and minimizing label-specific feature redundancy. These characteristics account for the superior ability of ELC in eliminating noisy features and picking out remarkable ones.

4. Multi-Task Optimization for ELC

Equation (1) describes an integer programming problem, which is NP-hard and complicated to solve. Moreover, the 2 , 0 -norm constraint in Equation (1) is non-smooth, which leads to a slow convergence rate. In this section, we devise an efficient framework to address this problem by using the sparse multi-task learning technology in the proximal alternating direction method (PADM) framework [40].
Suppose the spectral decomposition of the correlation matrix S can be denoted as
S = Φ Σ Φ T = Φ d i a g ( σ 1 , , σ l ) Φ T , σ 1 σ l ,
where Φ and Σ are respectively the eigenvector and eigenvalue matrices of S . Then, Equation (1) can be reformulated as
min W , p 1 2 Y T X d i a g ( p ) W Γ * F 2 , s . t . W R d × l , W 2 , 1 t , p { 0 , 1 } d , p T 1 = k ,
where Γ * = n Φ Σ 1 / 2 , t is a hyperparameter to constrain W 2 , 1 to a convex solution, p is a feature indicator vector that reflects whether the corresponding features are selected or not (1 for selected and 0 for otherwise), and 1 is the vector with all ones.
On the basis of Equation (2), ELC is actually reformulated as a multivariate regression problem, which enables the multi-task learning technology [41]. This technology aims to learn a common set of features to tackle multiple relevant tasks and excels at various sparse learning formulations, including the optimization problem in Equation (1). Based on the multi-task learning technology, we then obtain the equivalent form of ELC as follows:
min W , p 1 2 A ^ d i a g ( p ) W Γ * F 2 + λ W 2 , 1 , s . t . p { 0 , 1 } d , p T 1 = k ,
where A ^ = Y T X , and λ > 0 is the regularization parameter. Clearly, we can apply the augmented Lagrangian method to solve this problem. Then, Equation (3) is further reformulated as
min U , W , p 1 2 A ^ d i a g ( p ) W Γ * F 2 + λ U 2 , 1 , s . t . U = W , p { 0 , 1 } d , p T 1 = k .
The Lagrangian function can be defined as
L ( U , W , p , V ) = 1 2 A ^ d i a g ( p ) W Γ * F 2 + β 2 W U 2 + λ U 2 , 1 t r V T ( W U ) ,
where V = v 1 T , , v d T T R d × l is the Lagrangian multiplier, and β > 0 is the penalty parameter.
Equation (5) involves four variables, that is, the auxiliary variable U , the feature weight matrix W , the feature indicator vector p , and the Lagrangian multiplier V . Clearly, simultaneously optimizing four variables is impractical. Accordingly, V is temporarily fixed for simplification in the following analysis. Then, minimizing L ( U , W , p , V ) is equivalent to the following two subproblems; i.e.,
  • min U L 1 ( U ) = min U β 2 W U 2 + λ U 2 , 1 + t r ( V T U ) ;
  • min W , p L 2 ( W , p ) = min W , p 1 β A ^ d i a g ( p ) W Γ * F 2 + W U 2 2 β t r ( V T W ) .
As to L 1 ( U ) , the following holds:
L 1 ( U ) = i = 1 d β 2 w i u i 2 + λ u i + t r ( v i T u i ) ,
where w i and u i are the i-th row vectors of W and U , respectively. Then, we reformulate min U L 1 ( U ) to its close form [41] as
min u i i = 1 d β 2 w i u i + 1 β v i 2 + λ u i .
Conducting gradient descent on Equation (7) yields the following optimal solution as
u i = max w i + 1 β v i λ β , 0 w i + 1 β v i w i + 1 β v i .
Then, the optimal U in iteration [ t + 1 ] can be denoted as
U [ t + 1 ] = max W [ t ] + 1 β V [ t ] λ β , 0 W [ t ] + 1 β V [ t ] W [ t ] + 1 β V [ t ] .
In terms of min W , p L 2 ( W , p ) , we let P = { p | p { 0 , 1 } d , p T 1 = k } . The dual problem of min W , p L 2 ( W , p ) is
min p P max W L 2 ( W , p ) .
Since simultaneously solving the both variables p and W is still tough, we first fix p to optimize W . Then, the solution of W can be obtained as
d i a g ( p ) A ^ T A ^ d i a g ( p ) β I W = d i a g ( p ) A ^ T Γ * + β U + V ,
where I is the identity matrix. The structure of A ^ T A ^ is commonly not circulant, and therefore the computation of Equation (11) is involved [42]. Considering this, an approximate term is added to L 2 ( W , p ) as follows:
L ˜ 2 ( W , p ) = 1 β τ W W [ t ] + τ Ω [ t ] 2 β t r ( V T W ) + W U 2 , Ω [ t ] = d i a g ( p [ t ] ) A ^ T A ^ d i a g ( p [ t ] ) W [ t ] Γ * ,
where τ > 0 , and W [ t ] is the optimal value of W in iteration [ t ] . Then, the solution of W [ t + 1 ] is
W [ t + 1 ] = τ β τ + 1 β U [ t + 1 ] + V [ t ] + 1 τ ( W [ t ] τ Ω [ t ] ) .
The detailed inference can be found in the Appendix A.
Similarly, we can easily obtain the optimal p by fixing W . Equation (10) is then equivalent to the following minimization problem in this case as follows:
min p P A ^ d i a g ( p ) W Γ * F 2 = min p P Y T i = 1 d p i f i w i Γ * F 2 .
Apparently, the top-k features that minimize Y T f i w i Γ * F 2 can be regarded as the remarkable ones. Their corresponding values in p are assigned as 1.
Note that the Lagrangian multiplier V is fixed through the above analysis, mainly for simplifying the solution process. We further tackle this problem in the popular PADM framework as illustrated in Algorithm 1. In this framework, V can be updated as
V [ t + 1 ] = V [ t ] β W [ t + 1 ] U [ t + 1 ] .
Algorithm 1 ELC
input: 
F = f 1 , , f d , Y , S , k , β , τ , λ
output: 
p [ t ]
1:
begin
2:
t = 0 , W [ 0 ] = 0 d × l , U [ 0 ] = 0 d × l , V [ 0 ] = 1 d 1 d × l ;
3:
find top-k features f ^ 1 [ 0 ] , , f ^ k [ 0 ] that minimize Equation (1), and set p i [ 0 ] = 1 , f i f ^ 1 [ 0 ] , , f ^ k [ 0 ] 0 , o t h e r w i s e ;
4:
while “not converged” do
5:
      optimize U [ t + 1 ] according to Equation (9);
6:
      optimize W [ t + 1 ] according to Equation (13);
7:
      find top-k features f ^ 1 [ t + 1 ] , , f ^ k [ t + 1 ] which minimize Equation (14), and set p i [ t + 1 ] = 1 , f i f ^ 1 [ t + 1 ] , , f ^ k [ t + 1 ] 0 , o t h e r w i s e ;
8:
      update V [ t + 1 ] according to Equation (15);
9:
       t = t + 1 ;
10:
end while;
11:
return p [ t ] ;
12:
end;
ELC in Algorithm 1 is implemented in the regression framework PADM, which is a fast alternating approach for the well-known alternating direction method (ADM) framework. PADM is effective and efficient in solving the minimization problem of the augmented Lagrangian function, and is able to converge to a certain solution W * , U * from any starting point W [ 0 ] , U [ 0 ] for any β > 0 [40].
In terms of the complexity of ELC, it only takes O ( k log d ) time to find k remarkable features from the d candidates. Thus, the time consumption for line 3 is O ( n d l 2 + k log d ) . The cost of the while loop in Algorithm 1 mainly lies in lines 6 and 7, which is O ( d 2 l 2 + n d l 2 + k log d ) . As this iteration process is repeated for t times, its total cost is O ( t ( d 2 l 2 + n d l 2 + k log d ) ) . Suppose t 1 . Then, the total complexity of ELC is approximately equal to O ( t ( d 2 l 2 + n d l 2 + k log d ) ) , where d , n , l , k , t are the numbers of features, instances, labels, selected features, and iterations for convergence, respectively.

5. Experimental Evaluation

Fourteen groups of multi-label data sets fetched from the Mulan library (http://mulan.sourceforge.net/datasets-mlc.html) are taken as the benchmarks in this section, which are shown in Table 1. We compare ELC (the source code is available at https://github.com/wangjuncs/ELC) with the following state-of-the-art multi-label feature selection methods:
Table 1. Benchmarks for multi-label feature selection.
  • MIFS (multi-label informed feature selection) [33]: a label correlation-based multi-label feature selection approach, which maps label information into a low-dimensional subspace and captures the correlations among multiple labels;
  • CMFS (correlated and multi-label feature selection) [35]: a feature selection approach based on non-negative matrix factorization, which exploits the label correlation information in features, labels, and instances to select the relevant features and remove the noisy ones;
  • LLSF (learning label-specific features) [36]: a unified multi-label learning framework for both feature selection and classification, which models high-order label correlations to select label-specific features.
More detailed experimental configurations can be found in the Appendix B.

5.1. Example 1: Classification Performance

The average classification performance of each feature selection approach is recorded in Table 2 and the pairwise t-tests at 5% significance level were conducted to validate the statistical significance. In addition to the traditional precision and AUC metrics, hamming loss penalizes incorrect the recognitions of instances to each target label, ranking loss penalizes the misordered labels in pairs, and one-error penalizes the instances whose top-ranked predicted labels are not in the ground-truth label set. Five metrics evaluated the multi-label classification performance from different aspects.
Table 2. Average multi-label classification performance (mean ± std.): the best results and those not significantly worse than it are highlighted in bold (pairwise t-test at 5% significance level).
A single metric is insufficient to illustrate the general classification performance on a dataset. For example, the overall performance of ML-KNN classifier [43] on birds is worse than that on enron under the precision metric, while it shows a better performance on birds than on enron under the AUC metric. Therefore, we extensively used five metrics to compare the performances of the compared approaches. As shown in Table 2, ELC outperforms MIFS, CMFS, and LLSF under various metrics. This superiority is attributed to two reasons. That is, ELC can effectively eliminate noisy features from the candidate feature subsets and maximally embed label correlation information into its selection process. The first term rules out the selection disturbance in the feature space, and the second term promises the proper guiding information extracted from the label space. By seamlessly fusing these two terms, ELC is able to find discriminative features for the downstream learning tasks. This point will be further validated in Section 5.2 and Section 5.3.

5.2. Example 2: Eliminating Noisy Features

In this section, we evaluate the performances of the compared approaches in eliminating noisy features. We take emotions, birds, and enron as the benchmarks, and measure the residual feature redundancy in the selected feature subset F ^ as follows:
R ( F ^ ) = 1 k ( k 1 ) l f ^ i , f ^ j F ^ c = 1 l ρ f ^ i , y c 2 ρ f ^ j , y c 2 ,
where ρ f ^ i , y c and ρ f ^ j , y c are the Pearson correlation coefficients of the features f ^ i and f ^ j with the target label y l , and k and l are the numbers of the selected features and labels, respectively. When R ( F ^ ) reaches its maximum value, the maximal redundant information exists in F ^ , which interprets as the inferior ability of the selection approach in removing noisy features.
The feature redundancy of k selected features for each approach is demonstrated in Figure 1, where k { d / 10 , 2 d / 10 , , 9 d / 10 } and d is the total number of original features. It illustrates that ELC is superior in reducing feature redundancy. In other words, ELC can effectively remove redundant features in its multi-label feature selection process. This is one of the crucial factors leading to the excellent discriminative ability of ELC. It should be pointed out that in contrast to the case of single-label feature selection, eliminating noisy features has not received sufficient attention from existing multi-label feature selection approaches. While the issue of noisy features is an obstacle of yielding high selection performance not only for the single-label learning but also for the multi-label cases, we devised ELC to comprehensively tackle this problem. Moreover, the reduced feature redundancy in the majority of redundancy elimination-based approaches is not directly relevant to the target labels. In contrast, ELC quantitatively reduces target-relevant redundancy without any prior probability knowledge, which is conducive to its superiority in multi-label feature selection.
Figure 1. Classification redundancy: (ac) are the classification redundancies produced by the feature selection approaches on the emotions, birds, and enron datasets, and the lower of the redundancy is the better.

5.3. Example 3: Embedding Label Correlations

Label correlation information is important for multi-label learning. In the following experiments, we estimate the preserved label correlation information of the selected feature subset F ^ as follows:
C ( F ^ ) = 1 k ( k 1 ) 1 n 2 Y T X F ^ X F ^ T Y S F 2 ,
where X F ^ denotes the instances characterized by F ^ and S is the label correlation matrix of the original data. Intuitively, Equation (17) measures the residue scale of label correlation information in the original and reduced feature spaces. A lower value indicates more information preserved. In other words, more label correlation information can be embedded in the feature selection process in this situation.
Similarly to the configuration in Section 5.2, we take emotions, birds, and enron as the benchmarks and record C ( F ^ ) of the k features selected by each approach, where k { d / 10 , 2 d / 10 , , 9 d / 10 } . As shown in Figure 2, ELC is better at preserving the class correlation information than the other multi-label feature selection approaches. Actually, the majority of the existing multi-label feature selection approaches take the label correlation information into consideration to some extent. In contrast to these approaches, ELC quantitatively measures this correlation information and maximally embeds it into the feature selection process. This characteristic, which has already been proved in Property 2, can be further revealed by the experimental results in this section.
Figure 2. Residual label correlation information: (ac) are the residual scales of the label correlation information that are not embedded by the feature selection approaches on the emotions, birds, and enron datasets, and the lower of the residual scale is the better.

5.4. Example 4: Time Consumption

In this section, we compare the approaches in terms of their feature selection efficiency. The time consumption here merely records the feature selection time, excluding the classification cost. All of the tests were implemented in Matlab on an Intel Core i7-4790 CPU (@3.6GHz) with 32GB memory (Intel Corp., Santa Clara, CA, USA). We respectively selected k ( k { 100 , 300 , 500 , 700 , 900 } ) features on the enron dataset and recorded the time consumption of each compared approach. As illustrated in Figure 3, ELC and CMFS are comparably efficient to converge, while MIFS is most time-consuming, which may be mainly attributed to its involved label clustering process.
Figure 3. Time consumption of each multi-label feature selection approach on the enron dataset.

6. Conclusions

A novel multi-label feature selection method called ELC is proposed in this paper. ELC embeds label correlation information in reduced feature subspace to eliminate noisy features. In this way, irrelevant and redundant features can be expected to be removed and a discriminative feature subset is constructed for the downstream learning tasks. These advantages help ELC yield good feature selection performance on a wide broad of multi-label data sets under various evaluation metrics.
In terms of optimizing ELC, we can feed it to some gradient descent frameworks to efficiently yield its optimal values, such as Adam with a self-adaptive learning rate [44]. Another interesting and possible exploration would be the consideration of noisy labels, which would induce negative effects on estimating label correlations. According to our pilot study, noisy labels may distort the label space and provide inaccurate guide information for feature selection. How to eliminate noisy labels may inspire our future work.

Author Contributions

Each author greatly contributed to the preparation of this manuscript. J.W. (Jun Wang) and J.W. (Jinmao Wei) wrote the paper; Y.X. and H.X. designed and performed the experiments; Z.S. and Z.Y. devised the optimization algorithms. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (number 61772288), the Natural Science Foundation of Tianjin City (number 18JCZDJC30900), the Ministry of Education of Humanities and Social Science Project (number 16YJC790123), the National Natural Science Foundation of Shandong Province (number ZR2019MA049), and the Cooperative Education Project of the Ministry of Education of China (number 201902199006).

Acknowledgments

The authors are very grateful to the anonymous reviewers and editor for their helpful and constructive comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

After adding an approximate term to L 2 ( W , p ) and reformulating it to L ˜ 2 ( W , p ) , we take the derivative of L ˜ 2 ( W , p ) with respect to W as follows:
L ˜ 2 W = β ( W U ) V + 1 τ ( W W [ t ] + τ Ω [ t ] ) , Ω [ t ] = d i a g ( p [ t ] ) A ^ T A ^ d i a g ( p [ t ] ) W [ t ] Γ * .
To induce the optimal solution of W , we make L ˜ 2 W equal to 0 and obtain:
( β + 1 τ ) W = β U + V + 1 τ ( W [ t ] τ Ω [ t ] ) .
Then, the optimal solution of W in the iteration [ t + 1 ] can be represented as
W [ t + 1 ] = τ β τ + 1 β U [ t + 1 ] + V [ t ] + 1 τ ( W [ t ] τ Ω [ t ] ) .

Appendix B. Experimental Configuration

The correlation (or similarity) matrices involved in experiments are all calculated based on the RBF kernel function. Specifically, the label correlation matrix S in ELP is defined as S i j = exp y i y j 2 2 δ 2 , y i , y j 0 0 , o t h e r w i s e , where δ 2 = m e a n ( y i y j 2 ) , i , j = 1 , , l . The instance similarity matrix in SPFS and CMFS is calculated as K i j = exp x i x j 2 2 δ 2 , y i = y j 0 , o t h e r w i s e , where δ 2 = m e a n ( x i x j 2 ) . The affinity graph in MIFS is constructed as K i j = exp x i x j 2 2 δ 2 , x i N p ( x j )   o r   x j N p ( x i ) 0 ; o t h e r w i s e where N p ( x i ) is the p-nearest neighbor of instance x i .
SPFS is implemented via the sequential forward selection (SFS) strategy. For a fair comparison, we tune the regularization parameter for all approaches via a grid search from { 10 3 , 10 2 , 10 1 , 1 , 10 } . For ELC, the parameter β is fixed to β = 10 8 , and τ is set to the spectral radius of A ^ T A ^ in the initial state and updated as τ [ t ] = 1 max ψ i in the t-th iteration, where ψ i is the i-th row vector of Ψ and Ψ = A ^ T A ^ V [ t ] . The convergence state is reached when any of the following two conditions is satisfied: (1) t m a x = 10 3 ; and (2) W [ t + 1 ] W [ t ] 10 4 .
Multi-label k-nearest neighbor (ML-kNN) classifier [43] is built on the k features selected by each compared approach, when k { d / 10 , 2 d / 10 , , 9 d / 10 } and d is the total number of features. All of the numerical features are normalized to zero mean and unit variance, and we employ the excellent features selected by the compared approaches to construct the ML-kNN classifiers and compare their classification performances. The 5-fold cross-validation is conducted, and we report the average performance of the ML-kNN classification under five metrics, i.e., precision, AUC, Hamming loss, ranking loss, and one error [39].

References

  1. Tang, J.; Alelyani, S.; Liu, H. Feature felection for classification: A review. In Data Classification: Algorithms and Applications; CRC Press: Chapman, CA, USA, 2014. [Google Scholar]
  2. Wang, J.; Wei, J.; Yang, Z. Supervised feature selection by preserving class correlation. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; pp. 1613–1622. [Google Scholar]
  3. Cai, D.; Zhang, C.; He, X. Efficient and robust feature selection via joint l2,1-norms minimization. In Proceedings of the KDD ’10: The 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 333–342. [Google Scholar]
  4. Xu, Y.; Wang, J.; An, S.; Wei, J.; Ruan, J. Semi-supervised multi-label feature selection by preserving feature-label space consistency. In Proceedings of the CIKM ’18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Orino, Italy, 22–26 October 2018; pp. 783–792. [Google Scholar]
  5. Brown, G.; Pocock, A.; Zhao, M.; Luján, M. Conditional Likelihood Maximisation: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 2012, 12, 27–66. [Google Scholar]
  6. Gu, Q.; Li, Z.; Han, J. Generalized fisher score for feature selection. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, 14–17 July 2011; pp. 266–273. [Google Scholar]
  7. He, X.; Cai, D.; Niyogi, P. Laplacian score for feature selection. In Proceedings of the 18th International Conference on Neural Information Processing Systems, Shanghai, China, 13–17 November 2011; pp. 507–514. [Google Scholar]
  8. Lin, D.; Tang, X. Conditional Infomax Learning: An Integrated Framework for Feature Extraction and Fusion. In Proceedings of the Computer Vision—ECCV 2006, 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 68–82. [Google Scholar]
  9. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
  10. Robnik-Šikonja, M.; Kononenko, I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 2003, 53, 23–69. [Google Scholar] [CrossRef]
  11. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  12. Bermejo, P.; Gámez, J.A.; Puerta, J.M. Speeding up incremental wrapper feature subset selection with Naive Bayes classifier. Knowl.-Based Syst. 2014, 55, 140–147. [Google Scholar] [CrossRef]
  13. Gütlein, M.; Frank, E.; Hall, M.; Karwath, A. Large-scale attribute selection using wrappers. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2009, Nashville, TN, USA, 30 March–2 April 2009; pp. 332–339. [Google Scholar]
  14. Xu, Y.; Wang, J.; Wei, J. To avoid the pitfall of missing labels in feature selection: A generative model gives the answer. In Proceedings of the AAAI Conference on Artificial Intelligence 2020, New York, NY, USA, 7–12 February 2020; pp. 6534–6541. [Google Scholar]
  15. Chen, W.; Yan, J.; Zhang, B.; Chen, Z.; Yang, Q. Document transformation for multi-label feature selection in text categorization. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, Washington, DC, USA, 28–31 October 2007; pp. 451–456. [Google Scholar]
  16. Ma, Z.; Nie, F.; Yang, Y.; Uijlings, J.R.; Sebe, N. Web image annotation via subspace-sparsity collaborated feature selection. IEEE Trans. Multimedia 2012, 14, 1021–1030. [Google Scholar] [CrossRef]
  17. Wang, X.; Li, G.Z. Multilabel learning via random label selection for protein subcellular multilocations prediction. IEEE/ACM Trans. Comput. Biol Bioinform. 2013, 10, 436–446. [Google Scholar] [CrossRef] [PubMed]
  18. Zhang, Z.L.; Zhou, Z.H. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 2014, 26, 1819–1837. [Google Scholar] [CrossRef]
  19. Rivolli, A.; J, J.R.; Soares, C.; Pfahringer, B.; de Carvalho, A.C. An empirical analysis of binary transformation strategies and base algorithms for multi-label learning. Mach. Learn. 2020, 9, 1–55. [Google Scholar]
  20. Zhao, Z.; Wang, L.; Liu, H.; Ye, J. On similarity preserving feature selection. IEEE Trans. Knowl. Data Eng. 2013, 25, 619–632. [Google Scholar] [CrossRef]
  21. Zhao, J.; Lu, K.; He, X. Locality sensitive semi-supervised feature selection. Neurocomputing 2008, 71, 1842–1849. [Google Scholar] [CrossRef]
  22. Zhang, Y.; Zhou, Z.H. Multi-label dimensionality reduction via dependence maximization. ACM Trans. Knowl. Discovery Data 2010, 4, 1503–1505. [Google Scholar]
  23. Nie, F.; Xiang, S.; Jia, Y. Trace ratio criterion for feature selection. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, IL, USA, 13–17 July 2008; pp. 671–676. [Google Scholar]
  24. Zhao, Z.; Liu, H. Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th International Conference on Machine Learning, ICML 2007, Corvallis, OR, USA, 20–24 June 2007; pp. 1151–1157. [Google Scholar]
  25. Zhao, Z.; Wang, L.; Liu, H. Efficient spectral feature selection with minimum redundancy. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; pp. 673–678. [Google Scholar]
  26. Verikas, A.; Bacauskiene, M. Feature selection with neural networks. Pattern Recog. Lett. 2002, 23, 1323–1335. [Google Scholar] [CrossRef]
  27. Arefnezhad, S.; Samiee, S.; Eichberger, A.; Nahvi, A. Driver drowsiness detection based on steering wheel data applying adaptive neuro-fuzzy feature selection. Sensors 2019, 14, 943. [Google Scholar] [CrossRef]
  28. Cateni, S.; Colla, V.; Vannucci, M. A fuzzy system for combining filter features selection methods. Int. J. Fuzzy Syst. 2017, 19, 1168–1180. [Google Scholar] [CrossRef]
  29. Wang, J.; Wei, J.M.; Yang, Z.; Wang, S.Q. Feature selection by maximizing independent classification information. IEEE Trans. Knowl. Data Eng. 2017, 29, 828–841. [Google Scholar] [CrossRef]
  30. Kong, D.; Ding, C.; Huang, H.; Zhao, H. Multi-label ReliefF and F-statistic feature selections for image annotation. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 2352–2359. [Google Scholar]
  31. Ji, S.; Ye, J. Linear dimensionality reduction for multi-label classification. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, CA, USA, 11–17 July 2009; pp. 1077–1082. [Google Scholar]
  32. Wang, H.; Ding, C.; Huang, H. Multi-label linear discriminant analysis. In Proceedings of the 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; pp. 126–139. [Google Scholar]
  33. Jian, L.; Li, J.; Shu, K.; Liu, H. Multi-label informed feature selection. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 1627–1633. [Google Scholar]
  34. Huang, J.; Li, G.; Huang, Q.; Wu, X. Joint feature selection and classification for multilabel learning. IEEE Trans. Cybern. 2018, 48, 876–889. [Google Scholar] [CrossRef]
  35. Braytee, A.; Liu, W.; Catchpoole, D.R.; Kennedy, P.J. Multi-label feature selection using correlation information. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 1649–1656. [Google Scholar]
  36. Huang, J.; Li, G.; Huang, Q.; Wu, X. Learning label-specific features and class-dependent labels for multi-label classification. IEEE Trans. Knowl. Data Eng. 2016, 28, 3309–3323. [Google Scholar] [CrossRef]
  37. Ji, S.; Tang, L.; Yu, S.; Ye, J. Extracting shared subspace for multi-label classification. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 381–389. [Google Scholar]
  38. Nie, F.; Huang, H.; Cai, X.; Ding, C.H. Efficient and robust feature selection via joint 2,1-norms minimization. In Proceedings of the 4th Annual Conference on Neural Information Processing Systems 2010, Vancouver, BC, Canada, 6–9 December 2010; pp. 1813–1821. [Google Scholar]
  39. Zhang, M.L.; Wu, L. LIFT: Multi-label learning with label-specific features. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 107–119. [Google Scholar] [CrossRef]
  40. Xiao, Y.H.; Song, H.N. An inexact alternating directions algorithm for constrained total variation regularized compressive sensing problems. J. Math Imaging Vision 2012, 44, 114–127. [Google Scholar] [CrossRef]
  41. Gong, P.; Zhou, J.; Fan, W.; Ye, J. Efficient multi-task feature learning with calibration. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 10–13 August 2014; pp. 761–770. [Google Scholar]
  42. Horn, R.A.; Johnson, C.R. Matrix Analysis, 2nd ed.; Cambridge University: Cambridge, UK, 2012. [Google Scholar]
  43. Zhang, M.L.; Zhou, Z.H. ML-kNN: A lazy learning approach to multi-label learning. Pattern Recog. 2007, 40, 2038–2048. [Google Scholar] [CrossRef]
  44. Kingma, D.K.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.