**An Effective Multi-Label Feature Selection Model Towards Eliminating Noisy Features**

**Jun Wang <sup>1</sup> , Yuanyuan Xu <sup>2</sup> , Hengpeng Xu <sup>3</sup> , Zhe Sun <sup>4</sup> , Zhenglu Yang <sup>2</sup> and Jinmao Wei 2,***<sup>∗</sup>*


Received: 21 October 2020; Accepted: 12 November 2020; Published: 15 November 2020 -

**Abstract:** Feature selection has devoted a consistently great amount of effort to dimension reduction for various machine learning tasks. Existing feature selection models focus on selecting the most discriminative features for learning targets. However, this strategy is weak in handling two kinds of features, that is, the irrelevant and redundant ones, which are collectively referred to as noisy features. These features may hamper the construction of optimal low-dimensional subspaces and compromise the learning performance of downstream tasks. In this study, we propose a novel multi-label feature selection approach by embedding label correlations (dubbed ELC) to address these issues. Particularly, we extract label correlations for reliable label space structures and employ them to steer feature selection. In this way, label and feature spaces can be expected to be consistent and noisy features can be effectively eliminated. An extensive experimental evaluation on public benchmarks validated the superiority of ELC.

**Keywords:** feature selection; noise elimination; space consistency; label correlations

### **1. Introduction**

For pattern recognition, feature selection is important for its effectiveness in reducing dimensionality. Feature selection methods are divided into supervised, semi-supervised, and unsupervised ones, according to whether the instances are labeled, partially labeled, or not [1–4]. For supervised cases, class labels are employed for measuring features' discriminative abilities. Many popular and efficient feature selection methods belong to this group [5–10]. Supervised methods are further categorized into three well-known models: filter, wrapper, and embedded [11]. In recent years, some hybrid methods have emerged that combine filter and wrapper processes for enhancing performance and reducing computational cost [12,13].

In another categorization view, existing feature selection approaches can also be grouped to single-label and multi-label ones, whose difference lies in the size of labels that each instance is related with [14]. In single-label FS, instances and labels hold many-to-one connections and the target separability is emphasized in this learning task. With the great potential and success of multi-label learning in many machine learning fields, such as text categorization [15], content annotation [16], and protein location prediction [17], multi-label feature selection has received considerable attention in recent years. We approach the supervised multi-label feature selection in this study.

In multi-label learning, label correlations are the key to combining the complicated relationships among instances, which are typically annotated with multiple labels [18,19]. The mainstream multi-label feature selection strategy is to extract label correlations (via statistical or information-based measurements) and employ them to help find the most remarkable features. A critical issue is, however, this strategy would be trapped by two kinds of features, that is, irrelevant and redundant ones. Irrelevant features represent those lowly discriminative ones. Features of this kind are loosely correlated with learning targets and even may provide misleading information. Compared with irrelevant features, redundant features seem more deceptive. They may exhibit excellent (or comparably superior) performances and mix with remarkable features. Nevertheless, redundant features also lowly contribute to enhancing the discriminative ability of the constructed low-dimensional subspace, because the learning information they provide is redundant with the already distilled information. In general, we regard both irrelevant and redundant features as noisy ones, which may confuse selection processes and compromise the learning performance of downstream tasks.

In this paper, we present an effective multi-label feature selection model by embedding label correlations to eliminate noisy features, named ELC. Our major strategy is to keep feature-label space consistent and explore reliable label structures to drive feature selection. Concretely, we qualitatively assess label correlations in the label space and embed them in feature selection. In this way, the label structure information can be maximally preserved in the constructed low-dimensional subspace, and eventually the consistency between feature and label spaces can be achieved. Furthermore, we devise an efficient framework base on the sparse multi-task learning to optimize ELC, which can help ELC find globally optimal solutions and efficiently converge.

The major contributions of this paper are as follows:


The remaining parts of this paper are arranged as follows: related works are reviewed in Section 2; the proposed model ELC and its optimization framework are respectively introduced in Section 3 and Section 4; the experimental comparisons of ELC with several popular feature selection approaches are presented in Section 5; finally, conclusions are drawn in Section 6.

### **2. Related Work**

Feature selection approaches are commonly specified to a certain recognition scenario, i.e., single-label learning or multi-label learning, because of the different concerns of the two recognition tasks. The issue of noisy feature elimination is firstly raised in single-label feature selection, focusing on removing irrelevant features and picking out discriminative ones. For example, the popular single-label feature selection family by preserving instance similarity [20] directly highly scores the most discriminative features under various statistical metrics, such as the Laplacian score [7,21], the Fisher score [6], the Hilbert–Schmidt independence criterion [22], and the trace ratio [23], just to name a few. In addition to the above similarity preservation approaches, some traditional distance or instance difference based ones can also be deemed as simply pursuing "target-specific features," such as ReliefF [10], SPEC [24,25], and SPFS [20]. This denotation arises from the fact that target-specific features are picked based only on whether they are strongly correlated with the learning targets. In other words, those features that have excellent discriminative abilities for targets will prevail. The aforementioned approaches have generally achieved excellent performance in eliminating

irrelevant features, while may experience difficulties in improving learning performance due to their scarce attention on removing redundant features.

Recently, some remarkable neural networks-based and fuzzy logic-based feature selection works have been presented, which have received extensive attention due to their excellent feature selection performances [26–28]. For example, Verikas and Bacauskiene [26] proposed a feedforward neural network-based approach to find the salient features and remove those yielding the least accurate classifications. Arefnezhad et al. [27] highly scored the features most related to the drowsiness level via an adaptive neuro-fuzzy inference system, which was devised by combining filter and wrapper feature selection approaches. Cateni et al. [28] selected the mostly relevant features for better binary classification by combining several filter approaches through a fuzzy inference system. Generally speaking, the above studies serve as excellent examples of picking out target-specific features, while still leaving aside the underlying negative effects of noisy features.

A salient but redundant feature provides little valuable learning information if selected. Although this issue is ignored by a majority of feature selection approaches, it gains attention from some information-based ones. Among them, the family based on mutual information is regarded as the mainstream redundancy removing approach. The classical mutual information [9] and its variants (e.g., conditional mutual information) [5,29] can effectively position the redundant features and remove them via a greedy search. Nevertheless, an inevitable problem is that the performances of these approaches heavily depend on their probability estimation accuracy. This problem is more complicated in high-dimensional space.

In terms of multi-label feature selection approaches, they can be roughly categorized into two families. The first family directly divides the multi-label learning into multiple subproblems and utilizes single-label feature evaluation metrics to tackle them [4]. For instance, ReliefF is tailed for multi-label learning by dividing its estimations of nearest misses and hits to eight subproblems [30]. In addition, some single-label feature evaluation strategies are also reformulated to the multi-label ones by enforcing on each subgroup, such as class separability and linear discriminant analysis [31,32]. A major drawback of the above subproblem division strategy is that it ignores label correlations, which encode the underlying label structures for recognition and play critical roles in multi-label learning.

On the other hand, the second family of multi-label feature selection can better fix this issue since it incorporates label correlations into model construction. A common strategy of this family is to evaluate instance-label pairs via specific label ranking metrics and select the features by minimizing loss functions [33–36]. While real-world label relations could be beyond pairwise situations, some high-order correlation approaches have been proposed to model complicated label structures. A feasible solution is to build a common space shared among various labels [16,33,37], which typically suffers from high costs and complex computation. It is noteworthy that in contrast to single-label feature selection approaches, the multi-label ones rarely have the issue of noisy feature elimination. A few approaches specific to ruling out irrelevant features are based on sparse regularization [38]. These approaches neglect the negative effects of redundant features and are not competent in completely removing noisy features.

To comprehensively address the above issues, we will introduce a novel multi-label feature selection model in Section 3, which can effectively filter both kinds of noisy features (including irrelevant and redundant ones) and select the remarkable ones. The proposed model adopts a statistical metric to measure target-related feature redundancy and dispense with any probability estimation. Furthermore, this model extracts label correlations and keeps feature-label space consistency to guide feature selection, which facilitates irrelevant feature exclusion and remarkable feature domination.

### **3. The Methodology: ELC**

### *3.1. Model Description*

In this paper, we use {**x***i*, **<sup>y</sup>***i*}*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> to denote the data set, where **<sup>X</sup>** = [**x**1; ...; **<sup>x</sup>***n*] <sup>∈</sup> <sup>R</sup>*n*×*<sup>d</sup>* represents the instance matrix and instances are characterized by *d* features in the feature set **F** = {**f**1, ...,**f***d*}. **<sup>Y</sup>** = [**y**1, ..., **<sup>y</sup>***l*] ∈ {0, 1}*n*×*<sup>l</sup>* denotes the target label matrix, where *yij* = 1 represents a positive label and *yij* = 0 corresponds to a negative one.

Then, we formulate the multi-label feature selection by embedding label correlation (ELC) as follows:

$$\min\_{\mathbf{W}} \frac{1}{2} \left\| \mathbf{Y}^T \mathbf{\hat{Y}} - \mathbf{S} \right\|\_{F}^2 \text{ s.t. } \mathbf{\hat{Y}} = \frac{1}{n} (\mathbf{XW})^T \mathbf{Y}, \mathbf{W} \in \{0, 1\}^{d \times I}, \||\mathbf{W}||\_{2, 0} = k,\tag{1}$$

where **<sup>S</sup>** <sup>∈</sup> <sup>R</sup>*l*×*<sup>l</sup>* represents the label correlation matrix calculated over the initial label matrix, and *<sup>k</sup>* is the number of selected features. **<sup>W</sup>** <sup>∈</sup> <sup>R</sup>*d*×*<sup>l</sup>* is the feature selection matrix, where *wij* indicates the importance (also known as weight) of the *i*-th feature to the *j*-th label.

Equation (1) is actually the feature evaluation function of ELC, which is essentially a Frobenius-norm quadratic model. The matrix **S** represents the label correlations extracted from the label space, and its each element describes a relation between two target labels. These correlations can be easily obtained by some quantitative measurements, including RBF kernel function, Pearson correlation coefficient, etc. **Yˆ** *<sup>T</sup>***Yˆ** represents the label correlations extracted from the reduced feature space. **Yˆ** *<sup>T</sup>***Yˆ** is differentiated from **S** on account of the disturbance of noisy features. As described in Section 1, noisy features may distort the structure of the feature space and provide negative learning information. Considering this, ELC evaluates features based on their abilities of preserving label correlations in the feature space, that is, keeping feature-label space consistency. The features that can minimize the discrepancy between **Yˆ** *<sup>T</sup>***Yˆ** and **S** will be highly scored by ELC. In this way, ELC can be expected to construct an optimal feature subspace with eliminating different kinds of noisy features.

Under the constraint of the -2,0-norm in Equation (1), only *k* row in **W** is nonzero. This corresponds to the *k* selected features for *l* target labels, where 1 represents selected and 0 represents none. Note that *k* is most likely to be unequal to *l*. That is, more than one feature may be selected responsible for discriminating the same label, or only one feature is discriminative for more than one label. In the former case, multiple features are unified to recognize one target, while one feature deals with multiple recognition sub-tasks in the latter case.

### *3.2. Property Analysis*

The feature subset **Fˆ** = **ˆ f**1, **ˆ f**2,..., **ˆ f***k* that is selected by ELC can be considered as maximally maintaining feature-label space consistency. **Fˆ** is expected to be constituted by the remarkable features and exclude the noisy ones. In this subsection, we will further analyze the properties of ELC and reveal its underlying characteristics.

Suppose that each feature in **F** has been standardized to have mean zero and unit length. Then, the following things hold for Equation (1):

$$\left\|\left\|\mathbf{Y}^T\mathbf{\hat{Y}} - \mathbf{S}\right\|\right\|\_F^2 = \left\|\frac{1}{n^2} \left(\mathbf{Y}^T(\mathbf{X}\mathbf{W})(\mathbf{X}\mathbf{W})^T\mathbf{Y}\right) - \mathbf{S}\right\|\_F^2.$$

This is the objective of ELC. For more clearly illustrating its properties, let <sup>S</sup><sup>ˆ</sup> <sup>=</sup> *<sup>n</sup>*2**<sup>S</sup>** and H = **<sup>Y</sup>***T*(**XW**)(**XW**)*T***Y**. Then,

$$\left\|\left\|\mathbf{Y}^T\mathbf{\hat{Y}} - \mathbf{S}\right\|\right\|\_F^2 = \frac{1}{n^2} \left( tr(\mathcal{H}^T\mathcal{H}) + tr(\mathcal{S}^T\mathcal{S}) - 2tr(\mathcal{S}^T\mathcal{H}) \right).$$

Three terms are involved in this equation. Clearly, *tr*(Sˆ*T*Sˆ) represents the label correlation information extracted from the label space and is constant in the selection process. Thus, it is easy to conclude that min **W Yˆ** *<sup>T</sup>***Yˆ** <sup>−</sup> **<sup>S</sup>** 2 *<sup>F</sup>* is equivalent to min **<sup>W</sup>** *tr*(H*T*H) and max **<sup>W</sup>** *tr*(Sˆ*T*H). Then, two properties of ELC are given as follows:

**Property 1.** *Label correlation information can be maximally embedded in feature selection by ELC.*

**Proof.** *tr*(Sˆ*T*H) = *tr* (**XW**)*T***Y**Sˆ**Y***T*(**XW**) = ∑*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> **<sup>ˆ</sup> f***T <sup>i</sup>* (**Y**Sˆ**Y***T*)**<sup>ˆ</sup> f***<sup>i</sup>* = ∑*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> **<sup>ˆ</sup> f***T i* " *l* ∑ *c*1=1 *l* ∑ *c*2=1 **y***c*1*sc*1,*c*2**y***<sup>T</sup> c*2 # **ˆ f***i*, where *sc*1,*c*<sup>2</sup> is the correlation degree of the labels **y***c*<sup>1</sup> and **y***c*2, and **XW** indicates the selected features. Then, the following things holds: min **W Yˆ** *<sup>T</sup>***Yˆ** <sup>−</sup> **<sup>S</sup>** 2 *<sup>F</sup>* ∝ max **<sup>W</sup>** <sup>∑</sup>*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> **<sup>ˆ</sup> f***T i* ∑*l <sup>c</sup>*1=<sup>1</sup> <sup>∑</sup>*<sup>l</sup> <sup>c</sup>*2=<sup>1</sup> **y***c*<sup>1</sup>*sc*1,*c*2**y***<sup>T</sup> c*2 **ˆ f***i*. ∑*l <sup>c</sup>*1=<sup>1</sup> <sup>∑</sup>*<sup>l</sup> <sup>c</sup>*2=<sup>1</sup> **y***c*<sup>1</sup>*sc*1,*c*2**y***<sup>T</sup> <sup>c</sup>*<sup>2</sup> can be regarded as the correlation information of pairwise labels. Therefore, ELC can maximally embed label correlations in its feature selection process.

Label correlation information is important for multi-label learning. For example, the images about seas may share some common labels for recognition, such as ship, fish, and seagull, and their close correlations may help us distinguish the image category and find their shared features. The existing multi-label learning methods are categorized on the basis of the label correlation orders they consider [39]. Their correlation modeling capabilities directly affect their discriminative performance. As demonstrated in Property 1, ELC can measure the pairwise label correlations. Furthermore, it can also preserve this correlation information in its constructed feature subspace, which is crucial for ELC to eliminate noisy features. In other words, the features that can maximally preserve label correlation information are preferred by ELC. This strategy facilitates ELC building a low-dimensional feature space that is consistent with the label space and also suitable for multi-label learning.

In addition to the above property with respect to maximally embedding label correlations, another important property of ELC is illustrated as follows:

**Property 2.** *Feature redundancy can be minimized by ELC.*

**Proof.** *tr*(H*T*H) = <sup>∑</sup>*<sup>k</sup> i*,*j*=1 (**ˆ f***T <sup>i</sup>* **<sup>Y</sup>**)(**<sup>ˆ</sup> f***T <sup>j</sup>* **<sup>Y</sup>**)*<sup>T</sup>* 2 = ∑*<sup>k</sup> <sup>i</sup>*,*j*=<sup>1</sup> <sup>∑</sup>*<sup>l</sup> c*=1 **ˆ <sup>f</sup>***i*, **<sup>y</sup>***c***<sup>ˆ</sup> f***j*, **y***c* 2 = ∑*<sup>k</sup> <sup>i</sup>*,*j*=<sup>1</sup> <sup>∑</sup>*<sup>l</sup> <sup>c</sup>*=<sup>1</sup> *n*4*σ*<sup>4</sup> **<sup>y</sup>***cρ*<sup>2</sup> **ˆ f***i*,**y***<sup>c</sup> ρ*2 **ˆ f***j*,**y***<sup>c</sup>* ,

where *σ***y***<sup>c</sup>* is the standard deviation of the label **y***c*, and *ρ***<sup>ˆ</sup> <sup>f</sup>***i*,**y***<sup>c</sup>* and *<sup>ρ</sup>***<sup>ˆ</sup> <sup>f</sup>***j*,**y***<sup>c</sup>* are the Pearson correlation coefficients of **y***<sup>c</sup>* with the features **ˆ f***<sup>i</sup>* and **ˆ f***j*, respectively. Then, we have min **W Yˆ** *<sup>T</sup>***Yˆ** <sup>−</sup> **<sup>S</sup>** 2 *<sup>F</sup>* ∝ min **<sup>W</sup>** <sup>∑</sup>*<sup>k</sup> <sup>i</sup>*,*j*=<sup>1</sup> <sup>∑</sup>*<sup>l</sup> <sup>c</sup>*=<sup>1</sup> *n*4*σ*<sup>4</sup> **<sup>y</sup>***cρ*<sup>2</sup> **ˆ f***i*,**y***<sup>c</sup> ρ*2 **ˆ f***j*,**y***<sup>c</sup>* .

Clearly, *n* and *σ***y***<sup>c</sup>* are constant in the feature selection process. ∑*<sup>l</sup> <sup>c</sup>*=<sup>1</sup> *ρ***<sup>ˆ</sup> f***i*,**y***<sup>c</sup> ρ***ˆ <sup>f</sup>***j*,**y***<sup>c</sup>* can be regarded as the shared label dependency of the features **ˆ f***<sup>i</sup>* and **ˆ f***j*, that is, the feature redundancy for recognizing the target **y***c*. Therefore, ELC can minimize feature redundancy in its feature selection process.

Note that the term ∑*<sup>l</sup> <sup>c</sup>*=<sup>1</sup> *ρ***<sup>ˆ</sup> f***i*,**y***<sup>c</sup> ρ***ˆ <sup>f</sup>***j*,**y***<sup>c</sup>* in Property 2 is obtained by introducing the label correlation information. This is a completely novel estimation for the label-specific feature redundancy. The most majority of existing feature selection approaches (including the single-label and multi-label ones) adopt a univariate measurement criterion and merely the top-*k* features have opportunities to prevail. This strategy largely increases the redundant recognition information shared between features. For example, if we select the genes that are all discriminative for the diabetes type 1, we probably cannot give an accurate diagnosis since these features may be less aware of other types of diabetes. This is why we have to reduce recognition redundancy and enrich recognition information. Some approaches are able to reduce feature redundancy, while their focus is not the label-specific redundancy. For example, ∑*k <sup>i</sup>*,*j*=<sup>1</sup> *ρ***<sup>ˆ</sup> f***i*,**ˆ f***j* is actually reduced in SPFS [20]. This term includes an additional information irrelevant to recognition, and correspondingly, it is inappropriate. In contrast, ELC removes label-specific feature redundancy and is more suitable for multi-label learning with eliminating noisy features.

As discussed above, ELC processes two properties, i.e., maximally preserving label correlation information and minimizing label-specific feature redundancy. These characteristics account for the superior ability of ELC in eliminating noisy features and picking out remarkable ones.

### **4. Multi-Task Optimization for ELC**

Equation (1) describes an integer programming problem, which is NP-hard and complicated to solve. Moreover, the -2,0-norm constraint in Equation (1) is non-smooth, which leads to a slow convergence rate. In this section, we devise an efficient framework to address this problem by using the sparse multi-task learning technology in the proximal alternating direction method (PADM) framework [40].

Suppose the spectral decomposition of the correlation matrix **S** can be denoted as

$$\mathbf{S} = \boldsymbol{\Phi} \boldsymbol{\Sigma} \boldsymbol{\Phi}^T = \boldsymbol{\Phi} \text{diag}\left(\sigma\_1, \dots, \sigma\_l\right) \boldsymbol{\Phi}^T, \sigma\_1 \ge \dots \ge \sigma\_{l'}$$

where **Φ** and **Σ** are respectively the eigenvector and eigenvalue matrices of **S**. Then, Equation (1) can be reformulated as

$$\min\_{\mathbf{W}, \mathbf{p}} \frac{1}{2} \left\| \mathbf{Y}^T \mathbf{X} \text{diag}(\mathbf{p}) \mathbf{W} - \mathbf{I}^\* \right\|\_{F}^2, \text{ s.t. } \mathbf{W} \in \mathbb{R}^{d \times l}, \left\| \mathbf{W} \right\|\_{2, 1} \le t, \mathbf{p} \in \{0, 1\}^d, \mathbf{p}^T \mathbf{1} = k,\tag{2}$$

where **<sup>Γ</sup>**<sup>∗</sup> <sup>=</sup> *<sup>n</sup>***ΦΣ**1/2, *<sup>t</sup>* is a hyperparameter to constrain **W**2,1 to a convex solution, **<sup>p</sup>** is a feature indicator vector that reflects whether the corresponding features are selected or not (1 for selected and 0 for otherwise), and **1** is the vector with all ones.

On the basis of Equation (2), ELC is actually reformulated as a multivariate regression problem, which enables the multi-task learning technology [41]. This technology aims to learn a common set of features to tackle multiple relevant tasks and excels at various sparse learning formulations, including the optimization problem in Equation (1). Based on the multi-task learning technology, we then obtain the equivalent form of ELC as follows:

$$\min\_{\mathbf{W}, \mathbf{p}} \frac{1}{2} \left\| \mathbf{\hat{A}} \text{diag}(\mathbf{p}) \mathbf{W} - \Gamma^\* \right\|\_{F}^2 + \lambda \left\| \mathbf{W} \right\|\_{2, 1'} \text{ s.t. } \mathbf{p} \in \{0, 1\}^d, \mathbf{p}^T \mathbf{1} = k,\tag{3}$$

where **Aˆ** = **Y***T***X**, and *λ* > 0 is the regularization parameter. Clearly, we can apply the augmented Lagrangian method to solve this problem. Then, Equation (3) is further reformulated as

$$\min\_{\mathbf{U}, \mathbf{W}, \mathbf{p}} \frac{1}{2} \left\| \hat{\mathbf{A}} \text{diag}(\mathbf{p}) \mathbf{W} - \Gamma^\* \right\|\_F^2 + \lambda \left\| \mathbf{U} \right\|\_{2, 1}, \text{ s.t. } \mathbf{U} = \mathbf{W}, \mathbf{p} \in \{0, 1\}^d, \mathbf{p}^T \mathbf{1} = k. \tag{4}$$

The Lagrangian function can be defined as

$$\mathcal{L}(\mathbf{U}, \mathbf{W}, \mathbf{p}, \mathbf{V}) = \frac{1}{2} \left\| \mathbf{A} \text{diag}(\mathbf{p}) \mathbf{W} - \Gamma^\* \right\|\_F^2 + \frac{\beta}{2} \left\| \mathbf{W} - \mathbf{U} \right\|^2 + \lambda \left\| \mathbf{U} \right\|\_{2,1} - tr \left( \mathbf{V}^T (\mathbf{W} - \mathbf{U}) \right), \tag{5}$$

where **V** = **v***T* <sup>1</sup> ,..., **<sup>v</sup>***<sup>T</sup> d <sup>T</sup>* <sup>∈</sup> <sup>R</sup>*d*×*<sup>l</sup>* is the Lagrangian multiplier, and *<sup>β</sup>* <sup>&</sup>gt; 0 is the penalty parameter.

Equation (5) involves four variables, that is, the auxiliary variable **U**, the feature weight matrix **W**, the feature indicator vector **p**, and the Lagrangian multiplier **V**. Clearly, simultaneously optimizing four variables is impractical. Accordingly, **V** is temporarily fixed for simplification in the following analysis. Then, minimizing L(**U**, **W**, **p**, **V**) is equivalent to the following two subproblems; i.e.,

• min**<sup>U</sup>** <sup>L</sup>1(**U**) = min **U** *β* <sup>2</sup> **<sup>W</sup>** <sup>−</sup> **<sup>U</sup>**<sup>2</sup> <sup>+</sup> *<sup>λ</sup>* **U**2,1 <sup>+</sup> *tr*(**V***T***U**);

$$\bullet \min\_{\mathbf{W}, \mathbf{p}} \mathcal{L}\_2(\mathbf{W}, \mathbf{p}) = \min\_{\mathbf{W}, \mathbf{p}} \frac{1}{\mathcal{F}} \left\| \hat{\mathbf{A}} \text{diag}(\mathbf{p}) \mathbf{W} - \Gamma^\* \right\|\_F^2 + \left\| \mathbf{W} - \mathbf{U} \right\|^2 - \frac{2}{\mathcal{F}} tr(\mathbf{V}^T \mathbf{W}) .$$

As to L1(**U**), the following holds:

$$\mathcal{L}\_1(\mathbf{U}) = \sum\_{i=1}^d \left(\frac{\beta}{2} \left\| \mathbf{w}^i - \mathbf{u}^i \right\|^2 + \lambda \left\| \mathbf{u}^i \right\| + tr(\mathbf{v}\_i^T \mathbf{u}^i) \right), \tag{6}$$

where **w***<sup>i</sup>* and **u***<sup>i</sup>* are the *i*-th row vectors of **W** and **U**, respectively. Then, we reformulate min **<sup>U</sup>** <sup>L</sup>1(**U**) to its close form [41] as

$$\min\_{\mathbf{u}^{i}} \sum\_{i=1}^{d} \left( \frac{\beta}{2} \left\| \mathbf{w}^{i} - \mathbf{u}^{i} + \frac{1}{\beta} \mathbf{v}\_{i} \right\|^{2} + \lambda \left\| \mathbf{u}^{i} \right\| \right). \tag{7}$$

Conducting gradient descent on Equation (7) yields the following optimal solution as

$$\mathbf{u}^{i} = \max \left\{ \left\| \mathbf{w}^{i} + \frac{1}{\beta} \mathbf{v}\_{i} \right\| - \frac{\lambda}{\beta}, 0 \right\} \frac{\mathbf{w}^{i} + \frac{1}{\beta} \mathbf{v}\_{i}}{\left\| \mathbf{w}^{i} + \frac{1}{\beta} \mathbf{v}\_{i} \right\|}. \tag{8}$$

Then, the optimal **U** in iteration [*t* + 1] can be denoted as

$$\mathbf{U}^{[t+1]} = \max\left\{ \left\| \mathbf{W}^{[t]} + \frac{1}{\beta} \mathbf{V}^{[t]} \right\| - \frac{\lambda}{\beta}, 0 \right\} \frac{\mathbf{W}^{[t]} + \frac{1}{\beta} \mathbf{V}^{[t]}}{\left\| \mathbf{W}^{[t]} + \frac{1}{\beta} \mathbf{V}^{[t]} \right\|}. \tag{9}$$

In terms of min **<sup>W</sup>**,**<sup>p</sup>** <sup>L</sup>2(**W**, **<sup>p</sup>**), we let <sup>P</sup> <sup>=</sup> {**p**|**<sup>p</sup>** ∈ {0, 1}*d*, **<sup>p</sup>***T***<sup>1</sup>** <sup>=</sup> *<sup>k</sup>*}. The dual problem of min **<sup>W</sup>**,**<sup>p</sup>** <sup>L</sup>2(**W**, **<sup>p</sup>**) is

$$\min\_{\mathbf{p}\in\mathcal{P}} \max\_{\mathbf{W}} \mathcal{L}\_2(\mathbf{W}, \mathbf{p}). \tag{10}$$

Since simultaneously solving the both variables **p** and **W** is still tough, we first fix **p** to optimize **W**. Then, the solution of **W** can be obtained as

$$\left(\operatorname{diag}(\mathbf{p})\mathbf{\hat{A}}^T\mathbf{\hat{A}}\operatorname{diag}(\mathbf{p}) - \beta\mathbf{I}\right)\mathbf{W} = \operatorname{diag}(\mathbf{p})\mathbf{\hat{A}}^T\mathbf{\hat{I}}^\* + \beta\mathbf{U} + \mathbf{V},\tag{11}$$

where **I** is the identity matrix. The structure of **Aˆ** *<sup>T</sup>***Aˆ** is commonly not circulant, and therefore the computation of Equation (11) is involved [42]. Considering this, an approximate term is added to L2(**W**, **p**) as follows:

$$\begin{split} \tilde{\mathcal{L}}\_2(\mathbf{W}, \mathbf{p}) &= \frac{1}{\beta \tau} \left\| \mathbf{W} - \mathbf{W}^{[t]} + \tau \mathbf{1}^{[t]} \right\| - \frac{2}{\beta} tr(\mathbf{V}^T \mathbf{W}) + \left\| \mathbf{W} - \mathbf{U} \right\|^2, \\ \mathbf{O}^{[t]} &= diag(\mathbf{p}^{[t]}) \mathbf{A}^T \left( \mathbf{A}diag(\mathbf{p}^{[t]}) \mathbf{W}^{[t]} - \mathbf{I}^\* \right), \end{split} \tag{12}$$

where *τ* > 0, and **W**[*t*] is the optimal value of **W** in iteration [*t*]. Then, the solution of **W**[*t*+1] is

$$\mathbf{W}^{[t+1]} = \left(\frac{\tau}{\beta \tau + 1}\right) \left(\beta \mathbf{U}^{[t+1]} + \mathbf{V}^{[t]} + \frac{1}{\tau} (\mathbf{W}^{[t]} - \tau \mathbf{O}^{[t]})\right). \tag{13}$$

The detailed inference can be found in the Appendix A.

Similarly, we can easily obtain the optimal **p** by fixing **W**. Equation (10) is then equivalent to the following minimization problem in this case as follows:

$$\min\_{\mathbf{p}\in\mathcal{P}} \left\lVert \mathbf{A} \text{diag}(\mathbf{p}) \mathbf{W} - \Gamma^\* \right\rVert\_F^2 = \min\_{\mathbf{p}\in\mathcal{P}} \left\lVert \mathbf{Y}^T \sum\_{i=1}^d p\_i \mathbf{f}\_i \mathbf{w}^i - \Gamma^\* \right\rVert\_F^2. \tag{14}$$

Apparently, the top-*k* features that minimize **<sup>Y</sup>***T***f***i***w***<sup>i</sup>* − **<sup>Γ</sup>**<sup>∗</sup> 2 *<sup>F</sup>* can be regarded as the remarkable ones. Their corresponding values in **p** are assigned as 1.

Note that the Lagrangian multiplier **V** is fixed through the above analysis, mainly for simplifying the solution process. We further tackle this problem in the popular PADM framework as illustrated in Algorithm 1. In this framework, **V** can be updated as

$$\mathbf{V}^{[t+1]} = \mathbf{V}^{[t]} - \beta \left(\mathbf{W}^{[t+1]} - \mathbf{U}^{[t+1]}\right). \tag{15}$$

**Algorithm 1** ELC. **input: F** = {**f**1,...,**f***d*} , **Y**, **S**, *k*, *β*, *τ*, *λ* **output: p**[*t*] 1: **begin** 2: *<sup>t</sup>* <sup>=</sup> 0, **<sup>W</sup>**[0] <sup>=</sup> **<sup>0</sup>***d*×*l*, **<sup>U</sup>**[0] <sup>=</sup> **<sup>0</sup>***d*×*l*, **<sup>V</sup>**[0] <sup>=</sup> <sup>1</sup> *<sup>d</sup>* **1***d*×*l*; 3: find top-*k* features **ˆ f** [0] <sup>1</sup> , ... , **<sup>ˆ</sup> f** [0] *<sup>k</sup>* that minimize Equation (1), and set *p* [0] *<sup>i</sup>* = ⎧ ⎪⎨ ⎪⎩ 1, **f***<sup>i</sup>* ∈ **ˆ f** [0] <sup>1</sup> ,..., **<sup>ˆ</sup> f** [0] *k* 0, *otherwise* ; 4: **while** "not converged" **do** 5: optimize **U**[*t*+1] according to Equation (9); 6: optimize **W**[*t*+1] according to Equation (13); 7: find top-*k* features **ˆ f** [*t*+1] <sup>1</sup> , ... , **<sup>ˆ</sup> f** [*t*+1] *<sup>k</sup>* which minimize Equation (14), and set *p* [*t*+1] *<sup>i</sup>* <sup>=</sup> <sup>⎧</sup> ⎪⎨ ⎪⎩ 1, **f***<sup>i</sup>* ∈ **ˆ f** [*t*+1] <sup>1</sup> ,..., **<sup>ˆ</sup> f** [*t*+1] *k* 0, *otherwise* ; 8: update **V**[*t*+1] according to Equation (15); 9: *t* = *t* + 1; 10: **end while**; 11: **return p**[*t*] ; 12: **end;**

ELC in Algorithm 1 is implemented in the regression framework PADM, which is a fast alternating approach for the well-known alternating direction method (ADM) framework. PADM is effective and efficient in solving the minimization problem of the augmented Lagrangian function, and is able to converge to a certain solution {**W**∗, **<sup>U</sup>**∗} from any starting point **W**[0] , **U**[0] for any *β* > 0 [40].

In terms of the complexity of ELC, it only takes *O*(*k* log *d*) time to find *k* remarkable features from the *d* candidates. Thus, the time consumption for line 3 is *O*(*ndl*<sup>2</sup> + *k* log *d*). The cost of the while loop in Algorithm 1 mainly lies in lines 6 and 7, which is *O*(*d*2*l* <sup>2</sup> + *ndl*<sup>2</sup> + *k* log *d*). As this iteration process is repeated for *t* times, its total cost is *O*(*t*(*d*2*l* <sup>2</sup> + *ndl*<sup>2</sup> + *<sup>k</sup>* log *<sup>d</sup>*)). Suppose *<sup>t</sup>* 1. Then, the total complexity of ELC is approximately equal to *O*(*t*(*d*2*l* <sup>2</sup> + *ndl*<sup>2</sup> + *k* log *d*)), where *d*, *n*, *l*, *k*, *t* are the numbers of features, instances, labels, selected features, and iterations for convergence, respectively.

### **5. Experimental Evaluation**

Fourteen groups of multi-label data sets fetched from the Mulan library (http://mulan. sourceforge.net/datasets-mlc.html) are taken as the benchmarks in this section, which are shown in Table 1. We compare ELC (the source code is available at https://github.com/wangjuncs/ELC) with the following state-of-the-art multi-label feature selection methods:

• MIFS (multi-label informed feature selection) [33]: a label correlation-based multi-label feature selection approach, which maps label information into a low-dimensional subspace and captures the correlations among multiple labels;



**Table 1.** Benchmarks for multi-label feature selection.

More detailed experimental configurations can be found in the Appendix B.

### *5.1. Example 1: Classification Performance*

The average classification performance of each feature selection approach is recorded in Table 2 and the pairwise *t*-tests at 5% significance level were conducted to validate the statistical significance. In addition to the traditional precision and AUC metrics, hamming loss penalizes incorrect the recognitions of instances to each target label, ranking loss penalizes the misordered labels in pairs, and one-error penalizes the instances whose top-ranked predicted labels are not in the ground-truth label set. Five metrics evaluated the multi-label classification performance from different aspects.

A single metric is insufficient to illustrate the general classification performance on a dataset. For example, the overall performance of ML-KNN classifier [43] on birds is worse than that on enron under the precision metric, while it shows a better performance on birds than on enron under the AUC metric. Therefore, we extensively used five metrics to compare the performances of the compared approaches. As shown in Table 2, ELC outperforms MIFS, CMFS, and LLSF under various metrics. This superiority is attributed to two reasons. That is, ELC can effectively eliminate noisy features from the candidate feature subsets and maximally embed label correlation information into its selection process. The first term rules out the selection disturbance in the feature space, and the second term promises the proper guiding information extracted from the label space. By seamlessly fusing these two terms, ELC is able to find discriminative features for the downstream learning tasks. This point will be further validated in Sections 5.2 and 5.3.

**Table 2.** Average multi-label classification performance (mean ± std.): the best results and those not significantly worse than it are highlighted in bold (pairwise *t*-testat 5% significance level).


**(a)** Precision (the higher the better).

**(b)**AUC (the higher the better).



**Table 2.** *Cont.*



**(d)**Ranking loss (the lower the better).



**Table 2.** *Cont.*



### *5.2. Example 2: Eliminating Noisy Features*

In this section, we evaluate the performances of the compared approaches in eliminating noisy features. We take emotions, birds, and enron as the benchmarks, and measure the residual feature redundancy in the selected feature subset **Fˆ** as follows:

$$R(\mathbf{\hat{F}}) = \frac{1}{k'(k'-1)l} \sum\_{\mathbf{\hat{F}}\_i \mathbf{\hat{F}}\_j \in \mathbf{\hat{F}}} \sum\_{c=1}^l \rho\_{\mathbf{\hat{F}}\_i, \mathbf{y}\_c}^2 \rho\_{\mathbf{\hat{F}}\_j, \mathbf{y}\_c}^2 \tag{16}$$

where *ρ***<sup>ˆ</sup> <sup>f</sup>***i*,**y***<sup>c</sup>* and *<sup>ρ</sup>***<sup>ˆ</sup> <sup>f</sup>***j*,**y***<sup>c</sup>* are the Pearson correlation coefficients of the features **<sup>ˆ</sup> f***<sup>i</sup>* and **ˆ f***<sup>j</sup>* with the target label **y***l*, and *k* and *l* are the numbers of the selected features and labels, respectively. When *R*(**Fˆ**) reaches its maximum value, the maximal redundant information exists in **Fˆ**, which interprets as the inferior ability of the selection approach in removing noisy features.

The feature redundancy of *k* selected features for each approach is demonstrated in Figure 1, where *k* ∈ {*d*/10, 2*d*/10, ... , 9*d*/10} and *d* is the total number of original features. It illustrates that ELC is superior in reducing feature redundancy. In other words, ELC can effectively remove redundant features in its multi-label feature selection process. This is one of the crucial factors leading to the excellent discriminative ability of ELC. It should be pointed out that in contrast to the case of single-label feature selection, eliminating noisy features has not received sufficient attention from existing multi-label feature selection approaches. While the issue of noisy features is an obstacle of yielding high selection performance not only for the single-label learning but also for the multi-label cases, we devised ELC to comprehensively tackle this problem. Moreover, the reduced feature redundancy in the majority of redundancy elimination-based approaches is not directly relevant to the target labels. In contrast, ELC quantitatively reduces target-relevant redundancy without any prior probability knowledge, which is conducive to its superiority in multi-label feature selection.

**Figure 1.** Classification redundancy: (**a**–**c**) are the classification redundancies produced by the feature selection approaches on the emotions, birds, and enron datasets, and the lower of the redundancy is the better.

### *5.3. Example 3: Embedding Label Correlations*

Label correlation information is important for multi-label learning. In the following experiments, we estimate the preserved label correlation information of the selected feature subset **Fˆ** as follows:

$$\mathcal{C}(\mathbf{f}) = \frac{1}{k'(k'-1)} \left\| \frac{1}{n^2} \mathbf{Y}^T \mathbf{X}\_{\mathbf{f}} \mathbf{X}\_{\mathbf{f}}^T \mathbf{Y} - \mathbf{S} \right\|\_F^2,\tag{17}$$

where **XFˆ** denotes the instances characterized by **Fˆ** and **<sup>S</sup>** is the label correlation matrix of the original data. Intuitively, Equation (17) measures the residue scale of label correlation information in the original and reduced feature spaces. A lower value indicates more information preserved. In other words, more label correlation information can be embedded in the feature selection process in this situation.

Similarly to the configuration in Section 5.2, we take emotions, birds, and enron as the benchmarks and record *<sup>C</sup>*(**Fˆ**) of the *<sup>k</sup>* features selected by each approach, where *<sup>k</sup>* ∈ {*d*/10, 2*d*/10, ... , 9*d*/10}. As shown in Figure 2, ELC is better at preserving the class correlation information than the other multi-label feature selection approaches. Actually, the majority of the existing multi-label feature selection approaches take the label correlation information into consideration to some extent. In contrast to these approaches, ELC quantitatively measures this correlation information and maximally embeds it into the feature selection process. This characteristic, which has already been proved in Property 2, can be further revealed by the experimental results in this section.

**Figure 2.** Residual label correlation information: (**a**–**c**) are the residual scales of the label correlation information that are not embedded by the feature selection approaches on the emotions, birds, and enron datasets, and the lower of the residual scale is the better.

### *5.4. Example 4: Time Consumption*

In this section, we compare the approaches in terms of their feature selection efficiency. The time consumption here merely records the feature selection time, excluding the classification cost. All of the tests were implemented in Matlab on an Intel Core i7-4790 CPU (@3.6GHz) with 32GB memory (Intel Corp., Santa Clara, CA, USA). We respectively selected *k* (*k* ∈ {100, 300, 500, 700, 900}) features on the enron dataset and recorded the time consumption of each compared approach. As illustrated in Figure 3, ELC and CMFS are comparably efficient to converge, while MIFS is most time-consuming, which may be mainly attributed to its involved label clustering process.

**Figure 3.** Time consumption of each multi-label feature selection approach on the enron dataset.

### **6. Conclusions**

A novel multi-label feature selection method called ELC is proposed in this paper. ELC embeds label correlation information in reduced feature subspace to eliminate noisy features. In this way, irrelevant and redundant features can be expected to be removed and a discriminative feature subset is constructed for the downstream learning tasks. These advantages help ELC yield good feature selection performance on a wide broad of multi-label data sets under various evaluation metrics.

In terms of optimizing ELC, we can feed it to some gradient descent frameworks to efficiently yield its optimal values, such as Adam with a self-adaptive learning rate [44]. Another interesting and possible exploration would be the consideration of noisy labels, which would induce negative effects on estimating label correlations. According to our pilot study, noisy labels may distort the label space and provide inaccurate guide information for feature selection. How to eliminate noisy labels may inspire our future work.

**Author Contributions:** Each author greatly contributed to the preparation of this manuscript. J.W. (Jun Wang) and J.W. (Jinmao Wei) wrote the paper; Y.X. and H.X. designed and performed the experiments; Z.S. and Z.Y. devised the optimization algorithms. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (number 61772288), the Natural Science Foundation of Tianjin City (number 18JCZDJC30900), the Ministry of Education of Humanities and Social Science Project (number 16YJC790123), the National Natural Science Foundation of Shandong Province (number ZR2019MA049), and the Cooperative Education Project of the Ministry of Education of China (number 201902199006).

**Acknowledgments:** The authors are very grateful to the anonymous reviewers and editor for their helpful and constructive comments and suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Appendix A**

After adding an approximate term to <sup>L</sup>2(**W**, **<sup>p</sup>**) and reformulating it to <sup>L</sup>˜ <sup>2</sup>(**W**, **p**), we take the derivative of <sup>L</sup>˜ <sup>2</sup>(**W**, **p**) with respect to **W** as follows:

$$\frac{\partial \mathcal{L}\_2}{\partial \mathbf{W}} = \beta(\mathbf{W} - \mathbf{U}) - \mathbf{V} + \frac{1}{\tau}(\mathbf{W} - \mathbf{W}^{[t]} + \tau \mathbf{D}^{[t]}),\\\mathbf{\Omega}^{[t]} = \text{diag}(\mathbf{p}^{[t]}) \hat{\mathbf{A}}^T \left( \hat{\mathbf{A}} \text{diag}(\mathbf{p}^{[t]}) \mathbf{W}^{[t]} - \mathbf{T}^\* \right).$$

To induce the optimal solution of **<sup>W</sup>**, we make *<sup>∂</sup>*L˜ 2 *<sup>∂</sup>***<sup>W</sup>** equal to 0 and obtain:

$$(\boldsymbol{\beta} + \frac{1}{\tau})\mathbf{W} = \boldsymbol{\beta}\mathbf{U} + \mathbf{V} + \frac{1}{\tau}(\mathbf{W}^{[t]} - \tau\mathbf{O}^{[t]}).$$

Then, the optimal solution of **W** in the iteration [*t* + 1] can be represented as

$$\mathbf{W}^{[t+1]} = \left(\frac{\tau}{\beta \tau + 1}\right) \left(\beta \mathbf{U}^{[t+1]} + \mathbf{V}^{[t]} + \frac{1}{\tau} (\mathbf{W}^{[t]} - \tau \mathbf{O}^{[t]})\right).$$

### **Appendix B. Experimental Configuration**

The correlation (or similarity) matrices involved in experiments are all calculated based on the RBF kernel function. Specifically, the label correlation matrix **S** in ELP is defined as **S***ij* = ⎧ ⎨ ⎩ exp " −**y***i*−**y***j* 2 2*δ*<sup>2</sup> # , **y***i*, **y***j* = 0 0, *otherwise* , where *δ*<sup>2</sup> = *mean*( **y***<sup>i</sup>* − **y***<sup>j</sup>* 2 ), *i*, *j* = 1, ... , *l*. The instance

similarity matrix in SPFS and CMFS is calculated as **K***ij* = ⎧ ⎨ ⎩ exp " −**x***i*−**x***j* 2 2*δ*<sup>2</sup> # , **y***<sup>i</sup>* = **y***<sup>j</sup>* 0, *otherwise* , where *δ*<sup>2</sup> = *mean*( **x***<sup>i</sup>* − **x***<sup>j</sup>* 2 ). The affinity graph in MIFS is constructed as **K***ij* =

⎧ ⎨ ⎩ exp " −**x***i*−**x***j* 2 2*δ*<sup>2</sup> # , **x***<sup>i</sup>* ∈ N*p*(**x***j*) *or* **x***<sup>j</sup>* ∈ N*p*(**x***i*) 0; *otherwise* , where N*p*(**x***i*) is the *p*-nearest neighbor of

instance **x***i*.

SPFS is implemented via the sequential forward selection (SFS) strategy. For a fair comparison, we tune the regularization parameter for all approaches via a grid search from {10<sup>−</sup>3, 10−2, 10−1, 1, 10}. For ELC, the parameter *β* is fixed to *β* = 108, and *τ* is set to the spectral radius of **Aˆ** *<sup>T</sup>***Aˆ** in the initial state and updated as *τ*[*t*] = <sup>1</sup> max(*ψ<sup>i</sup>* ) in the *<sup>t</sup>*-th iteration, where *<sup>ψ</sup><sup>i</sup>* is the *<sup>i</sup>*-th row vector of **<sup>Ψ</sup>** and **Ψ** = **Aˆ** *<sup>T</sup>***AVˆ** [*t*] . The convergence state is reached when any of the following two conditions is satisfied: (1) *tmax* = 103; and (2) **<sup>W</sup>**[*t*+1] <sup>−</sup> **<sup>W</sup>**[*t*] <sup>≤</sup> <sup>10</sup><sup>−</sup>4.

Multi-label k-nearest neighbor (ML-kNN) classifier [43] is built on the *k* features selected by each compared approach, when *k* ∈ {*d*/10, 2*d*/10, ... , 9*d*/10} and *d* is the total number of features. All of the numerical features are normalized to zero mean and unit variance, and we employ the excellent features selected by the compared approaches to construct the ML-kNN classifiers and compare their classification performances. The 5-fold cross-validation is conducted, and we report the average performance of the ML-kNN classification under five metrics, i.e., precision, AUC, Hamming loss, ranking loss, and one error [39].

### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
