An Instance- and Label-Based Feature Selection Method in Classification Tasks

Fan, Qingcheng; Liu, Sicong; Zhao, Chunjiang; Li, Shuqin

doi:10.3390/info14100532

Open AccessArticle

An Instance- and Label-Based Feature Selection Method in Classification Tasks

¹

College of Information Engineering, Northwest A&F University, 3 Taicheng Road, Yangling, Xianyang 712100, China

²

Research Center of Information Technology, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2023, 14(10), 532; https://doi.org/10.3390/info14100532

Submission received: 8 August 2023 / Revised: 21 September 2023 / Accepted: 23 September 2023 / Published: 28 September 2023

(This article belongs to the Special Issue Information Systems and Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

Feature selection is crucial in classification tasks as it helps to extract relevant information while reducing redundancy. This paper presents a novel method that considers both instance and label correlation. By employing the least squares method, we calculate the linear relationship between each feature and the target variable, resulting in correlation coefficients. Features with high correlation coefficients are selected. Compared to traditional methods, our approach offers two advantages. Firstly, it effectively selects features highly correlated with the target variable from a large feature set, reducing data dimensionality and improving analysis and modeling efficiency. Secondly, our method considers label correlation between features, enhancing the accuracy of selected features and subsequent model performance. Experimental results on three datasets demonstrate the effectiveness of our method in selecting features with high correlation coefficients, leading to superior model performance. Notably, our approach achieves a minimum accuracy improvement of 3.2% for the advanced classifier, lightGBM, surpassing other feature selection methods. In summary, our proposed method, based on instance and label correlation, presents a suitable solution for classification problems.

Keywords:

feature selection; manifold learning; classification

1. Introduction

Classification is a fundamental task in machine learning, with diverse applications across various fields [1,2]. In this task, the inputs are typically represented as vectors. However, not all elements in a vector contain relevant or beneficial information for classification; some may even have a detrimental effect on the classification task. Therefore, the selection of informative elements within a sample is a topic of great interest among researchers [3,4,5,6].

By reducing feature dimensionality, feature selection significantly enhances model performance, reduces computational complexity, and improves the efficiency of data analysis and modeling processes [7]. On the other hand, manifold learning focuses on dimensionality reduction through manifolds. By assuming that data points are distributed along low-dimensional manifolds, manifold learning aims to decrease the dimensionality of the data while preserving local relationships [8]. This approach has proven particularly valuable in addressing high-dimensional nonlinear problems and attaining a clear representation of the data. Leveraging nonlinear mapping, manifold learning algorithms effectively map high-dimensional data to low-dimensional spaces, facilitating a more profound analysis of the underlying data structures [9].

Feature selection and manifold learning reduce data redundancy and maintain important features. Feature selection reduces dimensionality and removes irrelevant information during manifold learning. Manifold learning is able to analyze local structures and correlations in the lower feature space [10]. This integrated method improves data processing and modeling processes, improving following procedures.

In this paper, we propose a novel feature selection method that takes into account both instance correlation and label correlation. Our method involves using the least squares method [11] to calculate the linear relationship between each feature and the target variable. By fitting a linear model, we obtain weights for each feature. Features with higher weights or stronger correlations can be selected as the final subset. Regularization methods can further constrain feature selection and model complexity, finding an optimal subset with good generalization performance. In cases of sparsity, the

L_{2, 1}

norm is preferred for controlling non-zero coefficients due to its smoother nature and insensitivity to outliers. For exploring deep structural information of labels, introducing a separation variable captures label correlations and transforms the problem into a separable convex optimization problem. This approach preserves global and local structural information, improving performance. Consistency in label correlations is ensured by aligning the predicted label matrix with the ground truth label matrix and considering the similarity between adjacent instances. The objective function is optimized through the alternating direction multiplier method [12].

The main contributions are as follows:

We introduce a novel unsupervised feature selection method that takes into account both instance correlation and label correlation, effectively mitigating the impact of noise in data.
We improve performance by introducing a low-dimensional embedding vector to learn potential structural spatial information.
Our method outperforms existing approaches like PMU, MDFS, MCLS, and FIMF, showing significant improvements in a series of experiments.

2. Related Work

Feature selection methods have a rich research history and can be broadly categorized into two approaches [3,4,7]. The first approach involves generating a new lower-dimensional vector by utilizing vector elements from existing samples. The second approach focuses on selecting specific elements from existing vectors to construct a new vector. These methods play a crucial role in reducing dimensionality and improving the efficiency of feature representation.

Principal component analysis [3,13] is a commonly used dimensionality reduction method that projects the original data onto a new coordinate system, retaining a minimal number of principal components to reduce the dimensionality while preserving important information from the original data. However, it also has some limitations. For example, it cannot consider class information, performs poorly on non-linearly distributed data, and is sensitive to outliers.

The newly generated vectors using the methods above may need to be more consistent with the original vectors. However, the chi-squared test can help to compensate for this deficiency. The chi-squared test [4,14] is used to select discrete features and target variables. Using the chi-squared statistic, it measures the correlation between the feature and the target variable. However, the test assumes independence between the feature and the target variable and may provide incorrect results if this assumption is not met. Additionally, small sample sizes may lead to unstable results.

In fact, each element in the feature vector of the sample is not completely independent, and they have correlation, which can be used to improve the effectiveness of feature selection. PMU [15] is a feature selection approach that leverages the mutual information between the selected features and the tag set. It can be applied to diverse classification problems. The experimental results validate the effectiveness of this method in improving classification performance. PMU proves to be a valuable and efficient tool for feature selection in classification problems. The feature subset

f^{+}

is obtained by the following equation:

J = \tilde{I} ({S, f^{+}}; L) - \tilde{I} (S; L),

(1)

where S denotes an input feature set, L denotes a label set, and

\tilde{I} (S; L)

denotes the multivariate mutual information. FIMF [16] proposed a fast feature selection method based on information theory feature ranking to address the research gap in computationally efficient feature selection methods. The results indicate that the method significantly reduces the time required to generate feature subsets, particularly for large datasets, but its accuracy is limited. The objective function of FIMF is as follows:

J (f) = {\tilde{V}}_{2} (f \times L_{1}^{^{'}}) + \sum_{k = 3}^{b + 1} {(- 1)}^{k} {\tilde{V}}_{k} (f \times Q_{k - 1}^{^{'}}),

(2)

where f denotes a feature set; L denotes a label set; Q is a set of labels consisting of the highest-entropy labels.

M C L S

[17] uses manifold learning to transform the logical label space into a Euclidean label space and constrains the similarity between samples through corresponding numerical labels. The final selection criteria integrate both supervised information and the local properties of the data. The formula for computing the i-th feature is as follows:

M C L S_{i} = \frac{{\tilde{f}}_{i}^{T} \hat{L} {\tilde{f}}_{i}}{{\tilde{f}}_{i}^{T} \hat{D} {\tilde{f}}_{i}} \cdot \frac{f_{i}^{T} {\hat{L}}^{M} f_{i}}{f_{i}^{T} {\hat{L}}^{C} f_{i}},

(3)

where

f_{i}

denotes the i-th feature; L is a Laplacian matrix; D is a diagonal matrix. In order to reduce dimensionality and select relevant features, MDFS [18] proposed a manifold regularized embedded multi-label feature selection method. This method constructs a low-dimensional embedding that adapts to the label distribution, considers label correlations, and utilizes

L_{2, 1}

norm regularization for feature selection. The objective function of MDFS is as follows:

m i n_{F, W, b} T r (F^{T} L F) + | | X W + 1_{n} b^{T} - {F | |}_{F}^{2} + α | | F - {Y | |}_{F}^{2} + β T r (F L_{0} F^{T}) + {γ | | W | |}_{2, 1},

(4)

where

α, β, γ

are hyperparameters. However, this method did not take into account the potential structural information contained in the label. This method has enlightening effects on our research.

3. Materials and Methods

In this section, we propose a novel unsupervised feature selection method. Subsequently, detailed explanations are provided on three components: notations, problem formulation, and optimization.

3.1. Notations

Considering the extensive derivation of formulas and symbol labeling in this paper, Table 1 is provided to present the symbols that may appear. It is assumed that A is an arbitrary matrix mentioned in the table.

3.2. Problem Formulation

In the process of feature selection, we use the least squares method to calculate the linear relationship between each feature and the target variable. By fitting a linear model, the weight of each feature can be obtained. Based on these weights, we can select features with high weight or correlation as the final feature subset, which can be used as the basis for feature selection. In addition, the least squares method is further extended through regularization methods to constrain feature selection and model complexity. This can help us to find the optimal subset of features with good generalization performance, improving the predictive ability and interpretability of the model. The basic formula is as follows:

m i n_{W, b} {| | X W + 1_{n} b^{T} - Y | |}_{F}^{2} + {α | | W | |}_{p}^{2},

(5)

where p is the norm representation and W is a feature selection matrix, which is the key to solving the problem.

In feature selection, the

L_{1}

norm is usually used to count the number of non-zero coefficients to encourage model selection to select sparse features. However, the

L_{1}

norm is not ideal for selecting non-zero coefficients, as it tends to select some non-zero coefficients and compress other coefficients to zero without explicitly controlling the number of non-zero coefficients. In contrast,

L_{2, 1}

norm can better handle the sparsity of features.

L_{2, 1}

norm firstly calculates the sum of the

L_{2}

norms of each feature vector, and then the

L_{1}

norm is calculated for these results. The

L_{2, 1}

norm can better constrain the number of non-zero coefficients because it is smoother and insensitive to outliers. Since Y is a ground truth value label and a constant, and

1_{n} b^{T}

is also a constant, they can be merged together. Therefore, the formula is simplified as follows:

m i n_{W, b} {| | X W - Y | |}_{F}^{2} + {α | | W | |}_{2, 1}^{2} .

(6)

The process of completing feature selection is that Y is approximate to

X W

, so

w^{T} x_{i} - w^{T} x_{j}

is used to represent the distance, and

S_{i j}

is used to describe the similarity between two vectors. Using this method can reduce the negative impact of noise in the original data and reduce the spatial dimension to reduce the computational cost. In this way, the instance association is increased and expressed by the following formula:

\frac{1}{2} \sum_{i = 1}^{m} \sum_{j = 1}^{m} | | w^{T} x_{i} - w^{T} x_{j} {| |}_{2}^{2} S_{i j} = T r (W^{T} X^{T} (D - S) X W) = T r (W^{T} X^{T} L X W)

(7)

where L is a Laplacian matrix, which can be obtained by multiplying the matrix of similarity S by the matrix of diagonals D, where

d_{i i} = \sum_{j = 0}^{n} S_{i}

. The detail is as follows:

S = \{\begin{matrix} e^{- 0.5 \frac{| | x_{i} - x_{j} {| |}_{2}^{2}}{σ^{2}}}, & if i \neq j; \\ 1, & if i = j . \end{matrix}

(8)

D = {[\begin{matrix} d_{11} & 0 & \dots & 0 \\ 0 & d_{22} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & d_{n n} \end{matrix}]}_{n \times n}

(9)

L = D - S

(10)

In order to explore the underlying structural information of the labels, we introduce a separate variable V for capturing label correlations. By incorporating the low-dimensional embedding V, the aforementioned problem is transformed into a separable convex optimization problem, which preserves both global and local structured information, thereby enhancing performance. To ensure consistent correlation of labels, the predicted label matrix V should match the ground truth label matrix Y, and the instances within matrix V should exhibit similarity when they are adjacent. Based on the aforementioned analysis, we have formulated an equation to describe the label correlation.

m i n_{V} \sum_{i = 1}^{n} E_{i i} \sum_{j = 1}^{l} {(V_{i j} - Y_{i j})}^{2} + \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} S_{i j} | | v_{i} - v_{j} {| |}_{2}^{2} .

(11)

After conversion, it can be obtained that

m i n_{V} T r [{(V - Y)}^{T} E (V - Y)] + T r (V^{T} L V),

(12)

where E is a diagonal matrix with large numerical values for its elements. The two terms in this formula preserve global and local information, respectively. Given the correlation between matrix V and Y, it is crucial to enforce non-negativity of V in classification tasks to ensure its validity.

In summary, the ultimate objective function is formulated as:

\begin{matrix} m i n_{W, V} | | X W - {V | |}_{F}^{2} + {α | | W | |}_{2, 1}^{2} + β T r (W^{T} X^{T} L X W) + \\ γ (T r [{(V - Y)}^{T} E (V - Y)] + T r (V^{T} L V)]) s . t . V \geq 0, \end{matrix}

(13)

where

α

,

β

, and

γ

are hyperparameters. And then, we will optimize this objective function to obtain the solution to the problem.

3.3. Optimization

We have designed an optimization algorithm for the designed objective function. The objective function can be transformed into:

\begin{matrix} Γ = m i n_{W, V} T r [{(X W - V)}^{T} (X W - V)] + α T r (W^{T} A W) + β T r (W^{T} X^{T} L X W) + \\ γ (T r [{(V - Y)}^{T} E (V - Y)] + T r (V^{T} L V)]) s . t . V \geq 0, \end{matrix}

(14)

where A is a diagonal matrix with its elements defined as

\frac{1}{2} | | w^{i} {| |}_{2}

. Subsequently, we employ the alternating direction method of multipliers to solve the aforementioned problem.

3.3.1. Solve W as given V

By setting the partial derivative

\frac{\partial Γ}{\partial W} = 0

, we can obtain

\begin{matrix} \frac{\partial (T r [{(X W - V)}^{T} (X W - V)] + α T r (W^{T} A W) + β T r (W^{T} X^{T} L X W))}{\partial W} = 0 \\ \Rightarrow X^{T} X W - X^{T} V + α Λ W + β X^{T} L X W = 0 \\ \Rightarrow W = {(X^{T} X + α Λ + β X^{T} L X)}^{- 1} X^{T} V, \end{matrix}

(15)

Λ = {[\begin{matrix} \frac{1}{2 | | w_{1} {| |}_{2} + ε} \\ \frac{1}{2 | | w_{2} {| |}_{2} + ε} \\ ⋱ \\ \frac{1}{2 | | w_{d} {| |}_{2} + ε} \end{matrix}]}_{d \times d},

(16)

where

ε

represents a non-zero constant that serves the purpose of safeguarding the algorithm against potential crashes caused by a denominator equal to zero.

3.3.2. Solve V as given W

By setting the partial derivative

\frac{\partial Γ}{\partial V} = 0

, we can obtain

\begin{matrix} \frac{\partial (T r [{(X W - V)}^{T} (X W - V)] + γ (T r [{(V - Y)}^{T} E (V - Y)] + T r (V^{T} L V)])}{\partial V} = 0 \\ \Rightarrow X W + V + γ (L V + E V - E Y) = 0 \\ \Rightarrow V = {[γ (L + E) + I]}^{- 1} (γ E Y + X W), \end{matrix}

(17)

Therefore, the solution to this issue is illustrated by Equations (15) and (17). To attain the ultimate solution, it is recommended to iterate through the aforementioned procedure until convergence. The iterative algorithm is outlined in Algorithm 1.

Algorithm 1: The iterative optimization.
	Input: The feature matrix $X \in R^{n \times d}$ , label matrix $Y \in R^{n \times c}$ , and hyperparameters $α$ , $β$ , $γ$ .
	Output: The first K features obtained by sorting the results of ${\| \| W \| \|}_{2}$ .
1	Initialization: Randomly initialize $W_{0} \in R^{d \times c}$ , $V_{0} \in R^{n \times c}$ and E, set $ε$ = 0.01.
2	Calculate L, $Λ$ using Equation (10), Equation (16), respectively.
3	while not converged do
4		Calculate $W \leftarrow i t e r a t i o n_W (X, Λ, L, V, α, β)$ , update W by Equation (15).
5		Calculate $V \leftarrow i t e r a t i o n_V (X, L, E, Y, W, γ)$ , update V by Equation (17).
6		Calculate $Λ \leftarrow i t e r a t i o n_l a m b d a (W, ε)$ , update $Λ$ by Equation (16).
7	end
8	Sort $\| \| W_{i} {\| \|}_{2}$ in descending order and filter out the top K features from the sorted list.

4. Results and Discussion

4.1. Datasets

To evaluate and validate this method, these publicly available datasets were carefully selected, including well-known image dataset Yale and biological dataset Lung. In addition, Cattle dataset (https://www.kaggle.com/datasets/twisdu/dairy-cow, accessed on 1 August 2023) is also used, which is specifically curated for the task of cattle pose estimation, with pose classification serving as its downstream objective. Through thorough analysis of the relative positions of key points, the dataset enables an accurate determination of the cattle’s pose, thereby achieving the ultimate goal of precise pose classification. The detailed information of these datasets is presented in Table 2. All datasets are divided into training and testing sets in a 7:3 ratio.

4.2. Evaluation Metrics

In this paper, to thoroughly evaluate the performance of all methods, we employed the following five evaluation metrics for comprehensive evaluation: accuracy, macro-

F_{1}

, micro-

F_{1}

, kappa [19], and Hamming loss [20].

Accuracy

$a c c u r a c y (y, \tilde{y}) = \frac{1}{n} \sum_{i = 0}^{n - 1} F (y, \tilde{y}), F (y, \tilde{y}) = \{\begin{matrix} 0, & if y \neq \tilde{y}; \\ 1, & if y = \tilde{y} . \end{matrix}$

(18)
Macro- $F_{1}$ and Micro- $F_{1}$

$\begin{matrix} F_{1} = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} \\ P r e c i s i o n = \frac{T P}{T P + F P} \\ R e c a l l = \frac{T P}{T P + F N} \end{matrix}$

(19)

where $T P$ is the number of positive samples correctly identified. $F P$ is the number of negative samples for false positives. $F N$ is the number of missed positive samples. To obtain the macro- $F_{1}$ score, first calculate the $F_{1}$ value for each individual category, and then compute the average of all $F_{1}$ values across all categories. After, calculate the overall accuracy and recall rate. Then, compute the $F_{1}$ value to obtain the micro- $F_{1}$ score.
Kappa

$K = \frac{(p_{o} - p_{e})}{1 - p_{e}},$

(20)

where $p_{o}$ refers to the empirical probability of agreement on the label assigned to any sample, also known as the observed agreement ratio. On the other hand, $p_{e}$ represents the expected agreement between annotators when labels are assigned randomly. To estimate $p_{e}$ , a per-annotator empirical prior over the class labels is employed.
Hamming loss

$L_{H a m m i n g} (y, \tilde{y}) = \frac{1}{n * c} \sum_{i = 1}^{n - 1} \sum_{j = 0}^{c - 1} P (y, \tilde{y}), P (y, \tilde{y}) = \{\begin{matrix} 0, & if y_{i, j} \neq \tilde{y_{i, j}}; \\ 1, & if y_{i, j} = \tilde{y_{i, j}} . \end{matrix}$

(21)

where y is a ground truth label and

\tilde{y}

is a predicted label.

Accuracy is a widely used performance evaluation metric for classification tasks. It represents the proportion of correctly classified samples among all the classified samples. The micro-

F_{1}

score considers the overall count of true positives, false negatives, and false positives across all categories, while the macro-

F_{1}

score considers the

F_{1}

score for each individual category and calculates their unweighted average. It is important to note that the macro-

F_{1}

score does not consider the potential imbalance between different categories. The kappa coefficient is a measure used to assess consistency in testing. In the context of classification problems, consistency refers to the alignment between the model’s predicted results and the actual classification outcomes. It is calculated based on the confusion matrix and ranges between −1 and 1, with values typically greater than 0. Hamming loss is a metric utilized to analyze misclassification on individual labels. A lower value of this metric indicates better performance, as it signifies fewer misclassifications. The aforementioned set of five metrics is employed for evaluating the performance of the method.

4.3. Experimental Setup

To ensure a fair and unbiased evaluation of all methods, a traversal search strategy is adopted to select the hyperparameters for each method within the range of [0.1, 0.2, …, 1.0]. Additionally, feature selection was performed based on the magnitude of the feature number values in the dataset. Specifically, due to the number of features in the Lung and Yale datasets, a percentage-based filtering approach is employed, ranging from 1% to 20% in steps of 1%. Alternatively, the Cattle dataset, with only 32 features, undergoes feature selection based on the number of key points, ranging from 2 to 32 in steps of 2, taking into account their actual significance.

For the purpose of conducting experiments, the LightGBM classification algorithm [21] is employed, which is a highly performant framework for gradient-enhanced decision trees, which has been optimized and improved based on the GBDT library xgboost [22]. It boasts faster training speeds and lower memory consumption compared to traditional GBDT algorithms [23,24], utilizes a histogram-based algorithm, and introduces mutually exclusive feature bundling to enhance model performance. It also incorporates the GOSS training strategy, which selectively retains samples with larger gradients to expedite the training process. Based on the aforementioned characteristics, a good performance is achieved.

Upon completing the above steps, each method will be executed 20 times. The mean and standard deviation are calculated for the indicators of 20 experiments, and the absence of variance indicates that the experimental results are the same without any difference. The mean values of the evaluation metrics will be recorded for further analysis.

4.4. Experiment Results

We conducted comparative experiments using the PMU, MDFS, MCLS, and FIMF methods. We also proposed distinct filtering strategies based on the number of features present in each dataset sample. The performance of each method on the dataset was evaluated using five metrics. As depicted in Table 3, our method outperforms all others in terms of every metrics, securing the first position.

To further illustrate the influence of retaining different feature ratios on the final classification results, we conducted multiple sets of experiments based on the settings outlined in Section 4.3. The results corresponding to varying numbers of features are displayed in the Figure 1. This allows for a comprehensive understanding of how different feature ratios affect the classification outcomes.

The horizontal axis of the figures is the number of features selected in the task. The vertical axis denotes the value of evaluation indicators. As the number of features increases, the accuracy of classification increases. While the curve is not monotonous, it is fluctuating. The Hamming loss is decreasing along with the increase in selected features.

According to the results, our method outperforms the other methods on all three datasets. On the Lung dataset, our method achieved an accuracy of 0.984, a macro-F1 score of 0.979, a micro-F1 score of 0.984, a kappa coefficient of 0.968, and a Hamming loss of 0.016. These results indicate that our method is highly accurate and effective in classifying lung data. Similarly, on the Yale dataset, our method achieved an accuracy of 0.760, a macro-F1 score of 0.715, a micro-F1 score of 0.760, a kappa coefficient of 0.739, and a Hamming loss of 0.240. This demonstrates the superior performance of our method compared to other approaches. Lastly, on the Cattle dataset, our method achieved an accuracy of 0.909, a macro-F1 score of 0.909, a micro-F1 score of 0.909, a kappa coefficient of 0.863, and a Hamming loss of 0.091. These results further validate the effectiveness of our method for accurately classifying cattle data.

The above experimental results indicate that our method can better capture global and local potential information between features, thereby filtering out features that are more important to the classification task and improving the accuracy of the final classification. In contrast, LightGBM, which does not involve feature selection, performed relatively well on the Lung and Cattle datasets but showed lower performance on the Yale dataset. The other methods also showed varying degrees of performance across the different datasets. Overall, our proposed method demonstrates superior performance in terms of accuracy, F1 scores, kappa coefficient, and Hamming loss on all three datasets. These results highlight the effectiveness of our method for classification tasks and its potential for practical applications.

4.5. Analysis

During the iteration process, note that the objective function value for the tth iteration is

Γ_{t}

, and the objective function value for the

t - 1

th iteration is

Γ_{t - 1}

. The iteration is terminated if either the absolute difference between

Γ_{t - 1}

and

Γ_{t}

is less than 0.01 or if the iteration count exceeds 10.

Figure 2 illustrates the variation in the objective function value during the iteration of our method on each dataset. It is evident that the objective function value has converged significantly by the second epoch. This observation demonstrates the efficiency and fast convergence of our method.

4.6. Ablation Study

In order to better distinguish the effect of each part of the objective function, we divided the objective function into three basic parts, named the basic formula (

B A S E

), instance correlation (

I C

), and label correlation (

L C

), and conducted ablation experiments on each dataset. The specific formulas for the three parts of goal division are as follows:

B A S E = m i n_{W, b} {| | X W - Y | |}_{F}^{2} + {α | | W | |}_{2, 1}^{2},

(22)

I C = m i n_{W, V} | | X W - {V | |}_{F}^{2} + {α | | W | |}_{2, 1}^{2} + β T r (W^{T} X^{T} L X W),

(23)

\begin{matrix} L C = m i n_{W, V} T r [{(X W - V)}^{T} (X W - V)] + α T r (W^{T} A W) + β T r (W^{T} X^{T} L X W) + \\ γ (T r [{(V - Y)}^{T} E (V - Y)] + T r (V^{T} L V)]) . \end{matrix}

(24)

We start with Equation (22) as the baseline and then incorporate the

I C

module into the objective function, which considers the noise elimination of the original data. As shown in Table 4, this leads to an improvement in classification accuracy. With the help of this module, all of the evaluation indicators witnessed progress of different sizes in the three datasets. The accuracy of classification was improved by 6% in the Yale dataset. Following the

I C

module, the

L C

module is added to the objective function, taking into account the distance between the low-dimensional embedded label V and the predicted label

W X

, as well as the ground truth value label and the internal elements of V. This incorporation of both global and local information during the process enhances the accuracy of the method. The

L C

module helped the method to obtain improvements in all the indicators. It is notable that the Hamming loss was no more than half of what it was before in the lung dataset. The experiments show that the improvements in this paper are effective, bringing benefits to the classification task.

5. Conclusions

In this paper, we have introduced a novel unsupervised feature selection method that incorporates both instance correlation and label correlation. By considering instance correlation, we can mitigate the negative impact of noise in the datasets. Additionally, our approach utilizes a low-dimensional embedded vector to capture global and local information through label correlation. Experimental results have demonstrated that our method outperforms existing approaches such as PMU, MDFS, MCLS, and FIMF, thus validating the effectiveness of our improvements. Moving forward, our future work will explore the underlying relationship between global and local label information.

Author Contributions

Conceptualization, S.L. (Sicong Liu) and Q.F.; methodology, S.L. (Sicong Liu) and Q.F.; software, Q.F. and S.L. (Sicong Liu); validation, Q.F. and S.L. (Sicong Liu); data curation, Q.F.; writing—original draft preparation, Q.F. and S.L. (Sicong Liu); writing—review and editing, Q.F. and S.L. (Sicong Liu); visualization, Q.F.; supervision, S.L. (Shuqin Li) and C.Z.; project administration, S.L. (Shuqin Li); All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank Caili Gong of Inner Mongolia Agricultural University for opening the cattle dataset. This provides great convenience for our research.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LightGBM	A highly efficient gradient boosting decision tree
PMU	Feature selection for multi-label classification using multivariate mutual information
MDFS	Manifold regularized discriminative feature selection for multi-label learning
MCLS	Manifold-based constraint Laplacian score for multi-label feature selection
FIMF	Fast multi-label feature selection based on information-theoretic feature ranking
GBDT	Gradient boost decision tree
xgboost	Extreme gradient boosting

References

Sidey-Gibbons, J.A.M.; Sidey-Gibbons, C.J. Machine learning in medicine: A practical introduction. BMC Med. Res. Methodol. 2019, 19, 64. [Google Scholar] [CrossRef] [PubMed]
Castillo-Botón, C.; Casillas-Pérez, D.; Casanova-Mateo, C.; Ghimire, S.; Cerro-Prada, E.; Gutierrez, P.; Deo, R.; Salcedo-Sanz, S. Machine learning regression and classification methods for fog events prediction. Atmos. Res. 2022, 272, 106157. [Google Scholar] [CrossRef]
Shlens, J. A Tutorial on Principal Component Analysis. arXiv 2014, arXiv:1404.1100. [Google Scholar]
Meesad, P.; Boonrawd, P.; Nuipian, V. A Chi-Square-Test for Word Importance Differentiation in Text Classification. In Proceedings of the International Conference on Information and Electronics Engineering, Bangkok, Thailand, 28–29 May 2011. [Google Scholar]
Spencer, R.; Thabtah, F.; Abdelhamid, N.; Thompson, M. Exploring feature selection and classification methods for predicting heart disease. Digit. Health 2020, 6, 205520762091477. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Xu, G.; Liang, L.; Jiang, K. Detection of weak transient signals based on wavelet packet transform and manifold learning for rolling element bearing fault diagnosis. Mech. Syst. Signal Process. 2015, 54–55, 259–276. [Google Scholar] [CrossRef]
Kileel, J.; Moscovich, A.; Zelesko, N.; Singer, A. Manifold Learning with Arbitrary Norms. J. Fourier Anal. Appl. 2021, 27, 82. [Google Scholar] [CrossRef]
Ni, Y.; Koniusz, P.; Hartley, R.; Nock, R. Manifold Learning Benefits GANs. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11255–11264. [Google Scholar] [CrossRef]
Tang, B.; Song, T.; Li, F.; Deng, L. Fault diagnosis for a wind turbine transmission system based on manifold learning and Shannon wavelet support vector machine. Renew. Energy 2014, 62, 1–9. [Google Scholar] [CrossRef]
Tan, C.; Chen, S.; Geng, X.; Ji, G. A label distribution manifold learning algorithm. Pattern Recognit. 2023, 135, 109112. [Google Scholar] [CrossRef]
Jiang, B.N. On the least-squares method. Comput. Methods Appl. Mech. Eng. 1998, 152, 239–257. [Google Scholar] [CrossRef]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. In Foundations and Trends® in Machine Learning; Now Publishers: Hanover, ML, USA, 2011; Volume 3, pp. 1–122. [Google Scholar]
Lever, J.; Krzywinski, M.; Altman, N. Principal component analysis. Nat. Methods 2017, 14, 641–642. [Google Scholar] [CrossRef]
Sumaiya Thaseen, I.; Aswani Kumar, C. Intrusion detection model using fusion of chi-square feature selection and multi class SVM. J. King Saud Univ.—Comput. Inf. Sci. 2017, 29, 462–472. [Google Scholar] [CrossRef]
Lee, J.; Kim, D.W. Feature selection for multi-label classification using multivariate mutual information. Pattern Recognit. Lett. 2013, 34, 349–357. [Google Scholar] [CrossRef]
Lee, J.; Kim, D.W. Fast multi-label feature selection based on information-theoretic feature ranking. Pattern Recognit. 2015, 48, 2761–2771. [Google Scholar] [CrossRef]
Huang, R.; Jiang, W.; Sun, G. Manifold-based constraint Laplacian score for multi-label feature selection. Pattern Recognit. Lett. 2018, 112, 346–352. [Google Scholar] [CrossRef]
Zhang, J.; Luo, Z.; Li, C.; Zhou, C.; Li, S. Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognit. 2019, 95, 136–150. [Google Scholar] [CrossRef]
McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Medica 2012, 22, 276–282. [Google Scholar] [CrossRef]
Tsoumakas, G.; Vlahavas, I. Random k-Labelsets: An Ensemble Method for Multilabel Classification. In Machine Learning: ECML 2007; Kok, J.N., Koronacki, J., Mantaras, R.L.D., Matwin, S., Mladenič, D., Skowron, A., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4701, pp. 406–417. ISSN 0302-9743. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Appel, R.; Fuchs, T.; Dollár, P.; Perona, P. Quickly Boosting Decision Trees—Pruning Underachieving Features Early. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013. [Google Scholar]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]

Figure 1. A set of curves depicting the variations in each evaluation metric for all methods across different datasets as the proportion of feature selection changes.

Figure 2. A series of figures depicting the change in objective function values in our method across different datasets as the number of iterations increases.

Table 1. Explanation of professional terms.

No.	Definition	Description
1	$T r (A)$	Trace of A
2	${\| \| A \| \|}_{F}$	F-norm of A
3	${\| \| A \| \|}_{2, 1}$	2,1-norm of A
4	$A^{T}$	The transposition of A
5	$I_{d}$	D dimensional identity matrix
6	n	Number of samples
7	d	Number of features per sample
8	c	Number of categories
9	X	Feature vector
10	W	Weight matrix
11	V	Predicted label matrix
12	L	Laplacian matrix
13	Y	Ground truth label of dataset

Table 2. The detailed information of these datasets.

Datasets	Number of Samples	Number of Features	Number of Categories
Lung	203	3312	5
Yale	165	1024	15
Cattle	1800	32	3

Table 3. The comparative results of our method with PMU, MDFS, MCLS, and FIMF based on various evaluation metrics including accuracy, macro-F1, micro-F1, kappa, and Hamming loss is presented. LightGBM is presented, where the classification algorithm can be used directly without the need for any feature selection.

Lung
Methods	accuracy	macro-F1	micro-F1	kappa	Hamming loss
LightGBM	0.918	0.716	0.918	0.836	0.082
PMU	0.957 ± 0.021	0.909 ± 0.072	0.957 ± 0.021	0.912 ± 0.042	0.043 ± 0.021
MDFS	0.952 ± 0.021	0.901 ± 0.067	0.952 ± 0.021	0.902 ± 0.044	0.048 ± 0.021
MCLS	0.959 ± 0.030	0.892 ± 0.112	0.959 ± 0.030	0.915 ± 0.056	0.041 ± 0.030
FIMF	0.957 ± 0.025	0.919 ± 0.071	0.957 ± 0.025	0.919 ± 0.040	0.043 ± 0.025
Ours	0.984	0.979	0.984	0.968	0.016
Yale
Methods	accuracy	macro-F1	micro-F1	kappa	Hamming loss
LightGBM	0.660	0.607	0.660	0.636	0.340
PMU	0.662 ± 0.035	0.632 ± 0.026	0.662 ± 0.035	0.636 ± 0.037	0.338 ± 0.035
MDFS	0.716 ± 0.034	0.705 ± 0.035	0.716 ± 0.034	0.694 ± 0.035	0.284 ± 0.034
MCLS	0.586 ± 0.048	0.551 ± 0.047	0.586 ± 0.048	0.555 ± 0.052	0.414 ± 0.048
FIMF	0.672 ± 0.091	0.669 ± 0.070	0.672 ± 0.091	0.647 ± 0.097	0.328 ± 0.091
Ours	0.760	0.715	0.760	0.739	0.240
Cattle
Methods	accuracy	macro-F1	micro-F1	kappa	Hamming loss
LightGBM	0.877	0.877	0.877	0.816	0.123
PMU	0.899	0.899	0.899	0.849	0.101
MDFS	0.892	0.892	0.892	0.838	0.108
MCLS	0.894	0.894	0.894	0.841	0.106
FIMF	0.903	0.903	0.903	0.855	0.097
Ours	0.909	0.909	0.909	0.863	0.091

Table 4. Our method conducts ablation experiments on various datasets, where

B A S E

,

I C

, and

L C

correspond to formulas Equation (22), Equation (23), and Equation (24), respectively.

Table 4. Our method conducts ablation experiments on various datasets, where

B A S E

,

I C

, and

L C

correspond to formulas Equation (22), Equation (23), and Equation (24), respectively.

Lung
Modules	accuracy	macro-F1	micro-F1	kappa	Hamming loss
$B A S E$	0.934	0.764	0.934	0.871	0.066
$I C$	0.967	0.955	0.967	0.936	0.033
$L C$	0.984	0.979	0.984	0.968	0.016
Yale
Modules	accuracy	macro-F1	micro-F1	kappa	Hamming loss
$B A S E$	0.640	0.616	0.640	0.614	0.360
$I C$	0.700	0.670	0.700	0.676	0.300
$L C$	0.760	0.715	0.760	0.739	0.240
Cattle
Methods	accuracy	macro-F1	micro-F1	kappa	hamming loss
$B A S E$	0.885	0.885	0.885	0.827	0.115
$I C$	0.905	0.905	0.905	0.858	0.095
$L C$	0.909	0.909	0.909	0.863	0.091

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, Q.; Liu, S.; Zhao, C.; Li, S. An Instance- and Label-Based Feature Selection Method in Classification Tasks. Information 2023, 14, 532. https://doi.org/10.3390/info14100532

AMA Style

Fan Q, Liu S, Zhao C, Li S. An Instance- and Label-Based Feature Selection Method in Classification Tasks. Information. 2023; 14(10):532. https://doi.org/10.3390/info14100532

Chicago/Turabian Style

Fan, Qingcheng, Sicong Liu, Chunjiang Zhao, and Shuqin Li. 2023. "An Instance- and Label-Based Feature Selection Method in Classification Tasks" Information 14, no. 10: 532. https://doi.org/10.3390/info14100532

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Instance- and Label-Based Feature Selection Method in Classification Tasks

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Notations

3.2. Problem Formulation

3.3. Optimization

3.3.1. Solve W as given V

3.3.2. Solve V as given W

4. Results and Discussion

4.1. Datasets

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Experiment Results

4.5. Analysis

4.6. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI