Pairwise-Constraint-Guided Multi-View Feature Selection by Joint Sparse Regularization and Similarity Learning

Li, Jinxi; Tao, Hong

doi:10.3390/math12142278

Open AccessArticle

Pairwise-Constraint-Guided Multi-View Feature Selection by Joint Sparse Regularization and Similarity Learning

by

Jinxi Li

and

Hong Tao

^*

College of Science, National University of Defense Techonology, Changsha 410072, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(14), 2278; https://doi.org/10.3390/math12142278

Submission received: 17 June 2024 / Revised: 11 July 2024 / Accepted: 18 July 2024 / Published: 21 July 2024

(This article belongs to the Special Issue Applications of Artificial Intelligence, Machine Learning and Data Science)

Download

Browse Figures

Versions Notes

Abstract

Feature selection is a basic and important step in real applications, such as face recognition and image segmentation. In this paper, we propose a new weakly supervised multi-view feature selection method by utilizing pairwise constraints, i.e., the pairwise constraint-guided multi-view feature selection (PCFS for short) method. In this method, linear projections of all views and a consistent similarity graph with pairwise constraints are jointly optimized to learning discriminative projections. Meanwhile, the

l_{2, 0}

-norm-based row sparsity constraint is imposed on the concatenation of projections for discriminative feature selection. Then, an iterative algorithm with theoretically guaranteed convergence is developed for the optimization of PCFS. The performance of the proposed PCFS method was evaluated by comprehensive experiments on six benchmark datasets and applications on cancer clustering. The experimental results demonstrate that PCFS exhibited competitive performance in feature selection in comparison with related models.

Keywords:

multi-view feature selection; pairwise constraints; weakly supervised learning; joint subspace; similarity learning

MSC:

6208

1. Introduction

The development of data acquisition techniques and the diversification of data-processing methods make multi-view data a typical form of big data. Concretely, multi-view data is a general term for data with multiple feature representations of the same object. For example, cancers can be studied in terms of RNA gene expression, mitochondrial DNA, microRNA expression, and reverse phase protein array expression [1], where each kind of expression corresponds to a view in multi-view data. For another example, different visual descriptors, such as LBP [2], HOG [3], and SIFT [4] can be extracted from images, and each visual descriptor forms a view. For multi-view data, since views are the descriptions of the same object from different aspects, each view not only contains somewhat consistent information with the others but also contains some information complementary to the other views. How to make full use of the consistent and complementary information from multiple views to improve the data analysis performance gives rise to the flourishment of multi-view learning.

The studies on multi-view learning were conducted from various aspects, such as subspace learning [5], feature selection [6], classification [7], clustering [8], and multi-label learning [9]. More details can be found in surveys [10,11]. Among these studies, multi-view feature selection, as an important preprocessing step for dealing with multi-view high-dimensional data, has attracted extensive attention. Multi-view feature selection aims to select the most relevant features or variables for the subsequent learning tasks. Compared with multi-view subspace learning, which is an another typical way to handle the curse of dimensionality, multi-view feature selection keeps the original semantics of the variables and has an advantage in interpretability. In this paper, we focus on multi-view feature selection.

Conventionally, according to the labelling condition of the training data, multi-view feature selection methods can be grouped into three categories. That is, if the training samples are all labelled, partly labelled or all unlabeled, then the corresponding methods are categorized as supervised [12], semi-supervised [13], or unsupervised [14]. It is known that labelling samples is considerably expensive. Thus, unsupervised multi-view feature selection is more preferred in real applications. Existing unsupervised multi-view feature selection works can be divided into two groups, i.e., serial models and parallel models [15]. In the early period, serial models first concatenate all views into a big one, and then apply single-view feature selection methods on the concatenated features, such as spectral feature selection (SPEC) [16], minimum redundancy spectral feature selection (MRSF) [17], and URAFS [18]. However, approaches of this kind are not able to make full use of the consistency and complementarity between views. To overcome this deficiency, adaptive multi-view feature selection (AMFS) combines the graph Laplacian from all views [19], and Feng et al. added a multi-view manifold regularization term into the subspace learning scheme with sparsity induced by the

l_{2, 1}

-norm [20]. Along this line of research, Dong et al. proposed to learn consensus similarity from initial view-specific graphs, together with the sparse projection [21]. By contrast, parallel models conduct sparse projection learning on each view and merge the selected features from all views. Representative methods include MVUFS [22], RMVFS [23], ASVW [24], and JMVFG [25]. MVUFS learns pseudo-labels with local manifold regularization and simultaneously selects discriminative features from each view via sparse projection to the pseudo-label-indicating matrix [22]. RMVFS realizes view-specific sparse feature selection based on the multi-view K-means model [23]. ASVW encodes

l_{2, p}

-norm sparse regularized projection learning on each view into the consensus similarity learning scheme [24]. In view of the effectiveness of these methods, JMVFG gathers

l_{2, 1}

-norm regularized multi-view K-means, cross-view manifold regularization, and consensus graph learning into a whole [25].

Though previous works promoted the development of multi-view feature selection, they still retain certain limitations. For serial models, complementary information between multiple views might not be well handled, especially when concatenating multiple views into a big one. For parallel models, it is not easy to decide how many features are required for each view, which poses an obstacle to their practical applications. Moreover, both serial and parallel models employ

l_{2, 1}

- or

l_{2, p}

-norm regularization to impose feature-specific sparsity. This might affect the accuracy of feature selection since

l_{2, 1}

- and

l_{2, p}

-norms merely fulfill approximate sparsity. More importantly, these methods are all designed in a strict unsupervised setting and overlook certain available background prior information. Usually, background prior information, such as label correlation, label proportion, and pairwise constraints, plays an integral role in boosting learning performance.

In this paper, we focus on utilizing pairwise constraints to guide the multi-view feature selection process. To be specific, pairwise constraints consist of multi-link (ML) and cannot-link (CL) constraints. A must-link between two samples indicates that they are in the same cluster. Similarly, if there is a cannot-link between two samples, then they must be in different clusters. Pairwise constraints can be naturally obtained in many applications [26,27,28]. For example, in image segmentation, must-links can be added between groups of pixels from the same object, while cannot-link constraints can be inferred from two different image segments [27].

To overcome the abovementioned issues, in this paper, we aim to propose a new weakly supervised multi-view feature selection method by utilizing pairwise constraints. Note that pairwise constraints are easy to incorporate into graph-based clustering methods. Thus, we formulate the multi-view feature selection model by simultaneously performing pairwise constrained similarity learning and view-specific sparse projection. In this manner, the pairwise constraint information and local structure information can be fully explored to facilitate discriminative projection learning. To select an exact number of features, the

l_{2, 0}

-norm-based row sparsity constraint is adopted instead of the

l_{2, 1}

-norm in most existing works. That is to say, the

l_{2, 0}

-norm-based row sparsity constraint is imposed on the concatenated projection of all views. Then, the features that correspond to the non-zero rows of the concatenated projection matrix are selected. The resultant model is referred to as the pairwise constrained multi-view feature selection method, which is named PCFS for short hereinafter. The workflow of PCFS is shown in Figure 1.

In summary, the major contribution of this paper is that a new weakly supervised multi-view feature selection method is proposed. The proposed PCFS method makes full use of pairwise constraints to boost the feature selection performance, which are overlooked by existing multi-view feature selection methods. To be specific, the contributions of this paper can be refined as follows:

A common pairwise constrained similarity matrix for all views is adaptively learned, which exploits both the pairwise constraint information and local structure information to help select discriminative features.
View-specific projections are optimized under the $l_{2, 0}$ -norm-based cross-view sparsity constraint on the concatenated projection. In this way, both the consistent and complementary information of multiple views can be well captured.
An iterative algorithm was designed to optimize the objective function of the proposed PCFS method. Systematic experiments were conducted to verify its effectiveness.

The remaining contents of this paper are organized as follows. Some related works and preliminary knowledge are reviewed in Section 2. The PCFS model, including its formulation and optimization, is described in Section 3. In Section 4, we present the conducted experimental studies used to evaluate the effectiveness of PCFS. In Section 5, PCFS was applied to two cancer datasets. Section 6 gives a conclusion of this paper and Section 7 puts forward future work.

2. Related Works

In this section, we first introduce the notations and state the problem setting. Then, we review some representative unsupervised multi-view feature selection methods.

2.1. Notations and Problem Statement

Throughout this paper, we utilize boldface lowercase letters to denote vectors and matrices are written in boldface uppercase letters. For a matrix

W \in R^{d \times m}

, its

(i, j)

-th element is denoted as

w_{i} j

and its i-th row and j-the column are denoted by

w_{i \cdot}

and

w_{\cdot}

, respectively. The

l_{2, 0}

-norm of

W

is defined as

{∥ W ∥}_{2, 0} = \sum_{i = 1}^{d} {∥\sum_{j = 1}^{m} w_{i j}^{2}∥}_{0}

, where the operation

{∥\cdot∥}_{0}

outputs 1 if the input scalar is non-zero, and 0 otherwise. The

l_{2, 1}

-norm of

W

is defined as

{∥ W ∥}_{2, 1} = \sum_{i = 1}^{d} \sqrt{\sum_{j = 1}^{m} w_{i j}^{2}}

. It can be seen that the

l_{2, 0}

-norm-based minimization of a matrix can achieve row sparsity. The

l_{2, 1}

-norm is the convex approximation of the

l_{2, 0}

-norm.

Tr (W)

denotes the trace of the matrix

W

when

W

is square.

{∥ W ∥}_{F} = \sqrt{\sum_{i = 1}^{d} \sum_{j = 1}^{m} a_{i j}^{2}} = \sqrt{Tr (W W^{T})}

denotes the Frobenius norm of

W

and

W^{T}

denotes the transpose of

W

.

Assume that we are given a multi-view dataset with n samples and V views, where each view is denoted as

X^{v} = {[x_{1}^{v}, \dots, x_{n}^{v}]}^{T} \in R^{n \times d_{v}} (v = 1, 2, \dots, V)

and

x_{i}^{v}

is the i-th sample of the v-th view. The concatenated feature matrix is denoted as

X = [X^{1}, \dots, X^{V}]

. The goal of unsupervised multi-view feature selection is to select the k most discriminative features from the original d features, where

d = \sum_{v = 1}^{V} d_{v}

. Some important notations and their specific descriptions are documented in Table 1.

2.2. AMFS

Adaptive multi-view feature selection (AMFS) [19] is an unsupervised feature selection method proposed for human motion retrieval. AMFS employs local descriptors to characterize the local geometric structure of human motion data. AMFS can select discriminative features in three steps. First of all, it generates the Laplacian graph [30] as a local descriptor for each single view. After this, these Laplacian matrices are combined together with nonnegative view weights to explore more complementary information between different views. Finally, trace ratio criteria, as in the traditional Fisher score feature selection method, are used in AMFS to discard the redundant features.

In general, the objective function of AMFS is formulated as follows:

\begin{matrix} \begin{matrix} min_{W, α} \frac{Tr (W^{T} X (\sum_{v = 1}^{V} {α_{v}}^{r} L^{(v)}) X^{T} W)}{Tr (W^{T} X H X^{T} W)} \\ s . t . \sum_{v = 1}^{V} α_{v} = 1, α \geq 0, W \in {0, 1}^{d \times k}, \end{matrix} \end{matrix}

(1)

where

L^{(v)}

is the Laplacian graph corresponding to the v-th view and

H

is the centralized matrix to centralize the data points.

α = {[α_{1}, α_{2}, \dots, α_{V}]}^{T}

is the weight coefficient vector used to combine the different Laplacian matrices and

α \geq 0

means that its elements are all nonnegative. The parameter

r > 1

is designed to avoid the trivial solutions.

W \in {0, 1}^{d \times k}

is a weight coefficient matrix that has one and only one non-zero element in each row for realizing feature selection. The features corresponding to non-zero elements are naturally selected.

2.3. ASVW

Adaptive similarity and view weight (ASVW) [24] assumes that all the similarity matrices between different views are selfsame for characterizing the common structure from data points in all views. ASVW learns the common similarity matrix from the given multi-view data rather than using the predefined one. Considering that nearby points in high-dimensional space should also be nearby in low-dimensional space, the objective function of ASVW is described as follows:

\begin{matrix} \begin{matrix} min_{W^{1}, \dots, W^{V}, α, S} \sum_{v = 1}^{V} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{v}^{r_{1}} {∥{W^{v}}^{T} x_{i}^{(v)} - {W^{v}}^{T} x_{j}^{(v)}∥}^{2} {(s_{i j})}^{r_{2}} + λ \sum_{v = 1}^{V} {∥W^{v}∥}_{2, p}^{p} \\ s . t . {W^{v}}^{T} W^{v} = I_{m}, \sum_{v = 1}^{V} α_{v} = 1, α \geq 0, \sum_{j = 1}^{n} s_{i j} = 1, s_{i j} \geq 0, {∥s_{i \cdot}∥}_{0} = ι, \end{matrix} \end{matrix}

(2)

where

λ

is a nonnegative balance parameter,

I_{m}

is the m-order identity matrix,

{∥W^{v}∥}_{2, p}^{p}

is the p-th power of the

l_{2, p}

-norm of

W^{v}

, and

∥ W^{v} ∥_{2, p} = {(\sum_{i = 1}^{d_{v}} {(\sum_{j = 1}^{m} {|w_{i j}^{v}|}^{2})}^{\frac{p}{2}})}^{\frac{1}{p}}

.

α

has the same meaning as in AMFS, and

r_{1} > 1

and

r_{2} > 1

are two parameters used to avoid trivial solutions.

{∥s_{i \cdot}∥}_{0} = ι

means that each sample is only connected with

ι

neighbors. ASVW evaluates the importance of all features from different views with the two-norm of

w_{i \cdot}^{(v)}

after deriving the best optimal solution.

2.4. RMVFS

Robust multi-view feature selection (RMVFS) [23] is a model based on multi-view K-means. Concretely, it first produces pseudo-labels obtained by robust multi-view K-means. Then, the feature matrices of multiple views are projected to the aligned-label-indicating matrix with sparsity regularization. The optimization problem of RMVFS can be summarized as follows:

\begin{matrix} \begin{matrix} min_{H, G, C, W} \sum_{v = 1}^{V} (α_{v} {∥X^{v} - H G^{v}∥}_{2, 1} + {∥X^{v} W^{v} - H C^{v}∥}_{F}^{2} + β {∥W^{v}∥}_{2, 1}) \\ s . t . H \in {0, 1}^{n \times c}, \end{matrix} \end{matrix}

(3)

where

G^{v} \in R^{c \times d_{v}}

is regarded as the corresponding centroid matrix of the v-th view and

C^{v} \in R^{c \times c}

is the label alignment matrix. An alternate K-means-like optimization solution is utilized to update the variables in the unified framework.

2.5. JMVFG

Joint multi-view unsupervised feature selection and graph learning (JMVFG) [25] is an approach based on orthogonal decomposition. Given multi-view data feature matrices

{X^{v}}_{v = 1}^{V}

and initial similarity matrices

{A^{v}}_{v = 1}^{V}

, JMVFG decomposes

X^{v} W^{v}

into view-specific basis matrices

{B^{v} \in R^{c \times d_{v}}}_{v = 1}^{V}

and a view-consistent cluster indicator matrix

H \in R^{n \times c}

. Moreover, the unified similarity matrix

S \in R^{n \times n}

is adaptively constructed with the mutual collaboration of the global multi-graph fusion to ensure the cross-space locality preservation mechanism. On the whole, the objective function of JMVFG can be formulated as follows:

\begin{matrix} \begin{matrix} min_{\begin{matrix} W^{v}, B^{v}, δ, H, S, Z \end{matrix}} \sum_{v = 1}^{V} \{{∥X^{v} W^{v} - H B^{v}∥}_{F}^{2} + η {∥W^{v}∥}_{2, 1} + γ Tr [{W^{v}}^{T} {X^{v}}^{T} L X^{v} W^{v}] \\ + β {∥S - δ^{v} A^{v}∥}_{F}^{2}\} \\ s . t . S 1 = 1, S \geq 0; H^{T} H = I, H \geq 0; δ^{T} 1 = 1, δ \geq 0; {B^{v}}^{T} B^{v} = I, \forall v \end{matrix} \end{matrix}

(4)

where

η > 0

,

γ > 0

, and

β > 0

are hyperparameters that control the weights of the regularization term, the cross-space locality preservation term, and the graph term, respectively.

δ = {[δ^{1}, δ^{2}, \dots, δ^{V}]}^{T}

is an automatic learnable parameter that measures the importance of each view during the optimization.

1 = {[1, 1, \dots, 1]}^{T}

is a vector with all ones, and

I

is the identity matrix. In the rest of this paper, the dimensionality of

1

and the order of

I

are omitted when there is no confusion.

3. Methodology

In this section, we introduce the proposed PCFS method, including its model construction and the optimization algorithm.

3.1. The Model of PCFS

Given a multi-view dataset

X = [X^{1}, \dots, X^{V}]

, and the corresponding pairwise constraints, the goal of PCFS is to select the k most discriminative features from the original d features. To be specific, the set of must-link and cannot-link constraints are represented as

M

and

C

. Concretely,

\begin{matrix} \begin{matrix} M = \{(x_{i}, x_{j}) ∣ \forall j > i, x_{i}, x_{j} in the same cluster\}, \\ C = \{(x_{i}, x_{j}) ∣ \forall j > i, x_{i}, x_{j} in different clusters\} . \end{matrix} \end{matrix}

(5)

We adopted sparse projection to formulate the model. During the projection process, in order to preserve the local similarity, it usually requires that nearby points in the original space should be near in the projected space. Thus, the initial model can be formulated as

\begin{matrix} min_{{W^{v}}_{v = 1}^{V}} \sum_{v = 1}^{V} α_{v} \{\frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} {∥{W^{v}}^{T} x_{i}^{v} - {W^{v}}^{T} x_{j}^{v}∥}^{2} s_{i j}^{v} + λ {∥ W^{v} ∥}_{2, 1}\} \end{matrix}

(6)

where

s_{i j}^{v}

is the adaptively optimized similarity between

x_{i}^{v}

and

x_{j}^{v}

,

0 < α_{v} < 1

is the weight that reflects the contribution of the v-th view and

α = {[α_{1}, \dots, α_{v}]}^{T}

,

W^{v} \in R^{d_{v} \times m}

is the v-th projection matrix, and m is the dimensionality of the projected space. With the

l_{2, 1}

-norm-based regularization, the features can be ranked according to the

l_{2}

-norm of

w_{i \cdot}^{v}

, and

λ > 0

is a trade-off parameter.

Note that this model considers each view independently and ignores the connection between multiple views. To capture the consistent information across views, it is usually assumed that multiple views share the same similarity matrix

S

. Moreover, to avoid the trivial solution of

S

, the sum of elements of each row of

S

is constrained to be 1. Then, the model is updated using

\begin{matrix} \begin{matrix} min_{{W^{v}}_{v = 1}^{V}, S} & \sum_{v = 1}^{V} α_{v} \{\frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} {∥{W^{v}}^{T} x_{i}^{v} - {W^{v}}^{T} x_{j}^{v}∥}^{2} s_{i j} + λ {∥ W^{v} ∥}_{2, 1}\} \\ s . t . & s_{i \cdot}^{T} 1 = 1, s_{i \cdot} \geq 0, {∥ s_{i \cdot} ∥}_{0} = ι, \end{matrix} \end{matrix}

(7)

where

s_{i \cdot}

is the i-th row of

S

and

ι

is a predefined parameter to specify the connected neighbors of each sample.

It can be seen that the above formulation falls into the parallel model category. It needs to artificially decide the number of selected features for each view. However, there still lacks a practical criterion for allocating the number of selected features for each view. To address this problem, we aim to learn a weighted low-dimensional representation. Then, we have

\begin{matrix} \begin{matrix} min_{{W^{v}}_{v = 1}^{V}, S} & \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} {∥\sum_{v = 1}^{V} α_{v} {W^{v}}^{T} x_{i}^{v} - \sum_{v = 1}^{V} α_{v} {W^{v}}^{T} x_{j}^{v}∥}^{2} s_{i j} + λ {∥ W ∥}_{2, 1} \\ s . t . & s_{i \cdot}^{T} 1 = 1, s_{i \cdot} \geq 0, {∥ s_{i \cdot} ∥}_{0} = ι, \end{matrix} \end{matrix}

(8)

where

W = {[{W^{1}}^{T} \dots {W^{V}}^{T}]}^{T} \in R^{d \times m}

and

d = \sum_{v = 1}^{v} d_{v}

is the total dimensionality.

Rewriting the above formulation into a matrix form, it becomes

\begin{matrix} \begin{matrix} min_{{W^{v}}_{v = 1}^{V}, S} & Tr \{{(\sum_{v = 1}^{V} α_{v} X^{v} W^{v})}^{T} L_{S} (\sum_{v = 1}^{V} α_{v} X^{v} W^{v})\} + λ {∥ W ∥}_{2, 1} \\ s . t . & s_{i \cdot}^{T} 1 = 1, s_{i \cdot} \geq 0, {∥ s_{i \cdot} ∥}_{0} = ι, \end{matrix} \end{matrix}

(9)

where

\begin{matrix} L_{S} = D_{S} - \frac{S^{T} + S}{2}, \end{matrix}

(10)

is the graph Laplacian. The first term of Equation (9) makes sure that the weighted low-dimensional representation

\sum_{v = 1}^{V} α_{v} X^{v} W^{v}

preserves the original local geometric structure. The second term

{∥ W ∥}_{2, 1}

ranks the cross-view features as a whole. In this way, the complementary and consistent information of each view is well captured and the features are evaluated together.

Recently, Wu et al. [15] developed a technique to deal with the

l_{2, 0}

-norm optimization problem. This technique can be adopted to obtain a feature selection model with strict row sparsity. To be specific, the model is improved using

\begin{matrix} \begin{matrix} min_{{W^{v}}_{v = 1}^{V}, S} Tr \{{(\sum_{v = 1}^{V} α_{v} X^{v} W^{v})}^{T} L_{S} (\sum_{v = 1}^{V} α_{v} X^{v} W^{v})\} \\ {s . t . ∥ W ∥}_{2, 0} = k, W^{T} W = I_{m}, s_{i \cdot}^{T} 1 = 1, s_{i \cdot} \geq 0, {∥ s_{i \cdot} ∥}_{0} = ι \end{matrix} \end{matrix}

(11)

where k is the number of selected features and

I_{m}

is the m-th-order identity matrix.

Since prior pairwise constraints reflect the relationship between sample pairs, it is natural to represent them with the graph similarity. If there is a must-link between

x_{i}

and

x_{j}

, this is equivalent to

s_{i j} > 0

. However, simply letting

s_{i j} = 0

cannot guarantee that

x_{i}

and

x_{j}

are in different connected components since there might be indirect path between

x_{i}

and

x_{j}

. To overcome this issue, we adopted the graph-Laplacian-based cannot-link regularization proposed in [31].

Theorem 1

([31]). Denote the similarity matrix as

S

. For a cannot-link constraint between

x_{a}

and

x_{b}

, its indicator vector

y \in {[- 1, 1]}^{n}

is defined as

y_{a} = 1

and

y_{b} = - 1

. If

\begin{matrix} y^{T} L_{S} y = 0, \end{matrix}

(12)

then

x_{a}

and

x_{b}

must be in different connected components of

S

.

According to Theorem 1, the formulation of PCFS can be further improved using

\begin{matrix} \begin{matrix} min_{{W^{v}}_{v = 1}^{V}, S, Y} Tr \{{(\sum_{v = 1}^{V} α_{v} X^{v} W^{v})}^{T} L_{S} (\sum_{v = 1}^{V} α_{v} X^{v} W^{v})\} + β \sum_{i = 1}^{n_{C L}} y_{t}^{T} L_{S} y_{t} \\ {s . t . ∥ W ∥}_{2, 0} = k, W^{T} W = I_{m}, s_{i j} = s_{j i} = τ, \forall (i, j) \in M, s_{i \cdot}^{T} 1 = 1, s_{i \cdot} \geq 0, {∥ s_{i \cdot} ∥}_{0} = ι, \end{matrix} \end{matrix}

(13)

where

y_{t}

(

t = 1, 2, \dots, n_{c L}

) is the indicating vector of the t-th cannot-link

(x_{i_{t}}, x_{j_{t}}) \in C

,

Y = [y_{1}, \dots, y_{n_{C L}}] \in {[- 1, 1]}^{n \times n_{C L}}

,

τ > 0

is a given small positive value, and

β > 0

is a parameter used to balance the local-structure-preserving term and the cannot-link regularization term.

Meanwhile, as indicated in recent work [32], a theoretically ideal similarity matrix should have the property that the number of connected components equals the number of clusters. Such a property is also preferred in feature selection methods based on similarity learning [21]. If there are c connected components in a graph with n vertexes, then the rank of the graph Laplacian

L_{S}

will equal

n - c

. Note that the rank of

L_{S}

can be judged from its eigenvalues. Denote the j-th smallest eigenvalue of

L_{S}

as

σ_{j} (L_{S})

; then,

rank (L_{S}) = n - c

leads to

\sum_{j = 1}^{c} σ_{j} (L_{S}) = 0

. According to Ky Fan theory [33],

\begin{matrix} \sum_{j = 1}^{c} σ_{j} (L_{S}) = min_{U \in R^{n \times c}, U^{T} U = I} Tr (U^{T} L_{S} U) . \end{matrix}

(14)

Thus, when

γ

is large enough, the rank constraint can be realized by optimizing

\begin{matrix} \begin{matrix} min_{{W^{v}}_{v = 1}^{V}, S, Y, U} Tr \{{(\sum_{v = 1}^{V} α_{v} X^{v} W^{v})}^{T} L_{S} (\sum_{v = 1}^{V} α_{v} X^{v} W^{v})\} + β Tr (Y^{T} L_{S} Y) \\ + γ Tr (U^{T} L_{S} U) \\ {s . t . ∥ W ∥}_{2, 0} = k, W^{T} W = I_{m}, s_{i j} = s_{j i} = τ, \forall (i, j) \in M, s_{i \cdot}^{T} 1 = 1, s_{i \cdot} \geq 0, {∥ s_{i \cdot} ∥}_{0} = ι, \\ Y \in Ψ = \{Y \in {[- 1, 1]}^{n \times n_{C L}} ∣ \forall (x_{i_{t}}, x_{j_{t}}) \in C : y_{i_{t}} = 1, y_{j_{t}} = - 1\}, U^{T} U = I, \end{matrix} \end{matrix}

(15)

which is the optimization problem of the proposed PCFS. It can be seen that PCFS leverages consistent and complementary cross-view information by the common similarity matrix and weighted combination of view-specific projections, respectively. When learning the common similarity matrix, pairwise constraints are incorporated to provide prior information.

3.2. Optimization of PCFS

There are five groups of variables to be optimized in problem (15). We considered designing an iterative algorithm to optimize them.

(i) Update $W$ and fix the others. When $Y, U$ , and $S$ are fixed, model (15) becomes model (11).

Moreover, it can be turned into

\begin{matrix} min_{\begin{matrix} W^{T} W = I_{m \times m}, {∥ W ∥}_{2, 0} = k \end{matrix}} Tr (W^{T} {\tilde{P}}^{T} X^{T} L_{S} X \tilde{P} W), \end{matrix}

(16)

where

\tilde{P} = d i a g (\tilde{p}) \in R^{d \times d}

is a diagonal matrix,

\tilde{p} = [\begin{matrix} α^{1} 1_{d_{1}} \\ ⋮ \\ α^{V} 1_{d_{V}} \end{matrix}] \in R^{d}

,

W = {[{W^{1}}^{T} \dots {W^{V}}^{T}]}^{T} \in R^{d \times m}

, and

X = [X^{1}, \dots, X^{V}] \in R^{n \times d}

.

\tilde{p}

is the diagonal vector of

\tilde{P}

and

d i a g (\cdot)

is the vector diagonal operator. If we let

E^{T} E = {\tilde{P}}^{T} X^{T} L_{s} X \tilde{P}

, it can be written as

\begin{matrix} min_{\begin{matrix} W^{T} W = I_{m \times m}, {∥ W ∥}_{2, 0} = k \end{matrix}} Tr (W^{T} E^{T} E W) . \end{matrix}

(17)

Furthermore, we denote

S_{E} = λ \cdot I_{d \times d} - E^{T} E

, where

λ

should be large enough to guarantee that

S_{E}

is a positive semi-definite matrix. Then, problem (17) can be written as

\begin{matrix} min_{\begin{matrix} W^{T} W = I_{m \times m}, {∥ W ∥}_{2, 0} = k \end{matrix}} Tr (W^{T} S_{E} W), \end{matrix}

(18)

which is an NP-hard problem. Then, we consider solving it using two cases. First, we discuss the case

rank (S_{E}) \leq m

. Since

{∥ W ∥}_{2, 0} = k

, let

q \in Ind (k, d)

be the

R^{k}

indicator vector of non-space rows of

W

and

Ind (k, d) \in R^{k}

be an indicator vector that selects k unduplicated elements from

\{1, \dots, d\}

in an ascending order. Then,

W

can be decomposed into

W = H D

, where

H =

Π (q) \in {0, 1}_{d \times k}

; the element

h_{u v}

of H equals 1 only if

u = q_{v}

; and

Π (\cdot)

is the mapping function

Ind (k, d) \to {0, 1}_{d \times k}

. Naturally,

D \in R^{k \times m}

,

D^{T} D = I_{m \times m}

such that problem (18) can be written as the following problem (19):

\begin{matrix} max_{H = Π (q), q \in I n d (k, d), D^{T} D = I_{m \times m}} Tr (D^{T} H^{T} S_{E} H D), \end{matrix}

(19)

where the optimal

H = Π (\tilde{q})

, and

\tilde{q} \in Ind (k, d)

is an indicator vector of representations of the first k maxima of the diagonal vector of the

S_{E}

. Then, in problem (19), the solution

D

can be obtained by the eigenvectors corresponding to the first m maximum eigenvalues of

H^{T} S_{E} H

. Finally,

W

can be calculated as

H D

.

On the other side, we consider the case

rank (S_{E}) > m

. In this case, we consider utilizing the famous Majorize–Maximization (MM) framework [34]. Suppose the current solution is

W_{0}

. Through observing

{W_{0}}^{T} S_{E} W_{0} = W_{0}^{T} (S_{E} W_{0} {(W_{0}^{T} S_{E} W_{0})}^{†} W_{0}^{T} S_{E}) W_{0}

, where

{(\cdot)}^{†}

is the Moore–Penrose inverse operator, we guess the following surrogate problem of problem (18) based on the MM framework:

\begin{matrix} \begin{matrix} max_{W} Tr (W^{T} (S_{E} W_{0} {(W_{0}^{T} S_{E} W_{0})}^{†} W_{0}^{T} S_{E}) W) \\ s . t . W^{T} W = I_{m \times m}, {∥ W ∥}_{2, 0} = k, \end{matrix} \end{matrix}

(20)

with the following theorem.

Theorem 2

([15]). If the objective functions of problem (18) and problem (20) are

L_{R} (W)

and

L_{S} (W)

, respectively;

W^{T} W = I_{m \times m}

; and

{∥ W ∥}_{2, 0} = k

at any point, we obtain

L_{R} (W) \geq L_{S} (W)

and

L_{R} (W_{0}) = L_{S} (W_{0})

.

Then, we can suppose that problem (20) can be a surrogate problem for problem (18). Afterward, we focus on the surrogate problem (20) and draw the following two conclusions: (1) Since

rank (W_{0}) \leq m

, we have

rank (S_{E} W_{0} {(W_{0}^{T} S_{E} W_{0})}^{†} W_{0}^{T} S_{E}) \leq m

. (2) Considering

S_{E}

is positive semi-definite, we have

\begin{matrix} \begin{matrix} S_{E} W_{0} {(W_{0}^{T} S_{E} W_{0})}^{†} W_{0}^{T} S_{E} = S_{E} W_{0} {(W_{0}^{T} S_{E} W_{0})}^{†} \\ (W_{0}^{T} S_{E} W_{0}) {(W_{0}^{T} S_{E} W_{0})}^{†} W_{0}^{T} S_{E} = G G^{T} . \end{matrix} \end{matrix}

(21)

Therefore, the positive semi-definite

S_{E} W_{0} {(W_{0}^{T} S_{E} W_{0})}^{†} W_{0}^{T} S_{E}

is obtained. Therefore, problem (20) can be solved in the same way as problem (18) in the case of

rank (S_{E}) \leq m

. For conciseness, we do not provide the detailed optimization procedures.

(ii) Update $Y$ and fix the others. When $W, U$ , and $S$ are fixed, model (15) becomes

\begin{matrix} min_{Y} Tr (Y^{T} L_{S} Y) . \end{matrix}

(22)

Considering that the independence between the columns of

Y

, problem (22) can be achieved by solving

\begin{matrix} \begin{matrix} min_{y_{t}} y_{t}^{T} L_{S} y_{t} \\ s . t . \forall (x_{i_{t}}, x_{j_{t}}) \in C : y_{t} (i_{t}) = 1, y_{t} (j_{t}) = - 1 . \end{matrix} \end{matrix}

(23)

As stated before, the optimization of

y_{t}

can be regarded as the label propagation process over the graph

S

[35]. To be specific, we rearrange all samples as

χ^{t} =

\{x_{i_{t}}, x_{j_{t}}, x_{1}, \dots, x_{n}\}

. According to the reference above, the rearranged

y_{t}

is expressed as

y^{(t)} =

{\{y_{l}^{(t)}, y_{u}^{(t)}\}}^{T} \in {[- 1, 1]}^{n}

, where

y_{l}^{(t)} = {\{1, - 1\}}^{T}

. Naturally,

S

is also rearranged into

S^{(t)}

, whose Laplacian matrix is turned into

{L_{S}}^{(t)}

. Therefore, problem (23) can be reformulated as

\begin{matrix} \begin{matrix} min_{y^{(t)}} y^{(t) T} L_{S}^{(t)} y^{(t)} \\ s . t . y_{l}^{(t)} = {\{1, - 1\}}^{T}, where {L_{S}}^{(t)} = [\begin{matrix} L_{l l}^{(t)} & L_{l u}^{(t)} \\ L_{u l}^{(t)} & L_{u u}^{(t)} \end{matrix}] . \end{matrix} \end{matrix}

(24)

Then, according to the label propagation algorithm [35], the closed-form solution to problem (24) is

y_{u}^{(t)} = {(D_{S} - L_{u u}^{(t)})}^{- 1} L_{u l}^{(t)} y_{l}^{(t)}

.

(iii) Update $U$ and fix the others. When $W, Y$ , and $S$ are fixed, model (15) becomes

\begin{matrix} \begin{matrix} min_{U} Tr (U^{T} L_{S} U) \\ s . t . U^{T} U = I_{c \times c} . \end{matrix} \end{matrix}

(25)

Problem (25) degrades to a spectral clustering problem [36], whose solution consists of the eigenvectors of

L_{S}

corresponding to its c smallest eigenvalues.

(iv) Update $S$ and fix the others. When

W, Y

, and

U

are fixed, since every part of the model (15) contains the Laplacian matrix

L_{S}

, we denote

F = \sum_{v = 1}^{V} α_{v} X^{v} W^{v}

such that model (15) becomes

\begin{matrix} \begin{matrix} min_{S} Tr (F^{T} L_{S} F) + β \cdot Tr (Y^{T} L_{S} Y) + γ \cdot Tr (U^{T} L_{S} U) \\ s . t . s_{i j} = s_{j i} = τ, \forall (i, j) \in M, s_{i \cdot}^{T} 1 = 1, s_{i \cdot} \geq 0, {∥ s_{i \cdot} ∥}_{0} = ι, U^{T} U = I_{c \times c}, F^{T} F = I_{m \times m}, \end{matrix} \end{matrix}

(26)

where

F = \{f^{1}, \dots f^{n}\}, U = \{u^{1}, \dots u^{n}\}

.

To address the problem above, we introduce a property in the spectral analysis [36] described as

\begin{matrix} Tr (Y^{T} L_{S} Y) = \frac{1}{2} \sum_{i, j} {∥y^{i} - y^{j}∥}_{2}^{2} s_{i j} . \end{matrix}

(27)

Afterward, problem (26) can be reformulated as

\begin{matrix} \begin{matrix} min_{S} \sum_{i, j} (\frac{1}{2} d_{i j}^{f} + \frac{β}{2} d_{i j}^{y} + \frac{γ}{2} d_{i j}^{u}) s_{i j}, \\ s . t . s_{i j} = s_{j i} = τ, \forall (i, j) \in M, s_{i \cdot}^{T} 1 = 1, s_{i \cdot} \geq 0, {∥ s_{i \cdot} ∥}_{0} = ι, \end{matrix} \end{matrix}

(28)

where

d_{i j}^{f} = {∥f^{i} - f^{j}∥}_{2}^{2}, d_{i j}^{y} = {∥y^{i} - y^{j}∥}_{2}^{2}

, and

d_{i j}^{u} = {∥u^{i} - u^{j}∥}_{2}^{2}

. We note that the optimal solution

S

to problem (28) should basically maintain the similarity with the original matrix, which is directly preconstructed by

k_{n}

neighbor data points. Therefore, we assume that the current unoptimized similarity matrix is

S = A \in {[0, 1]}^{n \times n}

. To avoid confusion, let

A = {[a_{i j}]}_{n \times n}

. Then, the optimal solution to problem (28) should also be the solution to the following problems:

\begin{matrix} \begin{matrix} min_{S} \sum_{i, j} (\frac{1}{2} d_{i j}^{f} + \frac{β}{2} d_{i j}^{y} + \frac{γ}{2} d_{i j}^{u}) s_{i j} + \sum_{i, j} {(s_{i j} - a_{i j})}^{2}, \\ s . t . s_{i j} = s_{j i} = τ, \forall (i, j) \in M, s_{i \cdot}^{T} 1 = 1, s_{i \cdot} \geq 0, {∥ s_{i \cdot} ∥}_{0} = ι . \end{matrix} \end{matrix}

(29)

Considering that each column of

S

is independent, we can address problem (29) by solving the following problem for each column:

\begin{matrix} \begin{matrix} min_{s_{i}} {∥s_{i} - (a_{i} - \frac{1}{2} d_{i})∥}_{2}^{2}, \\ s . t . s_{i j} = s_{j i} = τ, \forall (i, j) \in M, s_{i \cdot}^{T} 1 = 1, s_{i \cdot} \geq 0, {∥ s_{i \cdot} ∥}_{0} = ι, \end{matrix} \end{matrix}

(30)

where

d_{i} = d_{i}^{f} + β d_{i}^{y} + γ d_{i}^{u}

. According to

M

, the must-linked objects of the i-th samples are denoted as

κ_{i} = \{j ∣ (x_{i}, x_{j}) \in M\}

. Fix

s_{i, j : j \in κ_{i}} = τ

and

s_{j : j \in κ_{i}, i} = τ

, and denote

s_{i, j : j \notin κ_{i}}, d_{i, j : j \notin κ_{i}}

, and

a_{i, j : j \notin κ_{i}}

as

{\tilde{s}}_{i \cdot}, {\tilde{d}}_{i \cdot}

, and

{\tilde{a}}_{i}

, respectively. Then, problem (30) can be rewritten as

\begin{matrix} min_{{\tilde{s}}_{i \cdot}} {∥{\tilde{s}}_{i \cdot} - ({\tilde{a}}_{i \cdot} - \frac{1}{2} {\tilde{d}}_{i \cdot})∥}_{2}^{2}, s . t . {\tilde{s}}_{i \cdot} \geq 0, {\tilde{s}}_{i \cdot}^{T} 1 = 1 - |κ_{i}| τ . \end{matrix}

(31)

where

|κ_{i}|

denotes the number of elements in

κ_{i}

. The above problem (31) can be efficiently addressed by referring to Reference [37] under the KKT condition. The Lagrangian function of problem (31) is

\begin{matrix} L ({\tilde{s}}_{i \cdot}, η, ζ) = \frac{1}{2} {∥{\tilde{s}}_{i \cdot} + \frac{{\tilde{d}}_{i \cdot} - 2 {\tilde{a}}_{i \cdot}}{2}∥}_{2}^{2} - η ({\tilde{s}}_{i}^{T} 1 - 1 + |κ_{i}| τ) - ζ {\tilde{s}}_{i \cdot}, \end{matrix}

(32)

where

η > 0

and

ζ > 0

are Lagrangian multipliers. This is obviously the converged solution of problem (31) satisfying the KKT condition. Based on this, the optimal solution of

{\tilde{s}}_{i \cdot}

is

\begin{matrix} {\tilde{s}}_{i j} = {(- \frac{e_{i j}}{2} + η)}_{+}, \end{matrix}

(33)

where

e_{i \cdot} = {\tilde{d}}_{i \cdot} - 2 {\tilde{a}}_{i \cdot}, η = \frac{1}{2 ι} (2 + \sum_{j = 1}^{ι} e_{i j})

, and

{|{\tilde{s}}_{i j}|}_{0} = ι

[32].

Without loss of generality, we suppose

e_{i 1}, e_{i 2}, \dots, e_{i n}

are ordered from smallest to largest. Thus, the optimal solution is obtained by

\begin{matrix} {\tilde{s}}_{i j} = \{\begin{matrix} \frac{\sum_{j = 1}^{ι} e_{i j} - ι e_{i j} + 2}{2 ι} & j \leq ι \\ 0 & j > ι \end{matrix} . \end{matrix}

(34)

The algorithm to solve model (15) is summarized in Algorithm 1.

Algorithm 1 Algorithm to solve model (15)

Input:: the pairwise constraints $M$ and $C$ ; a similarity matrix $S$ ; constant values $τ = 0.01, k$ , and m; the parameters $β$ and $γ$ ; multi-view data $X \in R^{n \times d}$ ; and constant vector $α = {[\frac{1}{V}, \dots, \frac{1}{V}]}^{T} \in R^{V}$ .
Output:: k features selected by model (15) and objective function value.

1:: $W = [W^{1}; \dots; W^{V}]$ and $U \in R^{n \times c}$ are initialized with the eigenvectors of $L_{S} = D_{S} -$ $\frac{S^{T} + S}{2}$ corresponding to its c smallest eigenvalues; $Y = [y_{1}, \dots, y_{n_{C L}}] \in {[- 1, 1]}^{n \times n_{C L}}$ and $y_{t} (i_{t}) = 1,$ $y_{t} (j_{t}) = - 1$ are fixed because of the t-th pair of cannot-linked samples $(x_{i_{t}}, x_{j_{t}}) \in C . F = \sum_{v = 1}^{V} α_{v} X^{v} W^{v}$ .
2:: while not converge do
3:: Calculate the value of the objective function.
4:: if $rank (S_{E}) \leq m$ then
5:: Update $H$ and $D$ via solving problem (19).
6:: Update $W$ via $W = H D$ .
7:: else
8:: while not converge do
9:: Let $W_{0}$ equal $W$ .
10:: Update $W$ via solving problem (20).
11:: end while
12:: end if
13:: Update $F = \sum_{v = 1}^{V} α_{v} X^{v} W^{v}$ .
14:: Update $Y$ via solving problem (24).
15:: Update $S$ via solving problem (31).
16:: Update $L_{S} = D_{S} - \frac{S^{T} + S}{2}$ .
17:: Update $U$ via solving problem (25).
18:: end while

4. Experiments

4.1. Experimental Datasets

In this part, experiments that were performed on two different types of datasets are described. Specifically, we selected six real datasets related to images or videos to evaluate PCFS as below:

Caltech101-7 comes from the Caltech Class 101 Image Database and it consists of 441 images from one to seven classes, which is associated with six views, namely, Gabor, WM, CENT, HOG, GIST, and LBP.
Caltech101-20 is comprised of 2386 images belonging to 20 classes. Gabor, WM, CENT, HOG, GIST, and LBP were extracted as six views.
AR10P includes 130 images divided into 10 categories. Each category contains 13 instances and SIFT, HOG, LBP, wavelet texture, and GIST were extracted as five views for the experiments.
Yale contains a total of 165 images from 15 volunteers, each with 15 images. Each person corresponds to one class; thus, five features were used as five views: SIFT, HOG, LBP, wavelet texture, and GIST.
YaleB is a dataset containing 2414 images from 38 classes. SIFT, HOG, LBP, wavelet texture, and GIST formed five views for the experiments.
Kodak is a video dataset for the consumer domain, and all samples are from actual users. It contains 2172 samples that can be divided into 21 categories. The edge orientation histogram, Gabor texture, and grid color rectangle were divided into three views.

More details related to the datasets are directly shown in Table 2.

4.2. Experimental Setting

Baselines: We compared PCFS with several representative multi-view feature selection methods on the clustering performance. When determining the baseline methods, we mainly considered their representativeness, diversity, and novelty. First of all, to show whether feature selection boosted the clustering performance or not, k-means with all features (ALLfea) was compared. Then, considering that the proposed PCFS method is a graph-based method, we compared it with the representative and prevalent graph-based methods: ACSL [21], ASVW [24], and AMFS [19]. Moreover, to increase the type diversity of the baselines, RMVFS [23] was selected because it is a classical multi-view feature selection method based on pseudo-label learning in a k-means-like style. Finally, considering the novelty and timeliness, we compared PCFS with JMVFG [25]. More concretely, JMVFG is a new multi-view feature selection method and it combines pseudo-label learning and graph learning to facilitate feature selection. Overall, due to the above considerations, AllFea, ACSL, ASVW, AMFS, RMVFS, and JMVFG were selected as baselines.

Evaluation metrics: We employed the standard metrics clustering accuracy (ACC) and normalized mutual information (NMI) for performance comparison. ACC is a popular metric that utilizes the permutation map function to describe the rate of accurately clustered samples to the total sample amount and is widely used to evaluate clustering performance [38]. NMI is a normalized form of mutual information (MI) using a geometric mean, which provides a credible indication of the shared information between two clustering partitions without the influence of the arrangement of cluster labels [39]. Another interesting criterion is the variation of information (VI), which is derived from information theoretic principles, the same as MI, and represents the sum of conditional entropies for comparing a pair of clusters [40]. The value of NMI ranges from 0 to 1, while VI is bounded by an upper bound that depends on the number of clusters. Therefore, NMI is more suitable than VI for comparison in our setting as a metric. ACC and NMI are the most commonly used metrics for evaluating clustering results. In [41,42,43,44], only NMI and ACC were used as evaluation metrics. Overall, ACC and NMI are cogent enough for evaluating clustering results in most cases. For both evaluation metrics, larger values indicate better performance.

Parameter setting: In our work, the number of neighbors was set to five when the KNN graph was involved in a certain method. In terms of PCFS,

β

and

γ

were chosen from

\{10^{- 3}, 10^{- 2}, 0.1, 1, 10

,

10^{2}\}

and

\{0.1, 1, 10, 10^{2}, 10^{3}, 10^{4}\}

, respectively. For each experiment, the same key pairwise constraint that consisted of 80 cannot-link constraints and 240 must-link constraints were used in each outer loop. In our PCFS, the total reduced dimensionality

m

was empirically selected as

\frac{2}{3} k

. For a fair comparison, all the vital parameters in the baselines were set as suggested in the original paper. The number of selected features was set to

k = ⌊ d \cdot r ⌋

, where r is the percentage of selected features, traversed from 0.1 to 0.3, with 0.02 as the interval, and

⌊ x ⌋

is the floor function, which returns the maximum integer value that is not greater than x.

Key pairwise constraints selection: Since excessive constraints are labor-intensive, many constraint selection methods have been proposed [45,46]. Instead, we specifically use an efficient scheme [31] for semi-supervised graph clustering. Considering that S is sparse and each sample is only connected to

k_{n}

neighbors in the feature space, it is easy to obtain intuitive and rough “neighbors”. On this basis, we (

i

) obtained the must-link constraints by querying the “non-neighbors” of the sample and (

ii

) obtained the cannot-link constraints by querying the “neighbors” of the sample. In this way, we could find key pairwise samples in the original feature space that were easily mislinked or unlinked.

4.3. Evaluations on Real Datasets

In order to alleviate the effect caused by K-means random center clustering, we predefined the cluster center by collecting the best features selected each time, and each experiment was repeated 20 times. The mean ACC and NMI scores on real datasets are reported in Figure 2 and Figure 3. It can been seen that the ACC and NMI scores grew with fluctuations when the number of selected features increased. When the scores arrived at the peak, they decreased slowly as the number of selected features continuously increased. This is consistent with intuition. By comparing the scores of different methods, it can be found that PCFS achieved a better performance than the baseline methods in most cases.

Furthermore, we show the scores of evaluation metrics of different multi-view feature selection methods on six real datasets when the percentage of selected features

r = 0.1

in Table 3. The proposed method obtained the highest ACC and NMI scores in most cases. Moreover, we utilized Student’s t-test to verify the statistical significance of the results. The symbols “*” and “†” indicate significant improvement and reduction in PCFS over the corresponding baseline method, respectively, with

p < 0.05

. On Kodak, Caltech101-20, and Caltech101-7, as shown in Table 3, PCFS had a slightly lower evaluation score than ACSL when

r = 0.1

. However, as Figure 2 and Figure 3 show, with the increase in r, the evaluation score of PCFS exceeded that of ACSL, and the result was more robust than that of ACSL on the whole. The results of the Student’s t-test also show that the improvement in PCFS compared with ACSL was significant when r varied in the region of 0.1 to 0.3. Overall, the robust and high performance of PCFS is fully demonstrated in Table 3 and Figure 2 and Figure 3.

4.4. Convergence Analysis

In this section, we conduct empirical convergence analysis on the real datasets. According to the theoretical analysis in the Methodology section, the monotonicity of each step in our optimization algorithm could be guaranteed. Specifically, we illustrate the convergence curves of PCFS in Figure 4. We can easily observe that the objective value rapidly and monotonically decreased as the number of iterations grew, where the convergence curves became stable within about 10 iterations. The fast convergence ensured the optimization efficiency of PCFS.

4.5. Parameter Analysis

First of all, let us theoretically discuss the effects of the parameters

γ, τ

, and

β

.

τ

represents a constant that sets weights for the edges between instances that must be linked. As the connectivity of the graph is independent of the edge weights [31], the value of

τ

has no direct impact on our model’s performance. With regard to

γ

, it has been established that a sufficiently large

γ

guarantees that

S

is an ideal similarity matrix, and how large the value of

γ

is considered enough needs to be discussed further in the experiments. Considering

β

, as

β \to \infty

, the pairwise constraints regularization term approaches zero infinitely, and all cannot-link constraints will be satisfied. However, too strict cannot-link constraints regularization will lead to very meticulous spectral clustering results, and it may be more difficult to select important and vital features, which affects the accuracy of the following k-means clustering with selected features. Therefore,

β

should be appropriate, i.e., neither too small nor too large.

Then, as we emphasized, the results of the experiments designed to analyze the influence of

β

and

γ

directly show that the variability of the two parameters

β

and

γ

have important impacts on the performance of our model. Specifically, the performance of our method with varying parameters

β

and

γ

and fixing

r = 0.1

is illustrated in Figure 5. As shown in it, the performance of the proposed method fluctuated with some hidden regularity, which suggests that the selection of

β

needed to be careful and delicate and the selection of

γ

should not be too large. Small changes in

β

will cause the pairwise constraints to have a drastic impact on the results. Within a certain range, the increase in

γ

will increase the

ACC

score, but it will be counterproductive after exceeding this range. The selection requirements of these two parameters actually reflect the requirements of two different constraints on the graph. Our method needs to find a balance point from which to obtain high-quality performance on most of the datasets.

4.6. Ablation Study

This section describes the ablation analysis of the proposed PCFS method. There are two regularization terms in the objective function of PCFS. They are the cannot-link regularization related to the cannot indicating matrix

Y

and the Laplacian rank regularization related to

U

. We compared PCFS with its variants with the cannot-link regularization or the Laplacian rank regularization removed. The results are shown in Table 4. It can be seen that the complete PCFS method using all regularization terms led to average

NMI (%)

and

ACC (%)

scores (over six datasets) of 62.86% and 64.63%, respectively.

On average, when the cannot-link regularization term was removed, the corresponding

NMI

and

ACC

scores were the lowest. This demonstrated that pairwise constraints were beneficial for multi-view feature selection from the opposite side. From the results of each dataset, it can be seen that only using the Laplacian rank regularization obtained lower scores than the model without any regularization terms. The possible reason for this was that excessive attention on the graph Laplacian rank led to inaccurate similarity learning. The utilization of pairwise constraints helped to prevent such deterioration, and thus, PCFS achieved a better performance.

4.7. Discussion

We tried to understand the performance of PCFS from two aspects. First, we conducted a comparison of PCFS with six baseline methods on six widely used multi-view datasets. The comparison results demonstrate the effectiveness of the proposed PCFS method. Second, we conducted ablation experiment to analyze the function of each part of the objective of PCFS. The ablation experiment results show that the proper integration of pairwise constraints and graph rank regularization helped to improve the multi-view feature selection performance.

In other published papers, Refs. [31,47] integrated pairwise constraints into the single-view clustering method and concluded that including pairwise constraint information considerably improves the clustering performance. Ref. [48] also tried to introduce pairwise constraints in multi-views data and obtained consistent results. In our work, the pairwise constraints were introduced into the similarity graph learning and combined with the row sparse projections based on the

l_{2, 0}

-norm to finally fulfill the multi-view feature selection task. Our work was a fusion and extension of the existing work, and theoretically had the ability to improve the clustering performance. The final experimental results also demonstrate that the findings of our work are in line with previous related works. Pairwise constraints helped the multi-view feature selection and improved its clustering performance.

5. Application on Cancer Datasets

Two multi-omics cancer datasets obtained from The Cancer Genome Atlas (TCGA) were used for the evaluation.

BRCA (breast invasive carcinoma) [49] consists of 398 samples belonging to four categories. It has four different views, namely, gene expression (RNA), mitochondrial DNA (mDNA), microRNA expression (miRNA), and reverse phase protein array expression (protein), which are measured on different platforms and contain different biological information.

CESC (cervical carcinoma) [50] consists of 128 samples which come from three different categories. Gene expression (RNA), mitochondrial DNA (mDNA), microRNA expression (miRNA), and reverse phase protein array expression (protein) contain different information and are measured on different platforms. Thus, the information naturally formed four different views for experiments.

The experimental setting was kept consistent with Section 4. By using the selected features to predefine the cluster center, we obtained the performance of each method on the multi-omics cancer datasets. According to the average ACC and NMI scores shown in Figure 6 and Figure 7, our proposed method achieved excellent performances on these two datasets. It should be noted that for the BRCA dataset, we preprocessed the original dataset with the

l_{2}

-norm, while the CESC dataset was preprocessed with the

l_{1}

-norm. With either preprocessing method, the proposed method showed robust and excellent performances on both two datasets. More details are directly shown in Table 5.

6. Conclusions

In this paper, a weakly supervised multi-view feature selection method named PCFS is proposed. PCFS utilizes pairwise constraints, i.e., cannot-links and must-links, to boost multi-view feature selection, which have been overlooked by most of the existing methods. The performance of this method was evaluated by comparison with related methods on six widely used multi-view datasets and two cancer datasets. The proposed PCFS method outperformed the baselines in most cases. This demonstrates that PCFS has promising multi-view feature selection performance.

7. Future Work

From the theoretical aspect, studying how to integrate pairwise constraint propagation into the proposed PCFS method is possible future work. Then, more pairwise constraints can be inferred, thereby benefiting the feature selection process. From the application aspect, the proposed PCFS method can be applied to crop disease image recognition [51] and histopathology image classification [52] to select prominent features.

Author Contributions

Methodology, J.L.; validation, J.L.; data curation, J.L.; resources, H.T.; writing—original draft, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by NSF of China under grant 62006238 and in part by the NSF of Hunan Province under grant 2023JJ20052.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Khan, A.; Maji, P. Selective update of relevant eigenspaces for integrative clustering of multimodal data. IEEE Trans. Cybern. 2020, 52, 947–959. [Google Scholar] [CrossRef] [PubMed]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
White, M.; Zhang, X.; Schuurmans, D.; Yu, Y.L. Convex multi-view subspace learning. Adv. Neural Inf. Process. Syst. 2012, 25, 1673–1681. [Google Scholar]
Han, Y.; Yang, Y.; Yan, Y.; Ma, Z.; Sebe, N.; Zhou, X. Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans. Neural Netw. Learn. Syst. 2014, 26, 252–264. [Google Scholar]
Duarte, M.F.; Hu, Y.H. Vehicle classification in distributed sensor networks. J. Parallel Distrib. Comput. 2004, 64, 826–838. [Google Scholar] [CrossRef]
Cai, X.; Nie, F.; Huang, H. Multi-view k-means clustering on big data. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013. [Google Scholar]
Zhang, Y.; Wu, J.; Cai, Z.; Philip, S.Y. Multi-view multi-label learning with sparse feature selection for image annotation. IEEE Trans. Multimed. 2020, 22, 2844–2857. [Google Scholar] [CrossRef]
Xu, C.; Tao, D.; Xu, C. A survey on multi-view learning. arXiv 2013, arXiv:1304.5634. [Google Scholar]
Sun, S. A survey of multi-view machine learning. Neural Comput. Appl. 2013, 23, 2031–2038. [Google Scholar] [CrossRef]
Lin, Q.; Yang, L.; Zhong, P.; Zou, H. Robust supervised multi-view feature selection with weighted shared loss and maximum margin criterion. Knowl. Based Syst. 2021, 229, 107331. [Google Scholar] [CrossRef]
Zhang, C.; Jiang, B.; Wang, Z.; Yang, J.; Lu, Y.; Wu, X.; Sheng, W. Efficient multi-view semi-supervised feature selection. Inf. Sci. 2023, 649, 119675. [Google Scholar] [CrossRef]
Wu, J.S.; Li, Y.; Gong, J.X.; Min, W. Collaborative and Discriminative Subspace Learning for unsupervised multi-view feature selection. Eng. Appl. Artif. Intell. 2024, 133, 108145. [Google Scholar] [CrossRef]
Wu, D.; Xu, J.; Dong, X.; Liao, M.; Wang, R.; Nie, F.; Li, X. GSPL: A Succinct Kernel Model for Group-Sparse Projections Learning of Multiview Data. In Proceedings of the International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–27 August 2021. [Google Scholar]
Zhao, Z.; Liu, H. Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th International Conference on Machine Learning, Corvalis, OR, USA, 20–24 June 2007; pp. 1151–1157. [Google Scholar]
Zhao, Z.; Wang, L.; Liu, H. Efficient spectral feature selection with minimum redundancy. In Proceedings of the AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; Volume 24, pp. 673–678. [Google Scholar]
Li, X.; Zhang, H.; Zhang, R.; Liu, Y.; Nie, F. Generalized Uncorrelated Regression with Adaptive Graph for Unsupervised Feature Selection. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 1587–1595. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Feng, Y.; Qi, T.; Yang, X.; Zhang, J.J. Adaptive multi-view feature selection for human motion retrieval. Signal Process. 2016, 120, 691–701. [Google Scholar] [CrossRef]
Feng, Y.; Xiao, J.; Zhuang, Y.; Liu, X. Adaptive unsupervised multi-view feature selection for visual concept recognition. In Proceedings of the 11th Asian Conference on Computer Vision, Daejeon, Republic of Korea, 5–9 November 2012; Springer: Berlin/Heidelberg, Germany, 2013; pp. 343–357. [Google Scholar]
Dong, X.; Zhu, L.; Song, X.; Li, J.; Cheng, Z. Adaptive Collaborative Similarity Learning for Unsupervised Multi-view Feature Selection. arXiv 2019, arXiv:1904.11228. [Google Scholar]
Qian, M.; Zhai, C. Unsupervised feature selection for multi-view clustering on text-image web news data. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, Shanghai, China, 3–7 November 2014; pp. 1963–1966. [Google Scholar]
Liu, H.; Mao, H.; Fu, Y. Robust Multi-View Feature Selection. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016. [Google Scholar]
Hou, C.; Nie, F.; Tao, H.; Yi, D. Multi-View Unsupervised Feature Selection with Adaptive Similarity and View Weight. IEEE Trans. Knowl. Data Eng. 2017, 29, 1998–2011. [Google Scholar] [CrossRef]
Fang, S.G.; Huang, D.; Wang, C.D.; Tang, Y. Joint Multi-View Unsupervised Feature Selection and Graph Learning. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 8, 1–16. [Google Scholar] [CrossRef]
Kumar, S.; Rowley, H.A. Classification of weakly-labeled data with partial equivalence relations. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio De Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
Cao, X.; Zhang, C.; Zhou, C.; Fu, H.; Foroosh, H. Constrained multi-view video face clustering. IEEE Trans. Image Process. 2015, 24, 4381–4393. [Google Scholar] [CrossRef] [PubMed]
Shi, Y.; Otto, C.; Jain, A.K. Face clustering: Representation and pairwise constraints. IEEE Trans. Inf. Forensics Secur. 2018, 13, 1626–1640. [Google Scholar] [CrossRef]
Zhao, X.; Liao, Y.; Xie, J.; He, X.; Shiqing, Z.; Wang, G.; Fang, J.; Lu, H.; Yu, J. BreastDM: A DCE-MRI dataset for breast tumor image segmentation and classification. Comput. Biol. Med. 2023, 164, 107255. [Google Scholar] [CrossRef]
Belkin, M.; Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2003, 15, 1373–1396. [Google Scholar] [CrossRef]
Nie, F.; Zhang, H.; Wang, R.; Li, X. Semi-supervised Clustering via Pairwise Constrained Optimal Graph. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, Online, 7–15 January 2021. [Google Scholar]
Nie, F.; Wang, X.; Huang, H. Clustering and projected clustering with adaptive neighbors. In Proceedings of the KDD ’14: 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014. [Google Scholar]
Fan, K. On a Theorem of Weyl Concerning Eigenvalues of Linear Transformations: II. Proc. Natl. Acad. Sci. USA 1950, 36, 31–35. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Babu, P.; Palomar, D.P. Majorization-Minimization Algorithms in Signal Processing, Communications, and Machine Learning. IEEE Trans. Signal Process. 2017, 65, 794–816. [Google Scholar] [CrossRef]
Zhu, X.; Ghahramani, Z. Learning from labeled and unlabeled data with label propagation. In ProQuest Number: Information to All Users; Carnegie Mellon University: Pittsburgh, PA, USA, 2002. [Google Scholar]
Ng, A.; Jordan, M.; Weiss, Y. On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2001, 14, 849–856. [Google Scholar]
Nie, F.; Wang, X.; Jordan, M.; Huang, H. The constrained laplacian rank algorithm for graph-based clustering. In Proceedings of the AAAI conference on artificial intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Zhang, C.; Fu, H.; Hu, Q.; Cao, X.; Xie, Y.; Tao, D.; Xu, D. Generalized latent multi-view subspace clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 86–99. [Google Scholar] [CrossRef] [PubMed]
Strehl, A.; Ghosh, J. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
Meilă, M. Comparing clusterings by the variation of information. In Proceedings of the Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, 24–27 August 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 173–187. [Google Scholar]
Chen, H.; Nie, F.; Wang, R.; Li, X. Unsupervised feature selection with flexible optimal graph. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 2014–2027. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Sun, Z.; Chehri, A.; Jeon, G.; Song, Y. A Novel Attention-Driven Framework for Unsupervised Pedestrian Re-identification with Clustering Optimization. Pattern Recognit. 2024, 146, 110045. [Google Scholar] [CrossRef]
Yang, Z.; Zhang, H.; Liang, N.; Li, Z.; Sun, W. Semi-supervised multi-view clustering by label relaxation based non-negative matrix factorization. Vis. Comput. 2023, 39, 1409–1422. [Google Scholar] [CrossRef]
Manojlović, T.; Štajduhar, I. Deep semi-supervised algorithm for learning cluster-oriented representations of medical images using partially observable dicom tags and images. Diagnostics 2021, 11, 1920. [Google Scholar] [CrossRef]
Van Craenendonck, T.; Dumancic, S.; Blockeel, H. COBRA: A Fast and Simple Method for Active Clustering with Pairwise Constraints. Statistics 2018. [Google Scholar] [CrossRef]
Xiong, S.; Azimi, J.; Fern, X.Z. Active learning of constraints for semi-supervised clustering. IEEE Trans. Knowl. Data Eng. 2013, 26, 43–54. [Google Scholar] [CrossRef]
Wang, Y.; Chen, L.; Zhou, J.; Li, T.; Yu, Y. Pairwise constraints-based semi-supervised fuzzy clustering with multi-manifold regularization. Inf. Sci. 2023, 638, 118994. [Google Scholar] [CrossRef]
Zhu, Z.; Gao, Q. Semi-supervised clustering via cannot link relationship for multiview data. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 8744–8755. [Google Scholar] [CrossRef]
Khan, A.; Maji, P. Low-rank joint subspace construction for cancer subtype discovery. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019, 17, 1290–1302. [Google Scholar] [CrossRef] [PubMed]
Khan, A.; Maji, P. Multi-manifold optimization for multi-view subspace clustering. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 3895–3907. [Google Scholar] [CrossRef]
Gu, X.; Wang, M.; Wang, Y.; Zhou, G.; Ni, T. Discriminative semisupervised dictionary learning method with graph embedding and pairwise constraints for crop disease image recognition. Crop Prot. 2024, 176, 106489. [Google Scholar] [CrossRef]
Tang, H.; Mao, L.; Zeng, S.; Deng, S.; Ai, Z. Discriminative dictionary learning algorithm with pairwise local constraints for histopathological image classification. Med. Biol. Eng. Comput. 2021, 59, 153–164. [Google Scholar] [CrossRef]

Figure 1. This is the workflow of PCFS, where the sample images come from Ref. [29].

Figure 2. Average ACC (%) scores of different baseline methods with different percentages of selected features on real datasets.

Figure 3. Average NMI (%) scores of different baseline methods with different percentages of selected features on real datasets.

Figure 4. Empirical convergence analysis. The objective function value of PCFS is illustrated as the number of iterations increased.

Figure 5. The average performance (with respect to ACC (%)) scores of PCFS with varying

γ

and

β

while fixing

r = 0.1

.

Figure 5. The average performance (with respect to ACC (%)) scores of PCFS with varying

γ

and

β

while fixing

r = 0.1

.

Figure 6. Average ACC (%) scores of different baseline methods with different percentages of selected features on cancer datasets.

Figure 7. Average NMI (%) scores of different baseline methods with different percentages of selected features on cancer datasets.

Table 1. Notations.

Notations	Descriptions
n	Number of samples
d	Total number of features
V	Number of views
$d_{v}$	Number of features of the v-th view
c	Number of clusters
k	Number of selected features
$ι$	Number of neighbors when constructing graphs
m	Dimensionality of the projected low-dimensional space
$M$	The set of must-link constraints
$C$	The set of cannot-link constraints
$n_{C L}$	Number of cannot-link constraints
$X \in R^{n \times d}$	Concatenated feature matrix of all views
$X^{v} \in R^{n \times d_{v}}$	Feature matrix of the v-th view
$S \in {[0, 1]}^{n \times n}$	Similarity matrix
$W \in R^{d \times m}$	Concatenated projection matrix of all views
$W^{v} \in R^{d_{v} \times m}$	Projection matrix of the v-th view
$Y \in {[- 1, 1]}^{n \times n_{C L}}$	The indicating matrix of cannot-link constraints

Table 2. The descriptions of the experimental datasets.

Datasets	Caltech101-7	Caltech101-20	AR10P	Yale	YaleB	Kodak
View-1	1302	48	50	50	50	73
View-2	48	40	32	32	32	48
View-3	512	254	512	512	512	225
View-4	100	1984	256	256	256	—
View-5	256	512	168	168	168	—
View-6	441	928	—	—	—	—
Total-Dim	2659	3766	1018	1018	1018	346
Class	7	20	10	15	38	21
Sample	441	2386	130	165	2414	2172
Type	Image	Image	Image	Image	Image	Video

Table 3. The average performance (mean

(%)

) was obtained from 20 runs in each multi-view feature selection algorithm. The best score on each dataset is highlighted in bold. The symbols “*” and “†” indicate significant improvement and significant reduction in the proposed method over the baseline on the dataset with

p < 0.05

, respectively. And no special marks are given for those with no significant difference.

Table 3. The average performance (mean

(%)

) was obtained from 20 runs in each multi-view feature selection algorithm. The best score on each dataset is highlighted in bold. The symbols “*” and “†” indicate significant improvement and significant reduction in the proposed method over the baseline on the dataset with

p < 0.05

, respectively. And no special marks are given for those with no significant difference.

Metric	Method	Caltech101-20	Caltech101-7	AR10P	Kodak	Yale	YaleB
ACC	AllFea	40.82 *	44.90 *	21.24 *	20.67 *	58.18 *	7.83 *
	ACSL	51.76 *	79.82 ^†	38.46 *	21.82 ^†	68.48 *	11.76 *
	AMFS	19.78 *	50.11 *	22.31 *	13.9 *	24.24 *	6.92 *
	ASVW	39.35 *	66.21 *	35.38 *	16.39 *	69.70 *	13.67 *
	JMVFG	38.23 *	55.33 *	23.08 *	19.52 *	43.64 *	8.53 *
	RMVFS	38.94 *	41.27 *	25.38 *	19.06 *	55.15 *	9.03 *
	PCFS	60.14	77.10	90.77	21.22	89.09	17.23
NMI	AllFea	48.71 *	31.82 *	21.26 *	22.62 *	60.99 *	10.18 *
	ACSL	61.11 ^†	69.4 ^†	32.43 *	23.09 *	66.57 *	16.46 *
	AMFS	24.91 *	38.87 *	18.65 *	12.28 *	29.83 *	9.14 *
	ASVW	46.57 *	49.71 *	28.97 *	14.31 *	66.79 *	17.91 *
	JMVFG	45.98 *	43.61 *	18.93 *	19.75 *	49.57 *	11.77 *
	RMVFS	46.74 *	29.73 *	25.94 *	19.9 *	58.1 *	12.66 *
	PCFS	57.15 *	57.28	86.73	23.32	86.91	18.92

Table 4. Average performances (mean

(%)

) over 20 runs by different methods on each dataset; the best score is highlighted in bold.

Table 4. Average performances (mean

(%)

) over 20 runs by different methods on each dataset; the best score is highlighted in bold.

	Term	PCFS	Whether the Term Is Chosen
Metric	Cannot-link regularization	✓	✓	×	×
Metric	Laplacian rank regularization	✓	×	✓	×
NMI	Caltech101-7	73.09	46.38	70.76	70.04
	Caltech101-20	62.25	60.64	50.06	58.09
	AR10P	93.30	89.59	46.83	86.61
	Yale	96.23	96.22	63.26	94.46
	YaleB	28.31	23.46	14.80	24.81
	Kodak	23.99	17.17	18.92	15.95
	Avg. score	62.86	55.58	44.10	58.33
$ACC$	Caltech101-7	85.49	59.18	83.67	83.84
	Caltech101-20	62.74	57.67	47.19	53.39
	AR10P	95.38	93.08	44.62	88.73
	Yale	96.97	96.97	64.24	95.36
	YaleB	25.85	24.11	10.36	21.19
	Kodak	21.32	19.48	18.28	17.45
	Avg. score	64.63	58.42	44.73	59.99

Table 5. The average performance (mean

(%)

) was obtained from 20 runs in each multi-view feature selection algorithm. On each cancer dataset, the best score is highlighted in bold. The symbol “*” indicates significant improvement of the proposed method over the baseline on the dataset with

p < 0.05

. And no special marks are given for those with no significant difference.

Table 5. The average performance (mean

(%)

) was obtained from 20 runs in each multi-view feature selection algorithm. On each cancer dataset, the best score is highlighted in bold. The symbol “*” indicates significant improvement of the proposed method over the baseline on the dataset with

p < 0.05

. And no special marks are given for those with no significant difference.

Metric	Method	BRCA	CESC
ACC	AllFea	59.80 *	44.35 *
	ACSL	63.57 *	46.77 *
	AMFS	42.46 *	64.52 *
	ASVW	53.52 *	50.00 *
	JMVFG	51.76 *	41.94 *
	RMVFS	69.60 *	45.97 *
	PCFS	72.11	82.26
NMI	AllFea	37.10 *	4.39 *
	ACSL	39.91 *	6.63 *
	AMFS	21.38 *	32.84 *
	ASVW	28.65 *	9.27 *
	JMVFG	35.06 *	1.25 *
	RMVFS	40.81 *	4.38 *
	PCFS	47.17	56.79

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Tao, H. Pairwise-Constraint-Guided Multi-View Feature Selection by Joint Sparse Regularization and Similarity Learning. Mathematics 2024, 12, 2278. https://doi.org/10.3390/math12142278

AMA Style

Li J, Tao H. Pairwise-Constraint-Guided Multi-View Feature Selection by Joint Sparse Regularization and Similarity Learning. Mathematics. 2024; 12(14):2278. https://doi.org/10.3390/math12142278

Chicago/Turabian Style

Li, Jinxi, and Hong Tao. 2024. "Pairwise-Constraint-Guided Multi-View Feature Selection by Joint Sparse Regularization and Similarity Learning" Mathematics 12, no. 14: 2278. https://doi.org/10.3390/math12142278

APA Style

Li, J., & Tao, H. (2024). Pairwise-Constraint-Guided Multi-View Feature Selection by Joint Sparse Regularization and Similarity Learning. Mathematics, 12(14), 2278. https://doi.org/10.3390/math12142278

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pairwise-Constraint-Guided Multi-View Feature Selection by Joint Sparse Regularization and Similarity Learning

Abstract

1. Introduction

2. Related Works

2.1. Notations and Problem Statement

2.2. AMFS

2.3. ASVW

2.4. RMVFS

2.5. JMVFG

3. Methodology

3.1. The Model of PCFS

3.2. Optimization of PCFS

4. Experiments

4.1. Experimental Datasets

4.2. Experimental Setting

4.3. Evaluations on Real Datasets

4.4. Convergence Analysis

4.5. Parameter Analysis

4.6. Ablation Study

4.7. Discussion

5. Application on Cancer Datasets

6. Conclusions

7. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI