Enhanced Similarity Matrix Learning for Multi-View Clustering

Zhang, Dongdong; Wang, Pusheng; Li, Qin

doi:10.3390/electronics14142845

Open AccessArticle

Enhanced Similarity Matrix Learning for Multi-View Clustering

by

Dongdong Zhang

¹,

Pusheng Wang

^1,* and

Qin Li

^2,*

¹

Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China

²

School of Computer Science and Software Engineering, Shenzhen Institute of Information Technology, Shenzhen 518172, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(14), 2845; https://doi.org/10.3390/electronics14142845

Submission received: 4 March 2025 / Revised: 13 June 2025 / Accepted: 13 June 2025 / Published: 16 July 2025

Download

Browse Figures

Versions Notes

Abstract

Graph-based multi-view clustering is a fundamental analysis method that learns the similarity matrix of multi-view data. Despite its success, it has two main limitations: (1) complementary information is not fully utilized by directly combining graphs from different views; (2) existing multi-view clustering methods do not adequately address redundancy and noise in the data, significantly affecting performance. To address these issues, we propose the Enhanced Similarity Matrix Learning (ES-MVC) for multi-view clustering, which dynamically integrates global graphs from all views with local graphs from each view to create an improved similarity matrix. Specifically, the global graph captures cross-view consistency, while the local graph preserves view-specific geometric patterns. The balance between global and local graphs is controlled through an adaptive weighting strategy, where hyperparameters adjust the relative importance of each graph, effectively capturing complementary information. In this way, our method can learn the clustering structure that contains fully complementary information, leveraging both global and local graphs. Meanwhile, we utilize a robust similarity matrix initialization to reduce the negative effects caused by noisy data. For model optimization, we derive an effective optimization algorithm that converges quickly, typically requiring fewer than five iterations for most datasets. Extensive experimental results on diverse real-world datasets demonstrate the superiority of our method over state-of-the-art multi-view clustering methods. In our experiments on datasets such as MSRC-v1, Caltech101, and HW, our proposed method achieves superior clustering performance with average accuracy (ACC) values of 0.7643, 0.6097, and 0.9745, respectively, outperforming the most advanced multi-view clustering methods such as OMVFC-LICAG, which yield ACC values of 0.7284, 0.4512, and 0.8372 on the same datasets.

Keywords:

multi-view clustering; similarity matrix; graph learning

1. Introduction

Multi-view data obtained from multiple sources [1] allow us to represent the same instance from different perspectives. For example, an image can be described by its SIFT feature, HOG feature, and LBP feature, and even its text feature. Typically, different views of the same data provide consistent and complementary information. By exploiting the consistent structure among multi-view data, multi-view learning obtains better performance when compared with traditional learning methods, which only use single-view learning [2,3,4]. Consequently, multi-view learning has found extensive applications in various real-world scenarios, including but not limited to image recognition, data mining, and multimedia [5,6,7], for example, generating pseudo-labels for data before large-scale model training to improve training efficiency or clustering users in social networks to enhance accuracy of recommendation systems.

Multi-view clustering [8,9,10] is a basic tool of multi-view learning. Its goal is to explore consistent information from multiple features and partition data into different groups. Many works have been proposed in recent years and have achieved promising results. Generally, they can be roughly divided into four classes. (1) Co-training/Co-regularized-based methods [11,12,13] are precursor techniques that focus on two views within a clustering framework. These methods work toward reducing the disparity between the two views to enhance clustering performance. The traditional co-training and co-regularized algorithm lacks robustness and can be significantly hampered by just a few inaccurate outliers. (2) Subspace-based methods [14,15,16] learn a common low-dimensional representation for multiple views. They leverage the self-expression characteristic of the data to identify the 34 underlying subspaces and subsequently classify the data points accordingly. For example, Yin et al. [17] learned a shared coefficient matrix by ensuring that the coefficient matrices from each pair of views were as similar as possible. (3) Non-negative matrix factorization based methods [18,19,20] aim to learn shared indicator matrices across multiple perspectives. For example, Akata et al. [21] enforces a shared indicator matrix among different views to perform multi-view clustering in the NMF framework. (4) Graph-based methods [22,23,24,25] learn a common similarity matrix across the multiple views, which can further reveal data cluster structures. They utilize the properties of the Laplacian graph, whose edges denote the similarities between each pair of data points and solve a relaxation of the normalized min-cut problem on the graph. Compared with other methods, graph-based methods can be applied to arbitrarily shaped clustering scenarios and achieve high performance.

Among the diverse multi-view clustering methods, graph-based methods achieve extraordinary performance since they can fuse heterogeneous features and learn a more distinctive clustering structure. Currently, numerous graph-based multi-view clustering methods have been proposed in order to effectively capture the manifold information from each view.

However, most of these methods directly employ original data as input to learn the common similarity matrix for multi-view clustering [22,23,26,27,28,29]. They construct the initial similarity matrix by using manifold learning on the original data for clustering. For example, MV-KM [30] can automatically determine the number of clusters without initialization; MV-FCM [31] extends single-view fuzzy c-means clustering to the multi-view setting and automatically identifies the optimal number of clusters; and MVKM-L2 [32], combines feature weights and view weights to achieve superior performance across multiple datasets. Despite their impressive performance, their effectiveness relies on the assumption that the original data are clean and trusted. To eliminate this assumption, low-rank based methods [24,33] have been proposed, focusing on capturing the global structure from entire samples of each view. This method removes the assumption of clean original data but also ignores the local manifold structure across neighboring samples.

Furthermore, these methods have not adequately explored the structural information across different views. Cross-view structural information is crucial for enhancing the consistency of the shared similarity matrix and achieving a better clustering structure. Nevertheless, existing methods have not effectively utilized this key information, resulting in suboptimal clustering structures. In other words, during data collection steps, even though different views describe the same object, there inevitably exists view-specific information. If such information cannot be properly utilized in cross-view learning, it may become undesirable noise.

Therefore, inspired by the ease of obtaining a robust similarity matrix rather than removing noise data, and recognizing that the elements of a similarity matrix can reveal data correlations, we propose an Enhanced Similarity Matrix Learning method for multi-view clustering (ES-MVC). As illustrated in Figure 1, our method operates on the premise that the correlation information within each view aligns with the correlation information across all views. ES-MVC integrates both global (global similarity matrix) and local (local similarity matrix) graphs to harness complementary information, enhancing clustering performance. More specifically, we firstly employ robust initial similarity matrices as input to alleviate the effect of the redundancy and noise. Next, we construct a global graph and multiple local graphs from training data under the same rank constraint. Therefore, both noise information and complementary information can be well respected. For the model optimization, we propose to employ an alternating direction strategy to solve our ES-MVC method. Finally, we conduct several experiments on real-world datasets to justify the effectiveness of our ES-MVC method. The key contributions of our work are outlined as follows:

A novel multi-view clustering method is proposed to obtain more complementary information from all the multi-view data, named Enhanced Similarity Matrix Learning for multi-view clustering (ES-MVC).
Our method leverages both local and global structures across multiple views to construct a consistent graph for enhanced clustering performance.
Furthermore, we apply a rank constraint to the similarity matrix in order to dynamically optimize neighbor assignment for improved clustering results.
Additionally, we develop a robust optimization algorithm to efficiently solve our objective function.

Experimental results demonstrate that our method significantly outperforms traditional multi-view clustering methods. In our experiments on datasets such as MSRC-v1, Caltech101, and HW, our proposed method achieves superior clustering performance with average accuracy (ACC) values of 0.7643, 0.6097, and 0.9745, respectively, outperforming the most advanced multi-view clustering methods such as OMVFC-LICAG [34], which yield ACC values of 0.7284, 0.4512, and 0.8372 on the same datasets. These findings suggest that our method, which combines both global and local structures, leads to more consistent and accurate clustering.

The remaining sections of this paper are as follows: Section 2 reviews related work, Section 3 explains the proposed ES-MVC method in detail, Section 4 presents experimental results comparing its performance with state-of-the-art techniques, Section 5 provides a convergence analysis, and Section 5 concludes with future research directions.

2. Related Works

More and more multi-view data are becoming available because of information explosion. This is spurring a lot of studies on multi-view learning, an important part of which is multi-view clustering. This paper primarily emphasizes graph-based multi-view clustering. In this part, we will introduce some multi-view clustering methods based on graphs from the following two aspects.

2.1. Similarity Matrix Learning for Multi-View Clustering

The key point of graph-based multi-view clustering methods is how to construct an ideal similarity matrix for clustering. Many studies have studied it and achieved some promising results. For example, Yu et al. [22] leveraged a co-training technique to learn a Bayesian undirected graph for multiple-view data. Kumar et al. [23] proposed co-regularized spectral multi-view clustering by minimizing the disagreement across two views and adopted the eigenvectors of the Laplacian graph for clustering. With the same technique, Cai et al. [26] integrated multi-view features by developing a multi-modal spectral clustering method. However, these methods are based on the pairwise concept, which is neither efficient nor optimal for multi-view data. To tackle this problem, Tang et al. [27] proposed a way to fuse information from multiple graph sources called LMF. Wang et al. [35] enforced a common eigenvector matrix and formulated a multi-objective problem for clustering. Although these methods improve the clustering performance, their effectiveness hinges on the assumption of the original data’s cleanliness and trustworthiness. All of them directly use the original data to learn a graph for clustering, and the performance is sensitive to noisy data.

2.2. Robust Multi-View Clustering

To enhance the robustness of multi-view clustering, researchers have sought solutions to mitigate the impact of noise. For instance, Xia et al. [33] developed a method called robust multi-view spectral clustering (RMSC) that uses low-rank and sparse decomposition to construct a transition probability matrix, effectively isolating noise from each graph. In order to automatically learn a set of weights for all graphs and assign small weights to data containing noise, Nie et al. [28,29] introduced an auto-weighted multiple graph learning model (AMGL) and an adaptive neighbor method through learning a Laplacian rank-constrained graph (MLAN). Li et al. [36] presented a Multi-view-based Parameter-Free (MPF) method for detecting groups in crowd scenes using an adaptive neighbors strategy. Motivated by the robust graphs learned by the above methods, some works use a robust similarity matrix as input to learn an optimal similarity matrix. These methods can release the influence of noisy data to some extent. For example, Nie et al. [37] introduced a novel multi-view clustering approach that takes graphs as input. Although these methods improve robustness, they primarily use a low rank or an

l 1

-norm to separate noise from the original data, resulting in a robust similarity matrix that mainly captures the global structure from all samples in each view. However, low-rank methods, such as those mentioned in [24,33], focus on capturing the global structure from entire samples of each view and ignore the local manifold structure across neighboring samples. Additionally, these methods fail to adequately consider and integrate the overall global structure among all views.

Compared to existing methods, ES-MVC introduces a novel design by using robust similarity matrices to reduce the impact of noise, jointly capturing both global and local structural information to enhance representation and automatically fusing multiple graphs without manual parameter tuning. We summarize methods that we have mentioned in the following Table 1.

3. Our Method

3.1. Motivation

From the above analysis, it is clear that most multi-view clustering methods that rely on original features are vulnerable to noise and outliers. Additionally, methods that use graphs as input fail to fully exploit the local structure within neighboring samples and the global structure across all views. Therefore, the similarity matrices generated by these methods are not optimal for clustering. To this end, we propose a new multi-view clustering method called Enhanced Similarity Matrix Learning for multi-view clustering (ES-MVC). It adaptively learns a shared similarity matrix by extracting the local as well as the global structure information from several robust initial graphs pre-defined from all views. A comparison of the general graph-based framework and our ES-MVC method is shown in Figure 2.

3.2. ES-MVC Method

Consider multi-view data

{X^{(1)}, \dots, X^{(m)}} \in R^{n \times d^{(v)} \times m}

, where

X^{(v)} = [x_{1}^{v}, x_{2}^{v}, \dots, x_{n}^{v}] \in R^{n \times d^{(v)}} (v = 1, \dots, m)

is v-th view matrix, n is the number of data points, m is the number of views, and

d^{(v)}

is the feature dimension of the v-th view.

3.2.1. Local Structure in a Single View

Let

A^{v} = [a_{1}^{v}, a_{2}^{v}, \dots, a_{n}^{v}] \in R^{n \times n}

denote the v-th initial local similarity matrix constructed from the v-th view

X^{(v)}

. We can obtain the local structure based on the initial local similarity matrix by solving the following problem:

\begin{matrix} min_{W} \sum_{v = 1}^{m} \sum_{i, j} ∥{a_{i}^{v} - a_{j}^{v}∥}_{2}^{2} w_{i j}^{2}, \\ s . t . \forall i, w_{i}^{⊤} 1 = 1, 0 \leq w_{i j} \leq 1, \end{matrix}

(1)

where

W = {w_{i j}} \in R^{n \times n}

,

\forall i, j \in 1, \dots, n

is a novel similarity matrix that we need to learn.

3.2.2. Global Structure Across Each Different View

Let

Y = [X^{(1)}, \dots, X^{(m)}] = [y_{1}, y_{2}, \dots, y_{n}] \in R^{n \times \sum_{v = 1}^{m} d_{v}}

denote the global data matrix, which is concatenated by all the views, where

y_{i} \in R^{\sum_{v = 1}^{m} d_{v}}

and

G = [g_{1}, g_{2}, \dots, g_{n}] \in R^{n \times n}

is the initial global similarity matrix constructed from the data matrix

Y

.

Then, we can learn the global structure, which lies among all views, based on the initial global similarity matrix, by the following function:

\begin{matrix} min_{W} \sum_{i, j} ∥{g_{i} - g_{j}∥}_{2}^{2} w_{i j}^{2}, \\ s . t . \forall i, w_{i}^{⊤} 1 = 1, 0 \leq w_{i j} \leq 1 . \end{matrix}

(2)

As shown in Figure 2, we adopt the graph construction method from [38] to create the robust initial graphs, including the local similarity matrices

A

and the global similarity matrix

G

, as inputs for the ES-MVC.

3.2.3. ES-MVC Objective Function

We integrate both local and global structures into the multi-view clustering objective function and utilize an irrational form to automatically determine the optimal weight for each graph, eliminating the need for manual parameter setting for each view. However, most existing graph-based methods require an additional clustering step (such as k-means) to obtain the final clustering result. The two-step strategy will inevitably lose some information, reducing the clustering performance. Therefore, inspired by [39], we constrain the structure of the graph and directly obtain clustering partitions from the connected components of the structured graph.

Consequently, we derive our objective function as follows.

\begin{matrix} J (W) = \underset{W}{\arg min} \underset{Local structure}{\underset{⏟}{\sum_{v = 1}^{m} \sqrt{\sum_{i, j} {∥a_{i}^{v} - a_{j}^{v}∥}_{2}^{2} w_{i j}^{2}}}} \\ + \underset{Global structure}{\underset{⏟}{α \sqrt{\sum_{i, j} {∥g_{i} - g_{j}∥}_{2}^{2} w_{i j}^{2}}}}, \\ s . t . \forall i, w_{i}^{⊤} 1 = 1, 0 \leq w_{i j} \leq 1, \underset{Rank constraint}{\underset{⏟}{rank (L_{W}) = n - c}} . \end{matrix}

(3)

where

α

is a parameter to balance the impact of local structure and global structure. Let

D

be the degree matrix whose i-th diagonal element is

D_{i i} = \sum_{j = 1}^{n} w_{i j}

, and let

L_{W}

denote the normalized graph Laplacian matrix, which is defined as

L_{W} = D - (W^{⊤} + W) / 2

. To achieve the ideal neighbor assignment in the learned similarity matrix, we add a rank constraint to the Laplacian matrix.

The rank constraint ensures that the number of connected components in the graph matches the number of clusters c, with each connected component corresponding to one cluster. This allows us to directly derive the clustering partitions from the graph, bypassing the need for an additional clustering step like k-means, which could lead to information loss.

The primary advantage of the rank constraint is that it enforces a structured graph where the connected components represent meaningful cluster assignments. This is particularly effective in multi-view clustering tasks, where maintaining consistency across multiple views is essential. By directly constraining the graph’s rank, the clustering process becomes more efficient, and the method avoids the potential loss of information inherent in two-step clustering strategies.

3.2.4. Robust Initial Graph Construction

While we have thoroughly examined the process of deriving the final optimal graph matrix

W

from global and local graphs, it is important to recognize that the quality of these initial graphs significantly influences the final clustering performance. Accordingly, this section focuses specifically on the initialization methodology for each view’s graph structure. Motivated by [38,39], the initial local graph for the v-th view can be obtained by solving the following problem.

min_{A ⪰ 0, A 1 = 1} \sum_{i = 1}^{n} \sum_{j = 1}^{m} {∥x_{i}^{v} - x_{j}^{v}∥}_{2}^{2} a_{i j}^{v} + γ {(a_{i j}^{v})}^{2},

(4)

The first term in Equation (4) enables the affinity matrix

A^{v}

to preserve local geometric relationships between samples and anchors, where the affinity values increase inversely with distance. The second term acts as a regularizer that promotes dense connections between anchors and samples. Through the hyperparameter

γ

, we control the sparsity of these connections by limiting the number of anchors assigned to each sample. Formally, let

d_{i, [j]}

denote the distance between the i-th sample and its j-th nearest anchor, with distances ordered such that

d_{i, [1]} \leq d_{i, [2]} \leq \dots \leq d_{i, [m]}

. The closed-form solution for Equation (4) is as follows.

a_{i j}^{v} = max (\frac{d_{i, [k + 1]} - d_{i, [j]}}{k d_{i, [k + 1]} - \sum_{l = 1}^{k} d_{i, [l]}}, 0),

(5)

where k is the number each sample connects to and

γ

can be computed as

γ = \frac{k}{2} d_{i, [k = 1]} - \frac{1}{2} \sum_{l = 1}^{k} d_{i, [l]}

.

The initial graph construction process described here effectively leverages both global and local data characteristics to enhance clustering performance while mitigating noise. Specifically, the first term in the optimization function preserves local geometric relationships by maintaining affinities between samples that are close in space, helping to reduce noise due to local inconsistencies. Meanwhile, the regularization term incorporates global features, ensuring that the affinity matrix is sparsified and capturing global patterns in the data. This balance of local and global features plays a crucial role in addressing view-specific noise, which can arise due to variations in view angles, lighting, or partial occlusions during the data collection phase. Even though different views describe the same object, they often contain view-specific information that, if not properly handled, can introduce undesirable noise. By optimizing the graph structure to balance local and global features, we reduce the impact of such noise, resulting in a clearer and more robust affinity matrix

A^{v}

, which accurately reflects the underlying clusters. More details on this process can be found in [39].

3.3. Optimization

For the convenience of optimization, we first modify the formulation of the objective function. Denote

\begin{matrix} {\tilde{λ}}_{v} = \frac{1}{\sqrt{\sum_{i, j} ∥{a_{i}^{v} - a_{j}^{v}∥}_{2}^{2} w_{i j}^{2}}} (v = 1, \dots, m), \\ φ = \frac{1}{\sqrt{\sum_{i, j} ∥{g_{i} - g_{j}∥}_{2}^{2} w_{i j}^{2}}} . \end{matrix}

(6)

Through simple algebra and by substituting

{\tilde{λ}}_{v}

and

φ

into objective function (3), we can get

\begin{matrix} J (W) = \underset{W}{\arg min} \sum_{v = 1}^{m} {\tilde{λ}}_{v} \sum_{i, j} ∥{a_{i}^{v} - a_{j}^{v}∥}_{2}^{2} w_{i j}^{2} \\ + φ \sum_{i, j} ∥{g_{i} - g_{j}∥}_{2}^{2} w_{i j}^{2}, \\ s . t . \forall i, w_{i}^{⊤} 1 = 1, 0 \leq w_{i j} \leq 1, rank (L_{W}) = n - c . \end{matrix}

(7)

We regard the data matrix

Y

as the (

m + 1

)-th view matrix and the initial global similarity matrix as the (

m + 1

)-th similarity matrix. Let

λ_{v} = \{\begin{matrix} {\tilde{λ}}_{v}, v \leq m \\ φ, v = m + 1 \end{matrix}

. So we can combine the first term and the second term in Equation (7) and use the total initial similarity matrix

A

to replace the combination of all initial similarity matrices,

\begin{matrix} J (W) = \underset{W}{\arg min} \sum_{v = 1}^{m + 1} λ_{v} \sum_{i, j} ∥{b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} w_{i j}^{2}, \\ s . t . \forall i, w_{i}^{⊤} 1 = 1, 0 \leq w_{i j} \leq 1, rank (L_{W}) = n - c . \end{matrix}

(8)

where

A = {A^{v}} = [A^{1}, A^{2}, \dots, A^{m}, G] \in R^{n \times n \times (m + 1)}

is the total initial similarity matrix, which consists of all the initial local similarity matrices and an initial global similarity matrix,

A^{v} = [b_{1}^{v}, b_{2}^{v}, \dots, b_{n}^{v}] \in R^{n \times n}, v = 1, \dots, m + 1

.

Assume that the class indicator matrix

H = [h_{1}, h_{2}, \dots, h_{n}] \in R^{n \times c}

is unified across all the views and set aside the normalization to different graphs. Each row of

H

represents a data point, and each column of

H

represents a cluster, which means that

h_{i}

indicates the cluster that data point i belongs to.

Let

σ_{i} (i = 1, 2, \dots, n)

be the i-th eigenvalue of

L_{W}

; then, we have

0 \leq σ_{i} \leq σ_{i + 1}

. As a result, the constraint

rank (L_{W}) = n - c

can be satisfied if

\sum_{i = 1}^{c} σ_{i} = 0

. According to the theorem in [40], we transform the problem of summing the smallest c eigenvalues of

L_{W}

into minimizing the trace of

H^{⊤} L_{W} H

, and then we have the following equivalent formulation:

\sum_{i = 1}^{c} σ_{i} = min_{H^{⊤} H = I} tr (H^{⊤} L_{W} H) .

(9)

Therefore, Equation (8) can be converted to

\begin{matrix} J (W, H, λ_{v}) = \underset{W, H^{⊤} H = I}{\arg min} μ tr (H^{⊤} L_{W} H) \\ + \sum_{v = 1}^{m + 1} λ_{v} \sum_{i, j} ∥{b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} w_{i j}^{2}, \\ s . t . \forall i, w_{i}^{⊤} 1 = 1, 0 \leq w_{i j} \leq 1 . \end{matrix}

(10)

where

μ

is a parameter that needs to be a large number so that the optimal solution to the objective function (10) will ensure that the equation

\sum_{i = 1}^{c} σ_{i} = 0

holds. In our method, the parameter

μ

is used to enforce that the graph has exactly c connected components. Instead of manually adjusting

μ

, we use an adaptive rule to update it during the optimization process. Initially,

μ

is set to a smaller value, and after each iteration, it is updated based on the number of zero eigenvalues of the Laplacian matrix. If the number of zero eigenvalues is less than c,

μ

is multiplied by 2; if the number is greater than

c + 1

,

μ

is divided by 2; otherwise, the iteration ends. This adaptive updating ensures that the final graph has exactly c connected components. This approach eliminates the need for manually tuning

μ

, and its adjustment is naturally linked to the structure of the graph. The number of zero eigenvalues of the Laplacian matrix provides a direct indication of how well the graph captures the underlying clusters. This dynamic adjustment makes the method robust to different dataset sizes, as

μ

adapts to the data’s complexity. Additionally, in practice, the value of

μ

does not need to be large from the outset, as the adaptive strategy allows it to grow or shrink as needed to guide the graph structure toward the correct number of clusters.

In the objective function (10), we have three unknown variables

W

,

λ_{v}

and

H

, where

λ_{v}

is related to

W

. Thus, it is difficult to directly solve the solution of the objective function Equation (10). Here we develop a more efficient algorithm that alternatively updates

W

(while fixing

H

and

λ_{v}

) and updates

H

and

λ_{v}

(while fixing

W

). The detailed process is described as follows:

Update $H$ and $λ_{v}$ , while fixing $W$
To be specific, $W$ is known in the ( $t + 1$ )-th iteration, so we can easily calculate $λ_{v}$ by Equation (6), and update $H$ by minimizing the objective function Equation (10). In this case, the second term and the third term in the objective function Equation (10) become constant. Thus, the objective function Equation (10) is converted to solve the following objective function

$H = \underset{H^{⊤} H = I}{\arg min} tr (H^{⊤} L_{W} H) .$

(11)

According to the matrix theory, the column vectors of the optimal indicator matrix $H$ of Equation (11) are composed of the c eigenvectors of $L_{W}$ corresponding to the c smallest eigenvalues.
Update $W$ while fixing $H$ and $λ_{v}$
After that, we can update $W$ while fixing $H$ and $λ_{v}$ . Denote $z_{i j} = \sum_{v = 1}^{m + 1} λ_{v} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2}$ , $y_{i j} = {∥h_{i} - h_{j}∥}_{2}^{2}$ .
According to the equation of $2 tr (H^{⊤} L_{W} H) = \sum_{i, j} {∥h_{i} - h_{j}∥}_{2}^{2} w_{i j} = \sum_{i, j} y_{i j} w_{i j}$ , the objective function (10) can be converted to Equation (12)

$\begin{matrix} W = \underset{W}{\arg min} \sum_{i, j} (z_{i j} w_{i j}^{2} + μ y_{i j} w_{i j}), \\ s . t . \forall i, w_{i}^{⊤} 1 = 1, 0 \leq w_{i j} \leq 1 . \end{matrix}$

(12)

Through simple algebra, for each i, we have Equation (13)

$\begin{matrix} W = \underset{w_{i}}{\arg min} {∥w_{i} + u_{i}∥}_{2}^{2}, \\ s . t . \forall i, w_{i}^{⊤} 1 = 1, 0 \leq w_{i j} \leq 1, \end{matrix}$

(13)

where $u_{i} \in R^{n \times 1}$ is a vector and the j-th element is $u_{i j} = \frac{1}{2 z_{i j}} μ y_{i j}$ .

We can calculate a closed-form solution of the problem easily with the method presented in [39]. As a result, the objective function can update the similarity matrix until it achieves ideal neighbors. In addition,

W

can be used for clustering directly, which greatly improves the stability of clustering performance. Parameter

μ

is first initialized to a smaller value, and after each iteration, it is updated according to the number of zero eigenvalues of the Laplacian matrix. If the number of zero eigenvalues is less than c, multiply

μ

by 2; if the number of zero eigenvalues is greater than

c + 1

, divide

μ

by 2; otherwise, end the iteration. This iterative procedure repeats until the algorithm converges. The convergence proof is given in the following section. Finally, we summarize the pseudocode of solving the objective function (10) in Algorithm 1.

Algorithm 1 ES-MVC

Input:

X^{(1)}, \dots, X^{(m)}

,

X^{(v)} \in R^{n \times d^{(v)}} (v = 1, \dots, m)

, class number c.

Initialize:

t = 1

,

W_{0} \in R^{n \times n}

which is initial graph.

while not converge do

1. Calculate the matrix

A^{v}

and

λ_{v}

for each view.

2. Calculate

L_{W}

.

3. Update the cluster indicator matrix

H = \underset{H^{⊤} H = I}{\arg min} tr (H^{⊤} L_{W} H)

4. For each i, update

w_{i}

.

5. Update t:

t \leftarrow t + 1

.

end while

Compute the clustering labels by running K-means on

H

or spectral clustering on

W

.

Output: The final clustering result.

3.4. Convergence Analysis

In this subsection, we provide the convergence analysis of our proposed algorithm. We adopt an alternating iteration approach for solving each subproblem. Although the non-convex nature of the problem allows for the possibility of local optima, the penalty parameter is gradually increased with each iteration. This adaptive strategy ensures that the iteration will continue until it converges to the global optimum, which is a common practice in optimization. The details of the convergence analysis are as follows.

Theorem 1.

In each iteration of Algorithm 1, we have

\begin{matrix} \sum_{v = 1}^{m + 1} \frac{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t + 1}}{2 \sqrt{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t + 1}}} \\ \leq \sum_{v = 1}^{m + 1} \frac{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t}}{2 \sqrt{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t}}}, \end{matrix}

(14)

i.e.,

\sum_{v = 1}^{m + 1} \sqrt{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t + 1}} \leq \sum_{v = 1}^{m + 1} \sqrt{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t}}

.

Proof.

For each iteration t, according to step 3 in Algorithm 1, we have the following inequality:

\begin{matrix} \sum_{v = 1}^{m + 1} \frac{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t + 1}}{2 \sqrt{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t}}} \\ \leq \sum_{v = 1}^{m + 1} \frac{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t}}{2 \sqrt{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t}}} . \end{matrix}

(15)

According to the inequality

a + b \geq 2 \sqrt{ab}

, we can easily get

\sqrt{a} - \frac{a}{2 \sqrt{b}} \leq \sqrt{b} - \frac{b}{2 \sqrt{b}}

. Let

a = \sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t + 1}

,

b = \sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t}

, then we have

\begin{matrix} (\sum_{v = 1}^{m + 1} \sqrt{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t + 1}} \\ - \sum_{v = 1}^{m + 1} \frac{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t + 1}}{2 \sqrt{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t}}}) \\ \leq (\sum_{v = 1}^{m + 1} \sqrt{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t}} \\ - \sum_{v = 1}^{m + 1} \frac{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t}}{2 \sqrt{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t}}}) . \end{matrix}

(16)

By adding Equations (15) and (16) to both sides, we have

\begin{matrix} \sum_{v = 1}^{m + 1} \sqrt{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t + 1}} \\ \leq \sum_{v = 1}^{m + 1} \sqrt{\sum_{i, j} {∥b_{i}^{v} - b_{j}^{v}∥}_{2}^{2} {(w_{i j}^{2})}^{t}} . \end{matrix}

(17)

Thus, Algorithm 1 will converge to a minimum of the objective function Equation (10). □

4. Experiments

In this section, we validate our method using six multi-view datasets: five widely used ones (Handwritten numerals, MSRC-v1, Caltech101, BDGP, and BBCSport) and one created by us (Volunteer). We compare it with spectral clustering (SC) [3] and five other multi-view clustering methods: Feature Concatenation Spectral Clustering (ConSC) [23], Auto-weighted Multiple Graph Learning (AMGL) [28], Multi-view Learning with Adaptive Neighbors (MLAN) [29], Robust Multi-view Spectral Clustering (RMSC) [33], Exclusivity-Consistency Regularized Multi-view Subspace Clustering (ECMSC) [41] and latent-information-guided one-step multi-view fuzzy clustering based on cross-view anchor graph (OMVFC-LICAG) [34]. Following [29], we set the parameter

μ

for the Laplacian matrix rank constraint.

To preliminarily evaluate the effectiveness of the proposed method in manifold structural multi-view scenarios, we add a three-ring dataset to our experiment. We firstly apply a synthetic three-ring toy dataset to preliminarily evaluate the effectiveness of our proposed method in manifold structural multi-view scenarios. The three-ring dataset consists of 600 samples from 3 non-linearly separable clusters, each represented by a 2D coordinate. The second view is obtained by performing FFT on the first view. The result (ACC, NMI, PUR) of the preliminary experiment is 1.0000 ± 0.00, 1.0000 ± 0.00, 1.0000 ± 0.00, and the cluster visualization is shown in Figure 3, which preliminarily demonstrates that our algorithm achieved good results on multi-view scenarios.

Then we evaluated clustering performance using three standard metrics: accuracy (ACC) [26], normalized mutual information (NMI) [42], and purity [43]. All three metrics range from 0 to 1, with higher values indicating better clustering performance. Since all methods involve a k-means procedure sensitive to initial centroids, we repeat the experiments 40 times and report the average performance and standard deviation (Std) for the ACC, NMI, and purity metrics.

We extracted various features from the datasets to perform multi-view clustering. Brief introductions are as follows:

Color Moment (CMT) [44]: extracts the feature that describes the global color distribution of an image.

HOG [45]: describes the information of local or global shape by capturing edge or gradient structure of an image.

GIST [46]: obtains global non-spatial features and their statistics of an image.

LBP [47]: describes the texture feature of the image.

SIFT [48]: is the feature constructed by key points from an image and allows local shape distortion and illumination difference.

Doc2Vec [49]: features embedding documents into a fixed-dimensional vector space, capturing semantic information and contextual relationships of the document.

TF-IDF [50]: features evaluate the importance of words in a document by measuring the product of term frequency and inverse document frequency, reducing the impact of common words on text representation.

LSA [51]: features extract latent semantic structures and concepts by applying singular value decomposition to reduce the dimensionality of text data to a lower-dimensional space.

Finally, it should be noticed that we assume that all views are complete and well aligned across view-specific samples.

The experiments were conducted using MATLAB 2023 on a machine equipped with an Intel Xeon Platinum 8352V CPU (2 processors, 2.10 GHz base frequency, 3.50 GHz turbo frequency) and 128 GB RAM. No GPU was used in the experiments, demonstrating the efficiency of our method even in a CPU-only environment.

4.1. Experiments on the MSRC-v1 Dataset

The MSRC-v1 dataset [52] includes 240 images from 8 classes, with 30 images per class. For our experiment, we selected seven classes and extracted five features: 24-dim CM, 576-dim HOG, 512-dim GIST, 256-dim LBP, and 254-dim CENTRIST.

We evaluated our method on this dataset, as shown in Table 2, where SC1 to SC5 represent the five features. Our method consistently outperformed the others.

4.2. Experiments on the Caltech101 Dataset

The Caltech101 dataset [53] contains images from 101 classes. Figure 4 shows samples from three classes. We used a subset of 441 images from 7 selected classes, with 3 features: 620-dim HOG, 2560-dim SIFT, and 1160-dim LBP. Table 3 presents the clustering performance, with our method consistently outperforming the others.

4.3. Experiments on the HW Dataset

The HW [54] dataset contains 2000 images across 10 digit classes. We extracted six features: 76-dim FOU, 216-dim FAC, 64-dim KAR, 240-dim PIX, 47-dim ZER, and 6-dim MOR. Figure 5 shows sample images from the HW dataset.

Table 4 lists the clustering performance of each method on the HW dataset. The results demonstrate the high accuracy of our method.

4.4. Experiments on a Visual and Text Dataset

BDGP [55] is a multi-modal dataset with visual and textual features. It contains 2500 images of drosophila embryos in 5 categories. Each image is represented by a 1750-dim visual vector and a 79-dim textual vector. We evaluate our method using both views. Table 5 shows the clustering performance for each method. The results demonstrate the high accuracy of our method.

4.5. Experiments on Two Text Datasets

The BBCSport [56] dataset consists of 737 sports news articles across five categories. The Volunteer dataset consists of 2038 volunteers, both male and female, aged 22 to 60, categorized into 5 classes: alleviation, emergency, environment, neighborhood, and training. Each volunteer’s information is described by two distinct text files: one file contains the volunteer’s subjective intentions, while the other file provides objective details about the volunteer. In our experiment, we extracted 3 kinds of features from these datasets, i.e., 100-dim Doc2Vec, 100-dim TF-IDF, and 100-dim LSA. Therefore, we have 3 views for the BBCSport datasets, and 6 views for the Volunteer dataset. Table 6 and Table 7 list the clustering performance of each method. The results indicate that our method outperforms other methods.

4.6. Experimental Analysis

We further analyze our model’s results in comparison with the previously mentioned experimental outcomes. Our observations are as follows:

From Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7, it is evident that nearly all multi-view clustering methods outperform the best single-view method based on SC [3]. This demonstrates the superior effectiveness of multi-view methods in these scenarios. The results further suggest that multi-view data offers richer information, and the different views complement each other.
Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7 show that our method outperforms state-of-the-art techniques. It captures both local and global structures for a consistent similarity matrix and uses a robust graph, improving resilience to noise and outliers.
The results in Table 5 also prove that our method can be applied to multi-model datasets and achieve a promising performance.
The results of Table 6 and Table 7 also prove that our method can be applied to text dataset, and our method outperforms other state-of-the-art methods.
Figure 6 visualizes the similarity matrices for the MSRC-v1 dataset. The first three matrices are the initial local similarity matrices for views 1, 2, and 3. The last three show matrices learned by our method using global structure, local structure, and both. Each matrix has seven diagonal blocks representing seven classes, with values indicating affinities within clusters. A clearer diagonal block with more nonzero points signifies better clustering performance. Our method shows a clear block diagonal structure, with more affinity points on the diagonal, resulting in superior clustering. The sixth block, highlighted in red, demonstrates the effectiveness of our method compared to those using only global or local structure.
Figure 7 shows the value of parameter $α$ vs. ACC results on the MSRC-v1 dataset. From Figure 7, we observe that when $α$ is zero or infinite, the clustering performance is not the best. Figure 6 also proves this. We notice from Figure 6 that the diagonal block of the learned similarity matrix from our method is more complete than others. This illustrates that using global structure or local structure separately cannot obtain the desired similarity structure. It further shows that the local manifold structure within each view alone is not enough for clustering tasks.
Figure 8 shows the convergence curve of our proposed method vs. the number of iterations on four datasets. It demonstrates that the solution of our method is efficient. In particular, we can see that our method converges in fewer than five iterations.

In summary, our proposed method demonstrates superior performance by adaptively fusing cross-view information through adaptive weighting techniques, effectively uncovering intrinsic data correlations to achieve optimal results. From an algorithmic perspective, it enables information sharing across views by constructing both single-view and cross-view graphs. Furthermore, by imposing constraints on the graph structure, it directly obtains clustering results, eliminating performance degradation caused by additional steps to acquire cluster labels. Finally, the algorithm’s convergence is rigorously validated through both theoretical derivation and experimental verification. Notably, the algorithm converges with remarkably few iterations, highlighting its computational efficiency.

4.7. Advantages and Limitations

In this section, we primarily discuss the advantages and disadvantages of the proposed method. Based on the experimental results, our method consistently outperforms other comparison methods, which directly demonstrates its effectiveness. The underlying reason for this performance lies in its ability to enable information sharing across views by constructing both single-view and cross-view graphs. Moreover, by imposing structural constraints on the graphs, the method directly produces clustering results, avoiding the performance degradation typically caused by additional steps required to obtain cluster labels.

Despite its strong performance, the biggest limitation of the proposed method is its high time complexity. This is because it requires the construction of a pairwise graph. Meanwhile, solving the rank-constraint problem on this graph is also computationally expensive. A potential solution to address this issue is to introduce the anchor strategy. By constructing a bipartite graph between the data points and anchors, the computational complexity of the method could be significantly reduced.

5. Conclusions and Future Works

In this paper, we propose a novel graph-based multi-view clustering method named ES-MVC. Unlike traditional multi-view clustering methods, ES-MVC leverages robust initial similarity matrices to mitigate the adverse effects of noisy data. By uniquely integrating both global and local manifold structures, it captures more complementary information. Furthermore, we introduce a dynamic adjustment of the graph structure through a rank constraint, which best reflects the underlying data distribution and facilitates adaptive and optimal neighbor assignments. To efficiently solve the objective function, we have developed an iterative algorithm and proven its convergence. Notably, this algorithm demonstrates exceptionally quick convergence, typically settling within fewer than five iterations on various datasets, which is a clear advantage for handling large-scale datasets. The ultimate significance of our method lies in enabling scalable unsupervised learning for real-world data fusion scenarios, where multi-view data are pervasive but often suffer from noise, incomplete structures, and high computational costs. By effectively combining both local and global information, ES-MVC not only improves clustering accuracy but also offers a practical and robust solution for multi-view data analysis. Extensive experimental results further demonstrate that ES-MVC outperforms existing state-of-the-art multi-view clustering methods.

In future practical applications, data in real-world scenarios are often multi-source and heterogeneous, where the same object can have multi-view or multi-modal representations. Our proposed method is well-suited for such settings, enabling robust data fusion and improved clustering performance. For example, it could be applied in generating pseudo-labels for large-scale model training, enhancing training efficiency by leveraging complementary information from different data sources. Additionally, it could be useful for clustering users in social networks, improving recommendation system accuracy by better capturing the underlying structure of user preferences across various views. We also plan to extend the proposed method to achieve new research breakthroughs on non-specific datasets, particularly in the context of ubiquitous computing and real-time processing. By addressing challenges such as scalability and real-time performance, we aim to enhance the practicality and applicability of our method in real-world settings. Furthermore, we will explore potential optimizations, such as leveraging anchor-based techniques or distributed computing frameworks, to reduce computational complexity and facilitate its deployment in large-scale applications.

While ES-MVC exhibits promising performance, several avenues remain open for future exploration.

First, the current framework assumes linear relationships within each view. Extending it to capture non-linear manifold structures—such as through deep graph learning or kernel-based methods—could further improve clustering accuracy on complex datasets.

Second, we plan to broaden the applicability of our approach to multi-modal and heterogeneous data, adapting it according to diverse data collection methods. Specifically, in real-world applications, ensuring that image data collected from different views meets the clustering requirements, particularly in terms of viewing angles and sizes, is crucial. To address this, we will explore methods such as aligning data from different views at either the feature or sample level to mitigate inter-view heterogeneity. Additionally, we will investigate data augmentation techniques (e.g., rotation, scaling, cropping) to overcome issues related to insufficient data and improve clustering performance. These techniques will be systematically evaluated in future experiments to determine their effectiveness in enhancing the robustness and accuracy of our clustering model.

Third, since our model is designed for clustering based on batch data by analyzing structural characteristics, it is currently unsuitable for real-time processing of individual samples. Batch processing inevitably incurs certain time costs, limiting its use in time-sensitive applications. To overcome this limitation, future work will focus on optimizing the model’s time complexity to better support real-time analysis scenarios.

Author Contributions

Conceptualization, P.W.; methodology, D.Z. and Q.L.; writing—original draft preparation, D.Z.; writing—review and editing, Q.L.; supervision, P.W.; funding acquisition, Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Natural Science Foundation of Guangdong Province under Grant 2023A1515011845.

Data Availability Statement

Publicly available datasets analyzed in this study can be found here: Caltech101, https://tensorflow.google.cn/datasets/catalog/caltech101; MSRC-v1, https://mldta.com/dataset/msrc-v1/; BDGP, https://github.com/IMKBLE/iCmSC; HW, https://archive.ics.uci.edu/dataset/72/multiple+features; BBCSport, http://mlg.ucd.ie/datasets/bbc.html; Volunteer, https://zenodo.org/records/11218434, all links were accessed on 12 June 2025.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Zhao, J.; Xie, X.; Xu, X.; Sun, S. Multi-view learning overview: Recent progress and new challenges. Inf. Fusion 2017, 38, 43–54. [Google Scholar] [CrossRef]
Sun, S. A survey of multi-view machine learning. Neural Comput. Appl. 2013, 23, 2031–2038. [Google Scholar] [CrossRef]
Ng, A.Y.; Jordan, M.I.; Weiss, Y. On spectral clustering: Analysis and an algorithm. In Proceedings of the NIPS, Vancouver, BC, Canada, 9–14 December 2002; pp. 849–856. [Google Scholar]
Duan, Y.; Nie, F.; Wang, R.; Li, X. Harmonic cut: An efficient and directly solved balanced graph clustering. Neurocomputing 2024, 578, 127381. [Google Scholar] [CrossRef]
Sun, J.; Bi, J.; Kranzler, H.R. Multi-view biclustering for genotype-phenotype association studies of complex diseases. In Proceedings of the IEEE BIBM, Shanghai, China, 18–21 December 2013; pp. 316–321. [Google Scholar]
Jing, X.Y.; Wu, F.; Dong, X.; Shan, S.; Chen, S. Semi-Supervised Multi-View Correlation Feature Learning with Application to Webpage Classification. In Proceedings of the AAAI, San Francisco, CA, USA, 4–9 February 2017; pp. 1374–1381. [Google Scholar]
Tsivtsivadze, E.; Borgdorff, H.; van de Wijgert, J.; Schuren, F.; Verhelst, R.; Heskes, T. Neighborhood co-regularized multi-view spectral clustering of microbiome data. In Proceedings of the IAPR, Kyoto, Japan, 20–23 May 2013; pp. 80–90. [Google Scholar]
Chao, G.; Sun, S.; Bi, J. A survey on multi-view clustering. arXiv 2017, arXiv:1712.06246. [Google Scholar]
Xu, Y.M.; Wang, C.D.; Lai, J.H. Weighted multi-view clustering with feature selection. Pattern Recognit. 2016, 53, 25–35. [Google Scholar] [CrossRef]
Lu, C.; Yan, S.; Lin, Z. Convex sparse spectral clustering: Single-view to multi-view. IEEE Trans. Image Process. 2016, 25, 2833–2843. [Google Scholar] [CrossRef]
Zhao, X.; Evans, N.; Dugelay, J.L. A subspace co-training framework for multi-view clustering. Pattern Recognit. Lett. 2014, 41, 73–82. [Google Scholar] [CrossRef]
Ye, Y.; Liu, X.; Yin, J.; Zhu, E. Co-regularized kernel k-means for multi-view clustering. In Proceedings of the IEEE ICPR, Cancun, Mexico, 4–8 December 2016; pp. 1583–1588. [Google Scholar]
Tan, J.; Yang, Z.; Cheng, Y.; Ye, J.; Dai, Q. SRAGL-AWCL: A Two-step Multi-view Clustering via Sparse Representation and Adaptive Weighted Cooperative Learning. Pattern Recognit. 2021, 117, 107987. [Google Scholar] [CrossRef]
Li, J.; Zhao, H.; Tao, Z.; Fu, Y. Large-scale Subspace Clustering by Fast Regression Coding. In Proceedings of the IJCAI, Melbourne, Australia, 19–25 August 2017; pp. 2138–2144. [Google Scholar]
Cao, X.; Zhang, C.; Fu, H.; Liu, S.; Zhang, H. Diversity-induced multi-view subspace clustering. In Proceedings of the IEEE CVPR, Boston, MA, USA, 7–12 June 2015; pp. 586–594. [Google Scholar]
Zhao, J.; Lu, G. Clean affinity matrix learning with rank equality constraint for multi-view subspace clustering. Pattern Recognit. 2023, 134, 109118. [Google Scholar] [CrossRef]
Yin, Q.; Wu, S.; He, R.; Wang, L. Multi-view clustering via pairwise sparse subspace representation. Neurocomputing 2015, 156, 12–21. [Google Scholar] [CrossRef]
Liu, J.; Wang, C.; Gao, J.; Han, J. Multi-view clustering via joint nonnegative matrix factorization. In Proceedings of the IEEE ICDM, Dallas, TX, USA, 7–10 December 2013; pp. 252–260. [Google Scholar]
Zong, L.; Zhang, X.; Zhao, L.; Yu, H.; Zhao, Q. Multi-view clustering via multi-manifold regularized non-negative matrix factorization. Neural Netw. 2017, 88, 74–89. [Google Scholar] [CrossRef]
Yang, W.; Wang, Y.; Tang, C.; Tong, H.; Wei, A.; Wu, X. One step multi-view spectral clustering via joint adaptive graph learning and matrix factorization. Neurocomputing 2023, 524, 95–105. [Google Scholar] [CrossRef]
Akata, Z.; Thurau, C.; Bauckhage, C. Non-negative matrix factorization in multimodality data for segmentation and label prediction. In Proceedings of the Computer Vision Winter Workshop, Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
Yu, S.; Falck, T.; Daemen, A.; Tranchevent, L.C.; Suykens, J.A.; De Moor, B.; Moreau, Y. L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform. 2010, 11, 309. [Google Scholar] [CrossRef]
Kumar, A.; Rai, P.; Daume, H. Co-regularized multi-view spectral clustering. In Proceedings of the NIPS, Granada, Spain, 12–14 December 2011; pp. 1413–1421. [Google Scholar]
Li, X.; Ren, Z.; Sun, Q.; Xu, Z. Auto-weighted Tensor Schatten p-Norm for Robust Multi-view Graph Clustering. Pattern Recognit. 2023, 134, 109083. [Google Scholar] [CrossRef]
Duan, Y.; Wu, D.; Wang, R.; Li, X.; Nie, F. Scalable and parameter-free fusion graph learning for multi-view clustering. Neurocomputing 2024, 597, 128037. [Google Scholar] [CrossRef]
Cai, D.; He, X.; Han, J. Document clustering using locality preserving indexing. IEEE Trans. Knowl. Data Eng. 2005, 17, 1624–1637. [Google Scholar] [CrossRef]
Tang, W.; Lu, Z.; Dhillon, I.S. Clustering with multiple graphs. In Proceedings of the IEEE ICDM, Miami, FL, USA, 6–9 December 2009; pp. 1016–1021. [Google Scholar]
Nie, F.; Li, J.; Li, X. Parameter-Free Auto-Weighted Multiple Graph Learning: A Framework for Multiview Clustering and Semi-Supervised Classification. In Proceedings of the IJCAI, New York, NY, USA, 9–15 July 2016; pp. 1881–1887. [Google Scholar]
Nie, F.; Cai, G.; Li, X. Multi-View Clustering and Semi-Supervised Classification with Adaptive Neighbours. In Proceedings of the AAAI, San Francisco, CA, USA, 4–9 February 2017; pp. 2408–2414. [Google Scholar]
Yang, M.S.; Hussain, I. Unsupervised multi-view K-means clustering algorithm. IEEE Access 2023, 11, 13574–13593. [Google Scholar] [CrossRef]
Hussain, I.; Sinaga, K.P.; Yang, M.S. Unsupervised multiview fuzzy c-means clustering algorithm. Electronics 2023, 12, 4467. [Google Scholar] [CrossRef]
Hussain, I.; Nataliani, Y.; Ali, M.; Hussain, A.; Mujlid, H.M.; Almaliki, F.A.; Rahimi, N.M. Weighted Multiview K-Means Clustering with L2 Regularization. Symmetry 2024, 16, 1646. [Google Scholar] [CrossRef]
Xia, R.; Pan, Y.; Du, L.; Yin, J. Robust Multi-View Spectral Clustering via Low-Rank and Sparse Decomposition. In Proceedings of the AAAI, Québec City, QC, Canada, 27–31 July 2014; pp. 2149–2155. [Google Scholar]
Zhang, C.; Chen, L.; Shi, Z.; Ding, W. Latent information-guided one-step multi-view fuzzy clustering based on cross-view anchor graph. Inf. Fusion 2024, 102, 102025. [Google Scholar] [CrossRef]
Wang, X.; Qian, B.; Ye, J.; Davidson, I. Multi-objective multi-view spectral clustering via pareto optimization. In Proceedings of the IEEE ICDM, Dallas, TX, USA, 7–10 December 2013; pp. 234–242. [Google Scholar]
Li, X.; Chen, M.; Nie, F.; Wang, Q. A Multiview-Based Parameter Free Framework for Group Detection. In Proceedings of the AAAI, San Francisco, CA, USA, 4–9 February 2017; pp. 4147–4153. [Google Scholar]
Nie, F.; Li, J.; Li, X. Self-weighted Multiview Clustering with Multiple Graphs. In Proceedings of the IJCAI, Melbourne, Australia, 19–25 August 2017; pp. 2564–2570. [Google Scholar]
Nie, F.; Wang, X.; Jordan, M.I.; Huang, H. The Constrained Laplacian Rank Algorithm for Graph-Based Clustering. In Proceedings of the AAAI, Phoenix, AZ, USA, 12–17 February 2016; pp. 1969–1976. [Google Scholar]
Nie, F.; Wang, X.; Huang, H. Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 977–986. [Google Scholar]
Fan, K. On a theorem of Weyl concerning eigenvalues of linear transformations II. Proc. Natl. Acad. Sci. USA 1950, 36, 31–35. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Guo, X.; Lei, Z.; Zhang, C.; Li, S.Z. Exclusivity-consistency regularized multi-view subspace clustering. In Proceedings of the IEEE CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 923–931. [Google Scholar]
Estévez, P.A.; Tesmer, M.; Perez, C.A.; Zurada, J.M. Normalized mutual information feature selection. IEEE Trans. Neural Netw. 2009, 20, 189–201. [Google Scholar] [CrossRef]
Varshavsky, R.; Linial, M.; Horn, D. Compact: A comparative package for clustering assessment. In Proceedings of the ISPA, Nanjing, China, 2–5 November 2005; pp. 159–167. [Google Scholar]
Wu, J.; Rehg, J.M. Where am I: Place instance and category recognition using spatial PACT. In Proceedings of the IEEE CVPR, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE CVPR, San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Le, Q.; Mikolov, T. Distributed Representations of Sentences and Documents. arXiv 2014, arXiv:1405.4053v2. [Google Scholar] [CrossRef]
Jones, K.S. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 2004, 60, 493–502. [Google Scholar] [CrossRef]
Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K. Indexing by Latent Semantic Analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
Winn, J.; Jojic, N. Locus: Learning object classes with unsupervised segmentation. In Proceedings of the IEEE ICCV, Beijing, China, 17–20 October 2005; Volume 1, pp. 756–763. [Google Scholar]
Fei-Fei, L.; Fergus, R.; Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 2007, 106, 59–70. [Google Scholar] [CrossRef]
van Breukelen, M.; Duin, R.P.; Tax, D.M.; Den Hartog, J. Handwritten digit recognition by combined classifiers. Kybernetika 1998, 34, 381–386. [Google Scholar]
Cai, X.; Wang, H.; Huang, H.; Ding, C. Joint stage recognition and anatomical annotation of drosophila gene expression patterns. Bioinformatics 2012, 28, i16–i24. [Google Scholar] [CrossRef] [PubMed]
Greene, D.; Cunningham, P. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006. [Google Scholar]

Figure 1. The general idea of our method. There are 6 samples of the cup class in the Caltech101 dataset, and we add noise to the first image, which is labeled as “1”. Then we use four views of these samples to perform the clustering task. The common graph is learned directly on the original data feature; the optimal graph is learned on the similarity matrix of each view.

Figure 2. Frameworks of the general graph-based multi-view clustering strategy and the proposed method ES-MVC.

Figure 3. Cluster visualization on the synthetic dataset.

Figure 4. Image samples of airplanes, dolphins, and umbrellas in the Caltech101 dataset.

Figure 5. Image samples of digits in the HW dataset.

Figure 6. Depiction of the similarity matrices for the MSRC-v1 dataset. (a). Local similarity matrix of view 1; (b). Local similarity matrix of view 2; (c). Local similarity matrix of view 3; (d). Global similarity matrix of all the views; (e). Local similarity matrix of all the views; (f). Joint global and local similarity matrix of all the views. If the element of the matrix is nonzero, then the corresponding point on the picture is blue; otherwise, it is white.

Figure 7. The value of parameter

α

vs. ACC results on the MSRC-v1 dataset.

Figure 7. The value of parameter

α

vs. ACC results on the MSRC-v1 dataset.

Figure 8. The convergence curve of our method vs. the number of iterations.

Table 1. The difference between existing methods.

Methods	Robust Similarity Matrix	Global Structure	Local Structure	Weighted Fusion
Co-training [22]	No	No	Yes	No
Co-regularized [26]	No	No	Yes	No
LMF [27]	No	Yes	No	No
RMSC [33]	Yes	Yes	No	No
AMGL [28]	No	Yes	No	Yes
MLAN [29]	No	Yes	No	Yes
MPF [36]	No	Yes	Yes	Yes
PwMC [37]	Yes	Yes	No	No
ES-MVC	Yes	Yes	Yes	Yes

Table 2. Clustering performance on the MSRC-v1 dataset.

Methods	ACC	NMI	Purity
SC1	0.3244 ± 0.04	0.2740 ± 0.03	0.3878 ± 0.05
SC2	0.5608 ± 0.05	0.5077 ± 0.04	0.5995 ± 0.06
SC3	0.6311 ± 0.08	0.5954 ± 0.07	0.6949 ± 0.05
SC4	0.4699 ± 0.07	0.4225 ± 0.04	0.5196 ± 0.03
SC5	0.5611 ± 0.03	0.4747 ± 0.02	0.5914 ± 0.05
ConcatSC	0.6769 ± 0.05	0.6539 ± 0.05	0.7219 ± 0.08
ESMSC	0.6953 ± 0.02	0.6867 ± 0.01	0.7575 ± 0.03
RMSC	0.6751 ± 0.06	0.6187 ± 0.05	0.7147 ± 0.07
AMGL	0.6875 ± 0.06	0.6587 ± 0.08	0.7212 ± 0.05
MLAN	0.6811 ± 0.02	0.6297 ± 0.02	0.7331 ± 0.02
OMVFC-LICAG	0.7284 ± 0.03	0.5907 ± 0.02	0.7384 ± 0.03
Ours	0.7643 ± 0.08	0.7275 ± 0.06	0.8125 ± 0.04

Table 3. Clustering performance on the Caltech101 dataset.

Methods	ACC	NMI	Purity
SC1	0.4821 ± 0.04	0.3660 ± 0.04	0.5577 ± 0.04
SC2	0.5601 ± 0.04	0.4378 ± 0.05	0.6347 ± 0.04
SC3	0.4611 ± 0.03	0.2755 ± 0.03	0.4915 ± 0.03
ConcatSC	0.5357 ± 0.03	0.4525 ± 0.04	0.6178 ± 0.05
ESMSC	0.5551 ± 0.03	0.4687 ± 0.03	0.6377 ± 0.05
RMSC	0.5005 ± 0.04	0.3416 ± 0.02	0.5544 ± 0.05
AMGL	0.4615 ± 0.05	0.3274 ± 0.06	0.5255 ± 0.02
MLAN	0.5531 ± 0.02	0.4787 ± 0.02	0.6487 ± 0.01
OMVFC-LICAG	0.4512 ± 0.02	0.3612 ± 0.04	0.5283 ± 0.06
Ours	0.6097 ± 0.04	0.5132 ± 0.03	0.6902 ± 0.03

Table 4. Clustering performance on the HW dataset.

Methods	ACC	NMI	Purity
SC1	0.6315 ± 0.07	0.6626 ± 0.04	0.6815 ± 0.06
SC2	0.6152 ± 0.10	0.6847 ± 0.05	0.6824 ± 0.08
SC3	0.7505 ± 0.08	0.8180 ± 0.06	0.8131 ± 0.09
SC4	0.8495 ± 0.05	0.8567 ± 0.08	0.8725 ± 0.08
SC5	0.6109 ± 0.07	0.6068 ± 0.05	0.6601 ± 0.05
SC6	0.4049 ± 0.04	0.4771 ± 0.05	0.4541 ± 0.08
ConcatSC	0.827 ± 0.15	0.8690 ± 0.07	0.8649 ± 0.09
ESMSC	0.7399 ± 0.01	0.7968 ± 0.02	0.7769 ± 0.02
RMSC	0.8625 ± 0.02	0.8105 ± 0.04	0.8715 ± 0.04
AMGL	0.8336 ± 0.03	0.8568 ± 0.03	0.8405 ± 0.05
MLAN	0.9731 ± 0.01	0.9387 ± 0.01	0.9731 ± 0.02
OMVFC-LICAG	0.8372 ± 0.03	0.7981 ± 0.02	0.8381 ± 0.04
Ours	0.9745 ± 0.04	0.9407 ± 0.04	0.9739 ± 0.02

Table 5. Clustering performance on the BDGP dataset.

Methods	ACC	NMI	Purity
SC1	0.4694 ± 0.02	0.2757 ± 0.02	0.4740 ± 0.02
SC2	0.3893 ± 0.02	0.2057 ± 0.09	0.3907 ± 0.02
ConcatSC	0.4747 ± 0.08	0.2962 ± 0.03	0.4871 ± 0.02
ESMSC	0.9191 ± 0.02	0.8535 ± 0.05	0.9192 ± 0.02
RMSC	0.4585 ± 0.01	0.2672 ± 0.03	0.4584 ± 0.04
AMGL	0.6032 ± 0.02	0.6102 ± 0.05	0.6855 ± 0.06
MLAN	0.9501 ± 0.02	0.8874 ± 0.01	9.9505 ± 0.02
OMVFC-LICAG	0.7754 ± 0.05	0.6941 ± 0.04	0.7752 ± 0.03
Ours	0.9688 ± 0.02	0.9132 ± 0.03	0.9687 ± 0.03

Table 6. The average ACC, NMI, purity, and std of each method on the BBCSport dataset.

Methods	ACC	NMI	Purity
SC1	0.4325 ± 0.02	0.1485 ± 0.04	0.4642 ± 0.03
SC2	0.5573 ± 0.01	0.3444 ± 0.05	0.6230 ± 0.09
SC3	0.7094 ± 0.03	0.6204 ± 0.08	0.7299 ± 0.05
ConcatSC	0.7132 ± 0.07	0.6215 ± 0.03	0.7228 ± 0.05
ESMSC	0.7375 ± 0.05	0.6420 ± 0.08	0.7387 ± 0.08
RMSC	0.6788 ± 0.09	0.5672 ± 0.05	0.7275 ± 0.05
AMGL	0.7165 ± 0.07	0.5825 ± 0.06	0.7460 ± 0.03
MLAN	0.7365 ± 0.04	0.6277 ± 0.02	0.7555 ± 0.05
OMVFC-LICAG	0.5172 ± 0.07	0.5590 ± 0.02	0.3101 ± 0.06
Ours	0.7855 ± 0.09	0.6686 ± 0.07	0.7997 ± 0.08

Table 7. Clustering performance on the Volunteer dataset.

Methods	ACC	NMI	Purity
SC1	0.3695 ± 0.02	0.0123 ± 0.02	0.3864 ± 0.05
SC2	0.3652 ± 0.02	0.0080 ± 0.03	0.3768 ± 0.04
SC3	0.3745 ± 0.05	0.0167 ± 0.06	0.3909 ± 0.06
SC4	0.3751 ± 0.06	0.0208 ± 0.03	0.3785 ± 0.08
SC5	0.3731 ± 0.02	0.0254 ± 0.02	0.3925 ± 0.05
SC6	0.3774 ± 0.03	0.0121 ± 0.03	0.3774 ± 0.05
ConcatSC	0.3757 ± 0.06	0.0382 ± 0.04	0.3908 ± 0.07
ESMSC	0.3973 ± 0.02	0.0381 ± 0.02	0.4013 ± 0.05
RMSC	0.3755 ± 0.02	0.0373 ± 0.02	0.3961 ± 0.08
AMGL	0.3843 ± 0.08	0.0199 ± 0.04	0.3852 ± 0.04
MLAN	0.3852 ± 0.02	0.0211 ± 0.01	0.3877 ± 0.02
OMVFC-LICAG	0.3011 ± 0.03	0.2552 ± 0.02	0.2984 ± 0.03
Ours	0.4533 ± 0.02	0.0456 ± 0.05	0.4767 ± 0.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, D.; Wang, P.; Li, Q. Enhanced Similarity Matrix Learning for Multi-View Clustering. Electronics 2025, 14, 2845. https://doi.org/10.3390/electronics14142845

AMA Style

Zhang D, Wang P, Li Q. Enhanced Similarity Matrix Learning for Multi-View Clustering. Electronics. 2025; 14(14):2845. https://doi.org/10.3390/electronics14142845

Chicago/Turabian Style

Zhang, Dongdong, Pusheng Wang, and Qin Li. 2025. "Enhanced Similarity Matrix Learning for Multi-View Clustering" Electronics 14, no. 14: 2845. https://doi.org/10.3390/electronics14142845

APA Style

Zhang, D., Wang, P., & Li, Q. (2025). Enhanced Similarity Matrix Learning for Multi-View Clustering. Electronics, 14(14), 2845. https://doi.org/10.3390/electronics14142845

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Similarity Matrix Learning for Multi-View Clustering

Abstract

1. Introduction

2. Related Works

2.1. Similarity Matrix Learning for Multi-View Clustering

2.2. Robust Multi-View Clustering

3. Our Method

3.1. Motivation

3.2. ES-MVC Method

3.2.1. Local Structure in a Single View

3.2.2. Global Structure Across Each Different View

3.2.3. ES-MVC Objective Function

3.2.4. Robust Initial Graph Construction

3.3. Optimization

3.4. Convergence Analysis

4. Experiments

4.1. Experiments on the MSRC-v1 Dataset

4.2. Experiments on the Caltech101 Dataset

4.3. Experiments on the HW Dataset

4.4. Experiments on a Visual and Text Dataset

4.5. Experiments on Two Text Datasets

4.6. Experimental Analysis

4.7. Advantages and Limitations

5. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI