Deep Multi-View Clustering Based on Reconstructed Self-Expressive Matrix

Shi, Zonghan; Zhao, Haitao

doi:10.3390/app13158791

Open AccessArticle

Deep Multi-View Clustering Based on Reconstructed Self-Expressive Matrix

by

Zonghan Shi

and

Haitao Zhao

^*

Automation Department, School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(15), 8791; https://doi.org/10.3390/app13158791

Submission received: 15 June 2023 / Revised: 21 July 2023 / Accepted: 27 July 2023 / Published: 29 July 2023

(This article belongs to the Special Issue Machine Intelligence and Networked Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Deep Multi-view Subspace Clustering is a powerful unsupervised learning technique for clustering multi-view data, which has achieved significant attention during recent decades. However, most current multi-view clustering methods rely on learning self-expressive layers to obtain the ultimate clustering results, where the size of the self-expressive matrix increases quadratically with the number of input data points, making it difficult to handle large-scale datasets. Moreover, since multiple views are rich in information, both consistency and specificity of the input images need to be considered. To solve these problems, we propose a novel deep multi-view clustering approach based on the reconstructed self-expressive matrix (DCRSM). We use a reconstruction module to approximate self-expressive coefficients using only a small number of training samples, while the conventional self-expressive model must train the network with entire datasets. We also use shared layers and specific layers to integrate consistent and specific information of features to fuse information between views. The proposed DCRSM is extensively evaluated on multiple datasets, including Fashion-MNIST, COIL-20, COIL-100, and YTF. The experimental results demonstrate its superiority over several existing multi-view clustering methods, achieving an improvement between 1.94% and 4.2% in accuracy and a maximum improvement of 4.5% in NMI across different datasets. Our DCRSM also yields competitive results even when trained by 50% samples of the whole datasets.

Keywords:

multi-view learning; subspace clustering; self-expressive matrix; deep learning

1. Introduction

In recent years, with the widespread availability of advanced technologies and the accumulation of enormous amounts of data, multi-view data analysis has garnered great attention. Multi-view data refers to datasets where multiple sets of features or modalities are available for each sample, providing versatile information from different perspectives. Generally, each view of these datasets can be represented by distinct descriptors, such as color, edges, texture, etc. The application of multi-view data analysis is crucial in various fields, including recommender systems, medical diagnosis, image segmentation [1,2,3], image search result organization (ISRO) [4] and skew distribution datasets clustering [5]. However, when the dataset has high dimensionality and a large number of samples, several clustering methods can be limited when applying to these applications.

The core idea behind multi-view subspace clustering is to capture consistent underlying representations of diverse views and group these data into distinct clusters. Recently, many multi-view clustering methods have been developed. Existing works about multi-view subspace clustering can be divided into different categories, i.e., non-negative matrix factorization (NMF) framework [6,7,8,9], collaborative clustering methods [10,11,12], co-training methods [13,14,15], self-expressive based methods [16,17] and deep-learning-based methods [18,19]. NMF-based methods aim to obtain the partitioning of the data by a low-rank decomposition of the data matrix, and it is proved to be effective where the subspaces of data points are independent of each other. Co-training is a semi-supervised learning framework where two or more classifiers are trained on different subsets of features or views, and they collaboratively learn from each other by iteratively updating and refining their predictions. Since Deep Subspace Clustering Network (DSC-Net) [20] was proposed in 2017, several methods have been conducted with the combination of self-expressive methodology and deep learning. DSC-Net comprises a convolutional autoencoder architecture [21], where the encoder attempts to learn the compact and meaningful representations of the input images and the decoder learns to reconstruct the input images by these latent features. Meanwhile, by incorporating the self-expressive layer, DSC-Net goes beyond traditional clustering approaches that rely on predefined distance metrics or assumptions about the data distribution. Instead, it learns the intrinsic relationships directly from the data and acquires the affinity matrix to perform spectral clustering, enabling more accurate clustering results.

While the multi-view model based on self-expression has demonstrated impressive performance and offers theoretical guarantees for correctness, it is hindered by two significant drawbacks: (1) The computational complexity associated with solving a self-expressive matrix of size

N \times N

, makes it difficult when dealing with large-scale datasets. (2) They rarely consider the problem of fusing local and global representations, which is important for mining inherent relations between different views.

To overcome these challenges, a new approach for deep multi-view subspace clustering based on reconstructing the self-expressive matrix (DCRSM) is proposed in this paper. We aim to reconstruct a set of self-expressive matrices from different latent features extracted by deep autoencoder. Principally, the number of network parameters does not need to match the number of samples in the dataset, which can help the network handle large-scale data readily and effectively. At the same time, we design a fusion network to learn global and local information from the extracted features and reconstruct the self-expressive matrices, which can integrate the common and specific information of different views simultaneously. Our research is of great practical importance. Since the cluster-based methods for image retrieval might struggle to trade-off between clustering quality and efficiency for large-scale dataset scenarios, our proposed approach may inspire further research on the field of ISRO to integrate features between high-dimensional data and training using a small batch of data to improve the clustering efficiency.

The main contributions of this paper are summarized as follows:

We present a new deep multi-view subspace clustering approach which is beyond the conventional self-expressive model and breaks the limitation of using all data as a batch every time while training. By reconstructing the self-expression coefficients using a small amount of data, our method can scale to arbitrarily large datasets.
Given the intrinsic relationship between the samples, our approach uses shared layers and specific layers to integrate local and global representation of each sample, which leverages both common and specific features simultaneously and explicitly and enables fuller use of information between multiple views.
We conduct extensive experiments on Fashion-MNIST, COIL-20, COIL-100, and YTF datasets. The results show that our proposed approach has the best performance compared with several multi-view subspace clustering methods.

The remaining sections of this paper are structured as follows: Section 2 provides an overview of the related work in the field of subspace clustering. Section 3 introduces our proposed network, loss function and algorithm training procedure. Section 4 describes the experimental datasets, analyses the quantitative results of different methods and visualizes the clustering performance. Section 5 concludes the paper with discussion and summary.

2. Related Work

2.1. Subspace Clustering

The k-means [22] and GMM [23] algorithms have been widely used to solve the clustering problems due to their computational efficiency. However, it is a challenge to achieve satisfactory results when applied these methods on high-dimensional data. The concept of self-expression is that any data point in a dataset can be represented as a linear combination of other data points in this dataset [24]. Conventional subspace clustering based on the idea of self-expression such as SSC [24], LLR [25], SSSC [26] and KSSC [27] are designed to build an affinity matrix for spectral clustering, while DSC-Net [20] incorporates a self-expressive layer in the autoencoder network, which improves the clustering results significantly. Zhou et al. [28] designed a network called DASC, which incorporated the idea of generative adversarial network [29] to guide the learning process and improved the accuracy of clustering results. To improve the feature representation ability, Valanarasu et al. [30] proposed an overcomplete convolutional encoder and an undercomplete convolutional encoder fusion approach to extract more meaningful representations. Some methods add a prior [31,32] or redesign constraints and regularizations [33,34,35] to the network. On the other hand, a few works aim to optimize clustering representation learning [36,37,38,39,40] to perform joint optimization of feature learning and clustering.

2.2. Scalable Subspace Clustering

With the exponential increase in data volume, large-scale subspace clustering has become a research hotspot. A few researchers have proposed clustering methods which can extend to large-scale data for single-view. Several methods are based on dictionary learning [41,42,43,44], which construct a dictionary and utilize it to represent each data point by dictionary atoms. Inspired by the development of learning subspace dictionary to express the data points, Zhang et al. [45] presented a framework where jointly trained a neural network and approximated the self-expressive coefficients to scale to large datasets. Cai et al. [46] proposed a network to learn the latent representations and subspace bases simultaneously, and the clustering outcomes can be inferred through the learned subspace bases. Zhai et al. [47] extracted hierarchy features from single view images and designed an attentive module for each feature map to enhance feature representations. However, these approaches are mainly applied to single-view subspace clustering and difficult to keep a balance between accuracy and efficiency. In multi-view clustering, we can process features from different views and design modules to enhance extensibility.

2.3. Deep Multi-View Clustering

Deep multi-view clustering methods aim to leverage data obtained from multiple modalities to uncover the inherent clustering structure. Some approaches are based on traditional clustering methods. Liu et al. [48] proposed a centroids-guided clustering method. They use the view-specific cluster centroids as guiding signals to guide deep representation learning. This approach incorporates information from different perspectives, but does not take into account common information between views, and it is also difficult to cluster large-scale datasets. Other approaches can be divided into four categories including CCA-based methods [49,50,51], spectral-based methods [52], graph-based methods [53,54] and self-expressive-based methods [55,56,57,58,59,60]. The self-expressive-based methods have gained significant attention due to their ability to capture the intrinsic relationships within data and their good performance. Due to the richness of information contained in multi-view data, many approaches focus on the consistency and specificity of the data in each view. For instance, Zhu et al. [18] proposed a framework called MvDSCN, which consists a diversity module and a universality module, and the final results are obtained by applying spectral clustering using the common affinity matrix. Wang et al. [55] designed a network called DMSC-UDL to learn local and global information by different self-expressive matrices, and the shared self-expressive matrix was used to obtain the final results. Wang et al. [59] proposed a network called SIB-MSC, which captured view-specific and view-common information by using different autoencoders. They also introduced mutual information as regularization terms to isolate view-specific spaces to be complementary to the shared view spaces. They use the view-common affinity matrices to perform spectral clustering to obtain the final results. However, most of them mainly focus on how to obtain the commonness and specificity between each view and often face limitations when it comes to handling large-scale datasets.

3. Our Method

3.1. Self-Expressive Clustering

Given a dataset

Q = \{q_{1}, q_{2}, . . ., q_{n}\} \in R^{d \times n}

which consists of n samples belonging to different linear subspaces, denoted as

S_{i}

, with each subspace having a dimension

d_{i}

, the samples

q_{n}

in

S_{i}

can all be donated by the other samples in that subspace. This is the self-expressive nature of the data. The learning model for self-expressive subspace clustering is generally as follows:

\begin{matrix} min \frac{1}{2} {∥ Q - Q C ∥}_{F}^{2} + λ {∥ C ∥}_{p} \\ s . t . diag (C) = 0 \end{matrix}

(1)

where

C \in R^{n \times n}

represents the self-expression coefficient matrix. The restriction of

diag (C) = 0

is to prevent the trivial solution of

C = I

, where I represents the identity matrix.

{∥ C ∥}_{p}

is a regularization term and there are multiple ways to define the matrix C, such as using the 1-norm, 2-norm or nuclear norm. By solving the above optimization formula, we can obtain the affinity matrix W by setting

W = \frac{| C + C^{T} |}{2}

. Subsequently, the final clustering results are obtained by applying the spectral clustering method to the affinity matrix.

3.2. Network Architecture

For the multi-view subspace clustering task, given a dataset with N data samples and V views, using the multi-view features

{\{q_{i}^{(v)} \in Q^{(v)}\}}_{i = 1}^{N} \in R^{D \times N}

and clustering these data points into K clusters, where

Q^{(v)}

represents the v-th view of the original input data. As illustrated in Figure 1, the proposed network mainly consists of three parts: multi-view convolutional encoders, multi-view conventional decoders and reconstructing self-expressive matrix modules.

3.2.1. Multi-View Convolutional Encoders

In the proposed method, each view of the original images has its own encoder to extract low-dimensional features, which helps to obtain different features for various views. Let us assume that

Θ_{e}^{(v)}

represents the parameters of multi-view encoders, and it aims to transform the input images

Q^{(v)}

to

S^{(v)}

with a non-linear mapping

E_{v} (Q^{(v)}; Θ_{e}^{(v)})

, where

E_{v}

donates the encoding network parameterized by

Θ_{e}^{(v)}

. Given the use of low-resolution datasets in subspace clustering, employing larger convolutional kernels or deeper network often results in the loss of fine-grained details, while smaller kernels may hinder the extraction of deep-level information. Therefore, we set three-layer encoders with [10, 20, 30] channels correspondingly and adopt [

4 \times 4

,

3 \times 3

,

4 \times 4

] kernel size to extract the representations of each view. We also employ the Mish activation function to ensure smoother gradients during the training process. Meanwhile, this implementation is the same as [55] which can ensure the fairness of the comparison experiments.

3.2.2. Multi-View Convolutional Decoders

The multi-view convolutional decoders are counterparts for inputs from different perspectives. The latent representations

Z^{(v)}

are then restored by the transposed convolutional decoders to the same size as the input space. Specifically,

{\hat{Q}}^{(v)} = D_{v} (Z^{(v)}; Θ_{d}^{(v)})

, where

D_{v}

represents the decoding network parameterized by

Θ_{d}^{(v)}

. We set three-layer decoders with [30, 20, 10] channels and use [

4 \times 4

,

3 \times 3

,

4 \times 4

] kernels to reconstruct the latent representations corresponding to each view.

3.2.3. Reconstructing Self-Expressive Matrices

The idea of the traditional multi-view self-expressive model is to learn serval self-expression layers inserted between encoders and decoders to simulate the self-expression properties and generates consistent subspace descriptions represented by matrix

C \in R^{N \times N}

. However, the number of parameters in conventional network grows quadratically with the number of data points N, which poses a limitation on its applicability to large-scale dataset, as the memory requirements for restoring an

N \times N

matrix might become prohibitive.

As shown in Figure 2, we propose a model to reconstruct the self-expression matrix for each representation of the input data and integrate the view-commonness and view-inconsistency using the shared layer and specific layer. In this way, the number of parameters is independent of the number of data points and can be adjusted flexibly based on the available memory resources. This allows our network to train the representation model of self-expressive coefficients by a small batch of data, providing great scalability for arbitrary sizes of datasets.

For each latent feature of each view in the dataset, we use Shared layer and Specific layer with 3-layer MLPs individually to approximate the solution of using embedded self-expressive layer, where the hidden units of MLPs are

[1024, 1024, 1024]

. The network can be represented as

S E (Z^{(v)}; Θ_{s}^{(v)})

, where

Θ_{s} (v)

donates the parameters of this module including the Shared MLP Layers and the Specific MLP layers. We can reconstruct the self-expressive coefficients in the following formulation:

C^{(v)} = α T_{b} ({Z_{C}^{(v)}}^{T} Z_{S}^{(v)})

(2)

where

Z_{C}^{(v)} \in R^{d \times N}

represents the output of shared layer and

Z_{S}^{(v)} \in R^{d \times N}

represents the output of a specific layer and

α

is a constant. Through Equation (2), we can calculate self-expressive matrix

C \in R^{N \times N}

by a constant number of parameters rather than learning it directly.

T_{b} (\cdot)

is a learnable soft thresholding operator, which is defined as:

T_{b} (w) = s g n (w) max (| w | - t, 0)

(3)

where t is a learnable parameter. The soft thresholding function has the property of shrinking smaller values to zero, effectively inducing sparsity by encouraging the elimination of less significant components. It helps retain important information while reducing the impact of noise.

3.3. Loss Function

In the proposed framework, the loss function mainly contains three parts: reconstruction loss, self-expressive loss, and regularization terms. The reconstruction loss by the autoencoder is applied to encourage the network to learn meaningful representations of the input data. Suppose

Q^{(v)}

represents the input images and

{\hat{Q}}^{(v)}

donates the reconstructed samples from the latent features. Then the reconstruction loss is defined as follows:

L_{r e c} = {∥ Q^{(v)} - {\hat{Q}}^{(v)} ∥}_{F}^{2}

(4)

The self-expressive loss and the regularization terms are applied in the reconstruction module. The self-expressive loss is used to promote the preservation of the underlying subspace structures in the data, which encourages samples from the same subspace to have similar coefficient patterns. The loss is defined as:

L_{s e} = {∥ Z^{(v)} - S E (Z^{(v)}; Θ_{s}^{(v)}) Z^{(v)} ∥}_{2}^{2}

(5)

We introduce a combination of

l_{1}

regularization and

l_{2}

regularization with a balancing weighted parameter

λ

as the regularization term of our proposed network. This term can be defined as:

L_{r e g} = λ ∥ S E (Z^{(v)}; Θ_{s}^{(v)}) ∥_{1} + \frac{1 - λ}{2} {∥ S E (Z^{(v)}; Θ_{s}^{(v)}) ∥}_{2}^{2}

(6)

The overall loss function can be expressed as:

L = L_{r e c} + L_{s e} + L_{r e g}

(7)

3.4. Training Procedure

The experimental procedure consists of three steps. The first step is to train the autoencoder to extract meaningful latent representations. The second step is to freeze the parameters of autoencoder and train the parameters of reconstruction module. The last step is to integrate the self-expressive matrix of each view and obtain the final clustering results. Due to the dependence of the loss function of the reconstruction module on the entire dataset, the memory utilization increases linearly with the number of data points. This hinders its capability to handle large-scale datasets. We use the two-pass algorithm with a constant memory complexity introduced in [45] to train this module. Once we have finished training, we can acquire the latent features of the full dataset by autoencoder and use the reconstruction module to approximate the self-expressive matrix. The training process is shown in Algorithm 1. We can perform spectral clustering by calculating the average of

C^{(v)}

of each branch to obtain the final clustering results.

Algorithm 1 Training process of DCRSM.

Input: Dataset

X^{(v)} \in R^{D \times N}

, model parameters

Θ^{(v)} = (Θ_{e}^{(v)}

,

Θ_{s}^{(v)}

,

Θ_{d}^{(v)})

, pre-trained epochs n, number of iterations T, learning rate

η

, hyper-parameter

λ

.

1:: Initialization: Initialize the network parameters $Θ^{(v)}$ ;
2:: while The algorithm has not converged do
3:: Train and update $Θ_{e}^{(v)}$ and $Θ_{d} (v)$ ;
4:: Learn the latent representations $Z^{(v)}$ using autoencoder by minimizing Equation (4).
5:: end while
6:: for $t = 1$ to T do
7:: Compute $Z_{C}^{(v)}$ using the Shared MLP layers;
8:: Compute $Z_{S}^{(v)}$ using the Specific MLP layers;
9:: Compute $C^{(v)}$ by Equation (2);
10:: Update $Θ_{s}^{(v)}$ using the two-pass algorithm by the loss function of Equations (5) and (6).
11:: end for
12:: Perform spectral clustering approach using the affinity matrix $C = \frac{1}{v} \sum_{i = 1}^{v} C^{(v)}$ ;

Output: The final cluster labels Y;

4. Experiments

4.1. Datasets

To evaluate the performance of the proposed method, experimental evaluations are conducted on five datasets, including Fashion-MNIST, COIL-20, COIL-100 and YouTubeFace (YTF). Figure 3 displays the visualization samples of these datasets. The detailed information of the experimental datasets is shown in Table 1.

Fashion-MNIST dataset: The Fashion-MNIST dataset can be divided into 10 clusters, each sample labeled with one of the ten fashion categories. In our experiments, we utilize the original gray images as the first view and the extracted edge features as the second view. We select 200 samples in each category for training following [55]’s work.
COIL-20 dataset: The COIL-20 dataset consists of 1440 samples such as duck and car model which belongs to 20 categories viewed from varying angles. We adopt the original grayscale images as the first view, and the second view is constructed by extracting edge features from the original images.
COIL-100 dataset: The COIL-100 dataset comprises a total of 7200 images which distributed over 100 clusters. For each object, it is positioned on a turntable set against a black background. We also down-sample the images into $32 \times 32$ . We use the original gray images as the first mode and the edge images as the second mode.
YouTubeFace (YTF) dataset: The YouTubeFaces dataset consists of videos collected from YouTube, featuring a diverse set of individuals in various scenarios and lighting conditions. It includes a total of over 3400 different individuals. Following [55], we employ the first 41 classes as the experimental dataset and crop the images to $128 \times 128$ . In this paper, the three views are the original images, the gray images and the edge images extracted from the original pictures.

4.2. Evaluation Metrics

To evaluate the clustering performance of various methods, we employ three evaluation metrics: Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI). ACC measures the ability of a model to correctly assign instances to their respective classes, giving an overall assessment of its performance. It can be calculated as follows:

A C C = max_{m} \frac{\sum_{i = 1}^{n} \{y_{i} = m a p (p_{i})\}}{n}

(8)

where

y_{i}

represents the ground-truth label,

p_{i}

donates the predicted cluster and

m a p (\cdot)

is a map function. The optimal one-to-one mapping is obtained using the Hungarian algorithm. NMI is a measure of the similarity between two clusters of a dataset. Specifically, NMI measures the amount of information shared between two clusters, normalized by the total amount of information present in the two clusters, which is defined as:

N M I = \frac{I (y; p)}{max \{H (y), H (p)\}}

(9)

where

I (y; p)

donates the mutual information between the ground-truth y and the predicted label p, and H represents the entropy. NMI takes values between 0 and 1, where 0 indicates no similarity between the two clusters, and 1 indicates perfect similarity. ARI is a measure of the similarity between two clusters of a dataset. It is a modified version of the Rand Index (RI), taking into account the expected agreement between two random clusters. It can be defined as follows:

A R I = \frac{R I - E (R I)}{max (R I) - E (R I)}

(10)

The Adjusted Rand Index takes values between −1 and 1, where −1 indicates complete disagreement between the two clusterings, 0 indicates the amount of agreement expected by chance and 1 indicates perfect agreement between the two clusterings.

4.3. Comparison Algorithms

We compare the proposed network with two single-view methods and several multi-view subspace clustering approaches by the criteria of ACC and NMI.

DEC [36] utilizes a fully connected stacked autoencoder network to learn latent representations and use the reconstruction loss to minimize it. Moreover, this network uses the Kullback–Leibler (KL) divergence to make comparisons between the original distribution and the ideal distribution.

DCCA [49] is an approach designed to learn intricate nonlinear transformations of two views, aiming to produce highly linearly correlated representations. The whole correlation is maximized by simultaneously learning parameters of both transformations.

BMVC [61] employs binary encoding techniques to jointly optimize binary encoding and clustering from multiple views. Compact collaborative discrete representation learning and binary clustering structure learning are two pivotal components in this framework.

GMC [62] automatically weights each view and jointly learns the graph structures of each view and the fused graph to obtain the clustering results.

DGCCA [51] introduces deep neural networks as the transformation functions, allowing for the learning of complex and nonlinear relationships between the views, which is the extension of GCCA.

DMJC [63] designs a soft assignment mechanism and introduces a novel auxiliary target distribution for multi-view fusion. The fusion process is optimized using KL divergence.

CMSC-DCCA [64] combines a correlation constraint and a self-expressive layer to leverage the information present in both inter-modal and intra-modal data.

EDESC [46] utilizes deep neural networks to learn a set of subspace bases and a mapping from samples to the latent space for each subspace. For the new data, the final clustering results can be obtained by computing the latent representations with the learned subspace bases for each subspace.

DMSC-UDL [55] combines local and global features with self-expression layers and designs a discriminative constraint to increase the discrimination between different views.

4.4. Implementation Details

We use Pytorch on the Ubuntu Linux 20.04 platform with NVIDIA RTX 3060 to implement our proposed method and other methods for comparison. In the training procedure of autoencoder, we train 3000 epochs using Adam optimizer with a learning rate of

1 \times 10^{- 3}

, then we freeze the autoencoder network and train the reconstruction module. We also use the Adam optimizer with a learning rate of

1 \times 10^{- 3}

and cosine annealing learning rate decay with the gradient clipping. The balancing weighted parameter

λ

in

L_{r e g}

is set to 0.9 and the batch size is fixed to 100. To verify that our method can scale to large-scale data, we uniformly and randomly sample N points in each dataset to perform different experiments. For the Fashion-MNIST dataset, we set

N \in {200, 500, 1000, 1500, 2000}

uniformly at random to train our model. For COIL-20 dataset, we set N to

{200, 500, 1000, 1440}

. For COIL-100 dataset, N is set to

{200, 500, 1000, 2000, 5000, 7200}

. For YTF dataset, N is in {200, 500, 1000, 2000, 5000, 10,000}.

4.5. Experimental Results

4.5.1. Comparisons with Other Methods

Table 2 shows the performance of our proposed method compared with other algorithms. We give the best view results for the single view methods. As seen in the results, our method significantly outperforms other methods in terms of ACC metrics on all experimental datasets and boosts the NMI criteria on three of the four datasets. For the Fashion-MNIST dataset, we raise the performance by 4.2% in ACC and show a competitive performance in NMI compared to other methods. For the COIL-20 dataset, we raise the performance by 1.94% in ACC and 0.15% in NMI. For the COIL-100 dataset, we improve by around 3.4% in ACC and 4.5% in NMI. For the YTF dataset, we improve the ACC criteria by 2.4% and NMI criteria by 3.4%. Overall, it can be seen that the multi-view approaches achieve better results than the single-view approaches and deep learning methods are more accurate than traditional methods. Our proposed method can use a deep neural network to extract latent features and integrate different views’ information to cluster the data points, which improves the clustering performance.

4.5.2. Clustering Ability to Large-Scale Dataset

We select N samples randomly for each dataset to train DCRSM, calculate the self-expressive matrix for the entire test samples and evaluate their ACC, NMI, and ARI results to evaluate their generalization ability. Table 3, Table 4, Table 5 and Table 6 illustrate the ACC, NMI, and ARI results of different datasets with various training numbers. The best results are marked in red and the second best results are marked in blue. The results show that our DCRSM can achieve relatively good performance even training with a small amount of data. For the Fashion-MNIST dataset, compared with the DMSC-UDL, we only need to use half the samples to achieve the same accuracy. For COIL-100, when the training number is 2000 which is less than a third of the whole data, DCRSM can achieve a reasonably good performance. The result shows that our approach has good scalability and adaptability when dealing with large-scale datasets.

4.5.3. Analysis on Time Complexity

To further demonstrate the effectiveness and efficiency of our approach, we show the clustering accuracy and running time of our proposed method with various training samples N in Figure 4. We highlight the best results in red and the second best results in blue. We also compare the running time with the DMSC-UDL approach and all running time is calculated after pre-training. Compared with DMSC-UDL, our method effectively balances training time and clustering accuracy, delivering promising performance within a reasonable timeframe on all experimental datasets. For Fashion-MNIST, COIL-100 and YTF datasets, our approach can achieve comparable accuracy levels while reducing the running time by around 50%. For the COIL-20 dataset, we can obtain superior accuracy of

70.69 %

within a shorter timeframe. The results unequivocally display the superior performance and computational efficiency of our approach.

For the time complexity, assuming that the resolution size of our input image is

M \times M

(all the images in the experimental dataset are square), the number of training samples is N and the training batch size is B, then the size of input tensor is

B \times 1 \times m \times m

. We denote the numbers of channels of encoder as

[C_{1}, C_{2}, C_{3}]

. The size of output tensor after encoder can be represented as

B \times C_{3} \times \frac{M}{4} \times \frac{M}{4}

. To simplify the symbolic representation, we use

B \times D

to denote the shape of latent representations. Since our method is the same as the autoencoder of the DMSC-UDL, we only calculate the time complexity of the self-expressive modules of the two methods. For the reconstructing self-expressive matrices module in our proposed DCRSM, the number of channels in the hidden layer is L, so the time complexity of a single-view is

O (L (D + L) B^{2})

, while for the self-expressive layer in DMSC-UDL, all training samples are needed in each iteration. The time complexity of single view is

O (D^{2} N^{2})

. For large-scale datasets

(B ≪ N)

, we can reduce the complexity extremely. Moreover, our method freezes the parameters of the encoder while training the reconstruction module, which also reduces the complexity of training process.

We also calculate the FLOPs and the number of parameters of our proposed model and the DMSC-UDL network, as shown in Figure 5. When dealing with small-scale datasets, our method and the comparative method have similar FLOPs. However, our method exhibits a significant advantage when extending to large datasets. For instance, on the COIL-100 dataset, our method has

34.23

G FLOPs, while DMSC-UDL has

121.47

G FLOPs. On the YTF dataset, our method has

45.4

G FLOPs and the comparative approach reaches

170.04

G FLOPs. Moreover, our method has a fixed number of parameters, while the parameters of traditional self-expressive methods such as DMSC-UDL grow quadratically with the number of samples in the dataset. The fixed number of parameters and lower FLOPs of our method give it a significant advantage when scaling up to large datasets.

4.5.4. Analysis of the Affinity Matrix

As illustrated in Figure 6, we show the results of the self-expressive matrix after training 2500 epochs when applying our model on the YTF dataset under the following four circumstances: (1) using single-view of our model on the color images, (2) using single-view of our model on the gray images, (3) using single-view of our model on the edge images, (4) using multi-view of our model on both the three views. It can be observed that, when using a first view only, the self-expression matrix exhibits significant non-zero weights outside the block-diagonal structure, which can negatively impact the clustering results. When using the second view only or using the third view only, the self-expressive weights of the block-diagonal matrix tend to be in relatively small magnitudes. In contrast, when employing the multi-view fusion approach, the self-expression matrix tends to be subspace-preserving. This demonstrates our method can leverage the information from two views and combine them to obtain a highly effective self-expression matrix for clustering. By integrating multiple views, our approach can capture a richer representation of the data, leading to improved clustering results.

4.5.5. Visualization of Clustering Results

To provide a more intuitive representation of our clustering results, we employ t-SNE (t-Distributed Stochastic Neighbor Embedding) to visualize the spatial distribution of the clustering results for all experimental datasets, as shown in Figure 7. Our method exhibits remarkable effectiveness in learning highly discriminative features that enable accurate and reliable clustering outcomes. For the four experimental datasets, our method demonstrates excellent discriminative ability in distinguishing the data. Even in scenarios involving highly imbalanced data, such as the YTF dataset, our method represents remarkable discrimination capability and achieves excellent clustering performance.

5. Discussion

In this paper, we introduce a novel deep multi-view subspace clustering approach based on reconstructing self-expressive matrix (DCRSM) that goes beyond the conventional self-expressive model. In the proposed DCRSM, we build a reconstruction module to approximate the self-expressive matrix for each view of the dataset. By leveraging this capability, our approach offers improved flexibility and efficiency in the training process, making it suitable for real-world applications. Meanwhile, our approach takes the intrinsic relationship between samples into account and utilizes shared and specific layers to integrate common and specific representations of each sample. By combining both common and specific features explicitly, our approach facilitates the comprehensive utilization of information across multiple views. Experiment results on publicly available datasets have shown the superiority of our method. In our study, we mainly concentrate on the scalability and accuracy of multi-view clustering. The main limitation of our proposed network is two-stage training, which adds complexity to the training process. In future work, we consider extending our model to an end-to-end model, which can simplify the training process.

Author Contributions

Conceptualization, Z.S. and H.Z.; methodology, Z.S. and H.Z.; software, Z.S.; validation, Z.S.; formal analysis, Z.S. and H.Z.; investigation, Z.S. and H.Z.; resources, Z.S. and H.Z.; data curation, Z.S.; writing—original draft preparation, Z.S.; writing—review and editing, Z.S. and H.Z.; visualization, Z.S.; supervision, H.Z.; project administration, H.Z.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (NSFC) under Grant 62173143 and Grant 61973122.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://github.com/zalandoresearch/fashion-mnist Fashion-MNIST accessed on 15 June 2023, https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php COIL-20 accessed on 15 June 2023, https://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php COIL-100 accessed on 15 June 2023, and http://www.cs.tau.ac.il/~wolf/ytfaces/index.html YTF accessed on 15 June 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xu, C.; Tao, D.; Xu, C. A survey on multi-view learning. arXiv 2013, arXiv:1304.5634. [Google Scholar]
Yan, X.; Hu, S.; Mao, Y.; Ye, Y.; Yu, H. Deep multi-view learning methods: A review. Neurocomputing 2021, 448, 106–129. [Google Scholar] [CrossRef]
Li, Y.; Yang, M.; Zhang, Z. A survey of multi-view representation learning. IEEE Trans. Knowl. Data Eng. 2018, 31, 1863–1883. [Google Scholar] [CrossRef] [Green Version]
Tekli, J. An overview of cluster-based image search result organization: Background, techniques, and ongoing challenges. Knowl. Inf. Syst. 2022, 64, 589–642. [Google Scholar] [CrossRef]
Lee, S.X.; McLachlan, G.J. An overview of skew distributions in model-based clustering. J. Multivar. Anal. 2022, 188, 104853. [Google Scholar]
Luo, P.; Peng, J.; Guan, Z.; Fan, J. Dual regularized multi-view non-negative matrix factorization for clustering. Neurocomputing 2018, 294, 1–11. [Google Scholar]
Huang, S.; Kang, Z.; Xu, Z. Auto-weighted multi-view clustering via deep matrix decomposition. Pattern Recognit. 2020, 97, 107015. [Google Scholar]
Cui, G.; Li, X.; Dong, Y. Subspace clustering guided convex nonnegative matrix factorization. Neurocomputing 2018, 292, 38–48. [Google Scholar] [CrossRef]
Khalafaoui, Y.; Grozavu, N.; Matei, B.; Goix, L.W. Multi-modal Multi-view Clustering based on Non-negative Matrix Factorization. In Proceedings of the 2022 IEEE Symposium Series on Computational Intelligence (SSCI), Singapore, 4–7 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1386–1391. [Google Scholar]
Sublime, J.; Maurel, D.; Grozavu, N.; Matei, B.; Bennani, Y. Optimizing exchange confidence during collaborative clustering. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–8. [Google Scholar]
Sublime, J.; Matei, B.; Cabanes, G.; Grozavu, N.; Bennani, Y.; Cornuéjols, A. Entropy based probabilistic collaborative clustering. Pattern Recognit. 2017, 72, 144–157. [Google Scholar] [CrossRef] [Green Version]
Ben-Bouazza, F.E.; Bennani, Y.; El Hamri, M. An Optimal Transport Framework for Collaborative Multi-view Clustering. In Recent Advancements in Multi-View Data Analytics; Springer: Berlin/Heidelberg, Germany, 2022; pp. 131–157. [Google Scholar]
Kumar, A.; Daumé, H. A co-training approach for multi-view spectral clustering. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, MA, USA, 28 June–2 July 2011; pp. 393–400. [Google Scholar]
Zhao, X.; Evans, N.; Dugelay, J.L. A subspace co-training framework for multi-view clustering. Pattern Recognit. Lett. 2014, 41, 73–82. [Google Scholar] [CrossRef]
Cai, W.; Zhou, H.; Xu, L. A multi-view co-training clustering algorithm based on global and local structure preserving. IEEE Access 2021, 9, 29293–29302. [Google Scholar]
Luo, S.; Zhang, C.; Zhang, W.; Cao, X. Consistent and specific multi-view subspace clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Cai, X.; Huang, D.; Zhang, G.Y.; Wang, C.D. Seeking commonness and inconsistencies: A jointly smoothed approach to multi-view subspace clustering. Inf. Fusion 2023, 91, 364–375. [Google Scholar] [CrossRef]
Zhu, P.; Hui, B.; Zhang, C.; Du, D.; Wen, L.; Hu, Q. Multi-view deep subspace clustering networks. arXiv 2019, arXiv:1908.01978. [Google Scholar]
Xie, Y.; Lin, B.; Qu, Y.; Li, C.; Zhang, W.; Ma, L.; Wen, Y.; Tao, D. Joint deep multi-view learning for image clustering. IEEE Trans. Knowl. Data Eng. 2020, 33, 3594–3606. [Google Scholar] [CrossRef]
Ji, P.; Zhang, T.; Li, H.; Salzmann, M.; Reid, I. Deep subspace clustering networks. Adv. Neural Inf. Process. Syst. 2017, 30, 23–32. [Google Scholar]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar]
MacQueen, J. Classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Los Angeles, LA, USA, 21 June–18 July 1965, 27 December 1965–7 January 1966; pp. 281–297. [Google Scholar]
Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4. [Google Scholar]
Elhamifar, E.; Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2765–2781. [Google Scholar] [CrossRef] [Green Version]
Liu, G.; Lin, Z.; Yan, S.; Sun, J.; Yu, Y.; Ma, Y. Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 171–184. [Google Scholar] [CrossRef] [Green Version]
Peng, X.; Zhang, L.; Yi, Z. Scalable sparse subspace clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 430–437. [Google Scholar]
Patel, V.M.; Vidal, R. Kernel sparse subspace clustering. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 2849–2853. [Google Scholar]
Zhou, P.; Hou, Y.; Feng, J. Deep adversarial subspace clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1596–1604. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Valanarasu, J.M.J.; Patel, V.M. Overcomplete deep subspace clustering networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 746–755. [Google Scholar]
Peng, X.; Zhu, H.; Feng, J.; Shen, C.; Zhang, H.; Zhou, J.T. Deep clustering with sample-assignment invariance prior. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 4857–4868. [Google Scholar] [CrossRef]
Peng, X.; Xiao, S.; Feng, J.; Yau, W.Y.; Yi, Z. Deep subspace clustering with sparsity prior. In Proceedings of the IJCAI, New York, NY, USA, 9–15 July 2016; pp. 1925–1931. [Google Scholar]
Cai, J.; Guo, W.; Fan, J. Unsupervised Deep Discriminant Analysis Based Clustering. arXiv 2022, arXiv:2206.04686. [Google Scholar]
Zhou, L.; Bai, X.; Wang, D.; Liu, X.; Zhou, J.; Hancock, E. Deep subspace clustering via latent distribution preserving. In Proceedings of the International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 4440–4446. [Google Scholar]
Peng, Z.; Liu, H.; Jia, Y.; Hou, J. Adaptive attribute and structure subspace clustering network. IEEE Trans. Image Process. 2022, 31, 3430–3439. [Google Scholar] [CrossRef]
Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 478–487. [Google Scholar]
Guo, X.; Gao, L.; Liu, X.; Yin, J. Improved deep embedded clustering with local structure preservation. In Proceedings of the IJCAI, Melbourne, Australia, 19–25 August 2017; pp. 1753–1759. [Google Scholar]
Cai, J.; Wang, S.; Guo, W. Unsupervised embedded feature learning for deep clustering with stacked sparse auto-encoder. Expert Syst. Appl. 2021, 186, 115729. [Google Scholar] [CrossRef]
Zhang, D.; Sun, Y.; Eriksson, B.; Balzano, L. Deep unsupervised clustering using mixture of autoencoders. arXiv 2017, arXiv:1712.07788. [Google Scholar]
Jiang, Z.; Zheng, Y.; Tan, H.; Tang, B.; Zhou, H. Variational deep embedding: An unsupervised and generative approach to clustering. arXiv 2016, arXiv:1611.05148. [Google Scholar]
Shen, J.; Li, P.; Xu, H. Online low-rank subspace clustering by basis dictionary pursuit. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 622–631. [Google Scholar]
You, C.; Li, C.; Robinson, D.P.; Vidal, R. Scalable exemplar-based subspace clustering on class-imbalanced data. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 67–83. [Google Scholar]
Matsushima, S.; Brbic, M. Selective sampling-based scalable sparse subspace clustering. Adv. Neural Inf. Process. Syst. 2019, 32, 12425–12434. [Google Scholar]
Li, J.; Liu, H.; Tao, Z.; Zhao, H.; Fu, Y. Learnable subspace clustering. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 1119–1133. [Google Scholar] [CrossRef]
Zhang, S.; You, C.; Vidal, R.; Li, C.G. Learning a self-expressive network for subspace clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12393–12403. [Google Scholar]
Cai, J.; Fan, J.; Guo, W.; Wang, S.; Zhang, Y.; Zhang, Z. Efficient deep embedded subspace clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1–10. [Google Scholar]
Zhai, W.; Gao, M.; Souri, A.; Li, Q.; Guo, X.; Shang, J.; Zou, G. An attentive hierarchy ConvNet for crowd counting in smart city. Clust. Comput. 2023, 26, 1099–1111. [Google Scholar] [CrossRef]
Liu, J.; Cao, F.; Liang, J. Centroids-guided deep multi-view k-means clustering. Inf. Sci. 2022, 609, 876–896. [Google Scholar] [CrossRef]
Andrew, G.; Arora, R.; Bilmes, J.; Livescu, K. Deep canonical correlation analysis. In Proceedings of the International Conference on Machine Learning, PMLR, Atlanta, GA, USA, 17–19 June 2013; pp. 1247–1255. [Google Scholar]
Wang, W.; Arora, R.; Livescu, K.; Bilmes, J. On deep multi-view representation learning. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 1083–1092. [Google Scholar]
Benton, A.; Khayrallah, H.; Gujral, B.; Reisinger, D.A.; Zhang, S.; Arora, R. Deep generalized canonical correlation analysis. arXiv 2017, arXiv:1702.02519. [Google Scholar]
Huang, Z.; Zhou, J.T.; Peng, X.; Zhang, C.; Zhu, H.; Lv, J. Multi-view Spectral Clustering Network. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; pp. 2563–2569. [Google Scholar]
Fan, S.; Wang, X.; Shi, C.; Lu, E.; Lin, K.; Wang, B. One2multi graph autoencoder for multi-view graph clustering. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 3070–3076. [Google Scholar]
Du, G.; Zhou, L.; Li, Z.; Wang, L.; Lü, K. Neighbor-aware deep multi-view clustering via graph convolutional network. Inf. Fusion 2023, 93, 330–343. [Google Scholar] [CrossRef]
Wang, Q.; Cheng, J.; Gao, Q.; Zhao, G.; Jiao, L. Deep multi-view subspace clustering with unified and discriminative learning. IEEE Trans. Multimed. 2020, 23, 3483–3493. [Google Scholar] [CrossRef]
Wang, Q.; Tao, Z.; Gao, Q.; Jiao, L. Multi-View Subspace Clustering via Structured Multi-Pathway Network. IEEE Trans. Neural Netw. Learn. Syst. 2022. [CrossRef]
Sun, X.; Cheng, M.; Min, C.; Jing, L. Self-supervised deep multi-view subspace clustering. In Proceedings of the Asian Conference on Machine Learning, PMLR, Vancouver, BC, Canada, 13 December 2019; pp. 1001–1016. [Google Scholar]
Tang, X.; Tang, X.; Wang, W.; Fang, L.; Wei, X. Deep multi-view sparse subspace clustering. In Proceedings of the 2018 VII International Conference on Network, Communication and Computing, Taiwan, Taiwan, 14–16 December 2018; pp. 115–119. [Google Scholar]
Wang, S.; Li, C.; Li, Y.; Yuan, Y.; Wang, G. Self-supervised information bottleneck for deep multi-view subspace clustering. IEEE Trans. Image Process. 2023, 32, 1555–1567. [Google Scholar] [CrossRef]
Yin, J.; Jiang, J. Incomplete Multi-view Clustering Based on Self-representation. Neural Process. Lett. 2023, 1–15. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, L.; Shen, F.; Shen, H.T.; Shao, L. Binary multi-view clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1774–1782. [Google Scholar] [CrossRef]
Wang, H.; Yang, Y.; Liu, B. GMC: Graph-based multi-view clustering. IEEE Trans. Knowl. Data Eng. 2019, 32, 1116–1129. [Google Scholar] [CrossRef]
Lin, B.; Xie, Y.; Qu, Y.; Li, C.; Liang, X. Jointly deep multi-view learning for clustering analysis. arXiv 2018, arXiv:1808.06220. [Google Scholar]
Gao, Q.; Lian, H.; Wang, Q.; Sun, G. Cross-modal subspace clustering via deep canonical correlation analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 3938–3945. [Google Scholar]

Figure 1. The network architecture of our proposed network DCRSM. This model contains three components: multi-view convolutional encoders, multi-view convolutional decoders and reconstruction model. The input of the network is dataset

Q^{(v)}

with V views and the autoencoder networks are used to learn the latent representations

Z^{(v)}

for each view of the input data. The reconstruction module is applied to simulate the self-expressive matrix

C^{(v)}

, which is used as input to the spectral clustering approach to obtain the final clustering results.

Figure 1. The network architecture of our proposed network DCRSM. This model contains three components: multi-view convolutional encoders, multi-view convolutional decoders and reconstruction model. The input of the network is dataset

Q^{(v)}

with V views and the autoencoder networks are used to learn the latent representations

Z^{(v)}

for each view of the input data. The reconstruction module is applied to simulate the self-expressive matrix

C^{(v)}

, which is used as input to the spectral clustering approach to obtain the final clustering results.

Figure 2. The framework of reconstruction module. The input of this module is the latent representations

Z^{(v)}

extracted from each view of the dataset. The specific MLP layers and shared MLP layers are applied to aggregate the commonness and inconsistency of each features and reconstruct self-expressive matrix for every view of the data points. The final clustering result is yielded by the average of

C^{(v)}

through spectral clustering algorithm.

Figure 2. The framework of reconstruction module. The input of this module is the latent representations

Z^{(v)}

extracted from each view of the dataset. The specific MLP layers and shared MLP layers are applied to aggregate the commonness and inconsistency of each features and reconstruct self-expressive matrix for every view of the data points. The final clustering result is yielded by the average of

C^{(v)}

through spectral clustering algorithm.

Figure 3. The visualization of COIL-20 and YTF dataset. The COIL-20 dataset has two views, including the original color and edge feature images. The YTF dataset has three views, including color images, gray images and edge images.

Figure 4. Clustering Accuracy vs. Running Time on different datasets with various training number N. DCRSM-N represents DCRSM trained with N data samples.

Figure 5. Comparison between our proposed method and DMSC-UDL on FLOPs and the number of parameters.

Figure 6. Visualization of affinity matrix of COIL-20 dataset. Subfigure (a) represents the self-expressive matrix when training only with the first view. Subfigure (b) represents the self-expressive matrix when training only with the second view. Subfigure (c) represents the self-expressive matrix when training only with the third view. Subfigure (d) represents the self-expressive matrix when training only with all views of the dataset.

Figure 7. The t-SNE visualization of our proposed method on all experimental datasets.

Table 1. Detailed information of the experimental datasets.

Dataset Name	Sample Number	Class Number	View Number	Image Size
Fashion-MNIST	2000	10	2	$28 \times 28 \times 1$
COIL-20	1440	20	2	$32 \times 32 \times 1$
COIL-100	7200	100	2	$32 \times 32 \times 1$
YTF	10,000	41	3	$128 \times 128 \times 3$

Table 2. The clustering results on Fashion-MNIST, COIL-20, COIL-100, YTF of different methods.

Method	Fashion-MNIST		COIL-20		COIL-100		YTF
Method	ACC(%)	NMI(%)	ACC(%)	NMI(%)	ACC(%)	NMI(%)	ACC(%)	NMI(%)
DEC [36]	51.80	54.60	68.00	80.25	55.30	60.25	37.10	44.60
DCCA [49]	52.74	53.82	55.76	64.91	57.88	63.43	45.19	60.35
BMVC [61]	45.36	38.05	34.31	40.33	-	-	28.13	38.28
GMC [62]	56.70	62.90	74.17	82.50	-	-	55.40	74.22
DGCCA [51]	56.28	57.04	54.01	62.40	-	-	47.26	61.38
DMJC [63]	61.41	63.41	72.99	81.58	68.53	74.96	61.15	77.40
CMSC-DCCA [64]	62.95	68.33	77.84	86.53	-	-	66.15	82.67
EDESC [46]	63.10	67.00	75.45	87.05	71.45	85.75	66.32	80.98
DMSC-UDL [55]	65.45	68.34	79.24	89.00	73.58	89.32	68.25	83.14
Our method	69.65	66.85	81.18	89.15	76.95	93.80	70.69	86.55

Table 3. The clustering performance of Fashion-MNIST dataset with different training number.

Methods	Training Number N	Fashion-MNIST
Methods	Training Number N	ACC (%)	NMI (%)	ARI (%)
DMSC-UDL	ALL	65.45	68.34	-
DCRSM	200	67.05	59.90	47.28
	500	68.50	64.88	52.22
	1000	65.30	64.23	48.21
	1500	66.35	64.26	49.21
	2000	69.65	66.85	53.23

Table 4. The clustering performance of COIL-20 dataset with different training number.

Methods	Training Number N	COIL-20
Methods	Training Number N	ACC (%)	NMI (%)	ARI (%)
DMSC-UDL	ALL	79.24	89.00	-
DCRSM	200	61.04	73.05	51.45
	500	68.20	80.72	62.62
	1000	75.42	86.01	71.02
	1440	81.18	89.15	77.35

Table 5. The clustering performance of COIL-100 dataset with different training number.

Methods	Training Number N	COIL-100
Methods	Training Number N	ACC (%)	NMI (%)	ARI (%)
DMSC-UDL	ALL	73.58	89.32	-
DCRSM	200	48.40	75.47	28.67
	500	60.35	82.29	46.94
	1000	65.10	87.79	51.01
	2000	72.31	91.57	59.64
	5000	73.89	92.00	68.46
	7200	76.95	93.80	69.71

Table 6. The clustering performance of YTF dataset with different training number.

Methods	Training Number N	YTF
Methods	Training Number N	ACC (%)	NMI (%)	ARI (%)
DMSC-UDL	ALL	68.25	83.14	-
DCRSM	200	61.61	77.54	56.54
	500	60.28	77.76	60.19
	1000	63.67	82.91	65.91
	2000	67.24	83.33	66.93
	5000	68.28	84.32	66.04
	10,000	70.69	86.55	70.22

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, Z.; Zhao, H. Deep Multi-View Clustering Based on Reconstructed Self-Expressive Matrix. Appl. Sci. 2023, 13, 8791. https://doi.org/10.3390/app13158791

AMA Style

Shi Z, Zhao H. Deep Multi-View Clustering Based on Reconstructed Self-Expressive Matrix. Applied Sciences. 2023; 13(15):8791. https://doi.org/10.3390/app13158791

Chicago/Turabian Style

Shi, Zonghan, and Haitao Zhao. 2023. "Deep Multi-View Clustering Based on Reconstructed Self-Expressive Matrix" Applied Sciences 13, no. 15: 8791. https://doi.org/10.3390/app13158791

APA Style

Shi, Z., & Zhao, H. (2023). Deep Multi-View Clustering Based on Reconstructed Self-Expressive Matrix. Applied Sciences, 13(15), 8791. https://doi.org/10.3390/app13158791

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Multi-View Clustering Based on Reconstructed Self-Expressive Matrix

Abstract

1. Introduction

2. Related Work

2.1. Subspace Clustering

2.2. Scalable Subspace Clustering

2.3. Deep Multi-View Clustering

3. Our Method

3.1. Self-Expressive Clustering

3.2. Network Architecture

3.2.1. Multi-View Convolutional Encoders

3.2.2. Multi-View Convolutional Decoders

3.2.3. Reconstructing Self-Expressive Matrices

3.3. Loss Function

3.4. Training Procedure

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Comparison Algorithms

4.4. Implementation Details

4.5. Experimental Results

4.5.1. Comparisons with Other Methods

4.5.2. Clustering Ability to Large-Scale Dataset

4.5.3. Analysis on Time Complexity

4.5.4. Analysis of the Affinity Matrix

4.5.5. Visualization of Clustering Results

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI