Next Article in Journal
Joint Optimization of Memory Sharing and Communication Distance for Virtual Machine Instantiation in Cloudlet Networks
Previous Article in Journal
Research on Automatic Vertical Parking Path-Planning Algorithms for Narrow Parking Spaces
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-CC: A New Baseline for Faster and Better Deep Clustering

1
School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
2
Intelligent Policing Key Laboratory of Sichuan Province, Sichuan Police College, Luzhou 646000, China
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(20), 4204; https://doi.org/10.3390/electronics12204204
Submission received: 24 September 2023 / Revised: 5 October 2023 / Accepted: 7 October 2023 / Published: 10 October 2023
(This article belongs to the Section Computer Science & Engineering)

Abstract

:
The aim of our paper is to introduce a new deep clustering model called Multi-head Cross-Attention Contrastive Clustering (Multi-CC), which seeks to enhance the performance of the existing deep clustering model CC. Our approach involves first augmenting the data to form image pairs and then using the same backbone to extract the feature representation of these image pairs. We then undertake contrastive learning, separately in the row space and column space of the feature matrix, to jointly learn the instance and cluster representations. Our approach offers several key improvements over the existing model. Firstly, we use a mixed strategy of strong and weak augmentation to construct image pairs. Secondly, we get rid of the pooling layer of the backbone to prevent loss of information. Finally, we introduce a multi-head cross-attention module to improve the model’s performance. These improvements have allowed us to reduce the model training time by 80%. As a baseline, Multi-CC achieves the best results on CIFAR-10, ImageNet-10, and ImageNet-dogs. It is easily replaceable with CC, making models based on CC achieve better performance.

Graphical Abstract

1. Introduction

Clustering is a fundamental problem in unsupervised learning, aiming to divide samples in a dataset into several categories so that samples within the same category have high similarity and those in different categories have low similarity. The goal of traditional clustering methods is to minimize the clustering objective function under a given distance metric, but the computational cost is often enormous for high-dimensional data [1]. With the development of deep learning, deep clustering has become one of the mainstream methods for unsupervised classification. Early deep clustering methods used deep neural networks for feature representation and then applied traditional clustering methods (such as K-Means [2]) for clustering. However, this approach cannot jointly optimize feature representation and clustering, leading to suboptimal solutions [3]. In order to address the above issues, [4] proposed a method to jointly train and optimize feature representation and clustering, thus learning a feature space that clusters data into different groups. In recent years, some new methods, such as in [5], can also jointly optimize feature representation and clustering, achieving good performance on high-dimensional datasets.
Contrastive learning is one of the research hotspots in the field of deep learning [6]. By comparing the similarity or difference between different samples, it learns the representation of samples, which has advanced advantages compared to traditional supervised and unsupervised learning methods. Existing representative contrastive learning methods include SimCLR [6], MoCo [7], and SimSiam [8], which typically adopt a Siamese network structure, obtain positive and negative instance pairs of images through data augmentation, and perform a large number of positive and negative sample pair comparisons to better extract data feature representations.
In the field of clustering, Contrastive Clustering (CC) proposed by [9] is a widely used method. It combines contrastive learning and deep clustering to improve clustering performance. CC constructs positive and negative instance pairs using data augmentation and maps them to the feature space using two independent Multi-Layer Perceptrons (MLPs). Instance-level and cluster-level contrastive learning is performed in row space and column space. This aims to maximize the similarity between positive pairs and minimize the similarity between negative pairs. However, mapping features to row space and column space may not fully capture the complex structure and semantic information in the data, which can lead to suboptimal solutions. To address this issue, we propose a multi-head cross-attention module to help the model capture information from multiple perspectives of the input data. This results in richer feature representations. Our contributions are as follows:
  • Our proposed baseline model for deep clustering has outperformed almost all existing deep clustering models in terms of performance metrics such as accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI). Moreover, our model can be used in place of CC and can be easily applied to improved methods such as C3 and SACC, resulting in even better performance.
  • Our model also shows qualitative improvement in terms of time performance, allowing it to converge quickly. In fact, as shown in Section 4.4, our model has reduced training time by 80% relative to CC.
  • Our model is designed to work in a single-stage, end-to-end fashion, which allows for batch optimization and application to large-scale scenarios.

2. Related Work

In this section, we briefly review work related to unsupervised representation learning and deep clustering.

2.1. Unsupervised Representation Learning

Unsupervised representation learning is a crucial aspect of deep learning that does not require labeled data. This technique learns more discriminative feature representations by learning the intrinsic structure of data or using certain means such as data augmentation and contrastive learning. This is applied in various fields such as computer vision and natural language processing. MoCo [7] introduced a momentum-based contrastive learning method for unsupervised representation learning, which enhances the stability of contrastive learning by introducing a momentum network. SimCLR [6] is another contrastive learning-based unsupervised representation learning method that learns feature representations of data by learning the similarity between image pairs. BYOL [10] proposed a bootstrapped-based unsupervised representation learning method that learns data representations by learning the target representation of the network. SimSiam [8] is unique because it does not require the use of negative samples but rather learns feature representations through interdependencies between positive samples.

2.2. Deep Clustering

Traditional clustering methods rely on similarity or distance measures to group data, but they struggle with high-dimensional or non-linear data structures. Deep learning-based clustering methods, such as DSEC [11,12], have been developed to address these limitations. These methods transform the clustering problem into a binary discrimination problem, which enables automatic feature extraction and clustering. Contrastive learning methods, such as MoCo, SimCLR, and BYOL, have been introduced into deep clustering, leading to the development of CC [9], C3 [13], TCL [14], SPICE [5], and SACC [15]. CC combines contrastive learning and deep clustering using the instance contrast head and cluster contrast head. C3 and TCL propose new loss functions to augment positive samples and mitigate the impact of false negative pairs, respectively. SPICE designs a semantic pseudo-label-based image clustering using the idea of CC, while SACC expands the backbone network of CC.

3. Method

When we apply deep clustering techniques to analyze data, we utilize a deep clustering model denoted as f ( x ; θ ) , where x R n represents the raw data, f ( x ; θ ) R d represents the latent space representation of the data, and θ represents the model parameters. The deep clustering model comprises two main components: a feature extraction unit and a clustering unit. These can be expressed as f ( x ; θ ) = g ( h ( x ; θ 1 ) ; θ 2 ) . The feature extraction unit is represented as h ( x ; θ 1 ) and θ 1 represents parameters of the feature extraction model, while g ( h ( x ; θ 1 ) ; θ 2 ) represents the clustering unit and θ 2 represents parameters of the cluster model. Various deep learning techniques such as Convolutional Neural Networks, Autoencoders, and Variational Autoencoders can be used to implement feature extraction and clustering units. By extracting features from the data and mapping them onto the latent space, we can obtain the latent representation of the data f ( x ; θ ) . Finally, this latent representation can be used as input to the clustering unit to obtain the clustering results of the data.
We propose a novel method that builds on the baseline CC model by improving three modules: Pair Construction Backbone (PCB), Instance-level Contrastive Head (ICH), and Cluster-level Contrastive Head (CCH), as shown in Figure 1. Data augmentation is performed in PCB to create data pairs, and the same backbone is used to obtain the feature representations of the data pairs. Contrastive learning is then performed on the matrix row space and column space in ICH and CCH, respectively. Our proposed method has three primary enhancements compared to the CC model:
  • Based on the findings in Section 4.2.4, we decided not to use the data augmentation technique proposed by CC in PCB. Instead, we opted for a combination of strong and weak augmentation. Previous experiments have demonstrated that the direct application of strong augmentation can lead to suboptimal results [16]. Therefore, we used a mixed approach of strong and weak augmentation.
  • After obtaining the image feature representation in PCB, we skipped the pooling operations and returned the calculated feature representation directly. According to [17], applying pooling operations directly may result in some information loss. Hence, we aimed to make the most of this information, as explained in the following point.
  • In ICH, since we eliminated the pooling layer, the MLP was no longer suitable. Instead, we used a multi-head cross-attention module. Each attention head contained a spatial attention unit and a channel attention unit. We then combined the features extracted from these heads.
In the following, we will introduce the three component modules of the model.

3.1. Pair Construction Backbone

We utilized a mixed strategy of strong and weak augmentation, inspired by recent contrastive learning works [14,15], which had a positive impact on experimental results (Section 4.2.4). Our weak augmentation method is similar to SimCLR [6], while our strong augmentation method is based on RandAugment [18]. We applied random flip and random mask operations first and then randomly selected n from a set of image transformations, including autocontrast, brightness, color, contrast, equalize, identity, posterize, rotate, sharpness, shearX/Y, solarize, and translateX/Y, to achieve efficient data expansion.
To explain the process, for any data instance x i where i is a member of the set { 1 , 2 , , N } and N represents the batch size, we perform two types of augmentation: strong ( T s ) and weak ( T w ). Using these, we create data pairs x i a = T s ( x i ) and x i b = T w ( x i ) .
We use a residual network R r [19] as the shared backbone network, similar to CC. This network will generate feature representations of the sample pairs, namely, h i a = R r ( x i a ; w r ) and h i b = R r ( x i b ; w r ) , where w r represents the parameters of network R r .

3.2. Instance-Level Contrastive Head

Contrastive learning aims to create an embedding space that clusters similar samples together and separates dissimilar ones. This results in effective clustering. To achieve this, CC uses an MLP (consisting of a linear layer, a ReLU activation function, and another linear layer) in the ICH module to map data to a low-dimensional space, similar to SimCLR. According to [20], high-dimensional data contains a lot of noise and redundancy, hence MLP is used in SimCLR. However, we believe that better methods can be used to map these features to another space. Therefore, instead of the original MLP, we have adopted a multi-head cross-attention module based on [20] that explains the limitations of the pooling layer.

3.2.1. Multi-Head Cross Attention

The proposed multi-head cross-attention module comprises of multiple cross-attention heads that work in parallel. As depicted in Figure 2, each cross-attention head includes a spatial attention unit and a channel attention unit. The spatial attention unit extracts spatial features from the input features provided by the PCB; whereas, the channel attention unit extracts channel features from the output features of the spatial attention unit. These two sets of features are then merged into a single feature representation. The left part of Figure 2 shows the spatial attention unit, which is made up of four convolutions and an activation function. The convolutions are constructed in dimensions of 3 × 3 , 1 × 3 , and 3 × 1 to capture multi-scale local features. On the other hand, the right part of the channel attention unit consists of a pooling layer, two linear layers, and an activation function. The two linear layers are used to implement an autoencoder.
Formally, let H = { H 1 , H 2 , H k } be the set of spatial attention heads, and S = { s 1 , s 2 , s k } be the output spatial attention feature representations, where k is the number of cross-attention heads. The j-th output feature can be represented as:
s i t = h i t × H j ( h i t ; w s ) , t { a , b }
where i { 1 , 2 N } and w s denotes the parameters of the network H j .
Similarly, let H ˜ = { H 1 ˜ , H 1 ˜ , H k ˜ } be the set of channel attention heads, and C = { c 1 , c 2 , c k } be the final output attention feature representations. The j-th output feature can be represented as:
c i t = s i t × H j ˜ ( s i t ; w c ) , t { a , b }
where j { 1 , 2 N } , w c denotes the parameters of the network H j ˜ .

3.2.2. Attention Fusion

Currently, the cross-attention heads we have obtained are not organized, so we need to direct them to focus on different areas. Firstly, we amplify the attention maps by using a log-softmax function to highlight the most important regions. Then, we introduce a partition loss to guide the cross-attention heads to concentrate on different relevant regions and avoid overlapping attention. Finally, we merge these heads into one. If c k represents the output of the k-th cross-attention head, then the log-softmax output can be expressed as c k = log ( exp ( c k ) i = 1 k exp ( c i ) ) . A partition loss is used to maximize the variance between the feature representations produced by these parallel cross-attention heads. This loss is represented as:
L m u l t i = 1 N C i = 1 N j = 1 C log ( 1 + k σ i j 2 )
where C is the number of cross-attention head channels, and σ i j 2 represents the variance of the j-th channel in the i-th sample.
For convenience, we use g I ( · ) to summarize the representation of the multi-head cross-attention head operation process. Then, the above process can be simplified as z t = g I ( h t ) , t { a , b } .

3.2.3. Instance Contrast Learning

After obtaining the multi-head cross-attention feature representations z i a and z i b , we use cosine distance as a similarity measure:
s ( z i k 1 , z j k 2 ) = z i k 1 · z j k 2 z i k 1 2 z j k 2 2
where i , j { 1 , 2 , , N } , k 1 , k 2 { a , b } . To represent pairwise similarity, given the sample z i a , we treat { z i a , z i b } as a positive pair and others as negative pairs. The loss for the positive pair of samples { z i a , z i b } is represented as:
i a = log exp ( s ( z i a , z i b ) / T I ) j = 1 N [ exp ( s ( z i a , z j b ) / T I ) + exp ( s ( z i a , z j a ) / T I ) ]
where T I is the instance-level temperature parameter. Therefore, the loss for all samples can be represented as:
L i n s = 1 2 N i = 1 N ( i a + i b )

3.3. Cluster-Level Contrastive Head

In line with the concept of contrastive learning, we observe that during clustering, data samples need to be mapped to a matrix whose dimensions are equal to the number of clusters. In this matrix, each column of a data sample represents the probability of it belonging to the respective cluster. Initially, we use z i a and z i b as input data and then pass them through a linear layer, Relu activation function, and Softmax layer. This transforms them into a matrix whose dimensions are determined by the number of clusters. The cluster contrastive learning is performed in this embedding space.
To formalize the above-stacked network, we use g C ( · ) and assume that the number of clusters is M. We then map the feature matrix to the cluster contrastive subspace via c a = g C ( z a ) and c b = g C ( z b ) . We use cosine distance as a similarity measure, as described in Equation (4). To represent pairwise similarity, we consider { c i a , c i b } as a positive pair and the others as negative pairs, given the sample c i a . We can represent the loss for the positive pair samples { c i a , c i b } as:
i a ^ = log exp ( s ( c i a , c i b ) / T C ) j = 1 N [ exp ( s ( c i a , c j b ) / T I ) + exp ( s ( c i a , c j a ) / T C ) ]
where T C is the cluster-level temperature parameter. Therefore, the loss for all samples can be represented as:
L c l u = 1 2 N i = 1 N ( i a ^ + i b ^ )
To address the issue of uneven clustering results, we introduce a loss term to the training process that incorporates the negative entropy of the cluster assignment probability. This term helps to regulate the distribution of the cluster assignment probabilities, ensuring that each cluster has a uniform representation of data samples. By doing so, the resulting cluster assignments are more evenly distributed and the effectiveness of contrastive learning is improved. This is illustrated below:
H ( C ) = i = 1 M [ p ( c i a ) log p ( c i a ) + p ( c i b ) log p ( c i b ) ]
where p ( · ) represents the assignment probability of the i-th cluster, p ( c i k ) = j = 1 M c j i k | | c k | | 1 , k { a , b } . The network’s final loss is the sum of the three loss functions.
L = L m u l t i + L i n s + L c l u + H ( C )

4. Experiments

In this section, we present a series of experiments to showcase the effectiveness of the proposed method.

4.1. Experimental Configuration

In order to evaluate the proposed method, we conducted experiments on five different image datasets. A brief description of these datasets can be found in Table 1. Previous unsupervised classification methods have used both the training and test sets during training, but this approach has been criticized as not optimal. Following the recommendations of [21], we trained only on the training set of CIFAR-10, CIFAR-100, and STL-10, instead of using both sets simultaneously as in previous works, which resulted in a 0.02 or even higher increase in metric values. For ImageNet-10 and ImageNet-dogs, we used the entire dataset because there is no widely accepted data splitting scheme. Finally, we used 20 superclasses instead of 100 subclasses for CIFAR-100.
To ensure a fair comparison with CC and other works, we adopted ResNet34 as the backbone and resized all images to 224 × 224 . Our instance-level temperature parameter T I was set to 0.5, while the cluster-level temperature parameter T C was set to 1.0. The number of cross-attention heads was set to 4, and we conducted an ablation study in Section 4.2. To optimize the entire network, we used an Adam optimizer with an initial learning rate of 0.003. We trained the network for 100 epochs, after which we adjusted the learning rate to 0.0003 and continued training for another 100 epochs. The experiments were conducted on an NVIDIA GeForce RTX 2080 Ti 11G GPU, and we set the batch size to 128.
To evaluate performance, we use three common metrics in clustering: accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI) [22].

4.2. Ablation Studies

4.2.1. Number of the Cross-Attention Heads

We conducted an experiment on the ImageNet-dogs dataset to study the impact of cross-attention heads on the model’s performance. The experimental settings are mentioned in Section 4.1. According to the results shown in Figure 3, our proposed multi-head attention module outperforms the single attention module. Moreover, we found that the multi-head attention module equipped with 4 cross-attention heads showed the most significant performance gain.

4.2.2. Effect of Contrastive Learning Head

To validate the accuracy of both comparison headers, we conducted an experiment where we removed the ICH and CCH individually and optimized only one of the comparison headers. The cluster representation of the data is not directly available for the ICH when the CCH is removed. Therefore, we applied the K-Means method on the ICH results. The experimental results are presented in Table 2. These results indicate that the joint optimization of both comparison headers is more effective than using only ICH or CCH.

4.2.3. Effect of Multi-Crossed Attention

In order to demonstrate the effectiveness of multi-head cross-attention in ICH when compared to the original MLP, we replaced the MLP in the ICH structure of CC with multi-head cross-attention (CC(Multi)). We conducted tests on the CIFAR-10 and ImageNet-dogs datasets, using experimental settings described in Section 4.1. Table 3 displays the experimental results. The performance of the model improved when the MLP in the ICH structure of CC was replaced with our proposed multi-head cross-attention. For the CIFAR-10 dataset, the NMI, ACC, and ARI increased by 3.3%, 2.9%, and 5.0% respectively. Likewise, for the ImageNet-dogs dataset, the NMI, ACC, and ARI increased by 25.6%, 38.0%, and 55.1%, respectively.
In order to demonstrate the effectiveness of multi-head cross-attention in clustering tasks in a more intuitive way, we used the T-SNE visualization method to display the clustering effect of CC and CC (Multi) on the CIFAR-10 dataset. The results are shown in Figure 4. We used the multi-head cross-attention features in ICH as input data for T-SNE and used the cluster assignment predictions of CCH as labels, represented by different colors. The results indicate that, compared to CC which only uses MLP, the inter-class distance is smaller, the intra-class distance is larger, and the cluster assignment is more balanced after using multi-head cross-attention. This proves that our proposed method is more effective in improving the quality of clustering.
Multi-head attention is more effective than traditional MLP in sharing information among different attention heads, which enhances the feature representation ability. By using multi-head cross-attention, the model can learn more precise and comprehensive feature representations, leading to improved model performance.

4.2.4. Effect of Data Augmentation

As described in Section 3, we presented the data augmentation methods used for the model. To evaluate the effectiveness of these methods, we performed data augmentation ablation studies on CIFAR-10 and ImageNet-dogs datasets. We compared the model’s performance using mixed strategies of strong and weak augmentation, weak augmentation only, and no data augmentation. Table 4 displays the performance improvements due to data augmentation. When we applied the same weak augmentation strategy as CC, the model’s performance increased by 3.9% on CIFAR-10, and by 39.3% on ImageNet-dogs. Using a mixed strategy of strong and weak augmentation further improved the performance.

4.3. Comparisons with State of the Arts

We evaluated our method on five public datasets and compared it with 15 representative clustering methods, including K-means [23], SC [24], DAE [25], DCGAN [26], VAE [27], DEC [28], DAC [29], DCCM [30], PICA [31], CC [9], IDFD [32], C3 [13], SPICE [5], TCL [14], SACC [15]. Table 5 shows the results of our proposed method on the baseline datasets.

4.3.1. Metrics Performance Analysis

After CC introduced its learning methods to the deep clustering work, most of the optimal models in the field of deep clustering have used CC as a baseline. These models, which include CC, such as C3, SPICE, TCL, and SACC, utilize both the training and testing sets during the training process. As explained in Section 4.1, this practice leads to better results. Even though this could be disadvantageous for us, we still separated the CIFAR-10, CIFAR-100, and STL-10 datasets, which already have recognized splitting schemes.
Although Multi-CC is considered to be a baseline, it has produced the best outcomes on CIFAR-10, ImageNet-10, and ImageNet-dogs datasets. It ranked second on CIFAR-100 after TCL and after CC, SPICE, and TCL on STL-10 datasets. The improvement of Multi-CC over CC is significant, particularly on the ImageNet-dogs dataset, where there was an increase in NMI, ACC, and ARI by 43.8%, 57.6%, and 95.3%, respectively. We also saw an increase in performance by 17.9%, 21.4%, and 10.0% for CIFAR-10, CIFAR-100, and ImageNet-10 datasets, respectively.
During our study, we noticed that the STL-10 dataset results of Multi-CC were lower than those of CC. We determined that this is because of the joint training and testing set training method used, as well as the use of unlabeled data. We confirmed our hypothesis and decided to retest the CC model by performing data segmentation on STL-10. We used the same data as Multi-CC, without any additional unlabeled data. The training set had 5000 images, and the test set had 8000 images. The results are presented in Table 6. Our conjecture was validated when the CC model was trained without the joint training and test set strategy and without introducing additional unlabeled data, resulting in a significant decrease in model performance. This also confirms that the joint training strategy mentioned in SCAN yields higher performance.

4.3.2. Metrics Performance Analysis with Other Options

To demonstrate the effectiveness and ease of use of our scheme, we applied the C3 and SACC improvements on CC to Multi-CC. The summary of C3 and SACC can be found in the Appendix A and Appendix B. The results are shown in Table 4, where the improved models are marked as Multi-CC(C3) and Multi-CC(SACC), respectively. These methods performed better than Multi-CC, indicating the superiority of our approach.
Multi-CC(C3) showed a significant improvement in NMI, ACC, and ARI by 11.4%, 8.3%, and 16.8%, respectively, on the ImageNet-dogs dataset compared to Multi-CC. Furthermore, compared to C3 implemented based on CC, Multi-CC(C3) increased NMI, ACC, and ARI by 59.2%, 68.7%, and 123.2%, respectively, on the ImageNet-dogs dataset. It also showed an average of 10% improvement on other datasets.
Multi-CC(SACC) also showed improvements compared to SACC implemented based on CC. For instance, on the ImageNet-dogs dataset, NMI, ACC, and ARI increased by 41.1%, 56.5%, and 91.9%, respectively. Similarly, there were also improvements on other datasets. These results not only confirm the effectiveness of the Multi-CC method but also show its flexibility and usability under different improvement schemes.

4.4. Time Performance Analysis

It is worth noting that Multi-CC outperforms CC in terms of time efficiency. According to CC, it takes 1000 epochs to train all datasets, which uses up a lot of GPU resources. However, Multi-CC is 80% faster and only requires 200 epochs to achieve a nearly optimal result, which is better than CC. We conducted a comparison between Multi-CC and CC on the ImageNet-dogs dataset and evaluated their performance in terms of NMI, ACC, and ARI, using only the first 200 epochs of results for CC. The results demonstrate that Multi-CC is capable of achieving a better result quickly and fine-tuning the model will lead to even better results. The comparison results are presented in Figure 5a–c.

4.5. Contributions to the Field of Unsupervised Learning

In the case of datasets, such as CIFAR-10, ImageNet-10, etc., SimCLR requires training in stages if classification is needed. However, our focus for this paper is not on classification; thus, we perform the K-Means algorithm directly on the features extracted by SimCLR to simplify the process. We also modified our proposed scheme for SimCLR, which we call Multi-SimCLR. To maintain fairness, both schemes make use of ResNet34 as the backbone and perform K-Means to compute the clustering accuracy after 200 epochs of training on the CIFAR-10 and STL-10 datasets.
Table 7 displays the results that demonstrate how our method can efficiently incorporate SimCLR and achieve better performance. Our method has shown a significant improvement in accuracy on the CIFAR-10 and STL-10 datasets, with an increase of 52.2% and 29.2%, respectively, compared to SimCLR. These results demonstrate the potential of our method to be easily generalized to the whole comparison learning framework.

5. Conclusions

We have made significant advancements in the field of deep clustering by improving the baseline CC and introducing a new technique called Multi-Head Cross-Attention Contrastive Clustering (Multi-CC). Despite considering it as a baseline, Multi-CC has achieved the best results on datasets such as CIFAR-10, ImageNet-10, and ImageNet-dogs. Notably, we have not jointly trained the training set and test set, which was conducted in previous works for these tests. Our model can effectively replace CC-based models such as C3, SACC, TCL, SPICE, etc. We have also applied the improvement methods from these models to Multi-CC, resulting in even better performance. Multi-CC has significantly reduced the time required for deep clustering by 80% compared to CC. Hence, we aspire to build upon our model as a foundation for making improvements to deep clustering models. However, for datasets such as Tiny-ImageNet and CIFAR-100, which contain 200 classes, there still isn ot a suitable method in the field of deep clustering, leaving it as an open problem.
Furthermore, we made a tentative endeavor to incorporate our approach into SimCLR, which still exerted a certain influence on SimCLR. This attempt has opened up a new research direction for applying similar multi-head cross-attention mechanisms in unsupervised learning in the future, thereby improving model performance and efficiency. In the next phase, we will shift our focus towards exploring the feasibility of our approach in a broader scope of unsupervised learning. Additionally, we will also delve into the exploration of super clustering.

Author Contributions

Conceptualization, Y.Y. (Yu Yang); Formal analysis, Y.Y. (Yu Yang); Funding acquisition, G.W.; Investigation, L.Z.; Methodology, Y.Y. (Yulin Yao); Project administration, Y.Y. (Yu Yang); Resources, L.Z.; Software, Y.Y. (Yulin Yao) and X.G.; Supervision, Y.Y. (Yu Yang); Validation, Y.Y. (Yulin Yao) and X.G.; Visualization, X.G.; Writing—original draft, Y.Y. (Yulin Yao); Writing—review and editing, Y.Y. (Yulin Yao) and G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (2022YFC3300803, 2021YFC3340602), the Opening Project of Intelligent Policing Key Laboratory of Sichuan Province (No. ZNJW2023KFMS008), the Natural Science Foundation of China (Grant No. 62172053), the 111 Project (Grant No. B21049), and Open Foundation of Guizhou Provincial Key Laboratory of Public Big Data (2018BDKFJJ019).

Data Availability Statement

CIFAR dataset is available at http://www.cs.toronto.edu/~kriz/cifar.html; STL dataset is available at https://cs.stanford.edu/~acoates/stl10/; ImageNet dataset is available at https://www.image-net.org/.

Acknowledgments

All authors thank School of Cyberspace Security, Beijing University of Posts and Telecommunications.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Cross-Instance Guided Contrastive Clustering (C3)

Cross-instance-guided Contrastive Clustering (C3) points out that CC ignores cross-instance patterns, which provide important information for improving clustering performance. Based on this observation, C3 proposes a new loss function based on CC to identify similar instances using an instance-level representation and encouraging them to cluster together. Specifically, C3 first trains the CC network to obtain a reliable z-space, which remains suboptimal since no cross-sample relationships are considered. Then, since the data labels are not accessible, C3 uses the cosine similarity of the sample representations in the z-space to measure data similarity. If, for a pair of instances, the similarity is greater than or equal to a threshold (quantified by the hyperparameter ζ ), then the samples can be considered similar and they are approximated by minimizing the loss function L C 3 defined below:
L C 3 = 1 2 N i = 1 N ( l i a ^ + l i b ^ )
l i a ^ = log k { a , b } j = 1 N F z i a z j k ζ exp z i a z j k k { a , b } j = 1 N exp z i a z j k
where F ( · ) denotes the indicator function. Two features are considered to be from the same class if their similarity is greater than ζ . We introduce the above method to Multi-CC, and after Multi-CC obtains a reliable z 1 space, we just replace the loss function and train 20 additional epochs using L C 3 , and the experimental results are shown in the paper.

Appendix B. Strongly Augmented Contrastive Clustering (SACC)

Strongly Augmented Contrastive Clustering (SACC) extends the traditional dual augmented view paradigm to multiple views and jointly utilizes strong and weak augmentation to enhance deep clustering. SACC utilizes a backbone with three levels of shared weights containing one strongly augmented view and two weakly augmented views. Based on the features generated by the backbone, weak-weak view pairs and strong-weak view pairs are utilized simultaneously for instance-level comparison learning and cluster-level comparison learning, which, together with the backbone, can be jointly optimized in a purely unsupervised manner.
Specifically, suppose that the feature representations obtained after the backbone are noted as z i a , z i b , z i c , and a two-layer nonlinear multilayer perceptron (MLP) g ( · ) in the ICH maps these features to another space, y i j = g ( z i j ) , where j { 1 , 2 , 3 } . The same cosine similarity is used to represent the distance between positive and negative instances. SACC computes the instance-level contrast loss on two expanded pairs:
L i n s = L i n s ( 1 , 2 ) + L i n s ( 2 , 3 )
L i n s ( a , b ) = 1 2 N i = 1 N ( i a + i b )
A two-layer nonlinear multilayer perceptron (MLP) h ( · ) is also used in CCH to map these features to another space, c i j = g ( z i j ) , where j { 1 , 2 , 3 } . The same cosine similarity is used to represent the distance between positive and negative instances. SACC computes the cluster-level contrast loss on two expanded pairs:
L c l u = L c l u ( 1 , 2 ) + L c l u ( 2 , 3 ) + L c l u ( 1 , 3 )
L i n s ( a , b ) = 1 2 N i = 1 N ( i a ^ + i b ^ ) + H ( Y )

References

  1. Min, E.; Guo, X.; Liu, Q.; Zhang, G.; Cui, J.; Long, J. A Survey of Clustering with Deep Learning: From the Perspective of Network Architecture. IEEE Access 2018, 6, 39501–39514. [Google Scholar] [CrossRef]
  2. Hartigan, J.A.; Wong, M.A. A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1979, 28, 100–108. [Google Scholar]
  3. Yang, J.; Parikh, D.; Batra, D. Joint Unsupervised Learning of Deep Representations and Image Clusters. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5147–5156. [Google Scholar]
  4. Huang, P.; Huang, Y.; Wang, W.; Wang, L. Deep Embedding Network for Clustering. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 1532–1537. [Google Scholar]
  5. Niu, C.; Shan, H.; Wang, G. SPICE: Semantic Pseudo-Labeling for Image Clustering. IEEE Trans. Image Process. 2021, 31, 7264–7278. [Google Scholar] [CrossRef] [PubMed]
  6. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G.E. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar]
  7. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R.B. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9726–9735. [Google Scholar]
  8. Chen, X.; He, K. Exploring Simple Siamese Representation Learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15745–15753. [Google Scholar]
  9. Li, Y.; Hu, P.; Liu, Z.; Peng, D.; Zhou, J.T.; Peng, X. Contrastive Clustering. arXiv 2020, arXiv:2009.09687. [Google Scholar] [CrossRef]
  10. Grill, J.B.; Strub, F.; Altch’e, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.Á.; Guo, Z.D.; Azar, M.G.; et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. arXiv 2020, arXiv:2006.07733. [Google Scholar]
  11. Chang, J.; Meng, G.; Wang, L.; Xiang, S.; Pan, C. Deep Self-Evolution Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 809–823. [Google Scholar] [CrossRef] [PubMed]
  12. Chang, J.; Guo, Y.; Wang, L.; Meng, G.; Xiang, S.; Pan, C. Deep Discriminative Clustering Analysis. arXiv 2019, arXiv:1905.01681. [Google Scholar]
  13. Sadeghi, M.; Hojjati, H.; Armanfard, N. C3: Cross-instance guided Contrastive Clustering. arXiv 2022, arXiv:2211.07136. [Google Scholar]
  14. Li, Y.; Yang, M.; Peng, D.; Li, T.; Huang, J.; Peng, X. Twin Contrastive Learning for Online Clustering. Int. J. Comput. Vis. 2022, 130, 2205–2221. [Google Scholar] [CrossRef]
  15. Deng, X.; Huang, D.; Chen, D.H.; Wang, C.D.; Lai, J.H. Strongly augmented contrastive clustering. Pattern Recognit. 2023, 139, 109470. [Google Scholar] [CrossRef]
  16. Wang, X.; Qi, G.J. Contrastive Learning with Stronger Augmentations. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5549–5560. [Google Scholar] [CrossRef] [PubMed]
  17. Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding Convolution for Semantic Segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
  18. Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 3008–3017. [Google Scholar]
  19. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  20. Cohen, M.B.; Elder, S.; Musco, C.; Musco, C.; Persu, M. Dimensionality Reduction for k-Means Clustering and Low Rank Approximation. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, Portland, OR, USA, 14–17 June 2015. [Google Scholar]
  21. Gansbeke, W.V.; Vandenhende, S.; Georgoulis, S.; Proesmans, M.; Gool, L.V. SCAN: Learning to Classify Images without Labels. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020. [Google Scholar]
  22. Zhou, S.; Xu, H.; Zheng, Z.; Chen, J.; Li, Z.; Bu, J.; Wu, J.; Wang, X.; Zhu, W.; Ester, M. A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions. arXiv 2022, arXiv:2206.07579. [Google Scholar]
  23. Lloyd, S.P. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–136. [Google Scholar] [CrossRef]
  24. Zelnik-Manor, L.; Perona, P. Self-Tuning Spectral Clustering. In Proceedings of the NIPS, Vancouver, BC, Canada, 13–18 December 2004. [Google Scholar]
  25. Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
  26. Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
  27. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  28. Xie, J.; Girshick, R.B.; Farhadi, A. Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
  29. Chang, J.; Wang, L.; Meng, G.; Xiang, S.; Pan, C. Deep Adaptive Image Clustering. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5880–5888. [Google Scholar]
  30. Wu, J.; Long, K.; Wang, F.; Qian, C.; Li, C.; Lin, Z.; Zha, H. Deep Comprehensive Correlation Mining for Image Clustering. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 8149–8158. [Google Scholar]
  31. Huang, J.; Gong, S.; Zhu, X. Deep Semantic Clustering by Partition Confidence Maximisation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8846–8855. [Google Scholar]
  32. Tao, Y.; Takagi, K.; Nakata, K. Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation. arXiv 2021, arXiv:2106.00131. [Google Scholar]
Figure 1. The framework of Multi-head Cross-Attention Contrastive Clusering.
Figure 1. The framework of Multi-head Cross-Attention Contrastive Clusering.
Electronics 12 04204 g001
Figure 2. Cross-Attention module architecture with spatial and channel attention units.
Figure 2. Cross-Attention module architecture with spatial and channel attention units.
Electronics 12 04204 g002
Figure 3. Comparison of the performance of different numbers of cross-attention heads. (a) NMI comparison; (b) ACC comparison; (c) ARI comparison.
Figure 3. Comparison of the performance of different numbers of cross-attention heads. (a) NMI comparison; (b) ACC comparison; (c) ARI comparison.
Electronics 12 04204 g003
Figure 4. Comparison of CC and CC(Multi) performance using T-SNE visualization. (a) T-SNE visualization with CC; (b) T-SNE visualization with CC(Multi).
Figure 4. Comparison of CC and CC(Multi) performance using T-SNE visualization. (a) T-SNE visualization with CC; (b) T-SNE visualization with CC(Multi).
Electronics 12 04204 g004
Figure 5. Comparison of Multi-CC and CC performance. (a) NMI comparison; (b) ACC comparison; (c) ARI comparison.
Figure 5. Comparison of Multi-CC and CC performance. (a) NMI comparison; (b) ACC comparison; (c) ARI comparison.
Electronics 12 04204 g005
Table 1. Summary of the dataset used for evaluation.
Table 1. Summary of the dataset used for evaluation.
DatasetSplitSamplesClasses
CIFAR-1060,00010
CIFAR-10060,00020
STL-1013,00010
ImageNet-10×13,00010
ImageNet-dogs×19,50015
Table 2. Ablation study of two contrastive learning heads.
Table 2. Ablation study of two contrastive learning heads.
DatasetContrastive HeadNMIACCARI
CIFAR-10ICH + CCH0.8110.8890.785
ICH0.4310.5140.323
CCH0.7880.8450.735
ImageNet-dogsICH + CCH0.640.6760.535
ICH0.5820.4750.409
CCH0.5610.5310.406
Table 3. Multi-head cross-attention module validity.
Table 3. Multi-head cross-attention module validity.
DatasetModelNMIACCARI
CIFAR-10CC (Multi)0.7280.8130.669
CC0.7050.790.637
ImageNet-dogsCC (Multi)0.5590.5920.425
CC0.4450.4290.274
Table 4. Ablation studies with strong and weak augmentation methods.
Table 4. Ablation studies with strong and weak augmentation methods.
DatasetAugumentationNMIACCARI
CIFAR-10 T s ( x ) + T w ( x ) 0.8110.8890.785
T w ( x ) + T w ( x ) 0.7280.8130.669
x + x 0.0640.1970.04
T w ( x ) + T w ( x ) (CC)0.7050.790.637
ImageNet-dogs T s ( x ) + T w ( x ) 0.640.6760.535
T w ( x ) + T w ( x ) 0.5590.5920.425
x + x 0.1170.1930.046
T w ( x ) + T w ( x ) (CC)0.4450.4290.274
Table 5. The clustering performance on the five baseline datasets. The first and second best results are indicated using bold and underlining, respectively.
Table 5. The clustering performance on the five baseline datasets. The first and second best results are indicated using bold and underlining, respectively.
DatasetCIFAR-10CIFAR-100STL-10ImageNet-10ImageNet-Dogs
MetricsNMIACCARINMIACCARINMIACCARINMIACCARINMIACCARI
K-means0.0870.2290.0490.0840.130.0280.1250.1920.0610.1190.2410.0570.0550.1050.02
SC0.1030.2470.0850.090.1360.0220.0980.1590.0480.1510.2740.0760.0380.1110.013
DAE0.2510.2970.1630.1110.1510.0460.2240.3020.1520.2060.3040.1380.1040.190.078
DCGAN0.2650.3150.1760.120.1510.0450.210.2980.1390.2250.3460.1570.1210.1740.078
VAE0.2450.2910.1670.1080.1520.040.20.2820.1460.1930.3340.1680.1070.1790.079
DEC0.2570.3010.1610.1360.1850.050.2760.3590.1860.2820.3810.2030.1220.1950.079
DAC0.3960.5220.3060.1850.2380.0880.3660.470.2570.3940.5270.3020.2190.2750.111
DCCM0.4960.6230.4080.2850.3270.1730.3760.4820.2620.6080.710.5550.3210.3830.182
PICA0.5910.6960.5120.310.3370.1710.6110.7130.5310.8020.870.7610.3520.3520.201
CC0.7050.790.6370.4310.4290.2660.7640.850.7260.8590.8930.8220.4450.4290.274
IDFD0.7110.8150.6630.4260.4250.2640.6430.7560.575------
C30.7480.8380.7070.4340.4510.275---0.9050.9420.8610.4480.4340.28
SPICE0.7340.8380.7050.4480.4680.2940.8170.9080.8120.8280.9210.8360.5720.6460.479
TCL0.8190.8870.780.5290.5310.3570.7990.8680.7570.8750.8950.8370.6230.6440.516
SACC0.7650.8510.7240.4480.4430.2820.6910.7590.6260.8770.9050.8430.4550.4370.285
Multi-CC0.8110.8890.7850.5270.4980.3350.7310.8180.6820.9240.9640.930.640.6760.535
Multi-CC(SACC)0.8180.8980.7990.5420.5180.3540.7390.8220.6860.9310.9720.9390.6520.6840.547
Multi-CC(C3)0.810.8840.7760.5380.5060.350.7350.8370.6950.940.9760.9480.7130.7320.625
Table 6. Retesting CC with STL-10.
Table 6. Retesting CC with STL-10.
DatasetModelNMIACCARI
STL-10CC0.4010.4990.255
Multi-CC0.7310.8180.682
Table 7. Effectiveness of the Multi-Cross Attention Module on SimCLR.
Table 7. Effectiveness of the Multi-Cross Attention Module on SimCLR.
DatasetModelACC
CIFAR-10SimCLR0.418
Multi-SimCLR0.636
STL-10SimCLR0.431
Multi-SimCLR0.557
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yao, Y.; Yang, Y.; Zhou, L.; Guo, X.; Wang, G. Multi-CC: A New Baseline for Faster and Better Deep Clustering. Electronics 2023, 12, 4204. https://doi.org/10.3390/electronics12204204

AMA Style

Yao Y, Yang Y, Zhou L, Guo X, Wang G. Multi-CC: A New Baseline for Faster and Better Deep Clustering. Electronics. 2023; 12(20):4204. https://doi.org/10.3390/electronics12204204

Chicago/Turabian Style

Yao, Yulin, Yu Yang, Linna Zhou, Xinsheng Guo, and Gang Wang. 2023. "Multi-CC: A New Baseline for Faster and Better Deep Clustering" Electronics 12, no. 20: 4204. https://doi.org/10.3390/electronics12204204

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop