1. Introduction
Clustering is a fundamental problem in unsupervised learning, aiming to divide samples in a dataset into several categories so that samples within the same category have high similarity and those in different categories have low similarity. The goal of traditional clustering methods is to minimize the clustering objective function under a given distance metric, but the computational cost is often enormous for high-dimensional data [
1]. With the development of deep learning, deep clustering has become one of the mainstream methods for unsupervised classification. Early deep clustering methods used deep neural networks for feature representation and then applied traditional clustering methods (such as K-Means [
2]) for clustering. However, this approach cannot jointly optimize feature representation and clustering, leading to suboptimal solutions [
3]. In order to address the above issues, [
4] proposed a method to jointly train and optimize feature representation and clustering, thus learning a feature space that clusters data into different groups. In recent years, some new methods, such as in [
5], can also jointly optimize feature representation and clustering, achieving good performance on high-dimensional datasets.
Contrastive learning is one of the research hotspots in the field of deep learning [
6]. By comparing the similarity or difference between different samples, it learns the representation of samples, which has advanced advantages compared to traditional supervised and unsupervised learning methods. Existing representative contrastive learning methods include SimCLR [
6], MoCo [
7], and SimSiam [
8], which typically adopt a Siamese network structure, obtain positive and negative instance pairs of images through data augmentation, and perform a large number of positive and negative sample pair comparisons to better extract data feature representations.
In the field of clustering, Contrastive Clustering (CC) proposed by [
9] is a widely used method. It combines contrastive learning and deep clustering to improve clustering performance. CC constructs positive and negative instance pairs using data augmentation and maps them to the feature space using two independent Multi-Layer Perceptrons (MLPs). Instance-level and cluster-level contrastive learning is performed in row space and column space. This aims to maximize the similarity between positive pairs and minimize the similarity between negative pairs. However, mapping features to row space and column space may not fully capture the complex structure and semantic information in the data, which can lead to suboptimal solutions. To address this issue, we propose a multi-head cross-attention module to help the model capture information from multiple perspectives of the input data. This results in richer feature representations. Our contributions are as follows:
Our proposed baseline model for deep clustering has outperformed almost all existing deep clustering models in terms of performance metrics such as accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI). Moreover, our model can be used in place of CC and can be easily applied to improved methods such as C3 and SACC, resulting in even better performance.
Our model also shows qualitative improvement in terms of time performance, allowing it to converge quickly. In fact, as shown in
Section 4.4, our model has reduced training time by 80% relative to CC.
Our model is designed to work in a single-stage, end-to-end fashion, which allows for batch optimization and application to large-scale scenarios.
2. Related Work
In this section, we briefly review work related to unsupervised representation learning and deep clustering.
2.1. Unsupervised Representation Learning
Unsupervised representation learning is a crucial aspect of deep learning that does not require labeled data. This technique learns more discriminative feature representations by learning the intrinsic structure of data or using certain means such as data augmentation and contrastive learning. This is applied in various fields such as computer vision and natural language processing. MoCo [
7] introduced a momentum-based contrastive learning method for unsupervised representation learning, which enhances the stability of contrastive learning by introducing a momentum network. SimCLR [
6] is another contrastive learning-based unsupervised representation learning method that learns feature representations of data by learning the similarity between image pairs. BYOL [
10] proposed a bootstrapped-based unsupervised representation learning method that learns data representations by learning the target representation of the network. SimSiam [
8] is unique because it does not require the use of negative samples but rather learns feature representations through interdependencies between positive samples.
2.2. Deep Clustering
Traditional clustering methods rely on similarity or distance measures to group data, but they struggle with high-dimensional or non-linear data structures. Deep learning-based clustering methods, such as DSEC [
11,
12], have been developed to address these limitations. These methods transform the clustering problem into a binary discrimination problem, which enables automatic feature extraction and clustering. Contrastive learning methods, such as MoCo, SimCLR, and BYOL, have been introduced into deep clustering, leading to the development of CC [
9], C3 [
13], TCL [
14], SPICE [
5], and SACC [
15]. CC combines contrastive learning and deep clustering using the instance contrast head and cluster contrast head. C3 and TCL propose new loss functions to augment positive samples and mitigate the impact of false negative pairs, respectively. SPICE designs a semantic pseudo-label-based image clustering using the idea of CC, while SACC expands the backbone network of CC.
3. Method
When we apply deep clustering techniques to analyze data, we utilize a deep clustering model denoted as , where represents the raw data, represents the latent space representation of the data, and represents the model parameters. The deep clustering model comprises two main components: a feature extraction unit and a clustering unit. These can be expressed as . The feature extraction unit is represented as and represents parameters of the feature extraction model, while represents the clustering unit and represents parameters of the cluster model. Various deep learning techniques such as Convolutional Neural Networks, Autoencoders, and Variational Autoencoders can be used to implement feature extraction and clustering units. By extracting features from the data and mapping them onto the latent space, we can obtain the latent representation of the data . Finally, this latent representation can be used as input to the clustering unit to obtain the clustering results of the data.
We propose a novel method that builds on the baseline CC model by improving three modules: Pair Construction Backbone (PCB), Instance-level Contrastive Head (ICH), and Cluster-level Contrastive Head (CCH), as shown in
Figure 1. Data augmentation is performed in PCB to create data pairs, and the same backbone is used to obtain the feature representations of the data pairs. Contrastive learning is then performed on the matrix row space and column space in ICH and CCH, respectively. Our proposed method has three primary enhancements compared to the CC model:
Based on the findings in
Section 4.2.4, we decided not to use the data augmentation technique proposed by CC in PCB. Instead, we opted for a combination of strong and weak augmentation. Previous experiments have demonstrated that the direct application of strong augmentation can lead to suboptimal results [
16]. Therefore, we used a mixed approach of strong and weak augmentation.
After obtaining the image feature representation in PCB, we skipped the pooling operations and returned the calculated feature representation directly. According to [
17], applying pooling operations directly may result in some information loss. Hence, we aimed to make the most of this information, as explained in the following point.
In ICH, since we eliminated the pooling layer, the MLP was no longer suitable. Instead, we used a multi-head cross-attention module. Each attention head contained a spatial attention unit and a channel attention unit. We then combined the features extracted from these heads.
In the following, we will introduce the three component modules of the model.
3.1. Pair Construction Backbone
We utilized a mixed strategy of strong and weak augmentation, inspired by recent contrastive learning works [
14,
15], which had a positive impact on experimental results (
Section 4.2.4). Our weak augmentation method is similar to SimCLR [
6], while our strong augmentation method is based on RandAugment [
18]. We applied random flip and random mask operations first and then randomly selected n from a set of image transformations, including autocontrast, brightness, color, contrast, equalize, identity, posterize, rotate, sharpness, shearX/Y, solarize, and translateX/Y, to achieve efficient data expansion.
To explain the process, for any data instance where i is a member of the set and N represents the batch size, we perform two types of augmentation: strong () and weak (). Using these, we create data pairs and .
We use a residual network
[
19] as the shared backbone network, similar to CC. This network will generate feature representations of the sample pairs, namely,
and
, where
represents the parameters of network
.
3.2. Instance-Level Contrastive Head
Contrastive learning aims to create an embedding space that clusters similar samples together and separates dissimilar ones. This results in effective clustering. To achieve this, CC uses an MLP (consisting of a linear layer, a ReLU activation function, and another linear layer) in the ICH module to map data to a low-dimensional space, similar to SimCLR. According to [
20], high-dimensional data contains a lot of noise and redundancy, hence MLP is used in SimCLR. However, we believe that better methods can be used to map these features to another space. Therefore, instead of the original MLP, we have adopted a multi-head cross-attention module based on [
20] that explains the limitations of the pooling layer.
3.2.1. Multi-Head Cross Attention
The proposed multi-head cross-attention module comprises of multiple cross-attention heads that work in parallel. As depicted in
Figure 2, each cross-attention head includes a spatial attention unit and a channel attention unit. The spatial attention unit extracts spatial features from the input features provided by the PCB; whereas, the channel attention unit extracts channel features from the output features of the spatial attention unit. These two sets of features are then merged into a single feature representation. The left part of
Figure 2 shows the spatial attention unit, which is made up of four convolutions and an activation function. The convolutions are constructed in dimensions of
,
, and
to capture multi-scale local features. On the other hand, the right part of the channel attention unit consists of a pooling layer, two linear layers, and an activation function. The two linear layers are used to implement an autoencoder.
Formally, let
be the set of spatial attention heads, and
be the output spatial attention feature representations, where
k is the number of cross-attention heads. The
j-th output feature can be represented as:
where
and
denotes the parameters of the network
.
Similarly, let
be the set of channel attention heads, and
be the final output attention feature representations. The
j-th output feature can be represented as:
where
,
denotes the parameters of the network
.
3.2.2. Attention Fusion
Currently, the cross-attention heads we have obtained are not organized, so we need to direct them to focus on different areas. Firstly, we amplify the attention maps by using a log-softmax function to highlight the most important regions. Then, we introduce a partition loss to guide the cross-attention heads to concentrate on different relevant regions and avoid overlapping attention. Finally, we merge these heads into one. If
represents the output of the
k-th cross-attention head, then the log-softmax output can be expressed as
. A partition loss is used to maximize the variance between the feature representations produced by these parallel cross-attention heads. This loss is represented as:
where
C is the number of cross-attention head channels, and
represents the variance of the
j-th channel in the
i-th sample.
For convenience, we use to summarize the representation of the multi-head cross-attention head operation process. Then, the above process can be simplified as .
3.2.3. Instance Contrast Learning
After obtaining the multi-head cross-attention feature representations
and
, we use cosine distance as a similarity measure:
where
,
. To represent pairwise similarity, given the sample
, we treat
as a positive pair and others as negative pairs. The loss for the positive pair of samples
is represented as:
where
is the instance-level temperature parameter. Therefore, the loss for all samples can be represented as:
3.3. Cluster-Level Contrastive Head
In line with the concept of contrastive learning, we observe that during clustering, data samples need to be mapped to a matrix whose dimensions are equal to the number of clusters. In this matrix, each column of a data sample represents the probability of it belonging to the respective cluster. Initially, we use and as input data and then pass them through a linear layer, Relu activation function, and Softmax layer. This transforms them into a matrix whose dimensions are determined by the number of clusters. The cluster contrastive learning is performed in this embedding space.
To formalize the above-stacked network, we use
and assume that the number of clusters is
M. We then map the feature matrix to the cluster contrastive subspace via
and
. We use cosine distance as a similarity measure, as described in Equation (
4). To represent pairwise similarity, we consider
as a positive pair and the others as negative pairs, given the sample
. We can represent the loss for the positive pair samples
as:
where
is the cluster-level temperature parameter. Therefore, the loss for all samples can be represented as:
To address the issue of uneven clustering results, we introduce a loss term to the training process that incorporates the negative entropy of the cluster assignment probability. This term helps to regulate the distribution of the cluster assignment probabilities, ensuring that each cluster has a uniform representation of data samples. By doing so, the resulting cluster assignments are more evenly distributed and the effectiveness of contrastive learning is improved. This is illustrated below:
where
represents the assignment probability of the
i-th cluster,
. The network’s final loss is the sum of the three loss functions.
4. Experiments
In this section, we present a series of experiments to showcase the effectiveness of the proposed method.
4.1. Experimental Configuration
In order to evaluate the proposed method, we conducted experiments on five different image datasets. A brief description of these datasets can be found in
Table 1. Previous unsupervised classification methods have used both the training and test sets during training, but this approach has been criticized as not optimal. Following the recommendations of [
21], we trained only on the training set of CIFAR-10, CIFAR-100, and STL-10, instead of using both sets simultaneously as in previous works, which resulted in a 0.02 or even higher increase in metric values. For ImageNet-10 and ImageNet-dogs, we used the entire dataset because there is no widely accepted data splitting scheme. Finally, we used 20 superclasses instead of 100 subclasses for CIFAR-100.
To ensure a fair comparison with CC and other works, we adopted ResNet34 as the backbone and resized all images to
. Our instance-level temperature parameter
was set to 0.5, while the cluster-level temperature parameter
was set to 1.0. The number of cross-attention heads was set to 4, and we conducted an ablation study in
Section 4.2. To optimize the entire network, we used an Adam optimizer with an initial learning rate of 0.003. We trained the network for 100 epochs, after which we adjusted the learning rate to 0.0003 and continued training for another 100 epochs. The experiments were conducted on an NVIDIA GeForce RTX 2080 Ti 11G GPU, and we set the batch size to 128.
To evaluate performance, we use three common metrics in clustering: accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI) [
22].
4.2. Ablation Studies
4.2.1. Number of the Cross-Attention Heads
We conducted an experiment on the ImageNet-dogs dataset to study the impact of cross-attention heads on the model’s performance. The experimental settings are mentioned in
Section 4.1. According to the results shown in
Figure 3, our proposed multi-head attention module outperforms the single attention module. Moreover, we found that the multi-head attention module equipped with 4 cross-attention heads showed the most significant performance gain.
4.2.2. Effect of Contrastive Learning Head
To validate the accuracy of both comparison headers, we conducted an experiment where we removed the ICH and CCH individually and optimized only one of the comparison headers. The cluster representation of the data is not directly available for the ICH when the CCH is removed. Therefore, we applied the K-Means method on the ICH results. The experimental results are presented in
Table 2. These results indicate that the joint optimization of both comparison headers is more effective than using only ICH or CCH.
4.2.3. Effect of Multi-Crossed Attention
In order to demonstrate the effectiveness of multi-head cross-attention in ICH when compared to the original MLP, we replaced the MLP in the ICH structure of CC with multi-head cross-attention (CC(Multi)). We conducted tests on the CIFAR-10 and ImageNet-dogs datasets, using experimental settings described in
Section 4.1.
Table 3 displays the experimental results. The performance of the model improved when the MLP in the ICH structure of CC was replaced with our proposed multi-head cross-attention. For the CIFAR-10 dataset, the NMI, ACC, and ARI increased by 3.3%, 2.9%, and 5.0% respectively. Likewise, for the ImageNet-dogs dataset, the NMI, ACC, and ARI increased by 25.6%, 38.0%, and 55.1%, respectively.
In order to demonstrate the effectiveness of multi-head cross-attention in clustering tasks in a more intuitive way, we used the T-SNE visualization method to display the clustering effect of CC and CC (Multi) on the CIFAR-10 dataset. The results are shown in
Figure 4. We used the multi-head cross-attention features in ICH as input data for T-SNE and used the cluster assignment predictions of CCH as labels, represented by different colors. The results indicate that, compared to CC which only uses MLP, the inter-class distance is smaller, the intra-class distance is larger, and the cluster assignment is more balanced after using multi-head cross-attention. This proves that our proposed method is more effective in improving the quality of clustering.
Multi-head attention is more effective than traditional MLP in sharing information among different attention heads, which enhances the feature representation ability. By using multi-head cross-attention, the model can learn more precise and comprehensive feature representations, leading to improved model performance.
4.2.4. Effect of Data Augmentation
As described in
Section 3, we presented the data augmentation methods used for the model. To evaluate the effectiveness of these methods, we performed data augmentation ablation studies on CIFAR-10 and ImageNet-dogs datasets. We compared the model’s performance using mixed strategies of strong and weak augmentation, weak augmentation only, and no data augmentation.
Table 4 displays the performance improvements due to data augmentation. When we applied the same weak augmentation strategy as CC, the model’s performance increased by 3.9% on CIFAR-10, and by 39.3% on ImageNet-dogs. Using a mixed strategy of strong and weak augmentation further improved the performance.
4.3. Comparisons with State of the Arts
We evaluated our method on five public datasets and compared it with 15 representative clustering methods, including K-means [
23], SC [
24], DAE [
25], DCGAN [
26], VAE [
27], DEC [
28], DAC [
29], DCCM [
30], PICA [
31], CC [
9], IDFD [
32], C3 [
13], SPICE [
5], TCL [
14], SACC [
15].
Table 5 shows the results of our proposed method on the baseline datasets.
4.3.1. Metrics Performance Analysis
After CC introduced its learning methods to the deep clustering work, most of the optimal models in the field of deep clustering have used CC as a baseline. These models, which include CC, such as C3, SPICE, TCL, and SACC, utilize both the training and testing sets during the training process. As explained in
Section 4.1, this practice leads to better results. Even though this could be disadvantageous for us, we still separated the CIFAR-10, CIFAR-100, and STL-10 datasets, which already have recognized splitting schemes.
Although Multi-CC is considered to be a baseline, it has produced the best outcomes on CIFAR-10, ImageNet-10, and ImageNet-dogs datasets. It ranked second on CIFAR-100 after TCL and after CC, SPICE, and TCL on STL-10 datasets. The improvement of Multi-CC over CC is significant, particularly on the ImageNet-dogs dataset, where there was an increase in NMI, ACC, and ARI by 43.8%, 57.6%, and 95.3%, respectively. We also saw an increase in performance by 17.9%, 21.4%, and 10.0% for CIFAR-10, CIFAR-100, and ImageNet-10 datasets, respectively.
During our study, we noticed that the STL-10 dataset results of Multi-CC were lower than those of CC. We determined that this is because of the joint training and testing set training method used, as well as the use of unlabeled data. We confirmed our hypothesis and decided to retest the CC model by performing data segmentation on STL-10. We used the same data as Multi-CC, without any additional unlabeled data. The training set had 5000 images, and the test set had 8000 images. The results are presented in
Table 6. Our conjecture was validated when the CC model was trained without the joint training and test set strategy and without introducing additional unlabeled data, resulting in a significant decrease in model performance. This also confirms that the joint training strategy mentioned in SCAN yields higher performance.
4.3.2. Metrics Performance Analysis with Other Options
To demonstrate the effectiveness and ease of use of our scheme, we applied the C3 and SACC improvements on CC to Multi-CC. The summary of C3 and SACC can be found in the
Appendix A and
Appendix B. The results are shown in
Table 4, where the improved models are marked as Multi-CC(C3) and Multi-CC(SACC), respectively. These methods performed better than Multi-CC, indicating the superiority of our approach.
Multi-CC(C3) showed a significant improvement in NMI, ACC, and ARI by 11.4%, 8.3%, and 16.8%, respectively, on the ImageNet-dogs dataset compared to Multi-CC. Furthermore, compared to C3 implemented based on CC, Multi-CC(C3) increased NMI, ACC, and ARI by 59.2%, 68.7%, and 123.2%, respectively, on the ImageNet-dogs dataset. It also showed an average of 10% improvement on other datasets.
Multi-CC(SACC) also showed improvements compared to SACC implemented based on CC. For instance, on the ImageNet-dogs dataset, NMI, ACC, and ARI increased by 41.1%, 56.5%, and 91.9%, respectively. Similarly, there were also improvements on other datasets. These results not only confirm the effectiveness of the Multi-CC method but also show its flexibility and usability under different improvement schemes.
4.4. Time Performance Analysis
It is worth noting that Multi-CC outperforms CC in terms of time efficiency. According to CC, it takes 1000 epochs to train all datasets, which uses up a lot of GPU resources. However, Multi-CC is 80% faster and only requires 200 epochs to achieve a nearly optimal result, which is better than CC. We conducted a comparison between Multi-CC and CC on the ImageNet-dogs dataset and evaluated their performance in terms of NMI, ACC, and ARI, using only the first 200 epochs of results for CC. The results demonstrate that Multi-CC is capable of achieving a better result quickly and fine-tuning the model will lead to even better results. The comparison results are presented in
Figure 5a–c.
4.5. Contributions to the Field of Unsupervised Learning
In the case of datasets, such as CIFAR-10, ImageNet-10, etc., SimCLR requires training in stages if classification is needed. However, our focus for this paper is not on classification; thus, we perform the K-Means algorithm directly on the features extracted by SimCLR to simplify the process. We also modified our proposed scheme for SimCLR, which we call Multi-SimCLR. To maintain fairness, both schemes make use of ResNet34 as the backbone and perform K-Means to compute the clustering accuracy after 200 epochs of training on the CIFAR-10 and STL-10 datasets.
Table 7 displays the results that demonstrate how our method can efficiently incorporate SimCLR and achieve better performance. Our method has shown a significant improvement in accuracy on the CIFAR-10 and STL-10 datasets, with an increase of 52.2% and 29.2%, respectively, compared to SimCLR. These results demonstrate the potential of our method to be easily generalized to the whole comparison learning framework.
5. Conclusions
We have made significant advancements in the field of deep clustering by improving the baseline CC and introducing a new technique called Multi-Head Cross-Attention Contrastive Clustering (Multi-CC). Despite considering it as a baseline, Multi-CC has achieved the best results on datasets such as CIFAR-10, ImageNet-10, and ImageNet-dogs. Notably, we have not jointly trained the training set and test set, which was conducted in previous works for these tests. Our model can effectively replace CC-based models such as C3, SACC, TCL, SPICE, etc. We have also applied the improvement methods from these models to Multi-CC, resulting in even better performance. Multi-CC has significantly reduced the time required for deep clustering by 80% compared to CC. Hence, we aspire to build upon our model as a foundation for making improvements to deep clustering models. However, for datasets such as Tiny-ImageNet and CIFAR-100, which contain 200 classes, there still isn ot a suitable method in the field of deep clustering, leaving it as an open problem.
Furthermore, we made a tentative endeavor to incorporate our approach into SimCLR, which still exerted a certain influence on SimCLR. This attempt has opened up a new research direction for applying similar multi-head cross-attention mechanisms in unsupervised learning in the future, thereby improving model performance and efficiency. In the next phase, we will shift our focus towards exploring the feasibility of our approach in a broader scope of unsupervised learning. Additionally, we will also delve into the exploration of super clustering.
Author Contributions
Conceptualization, Y.Y. (Yu Yang); Formal analysis, Y.Y. (Yu Yang); Funding acquisition, G.W.; Investigation, L.Z.; Methodology, Y.Y. (Yulin Yao); Project administration, Y.Y. (Yu Yang); Resources, L.Z.; Software, Y.Y. (Yulin Yao) and X.G.; Supervision, Y.Y. (Yu Yang); Validation, Y.Y. (Yulin Yao) and X.G.; Visualization, X.G.; Writing—original draft, Y.Y. (Yulin Yao); Writing—review and editing, Y.Y. (Yulin Yao) and G.W. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Key R&D Program of China (2022YFC3300803, 2021YFC3340602), the Opening Project of Intelligent Policing Key Laboratory of Sichuan Province (No. ZNJW2023KFMS008), the Natural Science Foundation of China (Grant No. 62172053), the 111 Project (Grant No. B21049), and Open Foundation of Guizhou Provincial Key Laboratory of Public Big Data (2018BDKFJJ019).
Data Availability Statement
Acknowledgments
All authors thank School of Cyberspace Security, Beijing University of Posts and Telecommunications.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Cross-Instance Guided Contrastive Clustering (C3)
Cross-instance-guided Contrastive Clustering (C3) points out that CC ignores cross-instance patterns, which provide important information for improving clustering performance. Based on this observation, C3 proposes a new loss function based on CC to identify similar instances using an instance-level representation and encouraging them to cluster together. Specifically, C3 first trains the CC network to obtain a reliable z-space, which remains suboptimal since no cross-sample relationships are considered. Then, since the data labels are not accessible, C3 uses the cosine similarity of the sample representations in the z-space to measure data similarity. If, for a pair of instances, the similarity is greater than or equal to a threshold (quantified by the hyperparameter
), then the samples can be considered similar and they are approximated by minimizing the loss function
defined below:
where
denotes the indicator function. Two features are considered to be from the same class if their similarity is greater than
. We introduce the above method to Multi-CC, and after Multi-CC obtains a reliable
space, we just replace the loss function and train 20 additional epochs using
, and the experimental results are shown in the paper.
Appendix B. Strongly Augmented Contrastive Clustering (SACC)
Strongly Augmented Contrastive Clustering (SACC) extends the traditional dual augmented view paradigm to multiple views and jointly utilizes strong and weak augmentation to enhance deep clustering. SACC utilizes a backbone with three levels of shared weights containing one strongly augmented view and two weakly augmented views. Based on the features generated by the backbone, weak-weak view pairs and strong-weak view pairs are utilized simultaneously for instance-level comparison learning and cluster-level comparison learning, which, together with the backbone, can be jointly optimized in a purely unsupervised manner.
Specifically, suppose that the feature representations obtained after the backbone are noted as
,
, and a two-layer nonlinear multilayer perceptron (MLP)
in the ICH maps these features to another space,
, where
. The same cosine similarity is used to represent the distance between positive and negative instances. SACC computes the instance-level contrast loss on two expanded pairs:
A two-layer nonlinear multilayer perceptron (MLP)
is also used in CCH to map these features to another space,
, where
. The same cosine similarity is used to represent the distance between positive and negative instances. SACC computes the cluster-level contrast loss on two expanded pairs:
References
- Min, E.; Guo, X.; Liu, Q.; Zhang, G.; Cui, J.; Long, J. A Survey of Clustering with Deep Learning: From the Perspective of Network Architecture. IEEE Access 2018, 6, 39501–39514. [Google Scholar] [CrossRef]
- Hartigan, J.A.; Wong, M.A. A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1979, 28, 100–108. [Google Scholar]
- Yang, J.; Parikh, D.; Batra, D. Joint Unsupervised Learning of Deep Representations and Image Clusters. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5147–5156. [Google Scholar]
- Huang, P.; Huang, Y.; Wang, W.; Wang, L. Deep Embedding Network for Clustering. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 1532–1537. [Google Scholar]
- Niu, C.; Shan, H.; Wang, G. SPICE: Semantic Pseudo-Labeling for Image Clustering. IEEE Trans. Image Process. 2021, 31, 7264–7278. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G.E. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R.B. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9726–9735. [Google Scholar]
- Chen, X.; He, K. Exploring Simple Siamese Representation Learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15745–15753. [Google Scholar]
- Li, Y.; Hu, P.; Liu, Z.; Peng, D.; Zhou, J.T.; Peng, X. Contrastive Clustering. arXiv 2020, arXiv:2009.09687. [Google Scholar] [CrossRef]
- Grill, J.B.; Strub, F.; Altch’e, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.Á.; Guo, Z.D.; Azar, M.G.; et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. arXiv 2020, arXiv:2006.07733. [Google Scholar]
- Chang, J.; Meng, G.; Wang, L.; Xiang, S.; Pan, C. Deep Self-Evolution Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 809–823. [Google Scholar] [CrossRef] [PubMed]
- Chang, J.; Guo, Y.; Wang, L.; Meng, G.; Xiang, S.; Pan, C. Deep Discriminative Clustering Analysis. arXiv 2019, arXiv:1905.01681. [Google Scholar]
- Sadeghi, M.; Hojjati, H.; Armanfard, N. C3: Cross-instance guided Contrastive Clustering. arXiv 2022, arXiv:2211.07136. [Google Scholar]
- Li, Y.; Yang, M.; Peng, D.; Li, T.; Huang, J.; Peng, X. Twin Contrastive Learning for Online Clustering. Int. J. Comput. Vis. 2022, 130, 2205–2221. [Google Scholar] [CrossRef]
- Deng, X.; Huang, D.; Chen, D.H.; Wang, C.D.; Lai, J.H. Strongly augmented contrastive clustering. Pattern Recognit. 2023, 139, 109470. [Google Scholar] [CrossRef]
- Wang, X.; Qi, G.J. Contrastive Learning with Stronger Augmentations. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5549–5560. [Google Scholar] [CrossRef] [PubMed]
- Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding Convolution for Semantic Segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
- Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 3008–3017. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Cohen, M.B.; Elder, S.; Musco, C.; Musco, C.; Persu, M. Dimensionality Reduction for k-Means Clustering and Low Rank Approximation. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, Portland, OR, USA, 14–17 June 2015. [Google Scholar]
- Gansbeke, W.V.; Vandenhende, S.; Georgoulis, S.; Proesmans, M.; Gool, L.V. SCAN: Learning to Classify Images without Labels. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020. [Google Scholar]
- Zhou, S.; Xu, H.; Zheng, Z.; Chen, J.; Li, Z.; Bu, J.; Wu, J.; Wang, X.; Zhu, W.; Ester, M. A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions. arXiv 2022, arXiv:2206.07579. [Google Scholar]
- Lloyd, S.P. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–136. [Google Scholar] [CrossRef]
- Zelnik-Manor, L.; Perona, P. Self-Tuning Spectral Clustering. In Proceedings of the NIPS, Vancouver, BC, Canada, 13–18 December 2004. [Google Scholar]
- Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
- Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Xie, J.; Girshick, R.B.; Farhadi, A. Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
- Chang, J.; Wang, L.; Meng, G.; Xiang, S.; Pan, C. Deep Adaptive Image Clustering. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5880–5888. [Google Scholar]
- Wu, J.; Long, K.; Wang, F.; Qian, C.; Li, C.; Lin, Z.; Zha, H. Deep Comprehensive Correlation Mining for Image Clustering. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 8149–8158. [Google Scholar]
- Huang, J.; Gong, S.; Zhu, X. Deep Semantic Clustering by Partition Confidence Maximisation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8846–8855. [Google Scholar]
- Tao, Y.; Takagi, K.; Nakata, K. Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation. arXiv 2021, arXiv:2106.00131. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).