In this section, we detail our proposed methodology, beginning with an overview of the noise classification problem and notations in
Section 3.1, followed by an explanation of twin contrastive clustering (TCC) [
12] in
Section 3.2. These sections serve as an introduction to the foundation for TPCR.
Section 3.3 presents modifications to TCC informed by label information, while
Section 3.4 presents the novel regularization terms based on the clustering outcomes of the adjusted TCC.
3.1. Problem Formulation
Considering a classification problem with C classes, denote the input space as and the label space as . Generally, models are trained on the clean dataset denoted as , with , and N representing the dataset’s sample size. When learning with noisy labels, we only have access to the noisy dataset , where is noisy; that is, some of and do not correctly reflect the visual content of the corresponding input. During training, only noisy labels are available, and it remains unknown whether is noisy ( or clean (). The objective is to train a model that achieves high accuracy on the true labels despite the presence of an unspecified number of noisy labels in the training set.
The neural network model for this classification task is denoted as
, where
is the trainable parameters of the network. This model captures the conditional probability distribution of
. Specifically, the model first maps the input
to a logits vector
. Subsequently, a softmax operation is applied to transform
into
, where
can be viewed as the probability of
belonging to
c-th category. When learning with noisy labels, this model employs a noisy classification loss function:
where
is the cross-entropy function and
is the one-hot vector corresponding to
. Notably,
and
can also represent the probability mass function of the categorical distribution. For the sake of brevity, we will use the ‘probability vector’ to refer to the probability mass function of a categorical distribution in the subsequent sections. With label noise, optimization of Equation (
1) leads to overfitting label noise, which reduces the prediction accuracy on clean labels.
3.2. Twin Contrastive Clustering
In order to identify similar samples, we need to obtain the representation of samples and conduct clustering. This study adopts twin contrastive clustering (TCC) [
12] as a contrastive learning framework. Prior to introducing TCC, we first describe the contrastive-instance method that underpins TCC’s methodology.
Contrastive learning leverages the unlabeled dataset
, obtained by ignoring label information from datasets
or
. Contrastive learning relies on pretext tasks for supervision [
38], broadly categorized into contrastive-instance and clustering-based categories. The contrastive-instance approach involves identifying two augmented versions of the same input as belonging to the same category, serving as single-sample recognition. Specifically, after random augmentations,
yields two variants:
and
, which are then transformed by a neural network model
into
-dimensional instance-level representations
and
. The probability of
being identified as itself (i.e.,
) is expressed as:
Here,
represents the temperature hyperparameter, which controls the concentration level [
12]. Contrastive-instance methods construct the loss function via Equation (
2) and further learn valuable representations.
Moving to TCC, after generating instance-level representations, it clusters samples and then formulates a loss function centered around the clustering outcomes. This loss function combines the cluster-level and instance-level parts. We first introduce the clustering process of TCC. To allocate
N samples in
into
K clusters, TCC employs learnable clustering parameters
, where
,
, and
refers to the
-norm. Using the dot product to measure similarity between
and
, the membership probability of
in cluster
k is calculated as:
For convenience, we use
to indicate cluster assignment probabilities of
to each cluster, i.e.,
. Note that
also reflects the degree of relevance of
to the
k-th cluster. With it being the aggregation weight, the representation
for the
k-th cluster can be expressed as follows:
Here,
-normalization is adopted for normalized representations benefiting contrastive learning [
27].
Analogous to Equation (
2), TCC employs representations
to generate an additional set of cluster-level representations, denoted as
. Utilizing both
and
, TCC’s cluster-level contrastive objective is formulated as:
Minimizing this equation enhances the similarity of representations of the same cluster (
and
), while reducing the similarity across different clusters (
and
), thereby fostering meaningful representations and clustering outcomes.
In addition to the cluster-level contrastive loss function
, TCC also contains the instance-level contrastive loss, the evidence lower bound (ELBO) loss, which is derived from the lower bound of the
. Denote
as the instance identification probability within the context of the
k-th cluster, with
denoting the prior following uniform distribution. The relationship between the instance identification probability
and its lower bound is captured by the following inequality:
where
represents the Kullback-Leibler divergence. The detailed derivation of the inequality can be found in the
Appendix A. The right-hand side of this inequality, the ELBO, incorporates the clustering probability
and enhances the clustering performance of TCC. Based on Equation (
6), the ELBO loss
for TCC is formulated as:
By minimizing
, TCC maximizes the lower bound of the
, thereby elevating
. Based on
and
, the loss function for TCC is represented as
.
3.3. Injecting Label Information to TCC
The ELBO loss is crucial for TCC to generate effective instance-level representations and meaningful clustering results. To align the clustering results more closely with category information, this subsection introduces modifications to .
Note that the KL divergence term
in Equation (
7) involves the clustering prior distribution
, which is simply set as the discrete uniform distribution for lack of meaningful prior information. To enhance the consistency between clustering and classification, it is a feasible way to replace the non-informative prior distribution with a meaningful clustering distribution derived from labels. To implement this replacement strategy, it is necessary to construct a new clustering prior probability distribution related to label information. This motivates us to reflect on the correspondence between classes and clusters.
Utilizing established notations, the total number of classes and clusters is denoted as C and K, respectively. A one-to-one correspondence between classes and clusters is feasible when , resulting in clustering outcomes that mirror the classification task—whereby each cluster corresponds to a single class. If , a single cluster may encompass multiple categories, diminishing the utility of clustering in identifying similar samples; such configurations are thus excluded from consideration. When , a one-to-one correspondence between clusters and classes cannot be achieved. To extend the concept of correspondence, it is possible to make one class correspond to multiple clusters. This is equivalent to splitting one class into several sub-classes and then associating each sub-class with a cluster. Moreover, a small K would pose challenges to TCC training, and K usually takes a larger value. Hence, it can be assumed that .
To delineate the one-to-many relationships between classes and clusters, we introduce an alignment matrix . Ideally, is expected to realize the transition from the classification probabilities to the clustering assignment probabilities , specifically, . For the k-th element of the clustering assignment probabilities, the relationship should hold, where , and represents the k-th row of . For each , the contribution of to is determined by the c-th element of , denoted as . Specifically, if cluster k is associated with class c, then should influence , signifying that ; otherwise, .
To construct the alignment matrix
, we need to clarify the class correspondences for each cluster. Intuitively, the class correspondence for a cluster should be the majority class label among the samples within that cluster. We refer to the label for the majority of samples as the main class of this cluster. In the context of label-noise classification tasks, there is no access to the true class labels
and corresponding
for individual samples. Thus, we resort to using
to deduce the class label for each sample, thereby determining the main class for each cluster. Specifically, for the samples within the
k-th cluster, we estimate the class index for each sample based on
. By aggregating these estimations, we identify the most frequent class, denoted as
, which is considered the main class for the
k-th cluster. Upon estimating the main class for all clusters, the alignment matrix
is formulated as:
Here,
is used to indicate the relevance of the
k-th cluster to the
c-th class, and
is the result of column-wise normalization of
to ensure that
still satisfy the conditions of the probability distribution.
The KL divergence term in
can be transformed as follows:
where
denotes the entropy. In the KL divergence term, only the cross-entropy term involves
. With
as the new prior, we replace the cross-entropy term with
. Simply replacing the prior distribution introduces new pitfalls since
may be misled by noise. To mitigate the impact of noisy labels, a confidence threshold
is introduced to filter out significantly erroneous label information. Specifically, we introduce an indicator function
. Only
satisfying
is used to guide clustering. Replace the cross-entropy term in Equation (
9) and obtain:
Compared to
, Equation (
10) introduces category information as a prior into the clustering process, facilitating category-consistent clustering outcomes. It is crucial to note that, during optimization,
is treated as fixed, and only
is updated. The modified ELBO loss is expressed as:
Another key element of
is
, which lies in the construction of
. In the original TCC,
is parameterized with a small neural network. However, the introduction of a small neural network added extra parameters, potentially leading to instability in the model training. To enhance the training process’s stability, we utilize the concatenation operation to generate the joint representations, which are subsequently employed to parameterize
. Additionally, expectation computation involves the reparameterization trick [
39,
40]. Specific details can be found in the
Appendix B. Finally, the modified TCC loss falls into the following form:
3.4. Prediction Consistency Regularization Based on Clustering
In the previous section, we adjusted the ELBO loss of TCC to incorporate classification information into the clustering process. In this section, we present a novel regularization term based on clustering results.
The purpose of the regularization term is to eliminate class prediction discrepancies among similar samples. In the clustering process of TCC, by evaluating the similarity between representations and , samples with similar representations are aggregated into the k-th cluster. Consequently, from the perspective of representations, samples belonging to the same cluster can be regarded as similar samples. Therefore, the regularization term should ensure that all samples within a cluster have similar class predictions. To achieve this, the most intuitive approach is to constrain differences in class predictions between all pairs of samples. This intuitive approach involves a high computational cost, whereas the prototype-based approach would be more efficient. To develop the prototype-based regularization term, we first generate the prediction center for each cluster and then encourage all class predictions within a cluster close to the corresponding prediction center.
To generate the prediction center,
is utilized as the substitute clean label. Note that
may contain errors, and not all clustering results have the same reliability. We adopt a weighted averaging approach to overcome potential misleading information. Specifically, for
, we denote its cluster index as
, and the corresponding cluster confidence as
. Let
be the set of indices of samples belonging to the
k-th cluster, then the prediction center for the
k-th cluster is defined as:
Note that
and
, where
. This indicates that
remains a probability mass function. Therefore,
can also be understood as an aggregation classification distribution, where clustering confidence
is the aggregation weight. Based on clustering prediction centers, we construct the regularization term as follows:
Here,
represents the cross-entropy function,
is the clustering assignment for
, and
is the prediction center of the
-th cluster. Alternative metrics such as inner product [
18] could be utilized to quantify the disparity between
and its associated prediction center. Equation (
14) is also formulated in a weighted averaging manner, which allows samples with higher clustering confidence to have a greater impact and help mitigate the potential impact of clustering errors. Finally, we obtain the following overall loss:
where
is the classification loss based on noisy labels,
is the adjusted TCC loss,
is the regularization term, and
is the regularization strength parameter.
The proposed regularization term relies on the quality of clustering. However, ensuring high-quality clustering during the initial stages of training is often challenging. To prevent the adverse effects of poor clustering results, we introduce a warm-up phase during which the objective function does not include the regularization term. Our training framework is summarized in Algorithm 1.
Algorithm 1: Training Algorithm |
Input: Noisy dataset , total number of training epochs S, warm-up epochs , , and Output: Classification network
|
|
To improve the alignment between clustering and classification while reducing the number of parameters, prior studies frequently shared parts of parameters between and . This choice is also adopted in this work. More precisely, is structured as an encoder with the backbone network, while is the composition of the same backbone network and a classification head.