DYMatch: Semi-Supervised Learning with Dynamic Pseudo Labeling and Feature Consistency

Mao, Zhongjie; Pan, Feng; Sun, Jun

doi:10.3390/app14010229

Open AccessArticle

DYMatch: Semi-Supervised Learning with Dynamic Pseudo Labeling and Feature Consistency

by

Zhongjie Mao

¹

,

Feng Pan

^1,* and

Jun Sun

^2,*

¹

School of IoT Engineering, Jiangnan University, 1800 Lihu Avenue, Wuxi 214122, China

²

Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, School of Artificial Intelligence and Computer Science, Jiangnan University, 1800 Lihu Avenue, Wuxi 214122, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(1), 229; https://doi.org/10.3390/app14010229

Submission received: 2 December 2023 / Revised: 18 December 2023 / Accepted: 25 December 2023 / Published: 26 December 2023

Download

Browse Figures

Versions Notes

Abstract

:

A considerable number of approaches based on consistency regularization and pseudo-labeling have been proposed in Semi-supervised Learning (SSL) so far. These approaches significantly enhance the SSL methods by effectively utilizing large amount of unlabeled data to improve the model’s performance. However, existing methods may fail to utilize the unlabeled data more efficiently, mainly due to the challenges faced in pseudo-label estimation. In this paper, we begin by analyzing the impact of the pseudo-label estimation method on training the SSL model. Furthermore, we emphasize that an effective pseudo-label estimation method should reflect the difference of recognition performance among samples from different categories, and also ensure the maintenance of both high-quantity and high-quality pseudo-labels during each training iteration. Based on above analysis, we propose DYMatch, an innovative SSL method employing a dynamic estimation process. Firstly, a dynamic pseudo-label estimation method based on Gaussian mixture model is proposed to dynamically estimate the confidence threshold on different categories. Secondly, a feature-correlation consistency regularization method is introduced to further enhance the learning on unlabeled data. The experimental results show that DYMatch is a simple and effective SSL method, especially when the labeled data is limited.

Keywords:

semi-supervised learning; dynamic pseudo-label estimation; Gaussian mixture model; feature-correlation consistency regularization

1. Introduction

In the field of computer vision, deep learning methods have consistently been at the forefront, and their superior performance heavily relies on the availability of enough accurately annotated labeled data [1,2]. However, the acquisition of enough labeled data is a labor-intensive and expensive task. Now, there are various methods available to address such issues, such as weakly supervised learning [3,4], zero-shot learning [5], and semi-supervised learning [6]. Among them, semi-supervised learning offers an effective approach by employing a few labeled data and a large amount of unlabeled data for training. It shows great potential in practical applications and can significantly reduce the dependence of training on laborious data collection and data annotation [7,8,9,10,11,12].

The main challenge in SSL lies in effectively utilizing unlabeled data to improve the model’s generalization performance [6]. And consistency regularization [13,14,15] and pseudo-labeling [16,17] represent two most widely utilized paradigms designed for SSL, with their combined approaches also demonstrating good performance [9,10,18,19]. The key idea underlying these methods is grounded in the assumption that the training process of the SSL model should follow the low-entropy assumption, that is, the model should maintain the consistency in predictions when different perturbations are applied to the same unlabeled data [6]. However, a potential limitation in employing the pseudo-labeling methods to address this problem is the requirement to set a fixed confidence threshold [7,8,9,18] or implement a stage-wise threshold adjustment method [20] for the selection of suitable pseudo labels.

In pseudo-labeling methods designed with a fixed confidence threshold, the model selects unlabeled data with prediction scores surpassing the threshold for training and directly discard those falling below the threshold. For example, UDA [18] and FixMatch [9] employ a fixed high-confidence threshold to maintain the quality of the chosen pseudo labels. However, during the early training stages, a fixed high-confidence threshold (e.g., 0.99) may exclude too much unlabeled data from training, resulting in a large amount of unlabeled data being left unused. This approach overlooks the fact that, in the early training stages, the model has a limited ability to distinguish unlabeled samples, making it challenging to generate high-confidence predictions. An enhanced approach for the fixed threshold is to employ a stage-wise threshold adjustment strategy. For example, Dash [20] and AdaMatch [21] opt to gradually increase the threshold as the training progresses. The utilization of unlabeled data in such methods demonstrates an improvement compared to methods using a fixed confidence threshold. However, the performance of the stage-wise threshold adjustment strategy is directly influenced by the pre-defined hyper-parameters, and the setting of hyper-parameters is largely decoupled from the model’s training process in most cases. This is especially notable in disregarding the varying difficulties associated with learning data from different categories. Furthermore, optimal hyper-parameters have to be researched on different datasets. FlexMatch [10] attempts to employ different local (class-based) thresholds for samples from different categories. Although the setting of local threshold takes into account the different learning status of the model for different categories, it is still derived from a pre-defined fixed global (dataset-based) threshold. In essence, these methods fundamentally ignore the crucial effect of the model’s training process on the pseudo-label estimation, whether for global or local thresholds.

Essentially, the core requirement for a pseudo-label estimation method is to maintain a balance between the quantity and quality of pseudo labels during the training process. The methods utilizing a high fixed threshold explicitly abandon the requirement for the quantity, concentrating solely on generating high-quality pseudo labels. However, when the labeled data are extremely rare, the challenge with the high fixed threshold lies in its difficulty in ensuring the quality of pseudo labels. Conversely, the stage-wise threshold adjustment method is equivalent to compromising the quality of pseudo labels in favor of increasing their quantity. Such a strategy is susceptible to introducing incorrect pseudo labels, leading to the cognitive bias during the training process. Therefore, a novel dynamic pseudo-labeling estimation method based on Gaussian mixture model is proposed in this paper. We assume that the prediction scores for a specific category in a training iteration are sampled from a Gaussian mixture distribution comprising two distributions, namely the positive and negative distributions. This assumption can be extended to the whole unlabeled data. By solving for this Gaussian mixture model, the model can assign each category a local threshold that aligns with the current training status. Specifically, for a given category, we consider the prediction scores of this category as the weighted sum of the positive and negative distribution, and employ the maximum likelihood estimation to predict the parameters of each Gaussian distribution. Consequently, the dynamic pseudo-label estimation method can adaptively generate the optimal local threshold at different training iterations so as to stabilize the quantity and quality of pseudo labels. Intuitively, during the early training stages, the dynamic pseudo-label estimation method employs a relatively low threshold to encourage the model to train on more unlabeled data and accelerate the convergence. As the classification performance improves and the prediction confidence increases, a higher threshold is generated to filter out incorrect pseudo labels, alleviating the cognitive bias of the model.

Inspired by previous researches [22,23], we recognize the significance of constraining the training process of feature representation in SSL tasks. Therefore, in DYMatch, we also propose a feature-correlation consistency regularization method, and its integration with the dynamic pseudo-label estimation method can effectively enhance the model’s classification performance. Unlike other feature-constraint methods, the feature-correlation consistency regularization method is selectively applied to the unlabeled data with prediction scores exceeding the confidence threshold generated by the dynamic pseudo-label estimation method. Moreover, we implement feature-correlation consistency regularization and dynamic pseudo-label estimation at different layers in the model, thereby effectively avoiding the influence of the strong coupling between these two methods on the model’s performance.

The paper is structured as follows. Section 2 reviews the common pseudo-labelling and consistency regularization methods in SSL, and also introduces the relevant background on Gaussian mixture model and its applications in SSL. Section 3 details the motivation and implementation of the DYMatch. In Section 4, we present the results of the comparative experiments and ablation studies. The paper is ultimately summarized in Section 5.

2. Related Works

2.1. Consistency Regularization and Pseudo-Labeling Methods in SSL

The SSL task addressed in this paper can be defined as a classification task with

C

categories. We define

X = {(X_{b}, p_{b}) : b \in (1, \dots, B)}

as the batches of labeled training samples, where

X_{b}

and

p_{b}

represent the

b

-th batch of labeled samples and the labels corresponding to these samples, respectively.

B

represents the number of these batches, and

b

denotes the index of an unlabeled sample in a batch. In addition, let

U = {U_{b} : b \in (1, \dots, B)}

be the batches of unlabeled data. We use

N_{L}

and

N_{U}

to denote the number of labeled and unlabeled samples in each batch of training data, typically with

N_{U} {= σ N}_{L}

.

P_{m o d e l} (y | z_{i}; θ)

is denoted as the prediction score by the model

p

for input

z_{i}

with parameters

θ

, where

z_{i}

represents the

i

-th sample in a training batch.

In SSL, the motivation of the consistency regularization often follows the smoothness assumption or the low-density assumption [6], which means that the model should generate similar predictions for a specific unlabeled sample when subjected to various perturbations. Therefore, the consistency regularization method can be viewed to find a smooth manifold on which the whole dataset lies by utilizing the unlabeled data [24]. A common strategy for introducing perturbations is employing data augmentation methods [25,26], which apply image transformations while preserving the semantic information of the data. Specifically, consistency regularization demands that an unlabeled sample

u_{i}

should exhibit similarity to its corresponding augmented sample, i.e.,

A u g m e n t (u_{i})

. Therefore, most SSL models employ the consistency regularization loss term in Equation (1) to a batch of unlabeled data:

\frac{1}{N_{U}} \sum_{i = 1}^{N_{U}} | | P_{m o d e l} (y| A u g m e n t_{1} (u_{i}); θ) - P_{m o d e l} (y | A u g m e n t_{2} (u_{i}); θ) {| |}_{2}^{2}

(1)

where

A u g m e n t_{1} (\cdot)

and

A u g m e n t_{2} (\cdot)

represent two different data augmentation operations,

u_{i}

denotes the

i

-th unlabeled sample in a batch. Mean Teacher approach [27] contains a teacher model and a student model. The student model closely resembles the regular mode of Π-Model [13], and the teacher model has the same network structure as the student model but employs an exponentially moving average of the student’s weights for parameter update. In Mean Teacher, the predictions of the student model and the teacher model are used to calculate the consistency regularization loss. Specifically,

A u g m e n t_{2} (\cdot)

in Equation (1) is defined as the predictions of the model updated with the Exponential Moving Average (EMA) method. This operation yields a relatively stable feature representation, significantly improving the classification performance. Dual Student [28] is an enhanced teacher-student model that overcomes the effect of the tightly coupled teacher-student models by introducing another student model to replace the teacher model. These two student models have different initial states and are trained with distinct optimization strategies during the training process. Furthermore, Dual Student also introduces two innovative concepts, that is, stable sample and stabilization constraint, to prevent performance bottlenecks associated with a coupled EMA teacher-student model. Virtual Adversarial Training (VAT) [29] proposed the concept of adversarial training based on consistency regularization, which aims to generate adversarial transformations of unlabeled data to perturb the predictions of the model. Essentially, it is a training strategy where

A u g m e n t_{2} (\cdot)

in Equation (1) represents an adversarial transformation. Consistency regularization method is performed between the predictions of the original unlabeled data and the perturbed one. UDA [18] can be viewed as an unsupervised data augmentation method that analyzes the contribution of noise perturbations methods on consistent regularization and replaces simple perturbations techniques with advanced data augmentation methods, such as RandAugment [25] and AutoAugment [26]. Within the consistency regularization framework, UDA extends the data augmentation method from full-supervised learning to SSL. In this paper, DYMatch also utilizes a strongly-weakly data augmentation method, similar to FixMatch [9], to introduce perturbations to the unlabeled data. However, compared to the feature-correlation consistency regularization applied by DYMatch in the feature space, the existing consistency regularization method tends to impose consistency regularization on model’s predictions. The strong coupling with the pseudo-labeling in these approaches may restrict the model’s classification performance.

The difference between pseudo-labeling and consistency regularization methods is that consistency regularization methods typically rely on constraints imposed by various data transformations. In contrast, the pseudo-labeling method relies more on obtaining good pseudo labels with a high confidence, and adding them to the training set as labeled data. By defining

q_{i} = P_{m o d e l} (y | u_{i})

, the loss function for unlabeled data can be derived as follows:

\frac{1}{N_{U}} \sum_{i = 1}^{N_{U}} l (m a x (q_{i}) \geq η) H ({\hat{q}}_{i}, q_{i})

(2)

where

{\hat{q}}_{i} = a r g m a x (q_{i})

,

η

is a hyper-parameter known as the confidence threshold,

H (\cdot)

is the standard cross-entropy loss function, and

l

is a mask function based on the

η

. Additionally, the

a r g m a x

operation is used to generate a “one-hot” distributions based on the model’s predictions. In Pseudo-Label [16], the model is firstly trained in the full-supervised approach on labeled data. Subsequently, the identical model is employed to predict the unlabeled data, and the predictions with the maximum confidence are viewed as the pseudo labels. In FlexMatch [10] the Course Pseudo-Labeling method (CPL), a curriculum learning method, is employed to learn the unlabeled data based on the model’s training status. Specifically, the significant contribution of CPL lies in its flexibly adjustment of the thresholds for each category during the training process, allowing more unlabeled data used for training. The dynamic pseudo-label estimation method in DYMatch is based on the model’s training status. It dynamically selects local thresholds for different categories, and simultaneously estimating global thresholds. This strategy effectively ensures that the model can achieve the optimal combination of pseudo labels in terms of both quantity and quality at any training iteration.

2.2. Gaussian Mixture Model and Its Application in SSL

The Gaussian Mixture Models (GMM) are commonly employed in machine learning problems to address clustering problems. This method belongs to mixture models, which assumes that the observed samples with known labels are sampled from a specific multivariate gaussian distribution. Specifically, the GMM assumes that the observed samples in the dataset belong to

K

different categories. However, due to the lack of label information of the samples, it is necessary to employ the prior probability

p_{k}

to denote the likelihood that the sample belonging to a category and the joint distribution corresponding to this category is modeled as a multivariate normal distribution

N (μ_{k}, \sum_{k})

. Once obtaining the parameter estimates of the model, the posterior distribution of an observed sample can be calculated, ultimately determining its category. In theory, the GMM is capable of fitting dataset with arbitrary distributions, and it shows better performance when handling dataset modeled as the multivariate Gaussian distribution clusters [30]. In this paper, the model’s prediction scores for the samples are continuous variables. Therefore, the GMM model is selected to model and analyze them.

Within the SSL tasks, there is a crucial task known as noise label learning, also referred to as robust training [31]. In this task, GMM is an important and effective method for estimating noise rates [32,33]. The estimated noise rate is widely used to reweight samples in robust classifiers [34,35] or determine the quantity of samples considered as clean training samples [36,37,38].

A novel SSL framework, DivideMix [39], has been proposed to deal with the challenge of training with noisy labels. Firstly, DivideMix proposes the concept of co-division, a process of training two networks simultaneously. The GMM is employed to fit the loss distribution of the training data, which aims to partition the training data into labeled and unlabeled sets. The labeled dataset can be considered as the samples with correct labels, while the unlabeled data can be regarded as the samples with noisy labels. The separated datasets are then used to train an SSL model. By iteratively repeating the previous process, the model gradually overcome the influence of noisy labels. Inspired by DivideMix, DYMatch integrates this dynamic estimation method into semi-supervised classification learning. Specifically, in DYMatch, the dynamic estimation method is employed to yield the local threshold for a specific category, which effectively improves utilization of unlabeled data.

3. DYMatch

The structure diagram of unlabeled data in DYMatch is shown in Figure 1. In this section, we provide a detailed description of how the DYMatch model deals with semi-supervised classification task. Additionally, we analyze DYMatch’s two core components, namely, the dynamic pseudo-label estimation method based on the Gaussian Mixture Model and the feature-correlation consistency regularization method.

3.1. Motivation of Dynamic Pseudo-Label Estimation Method

The core idea of the pseudo-labeling method is to integrate more unlabeled data into the training set by employing appropriate confidence threshold, thereby enriching the distribution of the training set and effectively improving the classification and generalization performance of the model. In this section, inspired by Wang et al. [12], we consider the model’s prediction scores for a specific category as a binary classification problem. The analysis of this binary classification problem provides motivation for the design of dynamic pseudo-label estimation method.

Assuming a binary classification problem, where the true distribution is a uniform mixture of two Gaussian distributions, that is, the sample set

X s

contains positive samples (

+ 1

) and negative samples (

- 1

). The input data

X

follows the following conditional distribution:

X | Y = - 1 ~ N (μ_{1}, σ_{1}^{2}) X | Y = + 1 ~ N (μ_{2}, σ_{2}^{2})

(3)

Furthermore, we assume that

μ_{2} > μ_{1}

. Given that binary classification models often employ a

S i g m o i d (\cdot)

function as the output activation function, the prediction score of the model can be defined as:

s (x) = \frac{1}{1 + {e x p}^{- β (x - \frac{μ_{1} + μ_{2}}{2})}}

(4)

where

β

is used to denote the current training status of the model. Typically, during the training process, the model should gradually become more confident for its classification performance, so

β

is a gradually increasing positive parameter.

(μ_{1} + μ_{2}) / 2

is the Bayes optimal linear decision boundary. In the early training stages,

β

is relatively small. The decision boundary is around the input data

x

, and

s (x)

may be closer to 0.5, indicating that the model fails to generate a confident prediction for the input data

x

. As the model becomes more confident,

β

is expected to gradually grow during training, and the decision boundary gradually moves away from the input data

x

. When

x - (μ_{1} + μ_{2}) / 2

is positive and large enough,

s (x)

tends towards 1. In this case, the input data

x

is viewed as a positive sample. And when

x - (μ_{1} + μ_{2}) / 2

is negative and its absolute value is large enough,

s (x)

will approach 0. At this time, the model assigns the input data x to the negative class.

Based on the above analysis, we can employ a fixed threshold

η \in (0.5,1)

to partition the prediction scores. The input data

x

will be assigned as pseudo label 1, when

x

satisfies the following conditions:

s (x) = \frac{1}{1 + {e x p}^{- β (x - \frac{μ_{1} + μ_{2}}{2})}} > η

(5)

And the Equation (5) can be rewritten as:

x > \frac{μ_{1} + μ_{2}}{2} + \frac{1}{β} l o g (\frac{η}{1 - η})

(6)

Likewise,

x

will be assigned as pseudo label −1 if:

s (x) = \frac{1}{1 + {e x p}^{- β (x - \frac{μ_{1} + μ_{2}}{2})}} < 1 - η

(7)

and the Equation (7) is equivalent to:

x > \frac{μ_{1} + μ_{2}}{2} - \frac{1}{β} l o g (\frac{η}{1 - η})

(8)

Finally, when

1 - η \leq s (x) \leq η

, the model cannot generate a high-confidence prediction for the input

x

. Generally, the utilization of unlabeled data is directly affected by the confidence threshold

η

. Specifically, as the confidence threshold

η

increases, the utilization of unlabeled data gradually gets lower. Moreover, in the early training stages, when

β

is small, indicating that the model cannot generate confident predictions, using a higher threshold may result in a lower utilization rate of the unlabeled data and a slower convergence.

If we integrate over

x

, the following conditional probability can be obtained:

P (Y_{p} = 1 | Y = 1) = Φ (\frac{\frac{μ_{2} - μ_{1}}{2} - \frac{1}{β} l o g (\frac{η}{1 - η})}{σ_{2}})

(9)

P (Y_{p} = 1 | Y = - 1) = Φ (\frac{\frac{μ_{1} - μ_{2}}{2} - \frac{1}{β} l o g (\frac{η}{1 - η})}{σ_{1}})

(10)

P (Y_{p} = - 1 | Y = - 1) = Φ (\frac{\frac{μ_{2} - μ_{1}}{2} - \frac{1}{β} l o g (\frac{η}{1 - η})}{σ_{1}})

(11)

P (Y_{p} = - 1 | Y = 1) = Φ (\frac{\frac{μ_{1} - μ_{2}}{2} - \frac{1}{β} l o g (\frac{η}{1 - η})}{σ_{2}})

(12)

where

Φ

is the cumulative distribution function of a standard normal distribution.

If

P (Y = 1) = P (Y = - 1) = 0.5

, we can obtain the following formula:

P (Y_{p} = 1) = \frac{1}{2} Φ (\frac{\frac{μ_{2} - μ_{1}}{2} - \frac{1}{β} l o g (\frac{η}{1 - η})}{σ_{2}}) + \frac{1}{2} Φ (\frac{\frac{μ_{1} - μ_{2}}{2} - \frac{1}{β} l o g (\frac{η}{1 - η})}{σ_{1}})

(13)

P (Y_{p} = - 1) = \frac{1}{2} Φ (\frac{\frac{μ_{2} - μ_{1}}{2} - \frac{1}{β} l o g (\frac{η}{1 - η})}{σ_{1}}) + \frac{1}{2} Φ (\frac{\frac{μ_{1} - μ_{2}}{2} - \frac{1}{β} l o g (\frac{η}{1 - η})}{σ_{2}})

(14)

When

σ_{1} \neq σ_{2}

, we can obtain that

P (Y_{p} = 1) \neq P (Y_{p} = - 1)

. Specifically, when the standard deviations (

σ_{1}

and

σ_{2}

) of the two normal distributions are unequal, it indicates that the samples of these two categories have different variability. In a binary classification task, this gap may affect the model’s decision boundary. The model may pay more attention to the category with the smaller standard deviation, as this means that the samples in that category are relatively more concentrated. As a result,

P (Y_{p} = 1)

and

P (Y_{p} = - 1)

are not equal. The model adjusts the decision boundaries according to the distribution properties of the different classes, making it easier to make more confident predictions for the category with the smaller standard deviation while being more cautious for the category with the larger standard deviation. This leads to unequal prediction performance of the model for these two classes. In fact, when using a larger confidence threshold

η

, the imbalance in the pseudo-label estimation method becomes more obvious. Imbalanced pseudo-label estimation method may distort decision boundaries and lead to cognitive bias in pseudo-labeling. An easy resolution for such situation is to use different local thresholds for different categories to estimation pseudo labels.

According to Equations (13) and (14), we can define that

a_{1} = 1 / 2 σ_{1}

,

a_{2} = 1 / 2 σ_{2}

,

b_{1} = l o g (η / (1 - η)) / β σ_{1}

,

b_{2} = l o g (η / (1 - η) / β σ_{2}

,

z = μ_{2} - μ_{1}

, and

P (Y_{p} = 1) + P (Y_{p} = - 1)

can be written as:

\begin{array}{l} P (Y_{p} & = 1) + P (Y_{p} = - 1) \\ = \frac{1}{2} Φ (a_{1} z - b_{1}) + \frac{1}{2} Φ ({- a}_{1} z - b_{1}) + \frac{1}{2} Φ (a_{2} z - b_{2}) + \frac{1}{2} Φ (- a_{2} z - b_{2}) \end{array}

(15)

According to Equation (15), we can obtain that:

f (z) = \frac{1}{2} Φ (a_{1} z - b_{1}) + \frac{1}{2} Φ ({- a}_{1} z - b_{1})

(16)

By taking the derivative of

z

in Equation (16), we can obtain:

f^{'} (z) = \frac{1}{2} a_{1} [ϕ (a_{1} z - b_{1}) - ϕ ({- a}_{1} z - b_{1})]

(17)

where

ϕ

is the probability density function of the standard normal distribution. According to its symmetry, Equation (17) can be rewritten as:

f^{'} (z) = \frac{1}{2} a_{1} [ϕ (a_{1} z - b_{1}) - ϕ (a_{1} z + b_{1})]

(18)

Since

| a_{1} z - b_{1} | < | {- a}_{1} z - b_{1} |

, that is,

| a_{1} z - b_{1} | < | a_{1} z + b_{1} |

, the following conclusion can be derived:

ϕ (a_{1} z - b_{1}) > ϕ (a_{1} z + b_{1})

. According to the above derivation, we can conclude that

f^{'} (z) > 0

, that is,

f (z)

is monotonically increasing in the interval

(0, \infty)

. Extending this conclusion to Equation (15), we can obtain that

P (Y_{p} = 1) + P (Y_{p} = - 1)

is also monotonically increasing. Therefore, the utilization rate of pseudo labels,

P (Y_{p} = 1) + P (Y_{p} = - 1)

, decreases as

μ_{2} - μ_{1}

becomes smaller. Specifically, when the distributions of two categories are more similar, the model may face challenges in accurately distinguishing the samples in these two categories. As the differentiation of two categories diminishes, more samples may be confused in the feature space, and the model cannot generate confident predictions about these samples. Therefore, a suitable pseudo-label threshold is needed to balance the utilization rate between these categories. Otherwise, there may not be enough samples for training the model to distinguish categories that are already challenging to differentiate.

In summary, an effective pseudo-labeling method should take into account the change in the model’s classification performance between different categories during the training process. Specifically, the confidence threshold

η

should gradually increase as the model’s performance parameter

β

during the training process. This ensures enough unlabeled data for the model to train in the early training iterations, while, in the later iterations, the threshold

η

can filter out incorrect predictions to mitigate the cognitive bias of the model. Moreover, given the model’s different classification performance in different categories, with some being easier to classify than others, the pseudo-labeling method need to adjust the threshold

η

for each class to encourage equitable setting of threshold to different classes. The main contribution of this paper lies in dynamically adjusting class- based local thresholds and dataset-based global thresholds based on the model’s training status. Additionally, the combination of feature-correlated consistency regularization and dynamic pseudo-label estimation further enhances the performance of SSL model. And then, we will provide a detailed description of these two methods used in DYMatch.

3.2. Dynamic Pseudo-Label Estimation Method Based on Gaussian Mixture Model

During the training process, it is crucial to dynamically assign corresponding confidence thresholds to each class, and the model’s prediction scores for the training data can accurately reflect the current training status for the corresponding category. Therefore, in DYMatch, we propose a dynamic pseudo-label estimation method based on GMM. Firstly, utilizing the model prediction scores during training allows us to obtain class-based local thresholds, dynamically partitioning unlabeled data into positive and negative samples. The final dynamic local threshold is then updated using the Exponential Moving Average (EMA) method based on the local thresholds at each training iteration. Additionally, the global dynamic threshold is estimated by using the EMA of the prediction scores from the model, and the dynamic local threshold is also used to adjust the global dynamic threshold.

Therefore, at the beginning of training, the confidence threshold may be relatively smaller, enabling utilization of more unlabeled data with potentially correct predictions for training. As the training of model, the confidence thresholds are dynamically adjusted and generally tend to increase. This adjustment filters out more unlabeled data with potentially incorrect model predictions, reducing the training bias of the model.

3.2.1. Dynamic Local Confidence Threshold Estimation

Dynamic local confidence threshold estimation aims to estimate class-specific local thresholds to account for the inter-class diversity. Furthermore, the dynamic local confidence threshold is steadily increased during training to ensure the discarding of unlabeled data with incorrect prediction.

Therefore, the Gaussian Mixture model is employed to distinguish the positive from negative samples in model’s prediction scores for specific categories. Here, positive samples correspond to unlabeled data that the model predicts correctly, while negative samples correspond to unlabeled data with incorrect prediction. The model’s prediction scores used to distinguish positive and negative samples can be regarded as the local thresholds for the corresponding categories. Specifically, we assume that the prediction scores

s^{c}

for category

c

are sampled from a Gaussian mixture distribution

P (s^{c})

with two distributions, positive and negative. Local thresholds are dynamically generated by fitting a Gaussian Mixture Model to the prediction scores.

P (s^{c}) = w_{n}^{c} N (s^{c} | μ_{n}^{c}, {(σ_{n}^{c})}^{2}) + w_{p}^{c} N (s^{c} | μ_{p}^{c}, {(σ_{p}^{c})}^{2})

(19)

where

N (μ, σ^{2})

denotes a Gaussian distribution. The parameters

w_{n}^{c}

,

μ_{n}^{c}

,

{(σ_{n}^{c})}^{2}

and

w_{p}^{c}

,

μ_{p}^{c}

,

{(σ_{p}^{c})}^{2}

represent the weight, mean and variance of the negative and positive sample distributions, respectively. Then, the EM algorithm is employed to infer the posterior probability

P (p o s | s^{c}, μ_{p}^{c}, {(σ_{p}^{c})}^{2})

. This posterior probability can be directly used to generate the pseudo label, and the dynamic local threshold

{\tilde{η}}^{l} (c)

corresponding to category

c

can be defined as:

{\tilde{η}}^{l} (c) = \underset{s^{c}}{argmax} P (p o s | s^{c}, μ_{p}^{c}, {(σ_{p}^{c})}^{2})

(20)

By using the EMA method, we obtain the final dynamic local threshold, which reflects the learning status of the model for a specific category:

η_{t}^{l} (c) = \{\begin{matrix} 1 / C, i f t = 0 \\ λ {\tilde{η}}_{t - 1}^{l} (c) + (1 - λ) {\tilde{η}}_{t}^{l} (c), o t h e r w i s e \end{matrix}

(21)

where

λ \in (0,1)

denotes the momentum decay in EMA,

t

represents the

t

-th iteration during the training process. Meanwhile,

{\tilde{η}}_{t}^{l} (c) = D L T E ({{D A (q}_{i} (c)}_{s}))

, where

D L T E (\cdot)

represents the dynamic local threshold estimation method based on GMM, and

{q_{i} (c)}_{s}

denote the prediction scores of unlabeled data with class

c

.

D A (\cdot)

represents the distribution alignment strategy from ReMixMatch, which balance the distributions of prediction scores. The local thresholds are initialized to

1 / C

, where

C

represents the number of categories. Finally,

η_{t}^{l} = [η_{t}^{l} (1), η_{t}^{l} (2), \dots, η_{t}^{l} (c)]

contains confidence thresholds for all categories.

In the implementation of the dynamic local threshold estimation method, we maintain a queue

m (c) \in R^{Q}

of prediction scores with dimension

Q

for each category. The queues of all categories form a prediction memory bank

M = \{m (1), m (2), \dots, m (c)\}, M \in R^{C \times Q}

, where

C

represents the number of categories. Specifically, the queue

m (c)

stores the prediction scores

s^{c}

of the unlabeled data predicted by the model as category

c

, and employs these prediction scores to fit a Gaussian mixture model. Subsequently, the dynamic threshold estimation method can dynamically adjust the threshold

η^{l} (c)

corresponding to category

c

based on the fitted Gaussian mixture model. As a result,

η^{l} (c)

can align with the model’s classification ability at different training stages. Moreover, the EM algorithm has a negligible effect on training time and does not impose an additional burden to the training.

3.2.2. Dynamic Global Confidence Threshold Estimation

The global threshold estimation shares similar characteristics as for the local one, that is, they both reflect the training status of the model and maintain a steady increase during training. However, unlike the local threshold estimation, the global threshold estimation should capture the model’s training status for the entire dataset. Therefore, we denote the global threshold as the model’s average prediction score for the unlabeled data. Specifically, the global threshold is estimated as the EMA of the prediction scores in each training iteration. We initialize the global threshold to

1 / C

, where

C

represents the number of categories. The global threshold

η_{t}^{g}

is defined as:

η_{t}^{g} = \{\begin{matrix} 1 / C, i f t = 0 \\ λ η_{t - 1}^{g} + (1 - λ) \frac{1}{N_{U}} \sum_{i = 1}^{N_{U}} m a x (D A (q_{i})), o t h e r w i s e \end{matrix}

(22)

3.2.3. Dynamic Pseudo-Label Estimation

Having obtained the class-specific dynamic local thresholds and dataset-specific dynamic global thresholds, we employ dynamic local thresholds to adjust the dynamic global threshold, and obtain the final dynamic confidence threshold:

η_{t} (c) = η_{t}^{g} (c) \cdot N o r m (η_{t}^{l} (c)) N o r m (η_{t}^{l} (c)) = \frac{η_{t}^{l} (c)}{m a x {η_{t}^{l} (c), c \in [C]}}

(23)

where

N o r m (\cdot)

denotes the maximum normalization function. Finally, the loss function for dynamic pseudo-label estimation of unlabeled data at the

t

-th iteration can be defined as:

L_{p l} = \frac{1}{N_{u}} \sum_{i = 1}^{N_{u}} I (m a x (D A (q_{i})) > η_{t} (a r g m a x ({D A (q}_{i})))) \cdot H ({D A (\hat{q}}_{i}), {D A (Q}_{i}))

(24)

where

a r g m a x (q_{i})

represent the model’s predictions for an unlabeled sample, and

η_{t} (a r g m a x (q_{i}))

is the confidence threshold corresponding to each category.

I (\cdot)

is a mask function based on the confidence threshold. In addition, we employ

{\hat{q}}_{i}

and

Q_{i}

to represent

P_{m o d e l} (y| a (u_{i}); θ)

and

P_{m o d e l} (y| A (u_{i}); θ)

, where

a (\cdot)

and

A (\cdot)

represent the weak augmentation and the strong augmentation method, respectively.

3.3. Consistency Regularization Method Based on Feature-Correlation

In methods like FixMatch, MixMatch and ReMixMatch, consistency regularization is typically performed on the prediction scores. The strong coupling between consistency regularization and the pseudo-label methods, which also rely on model’s prediction scores, may constrain the performance of the model. Therefore, in this paper, we perform the consistency regularization method on the feature maps fed to the classification module

g (\cdot)

. And this strategy focuses on unlabeled data with prediction scores exceeding the confidence threshold. Specifically, for a selected unlabeled sample, we employ the negative cosine similarity to measure the correlation between the two different augmented versions of this unlabeled sample. The different data augmentations correspond to strong and weak data augmentations.

However, it should be emphasized that there exist differences between the strongly augmented and the weakly augmented versions of the same unlabeled data. Consequently, they cannot be considered to be identical for the feature extraction. Therefore, a learnable linear projection module, denoted as

h (\cdot)

, is performed on the strongly augmented feature representation of the unlabeled data to weakening the constraint in the consistency regularization method. The module

h (\cdot)

consists of two consecutive linear layers, with a ReLU layer following the first linear layer:

h (x_{i}) = L i n e a r (R e L U (L i n e a r (x_{i})))

(25)

Finally, the loss function of the consistency regularization method based on feature-correlation can be defined as:

L_{c r} = \frac{1}{N_{m a s k}} \sum_{i = 1}^{N_{m a s k}} (1 - C O S (h (f (A (u_{i}))), f (a (u_{i})))) \cdot M a s k

(26)

where

C O S (\cdot)

denotes the cosine similarity function, the

f (a (u_{i})) \in R^{d \times 1}

represents the feature representation of the weakly augmented unlabeled data before the classification module

g (\cdot)

, and

f (A (u_{i})) \in R^{d \times 1}

represents the strongly augmented representation. In our consistency regularization method,

f (A (u_{i}))

is fed to a learnable linear projection module

h (\cdot)

, where

h : R^{d \times 1} \to R^{d \times 1}

.

M a s k

represents a mask vector where all unlabeled data with prediction scores exceeding the dynamic threshold are marked as 1, while others are marked as 0.

N_{m a s k}

is the number of unlabeled data marked as 1 in the

M a s k

.

3.4. Loss Function of DYMatch

In DYMatch, the final loss function for training the network is:

L = L_{s u p} + λ_{p l} L_{p l} + λ_{c r} L_{c r}

(27)

where

L_{s u p}

represents the standard cross-entropy loss used for training with labeled data,

L_{p l}

and

L_{c r}

are the loss for dynamic pseudo-label estimation and feature correlation consistency regularization on unlabeled data, respectively.

λ_{p l}

and

λ_{c r}

are a fixed scalar hyper-parameter, representing the relative weight of

L_{p l}

and

L_{c r}

, respectively.

L_{s u p}

is defined as:

L_{s u p} = \frac{1}{N_{L}} \sum_{i = 1}^{N_{L}} H (l_{i}, P_{m o d e l} (y | {a (x}_{i}))) .

(28)

In the loss function for unlabeled data, we use the loss term

L_{c r}

to obtain more discriminative feature representations, minimizing the difference between the feature representations of the weakly augmented and strongly augmented unlabeled data. This method ensures that the feature representations of the two different augmented samples remain consistency during the training process, thereby improving the model’s classification performance. Meanwhile,

L_{p l}

ensures a balance in the quantity and quality of pseudo labels.

4. Experiments

4.1. Datasets

The experiments were conducted on various object recognition datasets, such as CIFAR-10 [40,41], CIFAR-100 [40], SVHN [42], and STL10 [43], along with some domain adaptation datasets, in order to validate the performance of DYMatch.

The number of the training set and testing set in CIFAR-10 and CIFAR-100 datasets are 50,000 and 10,000, respectively. However, they consist of 10 and 100 categories, respectively. SVHN dataset includes 10 categories, with 73,257 samples used for training and 26,032 samples used for testing. STL-10 dataset is designed for evaluating SSL methods. It includes 5000 labeled samples and 100,000 unlabeled samples. Compared to other standard SSL datasets, STL-10 only has fewer labeled data, and the dataset compensates by offering a considerable amount of unlabeled data. Additionally, each category contains different proportions of labeled data and unlabeled data. These settings make the STL-10 dataset more challenging for SSL training and closer to real-world tasks.

Three domain adaptation datasets are also employed for the performance evaluation, including Office31 [44], Office-Home [41] and DomainNet [45]. The Office31 dataset comprises 4110 images distributed across three domains, which are Amazon, Webcam and Dslr. And these three domains share 31 object categories commonly found in offices. The Office-Home dataset consists of 64 object categories sampled from four domains, which are denoted as Artistic, Clip Art, Product, and Real-World, respectively. And Office-Home comprises around 15,500 images. The DomainNet dataset, containing 345 categories of common objects from six domains, is the largest domain adaptation dataset.

According to the standard settings used in SSL datasets, we randomly select a specific number of labeled samples to constitute the small part of the training set. The remaining data, discarding the labels, forms another part of the training set. In DYMatch, both weak augmentation and strong augmentation methods are employed to the unlabeled data. The weak augmentation involves flip-and-shift transformations, while the strong augmentation combines the RandAugment [25] method with the Cutout [46] method.

4.2. Experimental Settings

The baseline models employed for the comprehensive performance comparison encompassed Mean Teacher [27], Pseudo-Label [16], MixMatch [7], ReMixMatch [8], UDA [18], FixMatch [9], FlexMatch [10], DoubleMatch [23], FeatMatch [22], Meta pseudo-labelling [19], SimMatch [47], Semi-Clustering [48], SoftMatch [11] and FreeMatch [12]. The Mean Teacher and Pseudo-Label, respectively, employed the consistency regularization method and the pseudo-labeling method, which have become widely used in the SSL approach. MixMatch and UDA both explore the impact of different data augmentation methods on SSL. Based on MixMatch, ReMixMatch suggests that achieving a balance in the distribution of prediction scores can significantly enhance the model’s classification performance. FixMatch stands out as an attempt to combine consistency regularization and pseudo-labeling methods, resulting in good classification performance. DoubleMatch and FeatMatch innovatively refined the consistency regularization method, taking inspiration from FixMatch. Meanwhile, FlexMatch also enhanced the pseudo-labeling method upon the foundation laid by FixMatch. SoftMatch and FreeMatch take into account the impact of the model’s training status for the confidence threshold. Meta pseudo-labeling introduced an innovative training diagram. SimMatch and Semi-Clustering strive to integrate methods from other fields, including self-supervised and deep clustering strategies, into FixMatch. The hyper-parameters of all baseline methods are consistent with those in their corresponding papers.

To ensure a fair comparison, we followed the guidelines recommended in [49] and shared the same training pipelines for the same dataset, which include optimizer, learning rate decay schedule and the backbone module. In our experiments on all benchmark datasets, the model was trained employing the SGD optimizer. The momentum and weight decay in SGD were set to 0.9 and 1 × 10⁻³, respectively. The Nesterov method was not utilized in SGD. The learning rate was initialized at 0.02 and gradually decreased employing the cosine annealing scheduler. The Wider ResNet network [50] was employed as the backbone module. In detail, we employed Wider ResNet-28-2 for CIFAR-10, SVHN and the domain adaptation datasets. For the CIFAR-100 and STL-10 datasets, the Wider ResNet-28-8 and the Wider ResNet-37-2 networks are employed as the backbone, respectively. Moreover, we show the average and the standard deviation of the results for each model by trained three times on each number of labeled data on each dataset.

DYMatch involves five hyper-parameters: the dimension of the queue in the dynamic local threshold estimation (

Q

), the momentum decay in the EMA method (

λ

), the dimension of the output feature dimension in linear projection module

h (\cdot)

(

d

) and the relative weight hyper-parameters (

L_{p l}

and

L_{c r}

). In the process of training,

λ

and

Q

were set as 0.999 and 100, respectively. In most cases,

λ_{p l}

and

λ_{c r}

were set to 2.0. However, when only a minimal amount of labeled data is available (i.e., 40 labeled data for CIFAR-10, SVHN and STL-10 datasets, 400 labeled data for CIFAR-100),

λ_{p l}

and

λ_{c r}

were adjusted to

3.0

. This means that in these situations with limited labeled data, we should pay more attention to the constraints for the training on the unlabeled data. Moreover, the different backbones were employed for the different datasets. Thus, for CIFAR-10 and SVHN datasets, the

d

in

h (\cdot)

was set to 128. In CIFAR-100, the parameter

d

was set to 512, and for STL-10, the parameter

d

was set to 256. The labeled batch size

N_{L}

is set to 64. We set the

σ

as 7, which means that the unlabeled batch size

N_{U}

is set to 7 times of

N_{L}

for all benchmark datasets.

4.3. Results

4.3.1. Experimental Results for Standard SSL Datasets

In this subsection, we presented the classification performance of DYMatch with other baseline methods on the standard SSL datasets from Table 1, Table 2, Table 3 and Table 4. Results for Mean Teacher and Pseudo-Label with 40 labels per class are not included due to their poor performance on CIFAR-100. MPL, SoftMatch and FreeMatch obtained the best classification performance with few specific settings of the number of labeled data. However, DYMatch demonstrates good classification performance, outperforming other baseline methods on most labeled data settings on each standard SSL dataset.

As shown from Table 1, Table 2 and Table 3, with the results on the CIFAR-10, CIFAR-100, and SVHN datasets, and DYMatch consistently obtains the good classification performance on most different settings of the numbers of labeled data. In experiments with 40 labeled data per class on CIFAR-10, the error rate of DYMatch was slightly higher than MPL by 0.07%, but it was lower than other baseline methods. DYMatch also performed worse than MPL, when employing 2500 labeled samples on CIFAR-100. Furthermore, as the quantity of labeled data decreases, the DYMatch gradually demonstrates its robust classification performance. When utilizing a very small number of labeled samples, for example, only 4 labels per class, DYMatch exhibited the lowest error rates, which were 1.73% and 0.46% lower than MPL on CIFAR-10 and CIFAR-100 datasets, respectively. The classification accuracy of DYMatch was marginally lower by 0.03% than that of SoftMatch when 250 labeled samples were used on SVHN. However, with utilizing other settings of the number of the labeled data, DYMatch can achieve better performance compared to SoftMatch. In general, most models obtained better classification performance on the SVHN dataset, resulting in a relatively small performance gap among some outstanding models.

As shown in Table 4, for the STL-10 dataset, DYMatch still achieved very competitive classification performance. DYMatch can obtain the lowest error rate when using 1000 labeled data. And when only 40 labeled data are available, DYMatch’s error rate is only 2.02% higher than FreeMatch.

4.3.2. Experimental Results for Domain Adaptation Datasets

Domain adaptation datasets pose a challenge for training due to the presence of multiple domains within each category and significant difference existing among these domains. This intrinsic gap complicates the process of obtaining an optimal performance even with fully-supervised training. Typically, the domain adaptation datasets tend to be more complex and challenging in the field of SSL.

Initially, we create the training and testing sets from each of these datasets following the strategy employed in [51]. The proposed DYMatch was evaluated, along with FlexMatch, ReMixMatch, FixMatch, DoubleMatch, SimMatch and SoftMatch on these datasets and compared the classification performance. For a fair comparison, the training pipelines applied to these domain adaptation datasets was the same as that employed in SSL benchmark datasets. We set the weight of loss term for the unlabeled data followed the settings used in the corresponding original paper. According to Table 5, an accuracy of 56.63% can be obtained through the fully-supervised training on the most complex domain adaptation dataset, DomainNet. All semi-supervised learning methods included in the comparison struggle to achieve comparable classification performance, showing the distinct performance gaps. When facing with only 500 labeled data for each category on DomainNet dataset, DYMatch exhibits an accuracy slightly lower than SoftMatch by 0.8%. However, in all other settings, DYMatch consistently demonstrates the best classification performance.

4.4. Ablation Study

Since DYMatch combines two effective methods, i.e., the dynamic pseudo-label estimation method based on the Gaussian mixture model and the consistency regularization method based on feature correlation, the ablation studies were conducted to provide a deeper understanding of the factors contributing to DYMatch’s good performance. In this subsection, the results are presented only for the experiments involving 40 and 4000 labeled data on CIFAR-10.

As shown in Table 6, the error rates of DYMatch_PL (i.e., DYMatch with only dynamic pseudo-label estimation method based on the Gaussian mixture model) were lower than those of DYMatch_CR (i.e., DYMatch with only consistency regularization method based on feature correlation), but the combination of these two methods led to the better performance of DYMatch than DYMatch_PL and DYMatch_CR.

The comparison between DYMatch_CR and DYMatch_PL highlights their differences in utilizing the unlabeled data. In DYMatch_CR, the feature maps were used after the backbone

f (\cdot)

for feature-correlation consistency regularization. However, the lack of selection on unlabeled data, a large amount of data that cannot be accurately identified by the model may greatly hinder the convergence of the model. On the contrary, DYMatch_PL employed the dynamic pseudo-label estimation method, which effectively filter the unlabeled data during the training process. Consequently, compared to DYMatch_CR, DYMatch_PL proved to be more effective in utilizing the unlabeled data. Moreover, the superior performance of DYMatch indicated that the combination of these two methods resulted in a more efficient utilization of unlabeled data.

We also compared the performance of DYMatch when applying feature-correlation consistency regularization method at different layers, i.e., after the backbone

f (\cdot)

or after the classification module

g (\cdot)

. DYMatch_F denotes DYMatch using the feature representation after the backbone

f (\cdot)

, and DYMatch_G denotes DYMatch employing the feature representation after the classification module

g (\cdot)

. Table 7 obviously shows that the error rates of DYMatch_G is significantly higher than that of DYMatch_F. This result also demonstrated that performing consistency regularization after the classification module

g (\cdot)

limited the performance of the model due to the coupling with the dynamic pseudo-label estimation method.

According to the results shown in Table 7, the dynamic pseudo-label estimation method based on Gaussian mixture model in DYMatch effectively utilizes unlabeled data. Simultaneously, the feature-correlation consistency regularization method focuses effectively on the better pseudo labels, which be selected by the dynamic pseudo-label estimation method. Therefore, the consistency regularization method based on feature correlation can be used as an enhancement to the dynamic pseudo-label estimation method. The effective combination of these two methods can enable the model to obtain excellent classification performance.

Moreover, we also compared DYMatch with other SSL methods in different metrics within the pseudo-labeling method, such as the variations in confidence threshold, the quantity ang accuracy of the selected pseudo-labels. From Figure 2a,b, the confidence threshold of DYMatch gradually increases during the training process. It allows for the utilization of more unlabeled data in the early stages of training, while becoming more cautious in the later stages of training. This situation aligns with the analysis in Section 4.1. Correspondingly, As shown in Figure 2c, DYMatch can obtain the better accuracy of pseudo labels and classification performance compared to FixMatch, AdaMatch and FlexMatch.

4.5. Results for Different Optimizers and Learning Rate Decay Methods

The classification performance of the model was influenced by the choice of optimizers and their hyper-parameters. Table 8 reveals that SGD with a momentum of 0.9 can produced the best results, and the Adam optimizer yield the worst results. Additionally, the Nesterov method did not lead to a noticeable enhancement in performance. Furthermore, despite experimenting with different initial learning rates, we were not able to obtain a significant improvement in classification performance from this setting.

In DYMatch, we employ the learning rate decay method designed based on the cosine annealing strategy. To verify the performance of the model, we compared this method with two other learning rate decay methods, namely the Exp-warmup method and the method without learning rate decay. The results in Table 9 proved that the cosine annealing scheduler can achievedthe best classification performance.

5. Conclusions and Future Work

Although semi-supervised learning methods have made rapid progress in recent years, most of the more advanced methods are designed based on increasingly complicated learning algorithms. These methods often introduced the complex data perturbation methods or integrated other complex algorithms into the framework of SSL methods. In this paper, we proposed DYMatch, a new SSL algorithm that obtained better performance on a variety of standard SSL benchmark datasets. The main contribution of this paper lies in the introduction of a dynamic pseudo-label estimation method based on Gaussian mixture models. This strategy effectively improves the utilization of unlabeled data by dynamically estimating class-based confidence thresholds, thereby enhancing the model’s generalization performance. Moreover, the combination of this strategy with a feature-correlation consistency regularization method also significantly enhances the efficiency of utilizing unlabeled data. The good experimental results obtained illustrate the effectiveness of DYMatch on SSL classification tasks. We hope that this direction can serve as inspiration for further research, such as how to design better feature regularization methods that can more effectively utilize multi-domain dataset. If you are interested in our work and want to apply our methods to other fields, please feel free to contact the author.

Author Contributions

Conceptualization, Z.M.; Methodology, Z.M.; Software, Z.M.; Validation, Z.M.; Formal analysis, Z.M.; Writing—original draft, Z.M.; Writing—review & editing, Z.M., F.P. and J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: CIFAR10, CIFAR-100: https://www.cs.toronto.edu/%7Ekriz/cifar.html, SVHN: http://ufldl.stanford.edu/housenumbers/, STL10: https://ai.stanford.edu/%7Eacoates/stl10/, Office31: https://faculty.cc.gatech.edu/~judy/domainadapt/, Office-Home: https://www.hemanthdv.org/officeHomeDataset.html, DomainNet: https://ai.bu.edu/DomainNet/; (accessed on 1 December 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Wang, Z.; Mekala, D.; Shang, J. X-class: Text classification with extremely weak supervision. arXiv 2020, arXiv:2010.12794. [Google Scholar]
Zheng, M.; You, S.; Wang, F.; Qian, C.; Zhang, C.; Wang, X.; Xu, C. Ressl: Relational self-supervised learning with weak augmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 2543–2555. [Google Scholar]
Mylonas, N.; Karlos, S.; Tsoumakas, G. Zero-shot classification of biomedical articles with emerging mesh descriptors. In Proceedings of the 11th Hellenic Conference on Artificial Intelligence, Athens, Greece, 2–4 September 2020; pp. 175–184. [Google Scholar]
Chapelle, O.; Scholkopf, B.; Zien, A. Semi-supervised learning. IEEE Trans. Neural Netw. 2009, 20, 542. [Google Scholar] [CrossRef]
Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C.A. Mixmatch: A holistic approach to semi-supervised learning. Adv. Neural Inf. Process. Syst. 2019, 32, 5049–5059. [Google Scholar]
Berthelot, D.; Carlini, N.; Cubuk, E.D.; Kurakin, A.; Sohn, K.; Zhang, H.; Raffel, C. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv 2019, arXiv:1911.09785. [Google Scholar]
Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Adv. Neural Inf. Process. Syst. 2020, 33, 596–608. [Google Scholar]
Zhang, B.; Wang, Y.; Hou, W.; Wu, H.; Wang, J.; Okumura, M.; Shinozaki, T. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Adv. Neural Inf. Process. Syst. 2021, 34, 18408–18419. [Google Scholar]
Chen, H.; Tao, R.; Fan, Y.; Wang, Y.; Wang, J.; Schiele, B.; Xie, X.; Raj, B.; Savvides, M. Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning. arXiv 2023, arXiv:2301.10921. [Google Scholar]
Wang, Y.; Chen, H.; Heng, Q.; Hou, W.; Fan, Y.; Wu, Z.; Wang, J.; Savvides, M.; Shinozaki, T.; Raj, B. Freematch: Self-adaptive thresholding for semi-supervised learning. arXiv 2022, arXiv:2205.07246. [Google Scholar]
Sajjadi, M.; Javanmardi, M.; Tasdizen, T. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Adv. Neural Inf. Process. Syst. 2016, 29, 1163–1171. [Google Scholar]
Laine, S.; Aila, T. Temporal ensembling for semi-supervised learning. arXiv 2016, arXiv:1610.02242. [Google Scholar]
Bachman, P.; Alsharif, O.; Precup, D. Learning with pseudo-ensembles. Adv. Neural Inf. Process. Syst. 2014, 27, 3365–3373. [Google Scholar]
Lee, D.H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop Chall. Represent. Learn. ICML 2013, 3, 896. [Google Scholar]
Xie, Q.; Luong, M.T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
Xie, Q.; Dai, Z.; Hovy, E.; Luong, T.; Le, Q. Unsupervised data augmentation for consistency training. Adv. Neural Inf. Process. Syst. 2020, 33, 6256–6268. [Google Scholar]
Pham, H.; Dai, Z.; Xie, Q.; Le, Q.V. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11557–11568. [Google Scholar]
Xu, Y.; Shang, L.; Ye, J.; Qian, Q.; Li, Y.; Sun, B.; Li, H.; Jin, R. Dash: Semi-supervised learning with dynamic thresholding. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 11525–11536. [Google Scholar]
Berthelot, D.; Roelofs, R.; Sohn, K.; Carlini, N.; Kurakin, A. Adamatch: A unified approach to semi-supervised learning and domain adaptation. arXiv 2021, arXiv:2106.04732. [Google Scholar]
Kuo, C.W.; Ma, C.Y.; Huang, J.B.; Kira, Z. Featmatch: Feature-based augmentation for semi-supervised learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16. Springer International Publishing: Cham, Switzerland, 2020; pp. 479–495. [Google Scholar]
Wallin, E.; Svensson, L.; Kahl, F.; Hammarstrand, L. Doublematch: Improving semi-supervised learning with self-supervision. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), IEEE, Montreal, QC, Canada, 21–25 August 2022; pp. 2871–2877. [Google Scholar]
Belkin, M.; Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv. Neural Inf. Process. Syst. 2001, 14. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical data augmentation with no separate search. arXiv 2019, arXiv:1909.13719. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 113–123. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1195–1204. [Google Scholar]
Ke, Z.; Wang, D.; Yan, Q.; Ren, J.; Lau, R.W. Dual student: Breaking the limits of the teacher in semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6728–6736. [Google Scholar]
Miyato, T.; Maeda, S.; Koyama, M.; Ishii, S. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1979–1993. [Google Scholar] [CrossRef]
Reynolds, D.A. Gaussian mixture models. Encycl. Biom. 2009, 741, 659–663. [Google Scholar]
Frénay, B.; Verleysen, M. Classification in the presence of label noise: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2013, 25, 845–869. [Google Scholar] [CrossRef] [PubMed]
Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N.; McGuinness, K. Unsupervised label noise modeling and loss correction. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 312–321. [Google Scholar]
Song, H.; Kim, M.; Park, D.; Shin, Y.; Lee, J.G. Robust learning by self-transition for handling noisy labels. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 1490–1500. [Google Scholar]
Zhang, Z.; Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. Neural Inf. Process. Syst. 2018, 31, 8778–8788. [Google Scholar]
Zhang, Y.; Sugiyama, M. Approximating instance-dependent noise via instance-confidence embedding. arXiv 2021, arXiv:2103.13569. [Google Scholar]
Song, H.; Kim, M.; Lee, J.G. Selfie: Refurbishing unclean samples for robust deep learning. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5907–5915. [Google Scholar]
Han, B.; Yao, Q.; Yu, X.; Niu, G.; Xu, M.; Hu, W.; Tsang, I.; Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Adv. Neural Inf. Process. Syst. 2018, 31, 8527–8537. [Google Scholar]
Chen, P.; Liao, B.B.; Chen, G.; Zhang, S. Understanding and utilizing deep neural networks trained with noisy labels. In Proceedings of the International Conference on Machine Learning. PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 1062–1070. [Google Scholar]
Li, J.; Socher, R.; Hoi, S.C.H. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv 2002, arXiv:2002.07394. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning multiple layers of features from tiny images. Tech. Rep. 2009, 7. [Google Scholar]
Venkateswara, H.; Eusebio, J.; Chakraborty, S.; Panchanathan, S. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5018–5027. [Google Scholar]
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading Digits in Natural Images with Unsupervised Feature Learning. 2011. Available online: http://ufldl.stanford.edu/housenumbers/ (accessed on 1 December 2023).
Coates, A.; Ng, A.; Lee, H. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; JMLR Workshop and Conference Proceedings. pp. 215–223. [Google Scholar]
Saenko, K.; Kulis, B.; Fritz, M.; Darrell, T. Adapting visual category models to new domains. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Proceedings, Part IV 11. Springer: Berlin/Heidelberg, Germany, 2010; pp. 213–226. [Google Scholar]
Peng, X.; Bai, Q.; Xia, X.; Huang, Z.; Saenko, K.; Wang, B. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1406–1415. [Google Scholar]
DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Zheng, M.; You, S.; Huang, L.; Wang, F.; Qian, C.; Xu, C. Simmatch: Semi-supervised learning with similarity matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14471–14481. [Google Scholar]
Lerner, B.; Shiran, G.; Weinshall, D. Boosting the performance of semi-supervised learning with unsupervised clustering. arXiv 2020, arXiv:2012.00504. [Google Scholar]
Oliver, A.; Odena, A.; Raffel, C.A.; Cubuk, E.D.; Goodfellow, I. Realistic evaluation of deep semi-supervised learning algorithms. Adv. Neural Inf. Process. Syst. 2018, 31, 3235–3246. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
Sun, J.; Mao, Z.; Li, C.; Zhou, C.; Wu, X.J. Feature Space Renormalization for Semi-supervised Learning. arXiv 2023, arXiv:2311.04055. [Google Scholar]

Figure 1. A diagram of the training process of unlabeled data in DYMatch. Firstly, the strongly and weakly augmentations are used to generate two different visions of the unlabeled data. The feature maps

h (f (\tilde{u}))

and

f ({\tilde{u}}^{'})

are used to compute the loss of the feature-correlation consistency regularization method. And the Q_i denotes the pseudo labels selected by the dynamic pseudo-label estimation method based on the Gaussian Mixture Model. Please see more details in our method section below.

Figure 1. A diagram of the training process of unlabeled data in DYMatch. Firstly, the strongly and weakly augmentations are used to generate two different visions of the unlabeled data. The feature maps

h (f (\tilde{u}))

and

f ({\tilde{u}}^{'})

are used to compute the loss of the feature-correlation consistency regularization method. And the Q_i denotes the pseudo labels selected by the dynamic pseudo-label estimation method based on the Gaussian Mixture Model. Please see more details in our method section below.

Figure 2. We presented the variations in the thresholds, the number of pseudo labels, and the accuracy of pseudo labels for DYMatch and other SSL methods at each iteration on CIFAR-10 with 40 labels. (a) Class-average thresholds. (b) Class-average number of pseudo labels. (c) The accuracy of pseudo labels.

Table 1. The Error Rates for CIFAR-10 with Five Different Numbers of Labelled Images. The best result is in bold.

	CIFAR-10
Model	40 Labels	250 Labels	500 Labels	2000 Labels	4000 Labels
Mean Teacher [27]	70.09 ± 0.26	47.33 ± 4.71	42.01 ± 5.86	12.17 ± 0.22	9.19 ± 0.19
Pseudo-Label [16]	74.61 ± 1.60	49.98 ± 1.17	40.55 ± 1.70	21.96 ± 0.42	16.21 ± 0.11
MixMatch [7]	36.19 ± 6.48	11.08 ± 0.87	9.65 ± 0.94	7.03 ± 0.15	6.24 ± 0.06
ReMixMatch [8]	9.88 ± 1.03	6.27 ± 0.34	6.04 ± 0.04	5.53 ± 0.18	5.14 ± 0.04
UDA [18]	10.62 ± 3.75	5.43 ± 0.96	4.80 ± 0.09	4.73 ± 0.14	4.32 ± 0.08
FixMatch [9]	7.47 ± 0.28	5.07 ± 0.65	4.85 ± 0.06	4.54 ± 0.12	4.26 ± 0.05
FlexMatch [10]	4.97 ± 0.06	4.98 ± 0.49	4.47 ± 0.34	4.18 ± 0.12	4.19 ± 0.01
Semi-Clustering [48]	7.39 ± 0.61	5.51 ± 0.25	5.40 ± 0.23	4.71 ± 0.12	4.62 ± 0.09
DoubleMatch [23]	14.02 ± 5.71	5.56 ± 0.42	5.12 $\pm$ 0.43	4.66 ± 0.23	4.65 ± 0.17
FeatMatch [22]	7.88 ± 0.52	7.50 ± 0.64	6.21 ± 0.11	4.95 ± 0.20	4.91 ± 0.18
MPL [19]	6.62 ± 0.91	5.76 ± 0.26	5.89 ± 0.23	4.40 ± 0.04	3.89 ± 0.07
SimMatch [47]	5.18 ± 0.15	4.99 ± 0.13	4.62 ± 0.21	4.22 ± 0.14	4.31 ± 0.01
SoftMatch [11]	5.06 ± 0.02	4.84 ± 0.10	4.66 ± 0.12	4.33 ± 0.10	4.27 ± 0.02
FreeMatch [12]	4.97 ± 0.09	4.85 ± 0.10	4.57 ± 0.10	4.21 ± 0.05	4.14 ± 0.02
DYMatch (ours)	4.89 ± 0.11	4.80 ± 0.07	4.53 ± 0.08	4.15 ± 0.11	3.96 ± 0.02

Table 2. The Error Rates for SVHN with Five Different Numbers of Labelled Images. The best result is in bold.

			SVHN
Model	40 Labels	250 Labels	500 Labels	1000 Labels	4000 Labels
Mean Teacher [27]	36.09 ± 3.98	6.45 ± 2.43	3.82 ± 0.17	3.75 ± 0.10	3.39 ± 0.11
Pseudo-Label [16]	64.61 ± 5.60	21.16 ± 0.88	14.35 ± 0.37	10.19 ± 0.41	5.71 ± 0.07
MixMatch [7]	30.60 ± 8.39	3.78 ± 0.26	3.64 ± 0.46	3.27 ± 0.31	2.89 ± 0.06
ReMixMatch [8]	24.04 ± 9.13	3.10 ± 0.50	3.02 ± 0.33	2.83 ± 0.30	2.42 ± 0.09
UDA [18]	5.12 ± 4.27	2.74 ± 2.76	2.55 ± 0.09	2.35 ± 0.07	2.28 ± 0.10
FixMatch [9]	3.81 ± 1.18	2.48 ± 0.38	2.35 ± 0.05	2.28 ± 0.11	2.20 ± 0.09
FlexMatch [10]	8.19 ± 3.20	6.59 ± 2.29	6.79 ± 1.20	6.72 ± 0.30	4.45 ± 0.12
Semi-Clustering [48]	3.09 ± 0.54	2.30 ± 0.03	2.29 ± 0.03	2.26 ± 0.12	2.15 ± 0.12
DoubleMatch [23]	16.50 ± 13.73	2.41 ± 0.53	2.50 ± 0.23	2.25 ± 0.09	2.15 ± 0.09
FeatMatch [22]	4.01 ± 0.13	3.34 ± 0.19	3.08 ± 0.12	3.10 ± 0.06	2.62 ± 0.08
MPL [19]	9.33 ± 8.02	2.29 ± 0.03	2.30 ± 0.12	2.28 ± 0.07	2.18 ± 0.12
SimMatch [47]	3.19 ± 0.74	2.26 ± 0.21	2.14 ± 0.20	2.08 ± 0.04	2.97 ± 0.07
SoftMatch [11]	2.31 ± 0.00	2.15 ± 0.05	2.16 ± 0.10	2.08 ± 0.04	1.93 ± 0.00
FreeMatch [12]	3.79 ± 0.03	4.09 ± 0.66	4.38 ± 0.07	4.31 ± 0.00	2.67 ± 0.10
DYMatch (ours)	2.26 ± 0.13	2.18 ± 0.11	2.11 ± 0.02	2.04 ± 0.03	1.90 ± 0.01

Table 3. The Error Rates for CIFAR-100 with Four Different Numbers of Labelled Images. The best result is in bold.

	CIFAR-100
Model	400 Labels	1000 Labels	2500 Labels	10,000 Labels
Mean Teacher [27]	-	60.99 ± 0.21	53.91 ± 0.57	36.21 ± 0.19
Pseudo-Label [16]	-	62.78 ± 0.33	57.38 ± 0.46	36.21 ± 0.19
MixMatch [7]	67.61 ± 1.32	54.73 ± 0.24	39.94 ± 0.37	28.31 ± 0.33
ReMixMatch [8]	44.28 ± 2.06	37.86 ± 0.55	27.43 ± 0.31	23.03 ± 0.56
UDA [18]	59.28 ± 0.88	45.20 ± 0.71	33.13 ± 0.22	24.50 ± 0.25
FixMatch [9]	48.85 ± 1.75	40.55 ± 0.09	28.29 ± 0.11	23.03 ± 0.56
FlexMatch [10]	39.94 ± 1.62	38.44 ± 1.20	26.49 ± 0.20	21.90 ± 0.15
Semi-Clustering [48]	66.57 ± 0.23	54.01 ± 0.12	44.35 ± 0.03	35.99 ± 0.23
DoubleMatch [23]	42.61 ± 1.15	42.05 ± 0.34	27.47 ± 0.19	21.69 ± 0.26
FeatMatch [22]	48.70 ± 0.42	41.85 $\pm 0.23$	30.43 ± 0.04	26.83 ± 0.04
MPL [19]	44.49 ± 0.99	38.13 ± 0.23	27.43 ± 0.22	22.79 ± 0.18
SimMatch [47]	48.82 ± 1.51	37.12 ± 0.52	32.54 ± 0.27	26.42 ± 0.18
SoftMatch [11]	49.64 ± 1.46	36.42 ± 0.34	33.05 ± 0.05	27.26 ± 0.03
FreeMatch [12]	49.24 ± 2.16	37.15 ± 0.44	32.79 ± 0.21	27.17 ± 0.11
DYMatch (ours)	44.03 ± 0.62	35.42 ± 0.25	31.13 ± 0.10	20.34 ± 0.08

Table 4. The Error Rates for STL10 on Two Different Numbers of Labelled Images. The best result is in bold.

	STL10
Model	40 Labels	1000 Labels
Mean Teacher [27]	-	21.43 ± 2.39
Pseudo-Label [16]	74.76 ± 0.99	27.99 ± 0.80
MixMatch [7]	54.93 ± 0.96	10.41 ± 0.61
ReMixMatch [8]	32.12 ± 6.24	5.23 ± 0.45
UDA [18]	37.42 ± 8.44	7.66 ± 0.56
FixMatch [9]	35.97 ± 4.14	5.17 ± 0.63
FlexMatch [10]	29.15 ± 4.16	5.77 ± 0.23
Semi-Clustering [48]	40.90 ± 0.23	4.78 ± 0.29
DoubleMatch [23]	34.34 ± 0.34	4.46 ± 0.20
FeatMatch [22]	51.26 ± 0.33	6.94 ± 0.32
MPL [19]	35.76 ± 3.29	6.45 ± 0.29
SimMatch [47]	37.57 ± 0.23	5.63 ± 0.35
SoftMatch [11]	21.42 ± 3.48	5.73 ± 0.24
FreeMatch [12]	15.56 ± 0.55	5.63 ± 0.15
DYMatch (ours)	17.58 ± 0.67	5.35 ± 0.12

Table 5. The Error Rates for Other Object Recognition Datasets with Different Numbers of Labelled Images. (The numbers of data for each category contained in each domain adaptation dataset are given). The best result is in bold.

	Office31 (31)	Office-Home (65)		DomainNet (345)
Model	10 Labels per Class	10 Labels per Class	50 Labels per Class	50 Labels per Class	500 Labels per Class
ReMixMatch [8]	30.12 ± 0.33	68.69 ± 0.07	46.42 ± 0.36	82.54 ± 0.11	76.23 ± 0.23
FixMatch [9]	28.36 ± 0.20	55.73 ± 0.23	46.27 ± 0.23	79.12 ± 0.03	73.79 ± 0.15
FlexMatch [10]	25.13 ± 0.33	54.56 ± 0.30	43.99 ± 0.03	66.12 ± 0.15	53.45 ± 0.23
DoubleMatch [23]	26.30 ± 0.30	56.45 ± 0.13	45.25 ± 0.13	70.32 ± 0.32	55.43 ± 0.12
SimMatch [47]	26.09 ± 0.12	53.65 ± 0.09	45.01 ± 0.04	65.95 ± 0.20	51.33 ± 0.04
SoftMatch [11]	22.50 ± 0.32	52.66 ± 0.14	40.91 ± 0.12	64.24 ± 0.18	50.13 ± 0.05
DYMatch (ours)	20.31 ± 0.26	52.29 ± 0.26	40.42 ± 0.10	63.81 ± 0.10	50.93 ± 0.07
Fully-Supervised	13.54	32.48		43.37

Table 6. The Results of the Ablation Study on the Influence of the dynamic pseudo-label estimation method and the feature-correlation-based consistency regularization method on DYMatch (PL represents the dynamic pseudo-label estimation method based on Gaussian mixture model, and CR represents the consistency regularization method based on feature correlation). The best result is in bold.

	Methods		CIFAR-10
Model	CR	PL	40 Labels	4000 Labels
DYMatch_PL	✓	✗	8.16 ± 0.06	5.26 ± 0.05
DYMatch_CR	✗	✓	10.58 ± 0.16	8.41 ± 0.24
DYMatch	✓	✓	4.93 ± 0.11	3.96 ± 0.02

Table 7. The Results of the Ablation Study for Features from Different module. The best result is in bold.

Model	Method	CIFAR-10
Model	Method	40 Labels	4000 Labels
DYMatch_F	After backbone $f (\cdot)$	4.93 ± 0.11	3.96 ± 0.02
DYMatch_G	After classification module $g (\cdot)$	6.31 ± 0.08	5.06 ± 0.07

Table 8. The Results of the Ablation Study on Different Optimizers with Different Learning Rates (lr) and Momenta (mom). The best result is in bold.

Optimizer	Hyper-Parameters	CIFAR-10
Optimizer	Hyper-Parameters	40 Labels	4000 Labels
SGD	$l r$ $= 0.02, m o m$ = 0.9	4.93 ± 0.11	3.96 ± 0.02
SGD	$l r$ $= 0.02, m o m$ = 0.9, Nesterov	5.05 ± 0.10	4.10 ± 0.12
SGD	$l r =$ $0.01, m o m$ = 0.9	4.96 ± 0.09	4.06 ± 0.06
SGD	$l r =$ $0.01, m o m$ = 0.9, Nesterov	5.01 ± 0.13	4.14 ± 0.03
SGD	$l r =$ $0.03, m o m$ = 0.9	4.92 ± 0.03	3.99 ± 0.10
SGD	$l r =$ $0.03, m o m$ = 0.9, Nesterov	5.11 ± 0.02	4.04 ± 0.09
Adam	$l r$ $= 0.01, w d$ = 1 × 10⁻³	13.13 ± 0.13	10.34 ± 0.10

Table 9. The Results for Different Learning Rate Decay Schedulers. The best result is in bold.

Learning Rate Decay Scheduler	CIFAR-10
Learning Rate Decay Scheduler	40 Labels	4000 Labels
Cosine (min_lr = 1 × 10⁻³)	4.93 ± 0.11	3.96 ± 0.02
Exp-Warmup (min_lr = 1 × 10⁻³)	5.13 ± 0.08	4.10 ± 0.04
No Decay	8.02 ± 0.23	6.23 ± 0.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mao, Z.; Pan, F.; Sun, J. DYMatch: Semi-Supervised Learning with Dynamic Pseudo Labeling and Feature Consistency. Appl. Sci. 2024, 14, 229. https://doi.org/10.3390/app14010229

AMA Style

Mao Z, Pan F, Sun J. DYMatch: Semi-Supervised Learning with Dynamic Pseudo Labeling and Feature Consistency. Applied Sciences. 2024; 14(1):229. https://doi.org/10.3390/app14010229

Chicago/Turabian Style

Mao, Zhongjie, Feng Pan, and Jun Sun. 2024. "DYMatch: Semi-Supervised Learning with Dynamic Pseudo Labeling and Feature Consistency" Applied Sciences 14, no. 1: 229. https://doi.org/10.3390/app14010229

APA Style

Mao, Z., Pan, F., & Sun, J. (2024). DYMatch: Semi-Supervised Learning with Dynamic Pseudo Labeling and Feature Consistency. Applied Sciences, 14(1), 229. https://doi.org/10.3390/app14010229

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DYMatch: Semi-Supervised Learning with Dynamic Pseudo Labeling and Feature Consistency

Abstract

1. Introduction

2. Related Works

2.1. Consistency Regularization and Pseudo-Labeling Methods in SSL

2.2. Gaussian Mixture Model and Its Application in SSL

3. DYMatch

3.1. Motivation of Dynamic Pseudo-Label Estimation Method

3.2. Dynamic Pseudo-Label Estimation Method Based on Gaussian Mixture Model

3.2.1. Dynamic Local Confidence Threshold Estimation

3.2.2. Dynamic Global Confidence Threshold Estimation

3.2.3. Dynamic Pseudo-Label Estimation

3.3. Consistency Regularization Method Based on Feature-Correlation

3.4. Loss Function of DYMatch

4. Experiments

4.1. Datasets

4.2. Experimental Settings

4.3. Results

4.3.1. Experimental Results for Standard SSL Datasets

4.3.2. Experimental Results for Domain Adaptation Datasets

4.4. Ablation Study

4.5. Results for Different Optimizers and Learning Rate Decay Methods

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI