3.2. Problem
Let be a dataset and subscripts and refer to training or testing subsets of the data. Image classification can be formulated as the problem of learning a classifier f from a set of training data, , where is the ground-truth label in C categories corresponding to , and is the number of samples in the training dataset. In our setting, f is a classifier from the CNN. The goal of a vanilla image classification problem is to improve the accuracy of the unlabeled dataset examples: . However, due to the diversity of the datasets and fuzzy differences between different categories, the accuracy of test samples remains difficult to improve.
3.4. Separation Loss
As shown in
Figure 1, different classes of training and test datasets are mixed together. The decision boundary for the trained network remains fuzzy, leading to poor model performance and low accuracy of teat-end classification. Hence, it is necessary to improve the discrimination between different classes.
The purpose of a new separation loss function,
, is to improve the inter-class dispersion so that the boundaries between different categories can be separable, and samples in the same categories can be more closely associated with each other. The core part of separation loss is to reduce the similarity between different classes. Since the network is trained using batch-wise samples, we inevitably encounter situations where the number of samples in different classes are imbalanced. We hence calculate the covariance matrix of the output of each categories’ samples and then minimize the structural similarity [
27] between each two categories’ covariance matrix as follows.
where
B represents batch-wise data,
generates the categorical output by
.
calculates the covariance matrix of categorical features as in Equation (
3) and
takes the absolute value to accelerate the convergence.
where
is the data mean and
is either
or
. The
can be computed in Equation (
4).
where
and
are batch-wise features,
, and
are mean, standard deviations of domain invariant and specific features batch, and cross-covariance for
.
and
are two variables to stabilize the division with weak denominator. This loss function is derived from structural similarity index measure (SSIM) [
27]. It has the advantages of measuring luminance, contrast, and structural difference between
and
. Therefore,
has more capability of measuring the similarity between any two different categorical samples. In addition, the range of the
is from 0 to 1, where 1 indicates high similarity between batch features and 0 means they are not similar.
During the training, minimizing can lead to the minimal similarity between each of the two categories. Hence, it can achieve the inter-class dispersion.
3.5. Confident Pseudo Labeling
By combining separation loss with cross-entropy loss, we can improve the discrimination of classifier
f using training dataset. To improve the performance of the test dataset, we leverage transductive learning to mitigate the difference between the training and test datasets. Transductive learning can train both labeled training data and test samples (without true labels); hence, the difference between them can be minimized [
15].
To obtain knowledge from the test dataset, we first generate confident pseudo labels. Previous work either utilized hard pseudo labels or predicted class probability. In contrast to previous approaches, we aim to continuously train the new confident pseudo labeled test data. In this stage, we also take advantage of the initial training classifier
f to generate initial pseudo labels and examples for the test data. We define a confident pseudo label in the following equation.
where
represents confidence.
is the confident label and
is its corresponding confident sample. Here,
is the predicted probability in class
c given the observation
.
takes the dominant class probability, and it is higher than the threshold
p, and
p is between 0 and 1. The confident samples and their confident labels are able to push the decision boundary of classifier
f toward the test dataset.
We can construct a pseudo label test domain , which consists of confident test examples with its confident pseudo labels, where , and , and is controlled by p. if , and if .
However, this pseudo labeling method generates confident pseudo labels with only a single high probability. The classifier
f can be updated in the early stages of training but may not be able to train more examples on successive iterations since all high probability samples are treated as confident samples. Therefore, we propose to continuously generate confident examples in
T times adjustment learning so that the classifier
f could be updated in each adjustment learning. In adjustment learning, the pseudo label test domain becomes:
, where
and
, and
t is between
. To remove noisy pseudo labels of the predicted target domain in every
t, we set the number of
t-th updated domain
is not larger than the target domain sample size
, which means
.
In addition,
is updated using Equation (
6) with probability threshold
of every
t, it also meets the requirements (
and
), and we could obtain confident examples and pseudo labels during each
t-th iteration and the classifier
f will lean toward the test data. In
T times iterations, we then form a set of probability threshold as
. This approach produces confident examples and pseudo labels in each recurrent training interval.
During training, the constructed pseudo labeled test data domain
will keep optimizing the trained classifier
f after minimizing cross-entropy loss and separation loss functions. The pseudo labeled test data are also minimized by the cross-entropy loss. Therefore, the loss function for
in each training iteration is given by:
where
is the number of confident samples of
t-th adjustment learning.
is the confident pseudo labeled binary indicator of each class
c for the confident sample
in the
t-th adjustment training, and
is also the predicted probability of each class
c given the input of confident sample
.
3.6. Categorical Maximum Mean Discrepancy
The proposed confident pseudo labeling process can optimize the network parameters, and it is not necessary to minimize the differences between the training and test data. To reduce the discrepancy between training and test data, we also compute the maximum mean discrepancy (MMD) loss [
28], which is a frequently used distance-based loss function that reduces the divergence between the training and test data. However, MMD loss in conventional form focuses on only the marginal distribution alignment, which is more suitable for large domain divergence problems. As shown in
Figure 1, the training and test data overlap, suggesting the marginal distribution alignment is not important for these cow teat images. Due to the fuzzy boundaries between different categories, conditional distribution alignment is required. Hence, we propose a categorical MMD (CMMD) loss, which attempts to align the conditional distribution of each category of training and test data.
where
and
are the number of samples in each class of training and confident pseudo labeled test data,
, and
.
and
are categorical samples. This proposed CMMD loss measures the discrepancy between training and test datasets.