3.1.1. Initialization with Hard Labels
Whether in a supervised person re-identification environment or an unsupervised person re-identification environment, our goal is to learn a feature embedding function
, where parameters of
are collectively denoted as
. Since it does not have the identity information of pedestrian images, for a pedestrian data set
, initially the label of each training data
is assigned by its index
, and the index value entered into the network is used as the pseudo-label of the image. In this way, each training image is assumed to fall into an individual class by itself. For the feature vector v of each image
x, it is normalized by
regularization so that
via
. Then the probability of an image belongs to the
i-th class is defined as:
where
V is the memory bank [
22] that stores the feature of each class,
is the
j-th column in memory bank which indicates the feature of
j-th class.
N is the number of categories, which is the number of pedestrian images. The
is set to 0.07 following [
22].
Then, the non-parameterized instance classification loss function is adopted to train the image. The hard cross-entropy loss function is defined as:
where
denotes the label of class
i, the label of the ground truth is set to 1, and the label of the other class is 0.
The purpose of using hard labels to initialize the network is to minimize the Euclidean distance between the sample feature and the features in the memory bank, and to maximize the Euclidean distance between the feature and the features in the memory bank. As a result, the network has completed the initialization of all unlabeled images and obtains an initial discriminative ability.
3.1.2. Dynamic Adaptive Label Allocation
After initializing the network with the hard label, the model has learned to recognize each unlabeled image, and each training sample is learned to push other training images away during the training process. However, those images with the same identity should be closer in the feature space. Forcing the images of the same person to have obviously different representations of the same images will have negative effect on the network. Therefore, a softened label method is proposed to pull the images which have same identity closer. Firstly, the model will find K similar features from memory bank that are most similar to the feature of each training sample. It uses the Euclidean distance to represent the similarity between features. In this way, the smaller the Euclidean distance, the higher the degree of similarity between features. Then it will assign pseudo-labels to K similar features, those similar features are defined as , the corresponding label is . The difference with clustering method is that clustering method will treat similar features as the same class and assign the same hard labels. While this method will give similar features softened labels, so that the model can not only predict each image into the ground truth class, but make it acceptable to predict the training image into similar class.
Equation (
3) shows the way that SSL generates pseudo labels: it finds a suitable hyperparameter through experiments to balance the effort between sample class and similar classes. For data
, the target label distribution formula of SSL is
where
is a manual value. It replaces the one-hot hard label with the softened label by introducing a small manual parameter
to adjust the probability distribution. However, notice that
is a fixed value, which results in the same expected probability of every input sample and so do other similar categories. In fact, different sample features have different correlation with their corresponding similar features. Further, the smaller the Euclidean distance between the sample feature and the similar features in the feature space, the higher the confidence that they belong to the same class.
Based on this point of view, the DALA method is proposed to assign pseudo-labels to images. Firstly, the Euclidean distance between the sample feature and the other features in memory bank need to be calculated. Then it will find k nearest similar features to sample feature, and assign softened labels to the sample feature and those similar features based on Equation (
4). For data
, the target label distribution formula of our method is
where
denotes a multiplicative scaling coefficient, which is used to reduce the confidence of similar features.
denotes the Euclidean distance between feature
and feature
. For the convenience of follow-up,
is used to represent the sample feature label and
represent the similar feature labels. When
is relatively large, it will get a lower confidence on the similar feature, and this similar feature will get a smaller label. In contrast, when
is relatively small, it will get a higher confidence on the similar feature, and the similar feature will get a bigger label.
Comparing with the SSL’s pseudo-label allocation method, DALA shows more robustness. When the model trains each sample feature, it will assign a suitable size of pseudo-label according to the similarity between the sample feature and those similar features. This method will give the model a lot of freedom, and reduce the impact on the accuracy of the model when there are error categories in similar features.
3.1.3. Fine-Tuning the Model with Softened Labels
By taking reliable classes into account, the confidence of the sample feature class is reduced, and the confidence of the similar feature classes is increased, which guides the network to learn the similarity among images of the identity smoothly. The network is fine-tuned by the softened cross-entropy loss function (Equation (
5)):
The fine-tuned network can not only reduce the Euclidean distance between image features and the ground truth features in the memory bank, but also reduce the distance between sample feature and similar features. After each iteration, the model will update the memory bank. The same update way as MoCo [
23] is adopted, the difference is the DALA’s momentum update way is on the representations of the same sample, not the encoder. The update method of memory bank is as follows:
where
denotes the feature vector of the i-th image in the memory bank,
denotes the new feature vector corresponding to the image,
, is a momentum coefficient. When
m is 0, it means that the updated feature will completely replace the previous feature in the memory bank. When
m is a value greater than 0 and less than 1, it means that the updated feature retains some part of features of the previous sample,
m is used to stabilizes the learning process. Further,
m cannot be 1, otherwise, the memory bank will not be updated. In our experiments, a relatively small momentum works better than a lager value. It means that it is good for the learning of the model to retain the characteristics of a small part of previous samples.
Through the softened classification network, the model can gradually learn sample features that are close to their similar images. The learning of the reliable class is softened and gentle. Due to the DALA method, when it includes the wrong image in the reliable set, it tries to reduce negative effects. Besides this, the relatively weak supervision signal makes the model freer and has a higher potential.