Discriminatively Unsupervised Learning Person Re-Identification via Considering Complicated Images

Quan, Rong; Xu, Biaoyi; Liang, Dong

doi:10.3390/s23063259

Open AccessArticle

Discriminatively Unsupervised Learning Person Re-Identification via Considering Complicated Images

by

Rong Quan

^†

,

Biaoyi Xu

^† and

Dong Liang

^*

School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sensors 2023, 23(6), 3259; https://doi.org/10.3390/s23063259

Submission received: 16 February 2023 / Revised: 10 March 2023 / Accepted: 16 March 2023 / Published: 20 March 2023

(This article belongs to the Special Issue Person Re-Identification Based on Computer Vision)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

State-of-the-art purely unsupervised learning person re-ID methods first cluster all the images into multiple clusters and assign each clustered image a pseudo label based on the cluster result. Then, they construct a memory dictionary that stores all the clustered images, and subsequently train the feature extraction network based on this dictionary. All these methods directly discard the unclustered outliers in the clustering process and train the network only based on the clustered images. The unclustered outliers are complicated images containing different clothes and poses, with low resolution, severe occlusion, and so on, which are common in real-world applications. Therefore, models trained only on clustered images will be less robust and unable to handle complicated images. We construct a memory dictionary that considers complicated images consisting of both clustered and unclustered images, and design a corresponding contrastive loss by considering both kinds of images. The experimental results show that our memory dictionary that considers complicated images and contrastive loss can improve the person re-ID performance, which demonstrates the effectiveness of considering unclustered complicated images in unsupervised person re-ID.

Keywords:

purely unsupervised learning person re-ID; contrastive loss; unclustered outliers; complicated images

1. Introduction

Purely unsupervised learning person re-identification (re-ID) refers to recognizing the same person appearing in images captured by different cameras without any groundtruth labels. With the rapid development of unsupervised representation learning [1,2,3], the performance of purely unsupervised learning person re-ID based on contrastive loss has been gradually improving [4,5,6,7,8]. On the other hand, purely unsupervised learning person re-ID does not require any labeled information or complex training processes, which is easy to implement and deploy in real-world application scenarios. Therefore, purely unsupervised learning person re-ID has great research potential and a promising future.

Most existing purely unsupervised learning person re-ID methods execute iterative training in the following way. Firstly, they generate an initial feature representation for each image from the initial feature extraction network. Based on the initial image feature representations, they cluster all the images into multiple clusters and assign a pseudo label to each clustered image according to the cluster results. Knowing the image feature representations and pseudo labels, these methods build an instance-level [9,10] or cluster-level [11] memory dictionary. Then, they sample a batch of query images and compute the contrastive loss by comparing the query images with all the instance representations in the memory dictionary. Next, these methods train the feature extraction network based on the contrastive loss, and update the instance representations in the memory dictionary based on the trained feature extraction network.

Memory dictionary construction and contrastive loss calculations are two key components of existing purely unsupervised learning person re-ID methods. Ge et al. [9] constructed an instance-level memory dictionary consisting of all the clustered images and computed a cluster-level contrastive loss. They first generated each cluster’s representation by averaging the features of all the images in this cluster, and then computed the cluster-level contrastive loss by comparing the query samples with all the cluster representations. Based on the obtained contrastive loss, they updated the feature representations of the query images’ corresponding instances in the dictionary. Using the cluster-level contrastive loss could reduce the error caused by the noisy pseudo labels and also simplify the calculation. However, the instance-level feature update in the memory dictionary may lead to an unbalanced update and inconsistent representation for the clusters, especially when the numbers of images contained in each cluster varies significantly [11]. To solve this problem, Dai et al. [11] directly constructed a cluster-level memory dictionary and used the cluster centroid as each cluster’s representation in the dictionary. They computed the cluster-level contrastive loss by comparing the query images with all the cluster representations, and directly updated the cluster representations based on the query images. Computing the contrastive loss and updating the feature representations in the cluster-level can avoid the unbalanced cluster update and inconsistent cluster representation problems, and thus result in a better performance. We follow [11] and use the cluster-level memory dictionary and contrastive loss in our work.

We found that the existing cluster-based purely unsupervised learning person re-ID methods directly discard the unclustered outliers in the clustering process and train the feature extraction network only based on the clustered images. Through our observations, these unclustered outliers often contain pedestrians that are difficult to recognize owing to the problem of large variations in clothes and pose, severe occlusion, low resolution, different light, and so on. Such kinds of complicated images are also frequently encountered, especially in real-world applications. Directly ignoring these complicated images and training the feature extraction network only based on easily clustered common images will result in a less robust trained network which is unable to handle complicated images. Therefore, other than the easily clustered common images, we also consider complicated images that are hard to cluster during the clustering process in our method.

We propose an unsupervised learning person re-ID method that considers complicated images based on both easily clustered common images and unclustered complicated images in our work. Specifically, we propose a novel memory dictionary that considers complicated images and stores both clustered common images and unclustered complicated images, and a corresponding cluster-level contrastive loss which compares the query instances with not only the positive and negative instances, but also the unclustered complicated instances in the dictionary. We make an assumption that the unclustered complicated images always have different person IDs with the query images, and are difficult to recognize. Based on this assumption, we construct our cluster-level dictionary as follows. For the clustered images, we average the feature representations of all the images in each cluster and use the obtained average feature as the cluster’s feature representation. For the unclustered complicated images, we randomly sample one image from the complicated images and use its feature to represent the unclustered complicated instance in the dictionary. When computing the contrastive loss, we consider three types of instances in the dictionary for each query image, i.e., a positive instance which contains the same person as the query image, a negative instance of the opposite situation, and the complicated instance. Since we assume that the complicated instance has a different person in the query images and is difficult to distinguish, we treat the complicated images differently during the calculation of the contrastive loss.

The main contributions of this paper are three-fold:

We exploit the unclustered complicated images in the clustering stage to increase the trained model’s ability to recognize various images, and thus make our method more robust and suitable for real-world complex applications.
We construct a novel memory dictionary which considers complicated images, consisting of both easily clustered common images and unclustered complicated images, and design a more effective contrastive loss by comparing the query samples with not only the positive and negative instances, but also the complicated instances in the dictionary.
We demonstrate that our proposed method outperforms other state-of-the-art purely unsupervised learning person re-ID methods and some unsupervised domain adaptation methods.

2. Related Works

Existing unsupervised person re-ID methods can be divided into two classes: unsupervised domain adaptation methods and purely unsupervised learning person re-ID methods. Unsupervised domain adaptation methods use information from additional source domains to help train the model and update the feature representations in the target domain, while the purely unsupervised learning methods train the model completely based on unlabeled data.

Unsupervised domain adaptation. Unsupervised domain adaptation methods utilize transfer learning to improve the person re-ID performance on the target domain. Existing unsupervised domain adaptation methods can further be divided into two main categories: pseudo label-based methods [4,12,13,14,15,16,17,18] and domain translation-based methods [19,20,21,22,23]. Pseudo label-based methods first pre-train the model on the source domain and extract the features of the instances in the target domain based on the pre-trained model. Next, they generate pseudo labels for the instances in the target domain by either clustering their features or by measuring their feature similarities with the example features. Clustering-based methods are the most common unsupervised domain adaption person re-ID methods since they achieve a state-of-the-art performance. Two key characteristics of clustering-based methods are generating pseudo labels with higher accuracies and preventing the final person re-ID results from suffering from the noisy pseudo labels. Toward this end, Fu et al. [4] explored both the whole body and local body parts similarity to construct multiple clusters, and thus generate more accurate multi-scale pseudo labels. Ge et al. [13] proposed a mutual mean-teaching method to alleviate the errors induced by noisy pseudo labels which are generated from directly clustering on the target domain. They gradually refined the pseudo labels in the target domain by alternatively refining the hard pseudo labels offline and the soft pseudo labels online. Zhai et al. [12] proposed an augmented discriminative clustering method to more extensively use rich unlabeled images in the target domain. re-ID models based on augmented clusters are more discriminative. Zhang et al. [12] proposed an augmentation framework-based self-training method, which progressively improved the model’s performance by alternatively executing conservative and promising stages. Although unsupervised domain adaptation re-ID methods can achieve a promising performance by exploiting information from the source domain, they still need supervised information on the source domain and their performances are susceptible to the difference between the target domain and the source domain.

Purely unsupervised learning person re-ID. Purely unsupervised learning person re-ID methods completely train models on unlabeled data [5,6,7,9,10,24,25,26,27,28]. The training process usually consists of four main steps, including clustering to generate pseudo labels, constructing a memory dictionary, computing the contrastive loss, and updating the feature representations. Pseudo label generation and memory dictionary construction are two keys parts of this kind of method. Lin et al. [5] proposed a bottom-up clustering method by considering both the diversity between images with different people and the similarity between images with the same person. They treated each image as an individual cluster and gradually grouped similar images into one cluster to finally obtain a great balance between the diversity across clusters and similarity in clusters. Wang et al. [6] exploited camera-aware proxies in each cluster to further distinguish the images from different cameras, and thus generate more reliable pseudo labels. Based on the camera-aware pseudo labels, they designed both intra-camera and inter-camera contrastive loss to enhance the model’s identity discrimination ability. Wang et al. [10] formulated person re-ID as a multi-label classification problem. They first assigned a single-class image label and then proposed a memory-based multi-label classification loss to merge the single-label and multi-label classification into one framework. Ge et al. [9] proposed a self-paced contrastive learning method to gradually generate more accurate cluster results. Based on this, they calculated more accurate feature representations for the memory dictionary. Zhang et al. [27] refined the pseudo labels temporally based on the pseudo label similarities between every two successive training iterations with clustering consensus. The current purely unsupervised learning person re-ID methods leverage various means to remove the label noise and obtain more accurate pseudo labels, and update the feature representations based on these refined pseudo labels. However, these methods directly discard unclustered outliers when generating pseudo labels, which are complicated images that are hard to cluster owing to large variations in clothes and pose, severe occlusion, low resolution, different light, and so on. Training the model solely based on the easily clustered images will result in the model being unable to handle complicated images and impractical for real-world scenarios. In this work, we construct a memory dictionary and train the feature extraction network based on both easily clustered images and unclustered complicated images to increase the trained model’s generalization ability.

3. Methods

Figure 1 shows the overall framework of the proposed purely unsupervised learning person re-ID method considering complicated images. Given a set of N training images

{I_{1}, I_{2}, \dots, I_{N}}

, we first use the feature extraction network

f_{θ}

to generate the initial image feature representations

F = {f_{1}, f_{2}, \dots, f_{N}}

, and cluster the image feature representations into K clusters

{C_{1}, C_{2}, \dots, C_{K}}

. Next, we construct a cluster-level memory dictionary D with

K + 1

items, including K cluster centroids

{c_{1}, c_{2}, \dots, c_{K}}

and the image feature representation

h

of a random unclustered complicated image. Next, we sample a batch of M query images

{Q_{1}, Q_{2}, \dots, Q_{M}}

from the training dataset, input them into the feature extraction network to obtain the query feature representations

{q_{1}, q_{2}, \dots, q_{M}}

, and compute a cluster-level contrastive loss by comparing the query representations with the positive, negative, and complicated feature representations in D. At last, we train the feature extraction network with this contrastive loss and use the trained network to update the cluster centroids in D. The blue arrows in Figure 2 show the network training process and the pink arrows show the representative updating process of D after network training. Figure 2 illustrates one iteration of the method training, and it will take many iterations to achieve the optimal person re-ID result. Next, we will introduce the details of the proposed method.

3.1. Memory Dictionary Construction

Existing purely unsupervised learning person re-ID methods first cluster all the training images into multiple clusters and assign each clustered image a pseudo label based on the cluster results. They directly discard the unclustered outliers and completely ignore them during network training and feature updating processes, which is unreasonable since the unclustered images are also very important. These unclustered complicated images mostly contain pedestrians with severe occlusion, low resolution, or large variations in clothes, pose, or light, which are frequently present in real-world application scenarios. Such kinds of complicated images are the focus of most supervised learning person re-ID methods, but are discarded and completely ignored by most purely unsupervised learning person re-ID methods. Consequently, the trained network will fail to handle complicated images and thus have low robustness and generalization ability.

To take all the images into consideration during training, we construct a novel memory dictionary which considers complicated images, consisting of both easily clustered common images and the unclustered complicated images. Specifically, we first use a widely used cluster method named DBSCAN [29] to cluster all the images into K clusters

{C_{1}, C_{2}, \dots, C_{K}}

. For each cluster

C_{i}

, we compute its cluster centroid

c_{i}

as the average feature of all the images belonging to it:

c_{i} = \frac{1}{| C_{i} |} \sum_{j = 1}^{| C_{i} |} f_{j}

(1)

where

| C_{i} |

represents the number of the images in cluster

C_{i}

and

f_{j}

represents the feature of the jth image in cluster

C_{i}

.

For the unclustered images in the clustering process, we assume that they are different from the query images, while very difficult to recognize. Considering that the unclustered images are also different from each other and may contain images of the same person as the query image, we do not simply use all the unclustered images or their average features to represent the complicated image set in the dictionary. Instead, we randomly select one image from the unclustered images at each training iteration and use the selected image’s featured to represent the complicated image set of D. Although the unclustered images may contain images of the same person as the query images, randomly selecting one image at each training iteration can dramatically decrease the probability that the complicated image and the query image contain the same person.

We use labels 1 to K as the pseudo labels of all the clusters, and label 0 as the pseudo label of the complicated image set. After these operations, we construct a memory dictionary that considers complicated images, D, which covers all kinds of training images.

3.2. Contrastive Loss Calculation

After constructing the memory dictionary D, we randomly sample a batch of

M = P \times T

query images from the training images. In detail, we sample images from person P, and sample T images for each person following the set in [11,30]. We design our contrastive loss which considers complicated images based on the ClusterNCE loss of [11]. Specifically, for each query image

Q_{i}

with feature representation

q_{i}

, we compute its contrastive loss as:

L_{h} = - log \frac{exp (q_{i} \cdot c_{+} / τ)}{\sum_{k = 1}^{K} exp (q_{i} \cdot c_{k} / τ) + η exp (q_{i} \cdot h / τ)}

(2)

where

c_{+}

is the representation of the positive cluster which

Q_{i}

belongs to,

c_{k}

is the representation of cluster

C_{k}

,

h

is the representation of the complicated images in the dictionary,

τ

is the temperature hyper-parameter, and

η

is the hyper-parameter used to balance the influences of the common and complicated images. This contrastive loss considering complicated images not only requires the query image representation to have the smallest distance to the positive cluster representation and the largest distances to the negative cluster representations, but also requires the distance between the query image representation and the complicated image representation to be the largest.

3.3. Training and Updating

After the final training iteration, the obtained contrastive loss is used to train the feature extraction network. By considering both the easily clustered common images and unclustered complicated images, the trained feature extraction network is able to generate more accurate feature representations for all the training images. Next, we update the cluster representations and unclustered image’s feature representation using the trained feature extraction network. We first use a momentum updating method [11] to update the cluster representation in the memory dictionary. Specifically, we use the trained feature extraction network to recalculate each query image’s feature representation, and based on this we update the representations of the clusters that the query images belong to. The updating formula is as follows:

c_{k} \leftarrow μ c_{k} + (1 - μ) q_{i}^{*}

(3)

where

q_{i}^{*}

is the recalculated feature representation of the query image

Q_{i}

from the trained feature extraction network and

Q_{i}

belongs to cluster

C_{k}

.

c_{k}

is the original vector representation of cluster

C_{k}

in D and

μ

is the momentum parameter that controls the updating degree of the cluster representation. A small

μ

value indicates a large change to the original cluster representation at each update, and a big

μ

value indicates a small change to the original cluster representation at each update. For the unclustered complicated images, we recalculate the feature representation of the selected complicated image and replace it in the complicated image set. In the next iteration, we again randomly select a complicated image’s feature representation as the representation of the complicated image set in D.

The above process describes one iterative training, and the whole training framework of our method is shown in Algorithm 1.

Algorithm 1: Training framework of the unsupervised learning person re-ID method which considers complicated images.

4. Results

4.1. Experimental Settings

4.1.1. Datasets and Evaluation Metrics

To evaluate our proposed unsupervised learning person re-ID method which considers complicated images, we conducted experiments on two widely used person re-ID datasets named Market-1501 [31] and MSMT17 [19]. The Market-1501 dataset contains 32,668 images with 1501 distinct identities, which were collected from six cameras. The MSMT17 dataset contains 126,441 bounding images of 4101 distinct identities, which were collected from 15 cameras. The MSMT17 dataset is the most challenging and biggest person re-ID dataset, the images of which have signifiant scene and lighting variations. We use the mean average precision (mAP) and cumulative matching characteristic (CMC) top-1, top-5, and top-10 as the evaluation metrics of person re-ID.

4.1.2. Implementation Details

We use ReNet-50 [32] as the backbone of the feature extraction network and parameters pretrained on ImageNet to initialize the network. Following the settings in [11], we remove all the sub-module layers after the fourth convolutional layer, add a global average pooling layer, batch normalization layer, and L2-normalization layer in turn, and finally output a 2048-dimensional feature representation for each image. Each image is resized into 256 × 128 pixels before it is input into the feature extraction network. Before training, we conduct data augmentation on each dataset by random horizontal flipping, cropping, and erasure. We set P = 16 and T = 16, i.e., we sample a batch of 256 images from 16 identities in each training iteration. We use the Adam optimizer to train the network parameters. We set

τ

to 0.01,

μ

to 0.1, and

η

to 2. The initial learning rate is 0.0035, and it is reduced to 1/20 of its original value every 20 epochs. We train the model for 80 epochs.

4.2. Comparison with Other Unsupervised Person Re-ID Methods

We compared our method with some other state-of-the-art unsupervised person re-ID methods on the Market-1501 and MSMT17 datasets, including both purely unsupervised learning (BUC [5], SSL [33], MMCL [10], HCT [34], CycAs [35], UGA [36], SPCL [9], IICS [7], OPLG [28], RLCC [27], ICE [25], PPLR [37], Cluster-ReID [11], TAUDL [38], and UTAL [39]) and some unsupervised domain adaptation methods (MMCL [10], AD-Cluster [12], MMT [13], SPCL [9], TDR [40], and ECN [41]). The comparison results are shown in Table 1 and Table 2, where the baselines with “*” are the unsupervised domain adaption person re-ID methods, and those without “*’; are purely unsupervised learning person re-ID methods. We can see from Table 1 and Table 2 that our method outperforms all the compared purely unsupervised learning person re-ID methods on the Market-1501 dataset, and is comparable with the state-of-the-art purely unsupervised learning person re-ID methods on the MSMT17 dataset. In addition, although the unsupervised domain adaptation methods exploit additional source domain information such as other labeled data or trained person re-ID models during their training process, our method still outperforms some state-of-the-art unsupervised domain adaptation methods, which further demonstrates the superiority of our proposed method. Our model is based on Cluster-ReID [11], where we further consider the unclustered outliers in the clustering process by designing a new memory dictionary that considers complicated images and contrastive loss. We can see from the last two rows of Table 1 and Table 2 that the performance of our method is better than Cluster-ReID. By comparing the performances of our method and Cluster-ReID, we can conclude that considering both the easily clustered common images and the unclustered complicated images in cluster-based purely unsupervised learning person re-ID can obtain better person re-ID performance than when only considering easily clustered common images.

4.3. Ablation Study

We use the unclustered outliers in the clustering process to help construct a memory dictionary considering complicated images in our method. We assume that the complicated images contain different people than the query samples and are difficult to recognize. Furthermore, considering that the unclustered complicated images are very likely to be different from each other, we randomly sample one image from the complicated images and use its features to represent the complicated instance in the memory dictionary. Only using one image of the complicated image set to represent the complicated instance at each training iteration can avoid the sampled complicated image containing the same person as the query image, which is inconsistent with our assumption and will cause calculation errors. To demonstrate the effectiveness of the sampling strategy, we conducted an ablation experiment to randomly sample more than one image from the complicated image set from the Market-1501 dataset. The experimental results are shown in Table 3, where the numbers in the first column represent the number of images sampled from the complicated image set in each training iteration. As we can see from Table 3, sampling more images from the complicated image set will reduce the person re-ID performance of our method, which demonstrates the effectiveness of sampling only one image to represent the complicated image set in the dictionary.

As mentioned before, we use the momentum updating method to update the vector representation of each cluster in the memory dictionary. The updating formula is shown in Equation (3). At each iteration and update, the updated vector representation of a cluster is composed of its original vector representation and the recalculated feature representations of the query images belonging to this cluster, where the ratios of the original vector representation and the recalculated image feature representations are

μ

and

1 - μ

, respectively.

μ

is known as the momentum in the momentum updating method. Usually, a small

μ

value indicates a significant change to the original cluster representation at each update, and a big

μ

value indicates a small change to the original cluster representation at each update. Here, we attempt to use different momentum values and observe the person re-ID performances when using these different momentum values. Specifically, we experiment with 33 different momentum values and their person re-ID performances are reported in Figure 2. As we can see from Figure 2, momentum values smaller than 0.9 can always generate good person re-ID performances, while momentum values larger than 0.9 always result in a poor performance. Therefore, we set

μ

as 0.1 in our method.

5. Conclusions

We proposed a novel cluster-based, purely unsupervised learning person re-ID method that considers complicated images, where we constructed a memory dictionary considering complicated images and contrastive loss by not only considering the easily clustered common images, but also the complicated images that are hard to cluster. At each iteration, we use the average features to represent each cluster instance, randomly sample a complicated image to represent the complicated instance in the memory dictionary, and compute the contrastive loss based on a comparison between the query images with both kinds of instances. The experimental results show that using our proposed memory dictionary and contrastive loss can clearly improve the person re-ID performance, which demonstrates the effectiveness of considering complicated images during the iterative training process, as well as the effectiveness of our method. We found that the sampled complicated image sometimes contains the same person as the query images, which is contrary to our assumption and will therefore introduce error into the final person re-ID result. To tackle this issue, we prefer to use a more effective complicated instance sample strategy to select a more useful complicated image at each iteration that definitely contains a different person compared to the query images.

Author Contributions

Methodology, B.X.; Writing—original draft, R.Q.; Supervision, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under grant 62272229 and the Natural Science Foundation of Jiangsu Province under grant BK20222012.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ye, M.; Zhang, X.; Yuen, P.C.; Chang, S.F. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 142–149. [Google Scholar]
Wu, Z.; Xiong, Y.; Yu, S.X.; Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3733–3742. [Google Scholar]
Tian, Y.; Krishnan, D.; Isola, P. Contrastive multiview coding. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 776–794. [Google Scholar]
Fu, Y.; Wei, Y.; Wang, G.; Zhou, Y.; Shi, H.; Huang, T.S. Self-similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6112–6121. [Google Scholar]
Lin, Y.; Dong, X.; Zheng, L.; Yan, Y.; Yang, Y. A bottom-up clustering approach to unsupervised person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 8738–8745. [Google Scholar]
Wang, M.; Lai, B.; Huang, J.; Gong, X.; Hua, X.S. Camera-aware proxies for unsupervised person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 2764–2772. [Google Scholar]
Xuan, S.; Zhang, S. Intra-inter camera similarity for unsupervised person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 11926–11935. [Google Scholar]
Liang, D.; Kang, B.; Liu, X.; Gao, P.; Tan, X.; Kaneko, S.I. Cross-scene foreground segmentation with supervised and unsupervised model communication. Pattern Recognit. 2021, 117, 1079–1095. [Google Scholar] [CrossRef]
Ge, Y.; Zhu, F.; Chen, D.; Zhao, R. Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. Adv. Neural Inf. Process. Syst. 2020, 33, 11309–11321. [Google Scholar]
Wang, D.; Zhang, S. Unsupervised person re-identification via multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10981–10990. [Google Scholar]
Dai, Z.; Wang, G.; Yuan, W.; Zhu, S.; Tan, P. Cluster contrast for unsupervised person re-identification. In Proceedings of the Asian Conference on Computer Vision, Macau, China, 4–8 December 2022; pp. 1142–1160. [Google Scholar]
Zhai, Y.; Lu, S.; Ye, Q.; Shan, X.; Chen, J.; Ji, R.; Tian, Y. Ad-cluster: Augmented discriminative clustering for domain adaptive person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9021–9030. [Google Scholar]
Ge, Y.; Chen, D.; Li, H. Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification. arXiv 2020, arXiv:2001.01526. [Google Scholar]
Song, L.; Wang, C.; Zhang, L.; Du, B.; Zhang, Q.; Huang, C.; Wang, X. Unsupervised domain adaptive re-identification: Theory and practice. Pattern Recognit. 2020, 102, 107173. [Google Scholar] [CrossRef] [Green Version]
Kang, B.; Liang, D.; Mei, J.; Tan, X.; Zhou, Q.; Zhang, D. Robust RGB-T Tracking via Graph Attention-Based Bilinear Pooling. IEEE Trans. Neural Netw. Learn. Syst. 2022, 4, 1–12. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Xia, R.; Zou, K.; Yang, K. FFTI: Image inpainting algorithm via features fusion and two-steps inpainting. J. Vis. Commun. Image Represent. 2023, 91, 1037–1076. [Google Scholar] [CrossRef]
Yu, H.X.; Zheng, W.S.; Wu, A.; Guo, X.; Gong, S.; Lai, J.H. Unsupervised person re-identification by soft multilabel learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2148–2157. [Google Scholar]
Zhang, X.; Cao, J.; Shen, C.; You, M. Self-training with progressive augmentation for unsupervised cross-domain person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 8222–8231. [Google Scholar]
Wei, L.; Zhang, S.; Gao, W.; Tian, Q. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 79–88. [Google Scholar]
Kang, B.; Liang, D.; Ding, W.; Zhou, H.; Zhu, W.P. Grayscale-thermal tracking via inverse sparse representation-based collaborative encoding. IEEE Trans. Image Process. 2019, 29, 3401–3415. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Zhu, X.; Gong, S. Instance-guided context rendering for cross-domain person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 232–242. [Google Scholar]
Deng, W.; Zheng, L.; Ye, Q.; Kang, G.; Yang, Y.; Jiao, J. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 994–1003. [Google Scholar]
Ge, Y.; Zhu, F.; Chen, D.; Zhao, R.; Wang, X.; Li, H. Structured Domain Adaptation with Online Relation Regularization for Unsupervised Person Re-ID. IEEE Trans. Neural Netw. Learn. Syst. 2022, 5, 1–14. [Google Scholar] [CrossRef] [PubMed]
Fan, H.; Zheng, L.; Yan, C.; Yang, Y. Unsupervised person re-identification: Clustering and fine-tuning. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2018, 14, 1–18. [Google Scholar] [CrossRef]
Chen, H.; Lagadec, B.; Bremond, F. Ice: Inter-instance contrastive encoding for unsupervised person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 14960–14969. [Google Scholar]
Liang, D.; Geng, Q.; Wei, Z.; Vorontsov, A.; Kim, L.; Wei, M.; Zhou, H. Anchor retouching via model interaction for robust object detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5619213. [Google Scholar] [CrossRef]
Zhang, X.; Ge, Y.; Qiao, Y.; Li, H. Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 3436–3445. [Google Scholar]
Zheng, Y.; Tang, S.; Teng, G.; Ge, Y.; Liu, K.; Qin, J.; Qi, D.; Chen, D. Online pseudo label generation by hierarchical cluster dynamics for adaptive person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 8371–8381. [Google Scholar]
Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD-96 Proceedings, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; pp. 1116–1124. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, Y.; Xie, L.; Wu, Y.; Yan, C.; Tian, Q. Unsupervised person re-identification via softened similarity learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3390–3399. [Google Scholar]
Zeng, K.; Ning, M.; Wang, Y.; Guo, Y. Hierarchical clustering with hard-batch triplet loss for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 13657–13665. [Google Scholar]
Wang, Z.; Zhang, J.; Zheng, L.; Liu, Y.; Sun, Y.; Li, Y.; Wang, S. Cycas: Self-supervised cycle association for learning re-identifiable descriptions. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 72–88. [Google Scholar]
Wu, J.; Yang, Y.; Liu, H.; Liao, S.; Lei, Z.; Li, S.Z. Unsupervised graph association for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8321–8330. [Google Scholar]
Cho, Y.; Kim, W.J.; Hong, S.; Yoon, S.E. Part-based Pseudo Label Refinement for Unsupervised Person Re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7308–7318. [Google Scholar]
Li, M.; Zhu, X.; Gong, S. Unsupervised person re-identification by deep learning tracklet association. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 737–753. [Google Scholar]
Li, M.; Zhu, X.; Gong, S. Unsupervised tracklet person re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 1770–1782. [Google Scholar] [CrossRef] [PubMed]
Isobe, T.; Li, D.; Tian, L.; Chen, W.; Shan, Y.; Wang, S. Towards discriminative representation learning for unsupervised person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 19–25 June 2021; pp. 8526–8536. [Google Scholar]
Zhong, Z.; Zheng, L.; Luo, Z.; Li, S.; Yang, Y. Invariance matters: Exemplar memory for domain adaptive person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 598–607. [Google Scholar]

Figure 1. The flowchart of our proposed person re-ID method. The memory dictionary considering complicated images is made up of the clustered common images and unclustered complicated images, where the cluster instance is represented by the cluster centroid and the unclustered instance is represented by a randomly sampled complicated image. Furthermore, the contrastive loss is calculated by comparing the query images with two kinds of instances in the dictionary. The blue arrows show the training process of the feature extraction network and the pink arrows show the feature representation updating process.

Figure 2. Performance comparison of using different momentum values during the momentum updating process. The horizontal axis represents the momentum values and the vertical axis represents the person re-ID performance.

Table 1. Comparison between our method and other state-of-the-art unsupervised person re-ID methods on the Market-1501 dataset. The methods marked with “*” are unsupervised domain adaptation person re-ID methods, while those without “*” are purely unsupervised learning person re-ID methods.

Method	Market-1501
Method	Source	mAP	Top 1	Top 5	Top 10
MMCL [10] *	Duke	60.4	84.4	92.8	95.0
AD-Cluster [12] *	Duke	68.3	86.7	94.4	96.5
MMT [13] *	MSMT17	75.6	89.3	95.8	97.5
SPCL [9] *	MSMT17	77.5	89.7	96.1	97.6
TDR [40] *	Duke	83.4	94.2	-	-
BUC [5]	None	38.3	66.2	79.6	84.5
SSL [33]	None	37.8	71.7	83.8	87.4
MMCL [10]	None	45.5	80.3	89.4	92.3
HCT [34]	None	56.4	80.0	91.6	95.2
CycAs [35]	None	64.8	84.8	-	-
UGA [36]	None	70.3	87.2	-	-
SPCL [9]	None	73.1	88.1	95.1	97.0
IICS [7]	None	72.1	88.8	95.3	96.9
OPLG [28]	None	78.1	91.1	96.4	97.7
RLCC [27]	None	77.7	90.8	96.3	97.5
ICE [25]	None	79.5	92.0	97.0	98.1
PPLR [37]	None	81.5	92.8	97.1	98.1
Cluster-ReID [11]	None	83.0	92.9	97.2	98.0
Ours	None	83.8	93.3	96.9	97.8

Table 2. Comparison between our method and other state-of-the-art unsupervised person re-ID methods on the MSMT17 dataset. The methods with ‘*’ are unsupervised domain adaptation person re-ID methods, while those without ‘*’ are purely unsupervised learning person re-ID methods.

Method	MSMT17
Method	Source	mAP	Top-1	Top-5	top-10
MMT [13] *	Market	24.0	50.1	63.5	69.3
SPCL [9] *	Market	26.8	53.7	65.0	69.8
ECN [41] *	Duke	10.2	30.2	41.5	46.8
MMCL [10] *	Duke	16.2	43.6	54.3	58.9
TDR [40] *	Duke	36.3	66.6	-	-
MMCL [10]	None	11.2	35.4	44.8	49.8
CycAs [35]	None	26.7	50.1	-	-
UGA [36]	None	21.7	49.5	-	-
SPCL [9]	None	19.1	42.3	55.6	61.2
TAUDL [38]	None	12.5	28.4	-	-
UTAL [39]	None	13.1	31.4	-	-
IICS [7]	None	18.6	45.7	57.7	62.8
OPLG [28]	None	26.9	53.7	65.3	70.2
RLCC [27]	None	27.9	56.5	68.4	73.1
ICE [25]	None	29.8	59.0	71.7	77.0
PPLR [37]	None	31.4	61.1	73.4	77.8
Cluster-ReID [11]	None	33.0	62.0	71.8	76.7
Ours	None	34.9	61.9	72.7	77.0

Table 3. Comparison of randomly sampling different numbers of images from the complicated image set on the Market-1501 dataset.

Number	Market-1501
Number	mAP	Top 1	Top 5	Top 10
1	83.8	93.3	96.9	97.8
2	82.9	92.7	96.7	98.0
3	82.3	92.5	96.6	97.8
4	80.9	91.7	96.6	97.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Quan, R.; Xu, B.; Liang, D. Discriminatively Unsupervised Learning Person Re-Identification via Considering Complicated Images. Sensors 2023, 23, 3259. https://doi.org/10.3390/s23063259

AMA Style

Quan R, Xu B, Liang D. Discriminatively Unsupervised Learning Person Re-Identification via Considering Complicated Images. Sensors. 2023; 23(6):3259. https://doi.org/10.3390/s23063259

Chicago/Turabian Style

Quan, Rong, Biaoyi Xu, and Dong Liang. 2023. "Discriminatively Unsupervised Learning Person Re-Identification via Considering Complicated Images" Sensors 23, no. 6: 3259. https://doi.org/10.3390/s23063259

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Discriminatively Unsupervised Learning Person Re-Identification via Considering Complicated Images

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Memory Dictionary Construction

3.2. Contrastive Loss Calculation

3.3. Training and Updating

4. Results

4.1. Experimental Settings

4.1.1. Datasets and Evaluation Metrics

4.1.2. Implementation Details

4.2. Comparison with Other Unsupervised Person Re-ID Methods

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI