Person Re-Identification in Special Scenes Based on Deep Learning: A Comprehensive Survey

Chen, Yanbing; Wang, Ke; Ye, Hairong; Tao, Lingbing; Tie, Zhixin

doi:10.3390/math12162495

Open AccessReview

Person Re-Identification in Special Scenes Based on Deep Learning: A Comprehensive Survey

by

Yanbing Chen

¹

,

Ke Wang

²

,

Hairong Ye

^1,*,

Lingbing Tao

¹ and

Zhixin Tie

^1,3

¹

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

²

College of Information Engineering, Jiangmen Polytechnic, Jiangmen 529000, China

³

KeYi College, Zhejiang Sci-Tech University, Shaoxing 312369, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(16), 2495; https://doi.org/10.3390/math12162495

Submission received: 6 July 2024 / Revised: 3 August 2024 / Accepted: 12 August 2024 / Published: 13 August 2024

Download

Browse Figure

Versions Notes

Abstract

:

Person re-identification (ReID) refers to the task of retrieving target persons from image libraries captured by various distinct cameras. Over the years, person ReID has yielded favorable recognition outcomes under typical visible light conditions, yet there remains considerable scope for enhancement in challenging conditions. The challenges and research gaps include the following: multi-modal data fusion, semi-supervised and unsupervised learning, domain adaptation, ReID in 3D space, fast ReID, decentralized learning, and end-to-end systems. The main problems to be solved, which are the occlusion problem, viewpoint problem, illumination problem, background problem, resolution problem, openness problem, etc., remain challenges. For the first time, this paper uses person ReID in special scenarios as a basis for classification to categorize and analyze the related research in recent years. Starting from the perspectives of person ReID methods and research directions, we explore the current research status in special scenarios. In addition, this work conducts a detailed experimental comparison of person ReID methods employing deep learning, encompassing both system development and comparative methodologies. In addition, we offer a prospective analysis of forthcoming research approaches in person ReID and address unresolved concerns within the field.

Keywords:

person re-identification; deep learning; cross-modal; supervised learning; unsupervised learning

MSC:

68-02

1. Introduction

The installation of surveillance equipment in public spaces is aimed at identifying specific individuals through the video footage it provides when necessary. However, manually searching through over a hundred hours of surveillance video footage for a target pedestrian is extraordinarily time-consuming and labor-intensive. To address this issue, the research community has formally introduced the concept of person re-identification (ReID). Through ReID, one can utilize trained models to automatically locate specific individuals across multiple video clips.

Facial recognition, a common method for identifying pedestrians, is mature and extensively applied; however, it predominantly depends on pre-learned facial features. However, in everyday pedestrian image data, there is often insufficient information on pedestrian features available for learning [1]. Moreover, the facial images used in facial recognition are usually high-definition and clearly visible, whereas the video clips provided by surveillance equipment are often blurry [2], and it is frequently difficult to capture a full face. Consequently, facial recognition exhibits inherent limitations in pedestrian retrieval scenarios. Person ReID, on the other hand, does not require high-resolution images of pedestrians to achieve effective retrieval, thus possessing advantages that facial recognition lacks.

In recent years, person ReID has garnered significant attention from both industry and academia, making substantial progress in top-tier computer vision conferences. Under certain conditions, some advanced works [3,4,5] have even achieved recognition capabilities surpassing human eyesight, reaching an accuracy of 95.6% on the general-scenario Market1501 [6] dataset. However, in specific scenarios such as occlusion, infrared, and video datasets, person ReID methods still face considerable challenges. These challenges include potential color discrepancies between different surveillance devices, where the same-color clothing may present different visual appearances under different devices [7], the unpredictability of pedestrian postures [8], and the variability of camera viewpoints [9]. Additionally, factors such as weather or illumination changes [10], variations in pedestrian attire [11], and occlusions by irrelevant objects [12] can all affect the performance of person ReID systems. Images of traditional scenes are RGB images where the human torso is exposed in its entirety, and occluded images sometimes obscure the key features of the human body, posing a great difficulty in recognition. In infrared images and traditional image comparison, the lack of color is an important feature; naturally, there are also increased recognition difficulties. The video data relative to the image contain too much repeated redundant information, increasing the resource consumption, so that they are different from the general scene. The ReID solution will be different, and it needs to have a specialized targeted approach to solve. Currently, there is no universally powerful person ReID model that can be applied to all scenarios; researchers typically design specialized models for individual scenes to address specific conditions.

Despite the remarkable progress of ReID techniques, there are still some challenges such as occlusion, pose change, background clutter, lighting change, camera viewpoint change, low-resolution images, and cross-domain generalization. These issues hinder the accuracy and robustness of ReID algorithms in real-world scenes. In addition, multi-source data ReID utilizes multiple types of person information for matching, which needs to solve the problems faced by general ReID, in addition to the differences between different types of person information and general images when they are matched with each other, such as low-resolution images, infrared images, depth images, textual information, and sketch images. The use of these cross-modal and cross-type data makes multi-source data ReID research more practical and challenging. According to the analysis in a review of person ReID research on multi-source data [13], the multi-source data ReID technique has the advantage of fully utilizing all types of data to learn cross-modal and cross-type feature transformations compared to general ReID techniques.

In addition, the training of models is an essential step, and there are currently many methods available for this purpose, such as combining different loss functions, adding generic training techniques, and tuning hyperparameters. In the field of pedestrian re-identification (ReID), the training model and the loss function are key factors for improving the recognition accuracy. Deep learning-based models, such as CNNs, are commonly used to extract global features from pedestrian images, while specific models, such as ResNet or InceptionNet, are pre-trained to learn generalized features on large-scale datasets (e.g., ImageNet) and then fine-tuned on the ReID task. The design of the loss function is crucial for the model to learn to distinguish between different pedestrians, with Softmax Loss used for the classification task and Triplet Loss focusing on learning the relative distance between samples. Pose-Clustered Baselines (PCBs), Discriminative Gradient Networks (DG-Nets), and other methods enhance the discriminative nature of features by combining human pose information and gradient information. In addition, Generative Adversarial Networks (GANs) such as SPGAN are used to generate synthetic data to enhance the generalization ability of the model. Multi-task learning frameworks have also been used to simultaneously optimize pedestrian recognition and attribute prediction tasks. Unsupervised and semi-supervised learning methods show a strong potential for learning from unlabeled data when labeled data are limited. The choice of these methods and models [14] depends on the application scenario, data characteristics, and the specific problem to be solved, and continued advances in the ReID field are driving the development of new models and methods. As illustrated in Figure 1, a complete person ReID model training process is depicted. The input images first undergo preprocessing before being fed into the designed person ReID model. Moreover, during the training phase, the model requires a loss function to supervise the learning of person ReID. If it is in the testing phase, the effectiveness of the model needs to be evaluated. The model computes the features of the query image (query image), followed by calculating the features of all pedestrian images in the gallery set (gallery). Subsequently, a specific distance algorithm is used to compute the feature distance between the query image and the pedestrian images in the gallery set, which are then sorted by the proximity of their feature distances—the closer the distance, the more likely it is that they represent the same individual. Finally, the search results are assessed and evaluated.

ReID technology plays a key role in security monitoring and intelligent analysis, but it faces the following main challenges:

Occlusion issue: pedestrians may be occluded during movement, leading to incomplete recognition features.
Illumination variation: image quality changes under different lighting conditions, affecting recognition accuracy.
Pose variation: different poses of pedestrians affect the extraction of their silhouette features.
Cluttered background: complex backgrounds may interfere with the recognition of pedestrian features.
Low-resolution images: surveillance cameras may produce low-resolution images due to distance or hardware limitations, reducing recognition rates.
Cross-domain generalization: images captured by different cameras may have domain differences, requiring algorithms to have generalization capabilities.
Multi-modal data fusion: how to effectively integrate data from various sensors such as visible light and infrared.
Privacy protection: personal privacy issues need to be considered when using ReID technology.
Real-time performance: surveillance systems require rapid response for real-time monitoring.
Large-scale data processing: with the increase in surveillance equipment, processing large amounts of data becomes a challenge.

In conclusion, the technology of person ReID is greatly sought after, despite the numerous challenges it encounters. This paper concentrates on the application of person ReID in specialized contexts, comparing and reviewing notable contributions and exploring prospective research avenues. The main structure of this paper is as follows: (1) person ReID methods are categorized into six types based on the methodological approach; (2) person ReID methods are categorized into eight types based on the research direction; (3) datasets and experimental comparisons.

2. Overview of Person Re-Identification Methods

2.1. Global and Local Feature-Based Person Re-Identification

Global person ReID, as outlined in [15], constitutes a foundational deep learning-based strategy within the domain of re-identifying an individual. This method treats the entire pedestrian image as a whole for feature extraction and matching, offering a straightforward implementation that involves feature extraction and matching across the entire image. It can be easily integrated with other methods to form a multi-branch network for person ReID, enhancing recognition performance. Despite its advantages, global person ReID has its limitations. Initially, the approach solely relies on global image features, frequently neglecting essential fine-grained details like clothing patterns and shoe types. Secondly, the method lacks robustness, leading to lower tolerance for variations in illumination, occlusion, and other interference factors. The presence of such disturbances within a dataset can severely compromise the system’s performance, leading to a substantial decline in recognition accuracy.

There are also some drawbacks and research gaps: The feature alignment problem [16]: local feature methods need to solve the feature alignment problem, especially when the pedestrian posture and view angle change, and feature matching is prone to errors, which affects the recognition performance. Perspective change sensitivity: local features are sensitive to perspective change, and pedestrian images captured by different cameras may be difficult to match due to perspective difference.

To tackle the aforementioned challenges, several researchers have suggested concentrating on local features within the domain of person ReID. The diverse local feature extraction techniques can generally be divided into two principal categories:

Generating local features through simple geometric partitioning: This methodology entails segmenting the pedestrian image into discrete segments, such as upper body, lower body, head, etc., by leveraging predefined rules or templates. Each part is then processed independently to extract local features that capture the distinct characteristics of that particular region.
Generating meaningful body part features through human parsing or pose estimation: This approach utilizes advanced techniques such as human parsing or pose estimation algorithms to identify and segment the pedestrian image into semantically meaningful body parts, such as arms, legs, torso, and head. Subsequently, each anatomical region undergoes feature extraction, yielding distinctive attributes tailored to its unique characteristics, thereby furnishing refined and discernible cues for the ReID process.

By focusing on local features, these methods aim to enhance the robustness of person ReID to variations in lighting, occlusion, and other challenging conditions, as well as to capture finer details that may be missed by global approaches.

The first technique entails partitioning the complete pedestrian image into vertical, horizontal, or grid-like segments. It boasts the benefits of straightforwardness and straightforward implementation, coupled with an aptitude for seizing localized features. Nevertheless, it exhibits limitations, particularly in addressing shifts in posture, occlusions, and disparities in camera perspectives. The emergence of such challenges can precipitate misalignment and mismatches in the feature correspondence of segmented regions, culminating in diminished overall system efficacy.

The second method involves generating meaningful body parts through human parsing or pose estimation [17]. These body parts are related to the human body, such as the head, upper body, lower body, arms, and legs. They can also be external human appearance features, such as hats, hair, clothing, trousers, backpacks, and shoes. This methodology is adept at producing more insightful local feature representations and adeptly addressing perturbations stemming from alterations in pedestrian posture, occlusions, and camera perspectives. However, it incurs the drawback of necessitating the training of supplementary human parsing or pose estimation models, potentially leading to considerable computational demands and an increase in model dimensions.

Although the global feature-based method is effective in extracting overall pedestrian features, it still has some specific limitations:

Sensitivity to occlusion and view angle changes: global feature models are usually more sensitive to changes in the pedestrian’s pose and view angle, and these factors can easily lead to feature distortion and affect the recognition accuracy.
Lack of local information representation: global features may not be able to adequately capture the local details of pedestrians, such as the texture and color of clothing, which are very important for distinguishing different pedestrians in practical applications.
Sensitivity to changes in lighting and background: changes in lighting conditions and background may have a significant impact on the global features, leading to a decrease in the accuracy.

2.2. Semantic Attribute-Based Person Re-Identification

Semantic attributes are crucial for describing the visual characteristics of pedestrians [18] and can provide valuable information for person ReID. In the context of person ReID, semantic attributes can be used as auxiliary information to enhance the generalization and robustness of feature representation. Although semantic attribute-based person re-identification provides a way to describe persons from different perspectives when dealing with person re-identification tasks, there are some drawbacks and research gaps: Inadequate generalization of local features: local feature-based approaches may perform well on specific datasets but may face challenges when generalized to other datasets or realistic scenarios, especially when dealing with person re-identification problems in complex scenarios such as occlusion and pose changes. Labeling dependency: many semantic attribute-based methods rely on high-quality labeled data, which may limit their feasibility in real-world applications, especially in real-world scenarios where data labeling is costly. These attributes can be considered as high-level features that describe the appearance of pedestrians. Common semantic attributes include gender, age, clothing color, and shoe type. By incorporating semantic attribute information as additional input into the network, the network can learn how to combine semantic attributes with pedestrian appearance features, thereby better expressing and distinguishing between different pedestrians [19]. This approach can improve the generalization and robustness of the algorithm because semantic attributes provide prior knowledge that helps the network better learn the differences between pedestrians, preventing overfitting to the training data. Additionally, using semantic attributes can help address variations in pedestrian appearance due to factors such as lighting and occlusion. For example, if the algorithm has not seen a pedestrian wearing a long-sleeved shirt in the training set, it might misidentify them as another pedestrian in the test set. However, if the algorithm can use semantic attributes to recognize the long-sleeved shirt, it can better identify the pedestrian, as the knowledge that long-sleeved shirts are an important semantic attribute can assist in distinguishing the current pedestrian.

Although person ReID methods based on semantic analysis have made some progress in improving the accuracy and robustness, there are still some limitations and research gaps:

Limitations of datasets: existing datasets have limitations in terms of size, quality, and diversity, especially when dealing with video data, and the ambiguity and complexity of the datasets create difficulties in feature extraction and model training.
Challenges in feature extraction: semantic analysis-based approaches need to extract useful semantic information from complex pedestrian images, but how to effectively extract semantic features from pedestrian images that contribute to person ReID is a challenge.
Real-time problem: in application scenarios that require a fast response, semantic analysis-based algorithms may not be able to meet the real-time requirements, especially when dealing with high-resolution videos.

Future research can be conducted in the following areas:

Developing more effective semantic feature extraction and fusion methods to improve the accuracy and robustness.
Constructing larger, high-quality cross-modal datasets to support a wider range of experiments and applications.
Explore unsupervised or semi-supervised learning methods to reduce the reliance on large amounts of labeled data.

2.3. Viewpoint-Invariant Person Re-Identification

In person ReID, pedestrians captured from different camera viewpoints can exhibit distinct feature representations, which presents an opportunity to enhance the accuracy of person ReID by leveraging viewpoint information [20]. The utilization of viewpoint information in person ReID can be categorized into three approaches:

Viewpoint Segmentation Methods: This approach [21] segments the pedestrian image into different viewpoint regions and utilizes the features from each region for person ReID. By focusing on specific viewpoint regions, the method aims to capture more consistent features across different viewpoints.
Viewpoint Transformation Methods: This approach [22] leverages the transformation relationship between pedestrian images captured from different viewpoints. It aligns the transformed images to a reference viewpoint and then uses the aligned images for person ReID. This method can compensate for viewpoint variations and improve the consistency of feature representation.
Multi-view Feature Fusion Methods: This approach [23] fuses the features extracted from pedestrian images captured from different viewpoints to enhance the accuracy of person ReID. By combining features from multiple viewpoints, the method can create a more robust and comprehensive representation of the individual.

It should be emphasized that methods utilizing viewpoint information are not inherently incompatible and can be amalgamated to forge enhanced person ReID techniques. Such amalgamation fosters a synergistic outcome, harnessing the individual merits of each methodology to counteract the challenges arising from viewpoint discrepancies in the ReID process.

It also has some drawbacks and research gaps:

Accuracy of viewpoint information: The accuracy of viewpoint estimation in existing research directly affects the performance of person ReID. If the viewpoint information is inaccurate, it may lead to mismatches in feature space, thereby affecting recognition effectiveness.
Scarcity of multi-view samples: current datasets are limited in the collection of multi-view samples, which poses a challenge to multi-view modeling and restricts the model’s generalization ability and effective use of viewpoint information.
Insufficient deep mining of viewpoint information: Although viewpoint information is crucial for person ReID, the in-depth impact of viewpoint changes has not been fully explored. More research is needed to explore how to better utilize viewpoint information to enhance recognition accuracy.
Computational complexity: methods based on viewpoint information may increase the computational burden of the model, especially when performing viewpoint estimation and feature alignment, which could significantly increase the consumption of computational resources.
Model generalization ability: although methods based on viewpoint information may perform well on specific datasets, the generalization ability of the model still needs to be improved, especially in dealing with person ReID tasks in various real-world scenarios.

Future research can proceed in the following directions:

Utilize 3D modeling [24] and simulation techniques to increase the diversity and accuracy of viewpoint information.
Develop more efficient viewpoint estimation methods to improve the model’s accuracy and robustness.
Explore unsupervised learning methods to reduce reliance on annotated data.
Combine fine-grained image recognition techniques to enhance the model’s feature expression capability.
Enhance the model’s adaptability to viewpoint changes through multi-modal data fusion.

2.4. Domain-Aware Person Re-Identification Methods

In the context of person ReID, “domain” denotes the array of distinct settings and circumstances within which data are amassed, encompassing factors such as camera placement, luminosity levels, environmental backdrop, and the temporal aspect of image acquisition. The variances among domains result in notable visual inconsistencies in pedestrian imagery, thereby presenting formidable challenges to the task of person ReID. Consequently, methodologies for cross-domain person ReID have been developed to surmount these obstacles. Currently, methods based on domain information mainly fall into two categories: domain adaptation-based methods and domain generalization-based methods.

Domain adaptation-based methods [25] typically employ deep neural networks. These methods first train a deep neural network on the source domain data and then fine-tune the network on the target domain data to adapt to the features of the target domain. The advantage of this approach is that it can directly utilize the target domain data to improve performance. However, the downside is that it requires a significant amount of target domain data for fine-tuning, and the target domain data must have some similarity to the source domain data to achieve good results.

Domain generalization-based methods, on the other hand, focus more on enhancing the model’s generalization ability to accurately identify unseen domains. These methods often employ data augmentation techniques such as sample resampling, image rotation, and color transformation to generate more training data. Additionally, some methods utilize unsupervised domain adaptation techniques for domain generalization, such as DASSL (Domain-Adaptive Self-Supervised Learning) [26] and others. These methodologies commonly employ model regularization strategies to mitigate the propensity for overfitting and enhance the model’s capacity for generalization. It should be highlighted that domain-informed methods must often deliberate on the equilibrium between the source and target domains to optimize recognition efficacy.

2.5. Attention-Based Person Re-Identification

The basic idea of attention mechanisms is to assign different weights to different regions of an image, allowing the model to concentrate its focus on the most meaningful parts, thereby enhancing the model’s performance [27]. Attention mechanisms can be used at different levels within deep learning models, such as global feature extraction, local feature extraction, and multi-scale feature fusion, enabling the model to recognize pedestrians more accurately.

Among them, attention-based person ReID methods include spatial attention mechanisms (Spatial Attention) and channel attention mechanisms (Channel Attention). Spatial attention mechanisms learn the importance of each region to enhance the representation of important areas in the image. Channel attention mechanisms, on the other hand, learn the importance of each channel to enhance the representation of features.

In addition to the two aforementioned, attention mechanisms applied to person ReID also include multi-attention mechanisms (Multi-Attention Fusion), which are attention mechanisms that combine multiple attention modules, and temporal attention mechanisms (Temporal Attention), which are attention mechanisms used for video person ReID and primarily focus on the temporal dimension to better capture the pedestrian’s motion and pose changes in videos.

It is important to note that while attention mechanisms can improve the performance of person ReID, they still have robustness issues with interference factors such as lighting and occlusions present in the dataset. Therefore, in practical applications, it is necessary to combine other methods to enhance the accuracy and robustness of person ReID.

Although person re-identification methods based on the attention mechanism perform well in improving the recognition accuracy, there are still some drawbacks and research gaps:

Accuracy of attention distribution: The effectiveness of the attention mechanism relies on its ability to accurately recognize key regions in the image. If the attention distribution is inaccurate, it may cause the model to focus on the wrong features, thus affecting the recognition performance.
Computational resource consumption: the attention mechanism may increase the computational complexity of the model, especially when dealing with high-resolution images or videos, which may lead to a significant consumption of computational resources.

Future research can be conducted in the following areas:

Developing more accurate and robust learning methods for attention distribution.
Optimizing the attention mechanism to reduce the consumption of computational resources.
Enhancing the generalization ability of the model, especially for applications across datasets and multiple scenes.

2.6. Person Re-Identification Based on Image Generation

Owing to the constrained scope of person ReID datasets, the application of data augmentation techniques can amplify the volume of the training dataset, thereby bolstering the model’s resilience and capacity for generalization. The basic idea is to generate fake pedestrian images with different poses, occlusions, lighting, and other variations using an image generator model and add them to the original dataset to increase the number of samples [28]. Furthermore, this approach can be employed to synthesize pedestrian images enriched with a broader spectrum of attributes, thereby enhancing the model’s capacity for generalization. This can be achieved through training methodologies that incorporate the pedestrian’s 3D model or point cloud data.

The use of Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) is a common method for generating fake pedestrian images. GANs consist of a generator and a discriminator, with the generator responsible for generating fake images and the discriminator for distinguishing between real and fake images. Both the generator and the discriminator are optimized through adversarial training, resulting in the generation of more diverse fake images. In contrast to GANs, VAEs employ variational Bayesian methods to model the data distribution, enabling the creation of synthetic images. A distinct advantage of VAEs lies in their capacity to precisely manipulate image generation by modulating the parameters of the latent space, such as the mean and variance, to produce images with varied characteristics.

Although data generation methods can expand the training set and improve model performance, they also come with some issues. For instance, the generated fake images may contain noise or unrealistic features, which can degrade the model’s performance. Therefore, when using data generation methods, it is necessary to carefully select the generator model and control the quality of the generated fake images. This [29] explores a person re-identification method based on a generative model, which learns the representation of an individual by generating images of a specific person from random noise. This approach decouples the identity of an individual from other instance-specific information, such as pose and background, allowing for interesting transformations between different identities.

3. Research Directions in Person Re-Identification

3.1. Occlusion Person Re-Identification

In real-world surveillance scenarios, pedestrians may be occluded by other pedestrians, vehicles, or buildings, leading to a decrease in the accuracy of person ReID. To address this issue, the research direction of occlusion person ReID has emerged. Researchers [30] have proposed specialized datasets for occlusion person ReID, such as Partial-REID, Occluded-Duke, and Occluded-REID, among others. Unlike the datasets used for image-based person ReID, these datasets place a greater emphasis on scenarios where pedestrians are occluded.

The major drawbacks and research gaps faced by occluded person re-identification include the following:

Complexity of occlusion processing: pedestrians are often occluded by other objects in surveillance scenarios, resulting in the loss of key feature information and making it difficult for vision-based re-identification algorithms to accurately match pedestrian identities.
Feature alignment problems: occlusion situations can result in parts of the pedestrian’s body not being visible, which poses a challenge for feature alignment, especially when the occluded area overlaps with key feature areas.

To address the challenges of occlusion [31] in person ReID, researchers have proposed various methods. An emerging strategy integrates pose estimation with local feature extraction, exemplified by Pose-guided Feature Alignment (PFA) [32]. This technique leverages pose estimation to deduce the spatial coordinates of obscured pedestrian segments and subsequently aligns the features correspondingly. This alignment process optimizes the utilization of visible parts for enhancing the accuracy of ReID. Additionally, there are methods that utilize viewpoint information. For example, He et al. [33] proposed a local feature-based spatial feature reconstruction module that can fuse local features from different viewpoints to generate global features. Other methods include those that use reconstruction and attention mechanisms. These methods can all contribute to improving the accuracy of occlusion person ReID to some extent.

3.2. Video Person Re-Identification

Video person ReID [34] is a technology that identifies and tracks the same individual across different surveillance camera views by analyzing the pedestrian’s trajectory and appearance features within a video sequence. Unlike image-based person ReID, video person ReID is better at capturing the dynamic changes and movement characteristics of pedestrians. Unlike image-based person ReID, which deals with static images, video person ReID requires modeling sequence data. Therefore, it is essential to consider temporal and spatial features such as the pedestrian’s movement trajectory, direction, and speed. In addition to these temporal and spatial features, multi-modal features such as pedestrian pose, behavior, and audio can also be used for multi-modal fusion.

Common video person ReID datasets include MARS, DukeMTMC-VideoReID, and PRID2011, among others. In practical applications, video data are voluminous, necessitating efficient algorithms and computer hardware for processing and storage. Currently, video person ReID methods based on deep learning are widely studied and applied. MSCAN [35] is a multi-task video person ReID network that transforms the person ReID problem into multiple sub-tasks. It simultaneously learns global and local features, achieving good performance. Wei et al. [36] proposed a Multi-scale Spatio-Temporal Attention (MSTA) model that focuses on the importance of local regions of pedestrians for each frame of an entire video sequence. It uses multi-scale feature fusion in the spatial-temporal dimension to effectively improve the accuracy of person ReID.

Although video person re-identification technology has a wide range of application prospects in the field of intelligent surveillance and security, there are still some shortcomings and research gaps in the existing methods:

Insufficient use of spatio-temporal information: many existing methods do not make full use of the spatio-temporal information in the video, resulting in highly similar features in consecutive frames and a lack of effective modeling of person motion and temporal changes.
Computational complexity: in order to improve recognition accuracy, existing methods often increase computational overhead by introducing complex operations, such as image complementation networks and multi-scale graph convolutional networks, which are not conducive to deployment on devices with limited computational resources.
Handling of occlusion and view angle changes: In surveillance environments with high occlusion or complex backgrounds, it is difficult to extract accurate person silhouette or body part information, which affects the performance of person ReID.

Future research can be conducted in the following areas:

Exploring more effective spatio-temporal feature extraction methods to fully utilize the spatio-temporal information in videos.
Designing lightweight models to reduce computational overhead and improve model deployability on mobile devices.

3.3. Visible-Infrared Person Re-Identification

Visible-infrared person ReID [37] refers to the process of performing person ReID under two different wavelengths: visible light and infrared light. This approach is primarily used to address person ReID challenges in nighttime or low-light conditions. Traditional person ReID methods only handle RGB images, while visible-infrared person ReID requires handling cross-modal matching between visible light images and infrared images, which have very different visual features. This cross-modal person ReID task is thus extremely challenging.

There are various methods to address visible-infrared person ReID, primarily focusing on how to embed the features of images from two different modalities into the same feature space. Wu et al. [38] proposed a deep zero-padding framework to adaptively learn shared features across the two modalities; Wang et al. [39] utilized a Generative Adversarial Network to generate cross-modal pedestrian images to illustrate the feature differences caused by different modalities; Ye et al. [40] introduced a dual-stream network to share information across different modalities and model perspective information.

Visible light-infrared person re-identification technology still has some drawbacks and research gaps, although it has obvious advantages at night or in poorly illuminated environments:

Significant modal differences: there are significant differences between visible and infrared images in terms of imaging principles and visual performance, such as resolution, contrast, and color information, and these differences bring challenges to feature extraction and information fusion.
Difficulty in feature fusion [41]: since the feature distributions of the two modalities may differ significantly, how to effectively fuse the features of the two modalities to improve the recognition accuracy is a difficult task.
Occlusion and change of view angle: in real surveillance scenarios, pedestrians may be occluded or appear in different view angles, which poses a challenge to feature-based person ReID systems.

Future research can be conducted in the following areas:

Developing new algorithms or improving existing algorithms to handle modal differences and fusing multi-modal features more efficiently.
Expanding and diversifying datasets with more scenes and pedestrian samples to improve the generalization ability of the model.

3.4. Unsupervised Person Re-Identification

Unsupervised person ReID [21] refers to the task of matching images of the same individual across different scenes or times without any labeled identity information. Unlike supervised learning, unsupervised person ReID does not require labeled data, which can save time and labor costs. However, without the constraints of labeled data, the model is more susceptible to noise and outliers, leading to lower recognition accuracy. Additionally, the model cannot be optimized for specific tasks such as localization and detection.

Unsupervised person ReID primarily employs methods such as clustering, self-supervised learning, and Generative Adversarial Networks (GANs). Among these, clustering is the most straightforward unsupervised learning method. For example, Fan et al. [42] used the K-means clustering algorithm to calculate clusters in the pedestrian feature space and calculate cluster centers, which are then used as pseudo-labels. Another commonly used method is self-supervised learning, which trains the model by constructing similarity or contrastive loss. For instance, Chen et al. [43] utilized GANs to generate new perspective images of pedestrians as references and conducted contrastive learning between the original perspective images and the new perspective images. Furthermore, GANs are also widely used as an unsupervised learning method. They can generate fake data to simulate the distribution of real data.

Unsupervised person ReID datasets are typically the same as those used for image-based person ReID, but they do not use the labeled parts of the data. Although the recognition accuracy of unsupervised person ReID is lower, it still has high practical value in certain application scenarios, such as pedestrian tracking in surveillance videos.

3.5. Text-Based Person Re-Identification

In practical applications, pedestrian images may not be available, and text-based person ReID is a technique that uses textual descriptions to re-identify pedestrian images. Unlike traditional image-based person ReID, text-based person ReID uses a textual description of a pedestrian as the “query image” instead of a real pedestrian image.

The key to solving the problem of text-based person ReID lies in how to integrate the common features between textual descriptions and pedestrian images. Li et al. [44] used a gated neural attention model with attention mechanisms to learn the common features between textual descriptions and pedestrian images. Liu et al. [45] used adversarial generative networks to learn the latent common features between text and images to bridge the gap between different modalities.

Research on text-based person ReID requires specialized datasets, such as CUHK-PEDES [44]. Unlike other types of person ReID datasets, the pedestrian images in these datasets are usually accompanied by corresponding textual descriptions, allowing researchers to train and evaluate the performance of text-based person ReID algorithms using these datasets.

Text-based person ReID can compensate for the lack of pedestrian image data, but in some real-world scenarios, textual description information may not be sufficient or may be ambiguous, thereby affecting the performance of the ReID algorithm.

3.6. Cross-Resolution Person Re-Identification

Different cameras, environments, and conditions can lead to pedestrian images with varying resolutions and clarity. These factors can affect the performance of person ReID, as pedestrian images captured at different resolutions may exhibit different features, which can in turn impact the accuracy of feature extraction and matching. Cross-resolution person ReID aims to address this issue, with the core idea being to leverage information from images at different resolutions to establish a cross-resolution person ReID model.

Currently, methods for addressing cross-resolution person ReID mainly fall into two directions. The first direction is to extract features at different resolutions. For example, Li et al. [2] proposed a multi-scale image-based person ReID method, which extracts features at multiple scales and fuses them together to improve the accuracy of person ReID. Additionally, Liu et al. [46] proposed a scale-aware feature fusion method that can adaptively fuse features at different resolutions to further enhance the performance of person ReID.

The second direction is to generate or recover low-resolution images. For instance, Li et al. [47] used adversarial learning to restore low-resolution images to increase the robustness of cross-resolution person ReID. Furthermore, Tao et al. [48] proposed a cross-resolution person ReID method based on self-supervised learning, which can generate high-resolution images from low-resolution pedestrian images, thereby improving the accuracy of person ReID.

The major drawbacks and research gaps faced by cross-resolution person re-identification (CPR) include the following:

Resolution mismatch problem: in real surveillance scenarios, due to the different distances between pedestrians and cameras, the resolution of the captured pedestrian images may vary greatly, resulting in a significant degradation in the performance of the standard pedestrian ReID model.
Limitations of super-resolution techniques: Many cross-resolution person ReID methods rely on super-resolution (SR) techniques to enhance low-resolution images, but these methods usually preset a single scaling ratio, which may not be able to efficiently recover all the details of the image and may introduce noise.
Computational resources and time consumption: methods using multiscale super-resolution fusion can improve the recognition accuracy, but they require a lot of computational resources and inference time.

Future research can be carried out in the following areas:

Developing more effective feature extraction and fusion methods to improve the accuracy and robustness of cross-resolution person re-identification.
Designing lightweight models to reduce computational resource consumption and improve real-time performance.

4. Datasets and Experimental Comparisons

4.1. Datasets

This chapter will delineate the commonly employed datasets and evaluation standards in the field of person ReID. Presently, numerous open-source datasets are available for person ReID, each possessing its distinct attributes and the problems it seeks to address. Due to the constraints of space, only the most prevalent or those pertaining to specific domains will be expounded upon.

Market1501 [49] is one of the most widely used datasets for person ReID. The video clips in this dataset are sourced from six cameras located in front of a supermarket at Tsinghua University, with five high-resolution cameras and one low-resolution camera. Market-1501 utilizes the DPM (Deformable Part Model) to automatically generate pedestrian bounding boxes. It encompasses 1501 unique identities, totaling 32,668 images. These images are partitioned into three sections: the training set (train) consists of 751 pedestrians and 12,936 images; the query set (query) comprises 750 pedestrians and 3368 images; and the gallery set (gallery) includes 750 pedestrians and 19,732 images. In addition to the conventional Market-1501, there is an extended version known as Market-1501+500K. In Market-1501+500K, the number of images in the test set is increased to 519,732, with the addition of a substantial number of distractors.
DukeMTMC-ReID [50] is also one of the commonly utilized datasets in person ReID. The video clips in this dataset are sourced from eight cameras located within the Duke University campus. The pedestrian bounding boxes in this dataset are manually annotated. It encompasses 1404 unique identities, totaling 36,411 images. These images are partitioned into three sections: the training set includes 702 pedestrians and 16,522 images; the query set includes 702 pedestrians and 2228 images; and the gallery set includes 702 pedestrians and 17,761 images.
CUHK03 [51] derives its video clips from ten (five pairs) cameras located within the campus of the Chinese University of Hong Kong. This dataset has two testing protocols, with the newer protocol being more popular currently. Under the new testing protocol, CUHK03 contains 1467 unique identities, totaling 14,096 images. The new testing protocol has two annotation modes: Detected and Labeled. In the Detected mode, which uses the DPM (Deformable Part Model) algorithm to automatically generate pedestrian detection boxes, the training set comprises 767 pedestrians and 7365 images. The query set includes 700 pedestrians and 1400 images. The gallery set includes 700 pedestrians and 5332 images. In the Labeled mode, where pedestrian detection boxes are manually annotated, the training set comprises 767 pedestrians and 7368 images. The query set includes 700 pedestrians and 1400 images. The gallery set includes 700 pedestrians and 5328 images.
Occluded-DukeMTMC [32] is a sub-dataset of the DukeMTMC-ReID dataset, with the majority of its data containing pedestrians in occluded situations, specifically designed for testing the performance of models under occlusion. Its query set consists of 2210 images, selected from the query and gallery sets of DukeMTMC-ReID. The training set excludes images that are the same as those in the gallery set, totaling 15,618 images. The gallery set is consistent with the gallery set of DukeMTMC-ReID, containing 17,661 images.
SYSU-MM01 [38] is a cross-modal person ReID dataset. Its video clips are captured by four general RGB cameras and two near-infrared cameras within the campus of Sun Yat-sen University. The dataset contains 491 identities, with 287,628 being visible light images and 15,792 being infrared images. The training set comprises 22,258 visible light images and 11,909 near-infrared images, which are captured from indoor and outdoor cameras. For testing, the dataset includes two different evaluation settings: full search mode and indoor search mode. The query set consists of 3803 infrared images, captured by two infrared cameras. The gallery set includes all images captured by the four RGB cameras in full search mode, while in indoor search mode, it only includes images captured by two indoor cameras.
RSTPReid [52] is a person ReID dataset designed for textual queries. The dataset contains 20,505 images of 4101 individuals, with each individual having five corresponding images taken by different cameras, each accompanied by two sentences of textual description. For data partitioning, 3701, 200, and 200 identities are used for training, validation, and testing, respectively.
DukeMTMC-VideoReID [53] is a person ReID dataset intended for video testing. The dataset includes 1812 unique identities, with 4832 pedestrian trajectories and 815,420 images. Among these, 408 pedestrians are used as distractors, 702 for training, and 702 for testing.

4.2. Experimental Comparisons

Market-1501, DukeMTMC-reID, and CUHK03 are considered conventional datasets and are also the most frequently used as experimental benchmarks in person ReID. Occluded-DukeMTMC, SYSU-MM01, RSTPReid, and DukeMTMC-VideoReID belong to specialized datasets in the field of person ReID, designed to test the experimental performance of person ReID under certain specific scenarios. In addition to the datasets mentioned above, there are numerous other open-source person ReID datasets available for researchers to choose from. Researchers should select a dataset that aligns with their specific research scenarios to ensure the relevance and applicability of their experiments.

To provide a more intuitive comparison of person ReID research under special scenarios over the past decade, Table 1 and Table 2 are presented in this paper. These tables provide a horizontal overview of various person ReID approaches and directions, along with experimental performance comparisons and a summary of the experimental resources and backbone networks used.

Table 1 reveals that each person ReID approach and method achieved effective recognition results, with high experimental accuracy rates on the Market1501, DukeMTMC, and CUHK03 datasets. Notably, on the Market1501 dataset, the mean average precision (mAP) surpassed 60%. From a historical perspective, person ReID methods have evolved from initial global-local approaches based on geometric granularity to other granularity methods, such as viewpoint-, semantic-, and attention-based approaches, to aid in the training of person ReID models.

On the Market1501 dataset, the Rank-1 accuracy increased from 89.9% in 2019 to 96.1% in 2024, an improvement of 6.2 percentage points. The mAP also rose from 73.9% to 89.9%, an increase of 16 percentage points. The comparison suggests that person ReID methods based on viewpoint information and attention mechanisms exhibit superior performance and have gradually become the mainstream research direction.

Table 2 indicates that with the deepening of research, an increasing number of specific issues have been recognized, especially in the past two years, where significant breakthroughs have been achieved in person ReID across various specialized domains. Among these, the development of unsupervised person ReID, which is more aligned with real-world applications, shows a strong research trend and momentum. In this direction, the current recognition accuracy has gradually caught up with that of image-based person ReID in supervised scenarios, with mAP and Rank-1 reaching 85.80% and 94.50%, respectively. Furthermore, due to the significant differences observed in data from single scenarios, it is challenging to complete all tasks with a unified model. Consequently, person ReID still presents numerous unresolved issues.

5. Future Research Directions

This paper primarily introduces the mainstream methods and research directions for addressing person ReID. It is important to note that these methods and directions are not isolated from one another; they can be combined to achieve more effective results. For instance, one can integrate attention maps from pose estimation with local features to address occlusion in person ReID or combine global features with semantic attribute features to tackle video person ReID. Moreover, the methods and directions presented above do not encompass all that is available in the field of person ReID. New methods and directions are continuously emerging, such as Black ReID, which explores the challenges posed by individuals wearing black clothing. This paper will focus on introducing the following person ReID research directions that have not been fully explored.

5.1. Multi-Modal Person Re-Identification

In the past, person ReID predominantly relied on explicit, single-modal datasets. However, in real-world scenarios, most datasets are uncontrollable. A good person ReID system should be capable of handling varying resolutions, different modes, various environments, and multiple domains. Current research in this direction is extensive, with the multi-modal person ReID approaches mentioned in this chapter, such as visible-infrared person ReID, cross-resolution person ReID, text-based person ReID, and deep person ReID, all falling under the category of cross-modal person ReID. Yet, in addition to the aforementioned multi-modal person ReID, there are other less-explored multi-modal person ReID approaches. For example, visible-sketch person ReID is a less-studied direction. This category of cross-modal person ReID is continuously emerging, and it is evident that the challenge and research hotspot for the future lies in studying how to combine common RGB images with other modalities.

5.2. Cross-Dressing Person ReID

Dress changes are a common issue in person ReID, as the same individual may wear different clothes and shoes at different times and in different scenarios, leading to changes in their appearance features and posing challenges to ReID. Specifically, clothing changes can affect the accuracy and robustness of person ReID. If the ReID system solely focuses on the individual’s appearance features and neglects the impact of clothing changes, it may mistakenly identify different individuals as the same person or incorrectly classify different appearance features of the same individual as those of different individuals. This can lead to increased false identification and missed identification rates, thereby reducing the system’s accuracy and robustness.

5.3. Unsupervised and Semi-Supervised Person Re-Identification

Due to the labor-intensive nature of manually annotating datasets, some researchers have focused on using weakly supervised and unsupervised methods for person ReID to alleviate the high cost of annotating person images across multiple cameras. Compared to supervised learning, unsupervised methods reduce expensive data annotation costs and have shown great potential in real-world person ReID applications. However, due to the lack of annotated information, the recognition performance of these methods is often unsatisfactory. Therefore, how to improve person ReID performance in the absence of annotations remains a significant research area.

5.4. Other Person Re-Identification

In addition to unsupervised person ReID, there are two related directions aimed at reducing manual annotation: Weakly Supervised Person ReID and Semi-Supervised Person ReID.

Weakly Supervised Person ReID involves person ReID without using precise annotations. Instead, it relies on non-identity annotations such as camera perspective, pose, and attributes. This approach reduces the reliance on exhaustive and accurate identity annotations, which can be time-consuming and costly.

Semi-Supervised Person ReID, on the other hand, utilizes both labeled and unlabeled data. The labeled data are used for training, while the unlabeled data are data that have never been used in the training process. This approach leverages the large amounts of unlabeled data that are often available in real-world scenarios to supplement the labeled data and potentially improve the performance of the person ReID system.

This work [80] proposes a novel task for person re-identification, known as instruct-ReID, which requires the model to retrieve images based on given image or language instructions. This work introduces a large-scale OmniReID benchmark and an adaptive Triplet Loss as baseline methods for this new setup.

6. Conclusions

Person ReID has garnered extensive attention from researchers in recent years and has made significant breakthroughs in various fields. In this paper, we have presented a pioneering comprehensive review and analysis of person ReID works for the first time, categorizing it based on special scenarios. We have offered an in-depth examination of the status quo and empirical outcomes of leading approaches to person ReID in unique scenarios, exploring the concepts and trajectories of research. Comparative analysis has revealed that person ReID methods possess considerable research potential in niche areas such as unsupervised learning, occlusion handling, and cross-modal recognition. Finally, this paper also provides an extensive discourse on prospective research trajectories in the domain of person ReID.

Funding

This study is partially supported by the scientific research project of Zhejiang Provincial Department of Education (No. 21030074-F) and the scientific research project of Keyi College, Zhejiang Sci-Tech University (No. KY2024001).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, Y.C.; Zhu, X.; Zheng, W.S.; Lai, J.H. Person re-identification by camera correlation aware feature augmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 392–408. [Google Scholar] [CrossRef]
Li, X.; Zheng, W.S.; Wang, X.; Xiang, T.; Gong, S. Multi-scale learning for low-resolution person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3765–3773. [Google Scholar]
Park, H.; Ham, B. Relation network for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11839–11847. [Google Scholar]
Ning, X.; Gong, K.; Li, W.; Zhang, L.; Bai, X.; Tian, S. Feature refinement and filter network for person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 3391–3402. [Google Scholar] [CrossRef]
Zhu, K.; Guo, H.; Liu, Z.; Tang, M.; Wang, J. Identity-guided human semantic parsing for person re-identification. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 346–363. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
Huang, Y.; Zha, Z.J.; Fu, X.; Zhang, W. Illumination-invariant person re-identification. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 365–373. [Google Scholar]
Zheng, L.; Huang, Y.; Lu, H.; Yang, Y. Pose-invariant embedding for deep person re-identification. IEEE Trans. Image Process. 2019, 28, 4500–4509. [Google Scholar] [CrossRef]
Gray, D.; Tao, H. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In Proceedings of the Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, 12–18 October 2008; Springer: Cham, Switzerland, 2008; pp. 262–275. [Google Scholar]
Bhuiyan, A.; Mirmahboub, B.; Perina, A.; Murino, V. Person re-identification using robust brightness transfer functions based on multiple detections. In Proceedings of the Image Analysis and Processing—ICIAP 2015: 18th International Conference, Genoa, Italy, 7–11 September 2015; Springer: Cham, Switzerland, 2015; pp. 449–459. [Google Scholar]
Yang, Q.; Wu, A.; Zheng, W.S. Person re-identification by contour sketch under moderate clothing change. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 2029–2046. [Google Scholar] [CrossRef]
Zhuo, J.; Chen, Z.; Lai, J.; Wang, G. Occluded person re-identification. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Fu, X.; Huang, F.; Yang, Y.; Gao, X. An Open-World, Diverse, Cross-Spatial-Temporal Benchmark for Dynamic Wild Person Re-Identification. Int. J. Comput. Vis. 2024, 1–24. [Google Scholar] [CrossRef]
Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Niu, K.; Yu, H.; Qian, X.; Fu, T.; Li, B.; Xue, X. Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training. arXiv 2024, arXiv:2406.06045. [Google Scholar]
Sun, Y.; Xu, Q.; Li, Y.; Zhang, C.; Li, Y.; Wang, S.; Sun, J. Perceive where to focus: Learning visibility-aware part-level features for partial person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 393–402. [Google Scholar]
Wang, X.; Zheng, S.; Yang, R.; Zheng, A.; Chen, Z.; Tang, J.; Luo, B. Pedestrian attribute recognition: A survey. Pattern Recognit. 2022, 121, 108220. [Google Scholar] [CrossRef]
Lin, Y.; Zheng, L.; Zheng, Z.; Wu, Y.; Hu, Z.; Yan, C.; Yang, Y. Improving person re-identification by attribute and identity learning. Pattern Recognit. 2019, 95, 151–161. [Google Scholar] [CrossRef]
Sun, X.; Zheng, L. Dissecting person re-identification from the viewpoint of viewpoint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 608–617. [Google Scholar]
Li, P.; Wu, K.; Huang, W.; Zhou, S.; Wang, J. Camera-aware Label Refinement for Unsupervised Person Re-identification. arXiv 2024, arXiv:2403.16450. [Google Scholar]
Wang, D.; Chen, Y.; Tao, L.; Hu, C.; Tie, Z.; Ke, W. AEA-Net: Affinity-supervised entanglement attentive network for person re-identification. Pattern Recognit. Lett. 2023, 172, 237–244. [Google Scholar] [CrossRef]
Nguyen, V.D.; Khaldi, K.; Nguyen, D.; Mantini, P.; Shah, S. Contrastive viewpoint-aware shape learning for long-term person re-identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1041–1049. [Google Scholar]
Rao, H.; Miao, C. A Survey on 3D Skeleton Based Person Re-Identification: Approaches, Designs, Challenges, and Future Directions. arXiv 2024, arXiv:2401.15296. [Google Scholar]
Wang, W.; Chen, Y.; Wang, D.; Tie, Z.; Tao, L.; Ke, W. Joint attribute soft-sharing and contextual local: A multi-level features learning network for person re-identification. Vis. Comput. 2024, 40, 2251–2264. [Google Scholar] [CrossRef]
Achituve, I.; Maron, H.; Chechik, G. Self-supervised learning for domain adaptation on point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 123–133. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wei, L.; Zhang, S.; Gao, W.; Tian, Q. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 79–88. [Google Scholar]
Asperti, A.; Fiorilla, S.; Orsini, L. A generative approach to person reidentification. Sensors 2024, 24, 1240. [Google Scholar] [CrossRef] [PubMed]
Dou, S.; Jiang, X.; Tu, Y.; Gao, J.; Qu, Z.; Zhao, Q.; Zhao, C. DROP: Decouple Re-Identification and Human Parsing with Task-specific Features for Occluded Person Re-identification. arXiv 2024, arXiv:2401.18032. [Google Scholar]
Chen, Z.; Ge, Y. Occluded cloth-changing person re-identification. arXiv 2024, arXiv:2403.08557. [Google Scholar]
Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 542–551. [Google Scholar]
He, L.; Liang, J.; Li, H.; Sun, Z. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7073–7082. [Google Scholar]
Boujou, M.; Iguernaissi, R.; Nicod, L.; Merad, D.; Dubuisson, S. GAF-Net: Video-Based Person Re-Identification via Appearance and Gait Recognitions. In Proceedings of the 19th International Conference on Computer Vision Theory and Applications, Roma, Italy, 27–29 February 2024; pp. 493–500. [Google Scholar]
Li, D.; Chen, X.; Zhang, Z.; Huang, K. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 384–393. [Google Scholar]
Zhang, W.; He, X.; Yu, X.; Lu, W.; Zha, Z.; Tian, Q. A multi-scale spatial-temporal attention model for person re-identification in videos. IEEE Trans. Image Process. 2019, 29, 3365–3373. [Google Scholar] [CrossRef]
Du, Y.; Zhao, Z.; Su, F. YYDS: Visible-Infrared Person Re-Identification with Coarse Descriptions. arXiv 2024, arXiv:2403.04183. [Google Scholar]
Wu, A.; Zheng, W.S.; Yu, H.X.; Gong, S.; Lai, J. RGB-infrared cross-modality person re-identification. In Proceedings of the IEEE international Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5380–5389. [Google Scholar]
Wang, G.; Zhang, T.; Cheng, J.; Liu, S.; Yang, Y.; Hou, Z. RGB-infrared cross-modality person re-identification via joint pixel and feature alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3623–3632. [Google Scholar]
Ye, M.; Lan, X.; Li, J.; Yuen, P. Hierarchical discriminative learning for visible thermal person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Nie, J.; Lin, S.; Kot, A.C. Color Space Learning for Cross-Color Person Re-Identification. arXiv 2024, arXiv:2405.09487. [Google Scholar]
Fan, H.; Zheng, L.; Yan, C.; Yang, Y. Unsupervised person re-identification: Clustering and fine-tuning. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2018, 14, 1–18. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y.; Lagadec, B.; Dantcheva, A.; Bremond, F. Joint generative and contrastive learning for unsupervised person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2004–2013. [Google Scholar]
Li, S.; Xiao, T.; Li, H.; Zhou, B.; Yue, D.; Wang, X. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1970–1979. [Google Scholar]
Liu, J.; Zha, Z.J.; Hong, R.; Wang, M.; Zhang, Y. Deep adversarial graph attention convolution network for text-based person search. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 665–673. [Google Scholar]
Li, Y.; Liu, L.; Zhu, L.; Zhang, H. Person re-identification based on multi-scale feature learning. Knowl.-Based Syst. 2021, 228, 107281. [Google Scholar] [CrossRef]
Li, Y.J.; Chen, Y.C.; Lin, Y.Y.; Du, X.; Wang, Y.C.F. Recover and identify: A generative dual model for cross-resolution person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8090–8099. [Google Scholar]
Cheng, Z.; Dong, Q.; Gong, S.; Zhu, X. Inter-task association critic for cross-resolution person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2605–2615. [Google Scholar]
Yu, Z.; Tiwari, P.; Hou, L.; Li, L.; Li, W.; Jiang, L.; Ning, X. Mv-reid: 3d multi-view transformation network for occluded person re-identification. Knowl.-Based Syst. 2024, 283, 111200. [Google Scholar] [CrossRef]
Zheng, Z.; Zheng, L.; Yang, Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3754–3762. [Google Scholar]
Li, W.; Zhao, R.; Xiao, T.; Wang, X. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 152–159. [Google Scholar]
Zhu, A.; Wang, Z.; Li, Y.; Wan, X.; Jin, J.; Wang, T.; Hu, F.; Hua, G. Dssl: Deep surroundings-person separation learning for text-based person retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, China, 20–24 October 2021; pp. 209–217. [Google Scholar]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 17–35. [Google Scholar]
Wei, L.; Zhang, S.; Yao, H.; Gao, W.; Tian, Q. GLAD: Global–local-alignment descriptor for scalable person re-identification. IEEE Trans. Multimed. 2018, 21, 986–999. [Google Scholar] [CrossRef]
Wang, G.; Lai, J.; Huang, P.; Xie, X. Spatial-temporal person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8933–8940. [Google Scholar]
Chen, X.; Liu, X.; Liu, W.; Zhang, X.P.; Zhang, Y.; Mei, T. Explainable person re-identification with attribute-guided metric distillation. In Proceedings of the IEEE/CVF international Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 11813–11822. [Google Scholar]
Hong, P.; Wu, A.; Zheng, W.S. Semi-supervised person re-identification by attribute similarity guidance. In Proceedings of the IEEE 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 6471–6477. [Google Scholar]
Shi, X.; Liu, H.; Shi, W.; Zhou, Z.; Li, Y. Boosting Person Re-Identification with Viewpoint Contrastive Learning and Adversarial Training. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Ni, H.; Li, Y.; Gao, L.; Shen, H.T.; Song, J. Part-aware transformer for generalizable person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 11280–11289. [Google Scholar]
Pu, N.; Zhong, Z.; Sebe, N.; Lew, M.S. A memorizing and generalizing framework for lifelong person re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13567–13585. [Google Scholar] [CrossRef] [PubMed]
Somers, V.; De Vleeschouwer, C.; Alahi, A. Body part-based representation learning for occluded person re-identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 1613–1623. [Google Scholar]
Huang, M.; Hou, C.; Yang, Q.; Wang, Z. Reasoning and tuning: Graph attention network for occluded person re-identification. IEEE Trans. Image Process. 2023, 32, 1568–1582. [Google Scholar] [CrossRef] [PubMed]
Khatun, A.; Denman, S.; Sridharan, S.; Fookes, C. Pose-driven attention-guided image generation for person re-identification. Pattern Recognit. 2023, 137, 109246. [Google Scholar] [CrossRef]
Jin, X.; Lan, C.; Zeng, W.; Wei, G.; Chen, Z. Semantics-aligned representation learning for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11173–11180. [Google Scholar]
Bian, Y.; Liu, M.; Wang, X.; Tang, Y.; Wang, Y. Occlusion-Aware Feature Recover Model for Occluded Person Re-Identification. IEEE Trans. Multimed. 2023, 26, 5284–5295. [Google Scholar] [CrossRef]
Tan, L.; Xia, J.; Liu, W.; Dai, P.; Wu, Y.; Cao, L. Occluded Person Re-identification via Saliency-Guided Patch Transfer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5070–5078. [Google Scholar]
Xia, J.; Tan, L.; Dai, P.; Zhao, M.; Wu, Y.; Cao, L. Attention disturbance and dual-path constraint network for occluded person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6198–6206. [Google Scholar]
Yu, C.; Liu, X.; Wang, Y.; Zhang, P.; Lu, H. TF-CLIP: Learning text-free CLIP for video-based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6764–6772. [Google Scholar]
Wang, K.; Ding, C.; Pang, J.; Xu, X. Context sensing attention network for video-based person re-identification. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–20. [Google Scholar] [CrossRef]
Kim, M.; Kim, S.; Park, J.; Park, S.; Sohn, K. Partmix: Regularization strategy to learn part discovery for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18621–18632. [Google Scholar]
Fang, X.; Yang, Y.; Fu, Y. Visible-infrared person re-identification via semantic alignment and affinity inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 11270–11279. [Google Scholar]
Feng, J.; Wu, A.; Zheng, W.S. Shape-erased feature learning for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22752–22761. [Google Scholar]
Almansoori, M.K.; Fiaz, M.; Cholakkal, H. DDAM-PS: Diligent Domain Adaptive Mixer for Person Search. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 6688–6697. [Google Scholar]
Cho, Y.; Kim, W.J.; Hong, S.; Yoon, S.E. Part-based pseudo label refinement for unsupervised person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7308–7318. [Google Scholar]
Lan, L.; Teng, X.; Zhang, J.; Zhang, X.; Tao, D. Learning to purification for unsupervised person re-identification. IEEE Trans. Image Process. 2023, 32, 3338–3353. [Google Scholar] [CrossRef]
Shao, Z.; Zhang, X.; Ding, C.; Wang, J.; Wang, J. Unified pre-training with pseudo texts for text-to-image person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 11174–11184. [Google Scholar]
Yan, S.; Dong, N.; Zhang, L.; Tang, J. Clip-driven fine-grained text-image person re-identification. IEEE Trans. Image Process. 2023, 32, 6032–6046. [Google Scholar] [CrossRef]
Sun, R.; Yang, Z.; Zhao, Z.; Zhang, X. Dual-stream coupling network with wavelet transform for cross-resolution person re-identification. J. Syst. Eng. Electron. 2023, 34, 682–695. [Google Scholar] [CrossRef]
Wu, L.Y.; Liu, L.; Wang, Y.; Zhang, Z.; Boussaid, F.; Bennamoun, M.; Xie, X. Learning resolution-adaptive representations for cross-resolution person re-identification. IEEE Trans. Image Process. 2023, 32, 4800–4811. [Google Scholar] [CrossRef] [PubMed]
He, W.; Deng, Y.; Tang, S.; Chen, Q.; Xie, Q.; Wang, Y.; Bai, L.; Zhu, F.; Zhao, R.; Ouyang, W.; et al. Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 17521–17531. [Google Scholar]

Figure 1. A complete person ReID model training process.

Table 1. Comparison of experimental results on person re-identification ideas.

Category	Algorithm	Publish	Experimental Condition	Backbone	mAP/Rank1 (%)
Category	Algorithm	Publish	Experimental Condition	Backbone	Market1501	DukeMTMC	Cuhk03
Global and local	GLAD [54]	TMM19	GeForce GTX 1080 GPU, Intel i7 CPU	Resnet50	73.9/89.9	62.20/80.0	–/86.0
Global and local	St-Reid [55]	AAAI20		Resnet50	86.7/97.2	82.80/94.0	–/–
Semantic attribute	AMD [56]	ICCV21	–	Resnet50	87.2/94.8	75.5/88.2	–/–
Semantic attribute	ASG [57]	ICPR20	–	Resnet50	66.6/87.0	58.1/76.7	–/–
Viewpoint-invariant	MV-Reid [49]	KBS24	NVIDIA RTX 5000	Resnet50	89.9/96.1	83.7/92.7	–/–
Viewpoint-invariant	VRN [58]	ICASSP23	NVIDIA GeForce GTX 1080	Resnet-ibn-101	90.2/95.8	–/–	–/–
Domain-aware	PAT [59]	ICCV23	–	Vit	81.5/92.4	–/–	26.0/25.4
Domain-aware	MEGE [60]	TPAMIA23	A100 GPU	Resnet50	46.6/67.6	21.8/36.1	47.8/49.3
Attention	BPBreID [61]	WACV23	NVIDIA Quadro TX8000 GPU	Resnet50	87.0/95.1	78.3/89.6	–/–
Attention	PTGAT [62]	TIP23	NVIDIA 2080 GPU	Resnet50	88.2/95.3	80.2/89.1	–/–
Image generation	PAGN [63]	PR23	–	Resnet50	78.6/93.5	78.1/84.2	93.6/95.1
Image generation	SAN [64]	AAAI20	–	Resnet50	88.0/96.1	75.5/87.9	76.4/80.1

Table 2. Comparison of experimental results on person ReID direction.

Category	Algorithm	Publish	Experimental Condition	Backbone	mAP/Rank1/Rank10 (%)
Occlusion	OAFR [65]	TMM23	NVIDIA GeForce RTX A6000 GPU	Swin Transformer	66.4/76.7/–
	SPT [66]	AAAI24	Nvidia 3090 Ti	Vit	57.4/68.6/87.5
	ADP [67]	AAAI24	–	Vit	63.8/74.5/89.6
Video	TF-CLIP [68]	AAAI24	NVIDIA Tesla A30 GPU	Vit	89.4/93.0/–
Video	CSA-Net [69]	ACM TMC 23	–	Resnet50	84.1/89.7/–
Visible-infrared	PartMix [70]	CVPR23	–	Resnet50	74.62/77.78/–
	SAAI [71]	ICCV23	RTX3090 GPU	Resnet50	77.03/75.90/–
	SFL [72]	CVPR23	–	Resnet50	72.33/77.11/97.03
Unsupervised	DDAM-PS [73]	WACV24	NVIDIA RTX A6000 GPU	Resnet50	79.50/81.30/–
	PPLR [74]	CVPR23	–	Resnet50	81.50 /92.80/98.10
	PuriReid [75]	TIP23	NVIDIA TITAN V GPU	Resnet50	85.80/94.50/98.70
Text-based	UniPT [76]	ICCV23	Nvidia Tesla V100 GPU	Deit-small Vit	–/68.50/90.38
Text-based	TIReID [77]	TIP23	RTX3090 24GB GPU	Resnet50	–/69.57/91.15
Cross-resolution	PEN [78]	JSEE23	GeForce RTX 2070	Unet Resnet50	82.30/86.00/–
Cross-resolution	CRReID [79]	TIP23	–	Vit Resnet50	88.60/89.20/99.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Wang, K.; Ye, H.; Tao, L.; Tie, Z. Person Re-Identification in Special Scenes Based on Deep Learning: A Comprehensive Survey. Mathematics 2024, 12, 2495. https://doi.org/10.3390/math12162495

AMA Style

Chen Y, Wang K, Ye H, Tao L, Tie Z. Person Re-Identification in Special Scenes Based on Deep Learning: A Comprehensive Survey. Mathematics. 2024; 12(16):2495. https://doi.org/10.3390/math12162495

Chicago/Turabian Style

Chen, Yanbing, Ke Wang, Hairong Ye, Lingbing Tao, and Zhixin Tie. 2024. "Person Re-Identification in Special Scenes Based on Deep Learning: A Comprehensive Survey" Mathematics 12, no. 16: 2495. https://doi.org/10.3390/math12162495

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Person Re-Identification in Special Scenes Based on Deep Learning: A Comprehensive Survey

Abstract

1. Introduction

2. Overview of Person Re-Identification Methods

2.1. Global and Local Feature-Based Person Re-Identification

2.2. Semantic Attribute-Based Person Re-Identification

2.3. Viewpoint-Invariant Person Re-Identification

2.4. Domain-Aware Person Re-Identification Methods

2.5. Attention-Based Person Re-Identification

2.6. Person Re-Identification Based on Image Generation

3. Research Directions in Person Re-Identification

3.1. Occlusion Person Re-Identification

3.2. Video Person Re-Identification

3.3. Visible-Infrared Person Re-Identification

3.4. Unsupervised Person Re-Identification

3.5. Text-Based Person Re-Identification

3.6. Cross-Resolution Person Re-Identification

4. Datasets and Experimental Comparisons

4.1. Datasets

4.2. Experimental Comparisons

5. Future Research Directions

5.1. Multi-Modal Person Re-Identification

5.2. Cross-Dressing Person ReID

5.3. Unsupervised and Semi-Supervised Person Re-Identification

5.4. Other Person Re-Identification

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI