Pedestrian Re-Identification Based on Weakly Supervised Multi-Feature Fusion

Qin, Changming; Wang, Zhiwen; Zhang, Linghui; Peng, Qichang; Lin, Guixing; Lu, Guanlin

doi:10.3390/a17100426

Open AccessArticle

Pedestrian Re-Identification Based on Weakly Supervised Multi-Feature Fusion

by

Changming Qin

,

Zhiwen Wang

^*

,

Linghui Zhang

,

Qichang Peng

,

Guixing Lin

and

Guanlin Lu

School of Electronic Engineering, Guangxi University of Science and Technology, Liuzhou 545026, China

^*

Author to whom correspondence should be addressed.

Algorithms 2024, 17(10), 426; https://doi.org/10.3390/a17100426

Submission received: 25 July 2024 / Revised: 4 September 2024 / Accepted: 20 September 2024 / Published: 24 September 2024

(This article belongs to the Section Evolutionary Algorithms and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

This article proposes a weakly supervised multi-feature fusion pedestrian re-identification method, which introduces a multi-feature fusion mechanism to extract feature information from different layers into the same feature space and fuse them into the deep and shallow joint features. The goal is to fully utilize the rich information in the image and improve the performance and robustness of the pedestrian re-identification model. Secondly, by matching the target character with unprocessed surveillance videos, one only needs to know that the identity of a person appears in the video, without annotating the identity of a person in any of the frames of the video during the training process. This simplifies the annotation of training images by replacing accurate annotations with broad annotations; that is, it puts the pedestrian identities that appeared in the video in one package and assigns a video-level label to each package. This greatly reduces the annotation work and transforms this weakly supervised pedestrian re-identification challenge into a multi-instance and multi-label learning problem. The experimental results show that the method proposed in this paper is effective and can significantly improve mAP.

Keywords:

weak supervision; multi-feature fusion mechanism; deep shallow joint features; multiple instances and labels

1. Introduction

Currently, most of the research efforts focus on supervised learning methods, which have exceeded the manual level in terms of basic criteria. However, in practical application scenarios, there are still obvious gaps in these methods. These methods have relatively poor generalization capabilities and need a large amount of labeled data to train the model, which is contrary to the original intention of reducing the burden on human beings. In addition, the cost of data labeling is high, and it is quite difficult to obtain labeled data.

In traditional pedestrian re-recognition training, detection accuracy greatly benefits from the accurate annotation of existing datasets. The training images are already cropped pedestrian images within each bounding box, and these images are assigned the appropriate labels as shown in Figure 1a. It is not realistic to perform similar labeling on images recorded by multiple non-overlapping surveillance cameras without processing them, as annotators find it difficult to remember a stranger who is unrelated to them, and there are many pedestrians and strong edges in surveillance videos. Therefore, annotating a large number of pedestrian images in a short period is very difficult and prone to errors, which is not only expensive but also time consuming. In addition to the limitations of the dataset, feature extraction and utilization also have a significant impact on the recognition accuracy of the model. Most researchers mainly use the deep features extracted by the fully connected layer of a CNN in the construction of pedestrian re-recognition models for prediction. However, this approach may lead to a decrease in the model’s accuracy and generalization ability because it ignores some useful shallow feature information in the image. Therefore, many researchers have begun to propose various methods. Zhang et al. [1] proposed a multi-level supervised network (MLSN), consisting of a backbone network and several sub-networks; it utilizes low-level and high-level information to re-identify pedestrians. There are also some methods that utilize a deep CNN for learning on top-level feature maps [2], and directly design a more suitable metric loss function to learn the distance between two high-level features [3]. Although there are already many multi-level feature learning methods related to pedestrian re-identification, how to better obtain and utilize multi-level information remains a challenging task.

To alleviate these issues, this article considers designing a weakly multi-feature fusion pedestrian re-recognition model, introducing a multi-feature fusion mechanism to extract feature information from different layers into the feature space and fuse them into the deep and shallow joint features, achieving the goal of fully utilizing the rich information in the image and improving the performance and robustness of the pedestrian re-recognition model. Secondly, by matching the target character with unprocessed surveillance videos, it is possible to know that the identity of a person appears in the video, without the need to annotate the identity of a person in any frame of the video during the training process. By replacing the accurate annotations with broad annotations, the annotation of the training image is simplified. The pedestrian identities that have appeared in the video are placed in a package, and each package is assigned a video-level label. This greatly reduces the annotation work, transforming this weakly supervised pedestrian re-recognition challenge into a multi-instance and multi-label learning problem, as shown in Figure 1b. Therefore, for a video, there may be multiple video-level tags. The Re-ID task learns by comparing the pedestrian images obtained from the videos with query images during the inference process, so images are not organized in the form of packets during the application, as shown in Figure 1c. Extensive experiments were conducted to demonstrate the feasibility of the proposed weakly supervised model, and the performance of the proposed method compared to relevant methods was validated in the weakly labeled datasets.

2. Related Works

2.1. Pedestrian Re-ID Research

Re-ID was initially studied as a sub-task of multi-camera tracking [4]. In 1997, Huang and Russell et al. [5] first tracked pedestrians between multiple different cameras using appearance features and designed a tracking model that fused color and spatiotemporal features based on the Bayesian theory. Javed et al. [6] and Chen et al. [7] put the luminance transfer function in the same subspace as the spatiotemporal features to address the problem of changing pedestrian colors when tracking with multiple cameras. In 2005, Zajedel et al. [8] attempted to encode the probabilistic relationship between labels and features from a trajectory using a dynamic Bayesian network. In 2006, Gheissari et al. [9] used a single visual appearance to match pedestrians recorded by different cameras. Due to the rapid development of large datasets and evaluation metrics, metric learning-based approaches were proposed in 2011, and Dikmen et al. [10] and Zheng et al. [11] used metric learning to minimize the distance of the optimal matching pairs, which made discriminative similarity metrics a research hotspot [12,13]. In 2012, the AlexNet [14] algorithm used on ImageNet received better results, highlighting the great potential of deep learning techniques in the image domain. After 2014, convolutional neural networks (CNNs) started to be introduced into pedestrian re-identification tasks, which allowed the networks to obtain even better features from images. Yi et al. [15] and Li et al. [16] designed Siamese neural networks to learn better pedestrian features from the training images to determine whether a pair of input images belonged to the same pedestrian. Several datasets, such as CUHK03 [16] and Market-1501 [17], for pedestrian re-recognition training and validation were subsequently made publicly available, making this task a popular research topic in the computer vision community [18]. In 2016, He et al. [19] proposed the residual learning framework, which made deeper network training possible and allowed for an improvement in the accuracy of deeper networks. In 2017, Sławomir et al. [20] proposed a One-Shot method to learn metrics, which achieved a good performance by learning the deep texture of a single instance. The literature [21] estimated segment similarity using deep neural networks by dividing long video clips into multiple short video clips and aggregating the top-ranked segment similarities for sequence similarity estimation while maintaining a diverse appearance and the temporal information, which was successful on multiple datasets. In 2020, He et al. [22] proposed momentum contrast for unsupervised visual representation learning, constructing a dynamic dictionary with a queue and moving average encoders. This enables the instant construction of a large and consistent dictionary, which facilitates contrast for unsupervised learning. Many excellent research results have also appeared in the weakly supervised field in recent years, and many of them have been published in top journals. The history of the development of pedestrian re-recognition is shown in Figure 2.

The complete development process of the pedestrian re-identification task includes a total of five steps: obtaining raw surveillance video data, video extraction image and preprocessing, pedestrian annotation in the image, model building and training, and pedestrian retrieval; most of the current research mainly focuses on the last two steps and, as a result, the existing Re-ID methods can be divided into two main parts: strongly supervised learning and weakly supervised learning [23,24].

2.2. Strongly Supervised Pedestrian Re-Identification Research

2.2.1. Traditional Methods Based on Feature Expression and Metric Learning

The two most important parts of the pedestrian re-identification task are feature extraction and the distance metric. Initial research focused on how to construct robust appearance feature descriptions to recognize the same pedestrian in different cameras. Since 2005, various features such as color histograms, SIFT [25], Gabor [26] filtering, and their fusion have been used for pedestrian re-identification [3]. Among them, pedestrian appearance color features have the greatest impact on the results; so, color and texture features are usually used for fine localization, such as a color invariant signature in the literature [27] and significance matching in the literature [28]. In addition, it has been shown in the literature [29] that utilizing the silhouette and symmetrical structure of pedestrians can also improve recognition accuracy.

In conventional pedestrian re-recognition systems, a good distance metric is crucial to the results, because in real environments visual features are susceptible to change due to complex environmental influences. As a result, a large number of metric learning methods have been generated, and mainstream distance metric learning is based on the Mahalanobis distance. According to the learned distance scale transformation, the distance between different images of the same pedestrians is brought closer and the distance between different pedestrians is brought further away in the distance metric space so that the learned metrics are more discriminative in re-recognition tasks and more robust to the variations produced by cross-camera pedestrian images. Examples include Boosting [30] and metric learning [31]. There is also the KISSME [32] algorithm for large-scale data and its XQDA algorithm for multi-scene situations. In these methods, although good performance has been achieved in recent years, the complex and variable environment leads to poor performance of the algorithmic models based on metric learning in practice.

2.2.2. Pedestrian Re-Identification Based on Deep Learning

Convolutional neural networks have made great progress in the field of pedestrian re-recognition, and the pedestrian re-recognition algorithm based on deep learning overwhelms traditional algorithms with its absolute performance advantage. On the one hand, deep learning technology is used to build a deep neural network to mine deeper features; on the other hand, the distance metrics that have been studied for many years are integrated into the model to obtain the loss function, and the loss function value is minimized to continuously update and optimize the neural network parameters to obtain a more excellent recognition capability.

Yi et al. [15] and Li et al. [16] were the first to design Siamese neural networks using deep learning to extract better pedestrian features to determine whether a pair of input images belong to the same pedestrian. A large number of studies based on twinned convolutional neural networks followed. In the literature [33], Varior et al. merged a Long Short-Term Memory (LSTM) module into the Siamese neural network, which can sequentially process image parts by remembering the spatial connections, thus enhancing the extraction of deep features.

Sun et al. [34] utilized a local convolutional baseline approach (PCB) to emphasize the consistency of each part’s content and obtain more fine-grained information. Wang et al. [35] designed a multi-level fine-grained network structure (MGN) to fuse global and partial features, thus making the model more efficient in scenarios with large variances. Zhang et al. [36] proposed an effective Relationally Aware Global Attention (RGA) module, which captures global structural information for better attention learning. Recently, Gu et al. [37] proposed a novel Twin Comparison Mechanism (TCM) to provide more appropriate supervision for the Re-ID architecture search, which reduces the category overlap between training and validation data and achieves good performance.

In the widely used Market-1501 dataset, most image-based Re-ID methods have obtained a Rank-1 accuracy of 96.1% over humans. In particular, MSINet has obtained 89.6% of the best mAP and 95.1% Rank-1 accuracy in the Market-1501 dataset, and Gu et al. have used a Neural Architecture Search (NAS) to significantly improve its retrieval performance. Advanced strongly supervised learning-based methods have surpassed human recognition ability in a single dataset; however, they do not generalize well to other datasets and weakly supervised learning-based methods are more relevant to real-world application scenarios, which have received extensive attention from academia and industry, and have motivated researchers to shift their attention to more challenging scenarios, i.e., open-world datasets or weakly supervised learning [38].

2.3. Weakly Supervised Pedestrian Re-Identification Research

With the continuous development of pedestrian re-identification tasks, closed-world datasets can no longer meet the requirements for model training, and labeled data is difficult to scale in the open world. More and more researchers focus on weakly supervised learning methods [39]. Reducing the reliance on strongly labeled information can well alleviate the problems of small data size and a restricted application environment. At present, weakly supervised algorithms have been developed to a certain extent, but they are not mature enough, and there is still a lot of room for improvement when compared with strongly supervised learning. How to construct a dataset suitable for weakly supervised methods and how to convert the existing dataset into a usable dataset to train a better weakly supervised model is the focus of the next efforts, and is also the main research content of this paper. Weakly supervised learning methods are mainly divided into two categories: semi-supervised learning and unsupervised learning.

2.3.1. Semi-Supervised Re-ID

The concept of semi-supervised learning first appeared in the 1970s [40]. Semi-supervised learning (SSL) aims to use less labeled and unlabeled data to complete the learning task. Wu et al. [41] proposed an incremental sampling method (EUG) to gradually increase the number of selected pseudo-labeled candidates. A labeled trajectory is firstly used for each identity to initialize the CNN model, then for the sampling of some candidates with the most reliable pseudo-labels from the unlabeled trajectories, and finally for the CNN model based on the selected data.

For the problem where only a small number of pedestrians are present with labeling, the literature [42] generates a multi-view clustering of pseudo-labels by constructing a set of heterogeneous convolutional neural networks fine-tuned using the labeled portion, and then by clustering the un-labelled samples by integrating the features of multiple heterogeneous CNNs. After that, the network parameters are updated using training data with real and pseudo-labels, while minimizing the recognition loss and validation loss, and the process is iterated until the pseudo-label estimates do not change anymore.

2.3.2. Unsupervised Re-ID

Unsupervised methods mainly focus on domain adversarial learning or constructing pseudo-labels using unsupervised clustering. However, these methods have some limitations. Firstly, they do not take into account the Intra-Domain Variations (IDVs) within the target domain, and the features of the same pedestrian in different states can vary considerably. Secondly, these methods focus more on the current features rather than the global features, which may have a large impact on the early stage of training. Early unsupervised pedestrian re-identification methods mainly focused on learning invariant components such as dictionaries, metrics, or saliency, but this led to limitations in discriminability or scalability. Therefore, further research and improvement of unsupervised methods are needed to overcome these limitations and improve the performance and reliability of pedestrian re-identification. The literature [43] proposes a soft multi-label deep learning algorithm for unsupervised pedestrian re-identification, which learns the soft multi-label of each unlabeled pedestrian by comparing the unlabeled person with a set of known reference pedestrians and utilizes the soft multi-labels to address the learning of discriminative information in the absence of pairs of labels in the disjoint camera views. The literature [44] develops a PatchNet to select the portion of interest from the feature map and learn the discriminative features, instead of learning the discriminative features from the whole image, and designs the image-level feature learning loss to utilize all the patch features of the same image as its image-level guide. For deep unsupervised methods [45], cross-camera label estimation is a popular approach. Dynamic Graph Matching (DGM) formulates label estimation as a bipartite graph matching problem. The literature [46] introduces Transferable Joint Attribute-Identity Deep Learning (TJ-AIDL) for the simultaneous learning of attribute semantics and the identity discriminative feature representation space, enabling unsupervised learning in the target domain. For end-to-end unsupervised Re-ID, iterative clustering and Re-ID model learning are proposed. Similarly, the relationships between the samples are used in a hierarchical clustering framework. Soft multi-label learning mines soft-label information from reference sets for unsupervised learning.

Despite the progress of unsupervised cross-domain pedestrian re-identification techniques, their recognition results on publicly available Re-ID datasets are still significantly different from the supervised methods. Future research will focus on improving the performance of unsupervised methods, including exploring new feature representations, improving unsupervised algorithms, and optimizing the unsupervised learning process. In addition, research is needed to investigate how to apply these techniques more efficiently to video surveillance, intelligent security, and other fields, to realize the practical application and promotion of unsupervised cross-domain pedestrian re-identification techniques.

2.4. The Problems That Need to Be Further Addressed in Pedestrian Re-Identification

The difficulties and problems of pedestrian re-identification are mainly divided into the following three aspects:

(1) Pedestrian images in real scenes have many problems, such as changes in pedestrian posture, inconsistent shooting viewpoints, object occlusion, etc., and the global features extracted from the images by a CNN are easily interfered with by these problems.

(2) In traditional pedestrian re-identification setup methods, it is assumed that the labeled images are pedestrian images within the bounding box of each person; it is expensive and time consuming to perform this labeling in the multiple non-overlapping camera views of the original video surveillance, and it is still challenging to make better use of the existing datasets as well as to train with weakly labeled data.

(3) In practice, models are usually required to have a high recognition efficiency, but pedestrian re-recognition requires a lot of data processing work due to a large number of occlusions and small target recognition, and a large number of enhancements are added to improve performance, which increases the accuracy of the model while also dramatically increasing the spatiotemporal complexity of the model. The computational complexity and time-consuming nature of training these networks for pedestrian re-identification, especially in the absence of good computing power, and the intensive and complex computations of deep networks are very demanding on hardware, increasing the difficulty of deploying them on general-purpose hardware devices.

To address the above difficulties, this paper proposes a weakly supervised algorithm using multi-feature fusion to simplify the annotation of pedestrian images and transform the pedestrian re-identification challenge into a multi-instance multi-label learning problem. This method inputs the pedestrian image data into the ResNet-50 network in the form of packets, and obtains the classification probabilities through the fully connected layer; at the same time, these probabilities are modeled as unary terms in the probabilistic graphical model, and the relationship between the images is modeled as a binary term by taking into account features such as the appearance and texture of the pedestrian images. The classification loss function is formed by summing up the unary and the binary terms, and a pseudo-image-level label is obtained by minimizing this loss function; finally, the generated pseudo-labels are used to supervise the learning of the deep Re-ID model.

3. Our Approach

Although a weak annotation cannot directly obtain accurate information about pedestrian images, it can reflect the overall dependency relationship between pedestrian images and help establish image difference models across camera views. The overall algorithm model architecture is shown in Figure 3. The orange solid line with a diamond icon represents the forward propagation process during training, the gray solid line represents the initialization and pseudo-label generation process, and the blue solid line with a triangle icon represents the backward propagation process.

3.1. Weakly Supervised Multi-Feature Fusion Pedestrian Re-Recognition Algorithm Network Architecture

The weakly supervised multi-feature fusion Re-ID algorithm model proposed in this article consists of four core modules:

(1) Input pedestrian image data into the ResNet-50 network model in the form of packets to extract multiple features, and use the fully connected layers to obtain the rough classification probabilities, which are then used as a unary term in the probability graph model [47].

(2) Introduce a multi-feature fusion module to capture the various feature information of the pedestrian images from different perspectives, such as appearance and texture [48,49,50], constructing the feature relationships between different pedestrian images as binary terms in the graph model to smooth the generated label information. When testing or applying, only the first two module steps need to be completed to obtain relatively accurate detection results.

(3) Combine the unary and binary terms to form a classification loss function, and minimize this loss function to obtain pseudo-image-level labels for each image.

(4) Use the generated pseudo-labels to supervise the learning of deep Re-ID models.

The focus of this article is to construct a mechanism for weakly supervised pedestrian re-recognition; therefore, ResNet-50 is used as the feature extraction backbone in the process of building the model. Specifically, in the last layer of the feature extraction module, there is a fully connected layer that uses a SoftMax classifier for rough classification. The probability of personnel classification obtained is considered a rough estimate of the pedestrian labels, indicating the possibility that personnel IDs exist in the package. Further, refine the rough estimation results obtained directly from the extraction network to generate more accurate pseudo-image-level labels for each pedestrian image. Use package constraints to avoid possible incorrect labeling and reduce the occurrence of the situation of assigning pedestrian labels to pedestrian images that do not exist in package-level annotations. On the contrary, package constraints encourage assigning pedestrian IDs that already exist in the package-level annotations to the corresponding pedestrian images. This constraint mechanism helps to improve the robustness and accuracy of the model, ensuring the consistency and reliability of label allocation, and effectively addressing uncertainty and errors in data annotation.

3.2. Image Pseudo-Label Loss Function

In weakly supervised Re-ID, a packet-level label

L

is provided, but the image-level label

y

is unknown. Assuming a package contains n personal IDs, there are m IDs in the entire dataset. The initial image-level label

\sqrt{Y_{j}}

for each image

\sqrt{x_{j}}

can be inferred from its packet-level label

l

, and then

\dot{y}

can be used as a packet constraint to derive the pseudo-image-level labels, calculated using Equation (1), which can be further used for supervised model learning.

Y_{j} = (\begin{matrix} Y_{j}^{1} \\ ⋮ \\ Y_{j}^{k} \\ ⋮ \\ Y_{j}^{m} \end{matrix}), Y_{j}^{k} = \{\begin{array}{l} \frac{1}{n}, & k \in l \\ 0, & o t h e r w i s e \end{array}

(1)

This article uses micro-computing graphics to generate pseudo-image-level labels for images of pedestrians inside a package. It can be assumed that

x_{i}

is an image of a person in a package, where

i

is the image index within a package. Assigning label

y_{i}

to

x_{i}

will result in label loss. The label loss

L (y | x)

of the article image can be defined using Equation (2).

L (y | x) = \sum_{\forall i \in U} Φ (y_{i} | x_{i}) + \sum_{\forall i, j \in V} ψ (y_{i}, y_{j} | x_{i}; x_{j})

(2)

In Equation (2),

U

represents a set of images, and

V

represents two sets of images. Φ(

y_{i}

|

x_{i}

) is a unary term that measures the loss from assigning the label

y_{i}

to the character image; Ψ(

y_{i}

,

y_{j}

|

x_{i}

;

x_{j}

) is a paired term that measures the penalty for assigning labels to a pair of images (

x_{i}

;

x_{j}

). Mathematically speaking, graphic modeling is aimed at smoothing out uncertain predictions of personnel identities. A unary term provides a rough estimate of an image, while a binomial term smooths the label information of multiple images based on multiple different features.

A unary term is usually defined using Equation (3).

Φ (y_{i} | x) = - P_{i}^{y_{i}} \log (Y_{i}^{y_{i}} Θ P_{i}^{y_{i}})

(3)

In Equation (3),

P_{i}

is the rough classification probability of the pedestrian identity

x_{i}

obtained last in the feature extraction network.

Θ

represents the product of the elements. Due to the inconsistency of classification values obtained by a single binary term due to image noise interference, the binary term constructed by multi-feature information fusion can be used for smoothing and interactions.

3.3. Multi-Feature Fusion Algorithm

Generally speaking, when the lighting is not strong, color features are more dominant, but have a stronger impact on viewpoint changes than other features. Based on texture features, contour boundaries can be easily processed. Therefore, this method selects color and texture features to extract the most obvious appearance and posture contours to distinguish the identities of different pedestrians.

To utilize low-level and high-level information for pedestrian re-identification, this article introduces a multi-feature fusion mechanism. The module network structure is shown in Figure 4, consisting of a backbone network for feature map extraction and multiple sub-networks for shallow feature processing. The backbone network uses the ResNet-50 network with its multi-level architecture. The shallow feature maps tend to capture detailed information, while the deep feature maps tend to present global information. The widely used SoftMax classifier obtains shallow feature maps from different layers of the backbone network and places them in the feature space. Then, the extracted deep and shallow features are connected as the joint features of pedestrians, which together form a label loss function as binary and unary terms to prepare for obtaining the appropriate image-level labels.

Assuming

X = {x_{1}, x_{2}, . . ., x_{n}}

is the available dataset, where

x

is the training image. Let

R = {ξ | ξ \in R^{N}}

represent the sample space. For the given feature spaces A and B, the two feature vectors

T_{1}

and

T_{2}

extracted from the same image

ξ

can be represented as

A = {T_{1} | T_{1} \in R_{P}}

and

B = {T_{2} | T_{2} \in R_{q}}

.

For a given detection image

x_{P}

, extract a d-dimensional feature vector and use Equation (4) to represent it.

x = (x_{1}, x_{2}, \dots, x_{d}) \in ℝ^{d}

(4)

The combined feature vectors are represented by Equation (5).

U = \{\{x_{i}, \dots, x_{n}\}, \{x_{j}, \dots, x_{n}\}, \{x_{d}, \dots, x_{n}\}\}

(5)

The similarity between the two image samples can be determined based on the Mahalanobis distance [51], calculated using Equation (6).

W_{i j} = e x p (- {(x_{i} - x_{j})}^{T} M (x_{i} - x_{j}))

(6)

The binary term is defined using Equation (7).

ψ (y_{i}, y_{j} | x_{i}; x_{j}) = ζ (y_{i}, y_{j}) Y_{i}^{y_{i}} Y_{j}^{y_{j}} W_{i j}

(7)

Similar to unary terms, binary terms are also limited by the package-level annotations

y_{i}

and

y_{j}

. Binary terms can provide important knowledge that cannot be captured by unary terms (such as structural context dependencies). The Potts model provides a simple label compatibility function

ζ (y_{i} y_{j}) \in

{0, 1} in Equation (7), represented by Equation (8).

ζ (y_{i}, y_{j}) = \{\begin{array}{l} 0, & y_{i} = y_{j} \\ 1, & o t h e r w i s e \end{array}

(8)

It introduces penalties for assigning similar images with different labels.

By minimizing the label loss of (2), the pseudo-image-level labels

i

of a human image

i

can be obtained, and once such labels are generated, they will be used to update the network parameters.

3.4. Weak-Supervised Triplet Loss

Although this article does not directly compare with other losses, many previous studies and practices have shown that triplet losses are more suitable for pedestrian re-identification. The purpose of a triplet loss is to bring the distance between the benchmark samples and positive samples in the same category closer in a feature space, and to push the distance between benchmark samples and negative samples in other categories further, thereby enhancing the discrimination of different identities. This idea is very in line with the goal of the pedestrian re-identification task, which is to calculate the similarity between the two images in the given training and testing datasets. Figure 5 shows a visualization diagram of the triples.

The principle of determining that the image

x_{i}^{a}

of the reference sample is closer to all other images

x_{i}^{n}

of the reference sample than any other image

x_{i}^{p}

of anyone else can be represented by Equation (9).

∥ f (x_{i}^{a}) - f (x_{i}^{p}) ∥_{2}^{2} + Δ < ∥ f (x_{i}^{a}) - f (x_{i}^{n}) ∥_{2}^{2}, \forall (f (x_{i}^{a}), f (x_{i}^{p}), f (x_{i}^{n})) \in Γ

(9)

In Equation (9), Δ is the margin of mandatory execution between the positive and negative sample pairs. Γ is the set of all possible triples in the training dataset, with a cardinality of N.

In order to accelerate the convergence speed of the Re-ID model and increase recognition accuracy, this article proposes a weakly supervised triplet loss function, which is defined by Equation (10).

L_{w e a k_t r i p l e t} = \sum_{k = 1}^{N} [Δ + \underset{j \neq k; j = 1, \dots n; {\dot{y}}_{i} = {\dot{y}}_{j}}{m e d i a n} ∥ z_{k} - z_{j} ∥_{2}^{2} - \underset{j \neq k; j = 1, \dots n; {\dot{y}}_{i} = {\dot{y}}_{j}}{m i n} ∥ z_{k} - z_{j} ∥_{2}^{2}]

(10)

In Equation (10),

n

represents the number of bags in the training batch and

∥ \cdot ∥_{2}^{2}

represents the

L_{2}

norm. And the image-level pseudo-labels

{\dot{y}}_{i}

and

{\dot{y}}_{j}

relax the maximum value operation to the median operation, and

{\dot{y}}_{i} = {\dot{y}}_{j}

which means that the i and j samples belong to the same class.

4. Experiment

4.1. Experimental Setup and Environment

The feature extraction network uses the ResNet-50 backbone network, and its parameters are initialized using ImageNet pre-training. Other parameters are initialized by sampling from a normal distribution. Using a small batch of 128 images and an initial learning rate of 0.01 (0.09 for the fully connected layers), the learning rate is reduced to 0.001 after 100 iterations. This article uses 0.8 momentum and 0.0005 weight attenuation during the training process. The experimental hardware configuration and software environment are shown in Table 1.

4.2. Experimental Dataset

For the convenience of comparison with other methods, the dataset used in this article’s experiment was directly set to weakly annotated from the publicly available open-source dataset. Because the testing process is the same as fully supervised, it is only necessary to replace the training set in the original dataset with packet-level annotations. The specific approach is also very simple. You can choose to put some images in a package, and the package-level annotation is the weak label of these images. For example, if each image in the original dataset has detailed annotation information {XingFa C1_1, QiaoQian C2_1, 6, QiangGuo C5, and 05} is summarized as a package-level label, it is {XingFa, QiangGuo, QiaoQian}. One just needs to know that there are pedestrian images inside, without needing to know the detailed information. When a package contains n pedestrian IDs, it is represented as n IDs/package. The weakly supervised learning setting determined in this article based on experimental results is 5 IDs per packet.

4.3. Experimental Results

The training results are shown in Figure 6. As the number of model iterations (epoch) increases, the training loss gradually decreases, while the validation loss remains relatively stable, as shown in Figure 6a. The average precision–recall curve gradually stabilized with the increase of epoch, as shown in Figure 6b. When the score_threshold = 0.5, the model recall rate is 72.29%, as shown in Figure 6c. When the score_threshold = 0.5, the harmonic mean curve is shown in Figure 6d, and the average accuracy of the model is 78%. The average accuracy of the model is 80.17%, as shown in Figure 6e.

From these plots, it can be seen that the model exhibits a good learning trend during the training process, can balance precision and recall to a certain extent, and has a high average accuracy. However, in practical applications, the score threshold must be adjusted according to the specific task requirements to achieve the best balance between precision and recall.

4.4. Ablation Experiment

(1) Verify the probability map module and the multi-feature fusion module. When neither module is used, the effect is relatively poor, as shown in Table 2 for specific data. When not using the probability map module, the image labels can use the rough estimate label

Y_{j}

in Formula (1) to act as pseudo-labels. When the probability graph module is involved, the generated pseudo-label information is relatively accurate. The recognition accuracy of the datasets Markt-1501 and CUHK03 has significantly improved, with Rank-1 in the CUHK03 dataset increasing from 56.4% to 62.3%, an increase of 5.9%. The probability graph module proposed in this article has a positive effect on the recognition accuracy of the model.

The binary item constructed using multi-feature fusion information can smoothly generate image labels. With more accurate label supervision, the Re-ID model with higher recognition accuracy can be trained. Similar to proving the effectiveness of the probability graph model, experiments were conducted on the datasets CUHK03 and Markt-1501. The experimental results are shown in Table 2. If the binary items provided by the multi-feature fusion module are involved in the generation of pseudo tags and the training of models, the accuracy of the two test sets has been significantly improved, especially in the CUHK03 test set; the accuracy has increased from 56.4% to 67.5%, an increase of 11.1%. When the two modules are used at the same time, the accuracy rate is higher than that of using either module alone. The reason is that both modules focus on generating more accurate label information, and the information of the two modules is mutually enhanced, making the label allocation in the package more accurate.

(2) Expanding the number of weakly annotated categories can significantly improve the accuracy of the model. Divide the training set in dataset CUHK03 into four sets with different numbers of categories to demonstrate the impact of different numbers of categories on model accuracy. Table 3 presents the accuracy levels of Rank-1, 5, and 10. It is evident that as the number of weakly labeled categories increases, the accuracy of model recognition significantly improves.

(3) According to the experimental results, the more weakly labeled categories there are, the more advantageous the accuracy of the model is. However, the more pedestrians there are in a package, the greater the impact on the accuracy of the model. This is because, during model initialization, pseudo-labels need to be assigned to images. If the number of pedestrians increases, the uncertainty of pseudo-label allocation increases, making it more difficult for more images to obtain correct labels. To this end, experiments were conducted on datasets CUHK03 and PRID2011, where the images in the test set were grouped into one package with 1 pedestrian ID (equivalent to fully supervised), 2 pedestrian IDs, 3 pedestrian IDs, 10 pedestrian IDs, and a random number of pedestrian IDs. By observing the results in Table 4, it was found that the accuracy of the model trained with the weakly labeled data was not significantly different from that of the model trained with the strongly labeled data (such as the rank 1 accuracy of 69.6% and 71.5% seen in Table 4). In addition, weak annotations save more computational costs and annotation complexity than strong annotations. However, as the number of personnel IDs in each package increases, the accuracy of the weakly supervised methods gradually decreases. When the number of pedestrian IDs in each package increased from 2 to 10, the accuracy of the model decreased by 14.9%. This also indicates that the increase in uncertain identity labels makes it difficult to allocate the correct labels. It is worth mentioning that experiments with random numbers have an attractive performance (Rank-1 accuracy of 69.6% vs. 71.5%). Randomness refers to each package containing a random number of personnel IDs (an average of 5 IDs), which reflects the state of the real world. The results indicate that solving weakly supervised Re-ID problems is feasible and attractive in reality.

4.5. Comparison with Existing Technology

Strongly supervised learning requires training data with complete and accurate labeling information; whereas unsupervised learning does not rely on any labeling information, and weakly supervised learning falls somewhere in between. By using partially labeled or incompletely labeled data to train the model, the reliance on a large amount of fully labeled data can be alleviated to a certain extent, improving the flexibility and usefulness of learning.

Since the existing strongly supervised pedestrian re-identification methods cannot be directly applied to the weakly supervised environment in this paper, the method proposed in this paper is compared with unsupervised and weakly supervised pedestrian re-identification methods to confirm the effectiveness of this paper’s method when compared with the strongly supervised method. Table 5 compares the experimental results of the method proposed in this paper with the strongly supervised, unsupervised, and weakly supervised pedestrian re-identification models. For strongly supervised learning, PCB, PCB+RPP [34], MGN, SVDNet [52], MSMG-Net [53], and MSAN [54] are used in this paper. For unsupervised learning, CAMEl [55], PAUL, OIMI [56], DAL [57], DBC [58], and UTAL [59] are used. For weakly supervised learning, a total of 16 methods from CV-MML, WSDDN [60], HSLR [61], and SSLR [61] are used to carry out comparative experiments, and the CAMEL method is a traditional two-phase framework, which firstly needs to extract the image features, and then learns the asymmetric representation. In the optimal unsupervised DBC, PAUL in Table 5, there is no relevant experiment on the CUHK03 dataset, this paper only compares the performance of the Market-1501 and DuKeMTMC datasets. The results in Table 5 show that the weakly supervised pedestrian re-identification method proposed in this paper obtains a significant performance improvement over the unsupervised Re-ID method, which is considerably better than the best-performing unsupervised models, DBC and PAUL. The map of the proposed method in this paper for the DuKeMTMC dataset is 59.6% lower than that of the weakly supervised learning approach, WSDDN, which is 60.7%. It is higher than that of the better-performing CV-MIML for the DuKeMTMC dataset at 59.3%, compared with other weakly supervised performances using the same dataset. This paper’s method greatly improves performance and is effective when compared with other weakly supervised methods, and this paper also conducts experiments on the CUHK03 dataset, which performs well.

To further prove the effectiveness of the proposed method in this paper, it is compared with the strongly supervised methods PCB, PCB + RPP, MGN, SVDNet, MSMG-Net, and MSAN. The results are shown in Table 5. The model proposed in this paper still has some gaps compared to the strongly supervised model. Given that the method in this paper saves annotation work, reduces cost, and is easy to deploy, the weakly supervised Re-ID provides a good balance between annotation and accuracy. These results validate the effectiveness and feasibility of the method proposed in this paper.

To visualize the performance of the proposed method in this paper, the CMC curves of some of the compared methods are shown for the Market1501 dataset, from which it can be seen that the weakly supervised multi-feature fusion model based on weakly supervised Re-ID outperforms the purely unsupervised algorithm in the Market1501 dataset and is not too far away from the performance of the strongly supervised algorithm. The CMC curves are shown in Figure 7.

Finally, the recognition results of the test images in the DuKeMTMC and Market-1501 datasets were visualized and displayed, as shown in Figure 8 and Figure 9. The first column is the query image, which tested pedestrian images with occlusion, severe lighting changes, and low resolution. Following closely are the returned matching results, where the correctly matched images are marked with green dashed boxes. The images that detect errors are all closer in appearance to the query images.

5. Conclusions

(1) This article aims to eliminate the need for expensive label work (for example, strong supervision requires many accurate datasets, which increases the annotation costs) in traditional pedestrian re-identification by considering the construction of a weakly supervised multi-feature fusion pedestrian re-identification algorithm. In this weakly supervised environment, there is no need to provide specific annotations to individuals in the surveillance video. The only requirement is to indicate whether someone appears in the given video, that is, whether they are in a package. In such a setting, given a set of detection character images, the algorithm model can search for a certain person and the videos they appear in.

(2) This article transforms the problem of weakly supervised pedestrian re-recognition into a multi-instance and multi-label problem, using multi-feature fusion to capture the dependency relationships between the images in each weakly annotated packet. This method can mine potential intra-class changes in packets from all camera views and potential cross-view changes between the same person’s cross packets, and use a graph model to assign pseudo-labels to images.

(3) Finally, this model allows the use of images with fine labels for training. The experimental results verified the feasibility of training the model with less annotation work (this improves the accuracy of the weakly supervised model training compared to the unsupervised method), and also demonstrated the effectiveness of the proposed multi-feature fusion module and graph model.

Author Contributions

C.Q., L.Z., Q.P., G.L. (Guixing Lin): conceptualization, methodology, software, writing—original draft preparation, writing—review and editing, data curation. Z.W.: supervision, project administration, funding acquisition, resource, validation. C.Q., G.L. (Guanlin Lu): investigation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National (Guangxi) College Students’ Innovation and Entrepreneurship Training Program (202310594038). Funders C.Q.; the National Natural Science Foundation of China, grant numbers 62466004, 61962007 and 62266009. Funders Z.W.; the Key Projects of the Guangxi Natural Science Foundation, grant number 2018GXNSFDA294001. Funders Z.W.;and the Guangxi Key Laboratory of Big Data in Finance and Economics, grant No.FEDOP2022A06. Funders Z.W.

Data Availability Statement

The dataset used in this study can be found at https://pan.baidu.com/s/1fIKtQbZ7CZ4ONmsqv6ke-g?pwd=1234 and the Extracted code is 1234. Date of visit: 25 July 2024.

Acknowledgments

I would like to extend my sincere gratitude to my supervisor, Wang Zhiwen, for his instructive advice and useful suggestions on my thesis. I am deeply grateful for his help in the completion of this thesis.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, J.; Jiang, F. Multi-level supervised network for person re-identification. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2072–2076. [Google Scholar]
Li, W.; Zhu, X.; Gong, S. Harmonious attention network for person re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2285–2294. [Google Scholar]
Yuan, Y.; Chen, W.; Yang, Y.; Wang, Z. In defense of the triplet loss again: Learning robust person re-identification with fast approximated triplet loss and label distillation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1454–1463. [Google Scholar]
Wang, X. Intelligent multi-camera video surveillance: A review. Pattern Recognit. Lett. 2013, 34, 3–19. [Google Scholar] [CrossRef]
Huang, T.; Russell, S. Object identification in a Bayesian context. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, San Diego, CA, USA, 23–29 August 1997; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1997; pp. 1276–1282. [Google Scholar]
Javed, O.; Shafique, K.; Shah, M. Appearance modeling for tracking in multiple non-overlapping cameras. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 26–33. [Google Scholar]
Chen, K.W.; Lai, C.C.; Lee, P.J.; Chen, C.S.; Hung, Y.P. Adaptive learning for target tracking and true linking discovering across multiple non-overlapping cameras. IEEE Trans. Multimed. 2011, 13, 625–638. [Google Scholar] [CrossRef]
Zajdel, W.; Zivkovic, Z.; Krose, B.J.A. Keeping track of humans: Have I seen this person before? In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, Barcelona, Spain, 18–22 April 2005; pp. 2081–2086. [Google Scholar]
Gheissari, N.; Sebastian, T.B.; Hartley, R. Person reidentification using spatiotemporal appearance. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006; pp. 1528–1535. [Google Scholar]
Dikmen, M.; Akbas, E.; Huang, T.S.; Ahuja, N. Pedestrian recognition with a learned metric. In Proceedings of the Computer Vision-ACCV 2010, Queenstown, New Zealand, 8–12 November 2010; Springer: Heidelberg, Germany, 2011; pp. 501–512. [Google Scholar]
Zheng, W.S.; Gong, S.; Xiang, T. Person re-identification by probabilistic relative distance comparison. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 649–656. [Google Scholar]
Wang, Y.; Hu, R.; Liang, C.; Zhang, C.; Leng, Q. Camera compensation using feature projection matrix for person re-identification. In Proceedings of the 2013 IEEE International Conference on Multimedia and Expo (ICME), San Jose, CA, USA, 15–19 July 2013; pp. 1–6. [Google Scholar]
Wang, Z.; Hu, R.; Liang, C.; Yu, Y.; Jiang, J.; Ye, M.; Leng, Q. Zero-Shot person re-identification via Cross-View consistency. IEEE Trans. Multimed. 2016, 18, 260–272. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Deep metric learning for person re-identification. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 34–39. [Google Scholar]
Li, W.; Zhao, R.; Xiao, T.; Malik, K. Deep ReID: Deep filter pairing neural network for person re-identification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 152–159. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person pe-identification: A benchmark. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
Wang, W.C. Research on Pedestrian Re-identification Algorithm Based on Weak Supervision. Master’s Thesis, University of Electronic Science and Technology, Hangzhou, China, 2023. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Bak, S.; Carr, P. One-Shot metric learning for person re-identification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1571–1580. [Google Scholar]
Chen, D.; Li, H.; Xiao, T.; Yi, S.; Wang, X. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1169–1178. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9726–9735. [Google Scholar]
Xia, D.; Guo, F.; Liu, H.; Xia, Y. A review on the progress of open pedestrian re-identification. Data Acquis. Process. 2021, 36, 449–467. [Google Scholar]
Wang, S.; Xiao, S. A Review of Pedestrian Re-identification Research. J. Beijing Inst. Technol. 2022, 48, 1100–1112. [Google Scholar]
Zhao, R.; Ouyang, W.; Wang, X. Unsupervised salience learning for person re-identification. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 3586–3593. [Google Scholar]
Li, W.; Wang, X. Locally aligned feature transforms across views. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 3594–3601. [Google Scholar]
Kviatkovsky, I.; Adam, A.; Rivlin, E. Color invariants for person reidentification. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1622–1634. [Google Scholar] [CrossRef]
Zhao, R.; Ouyang, W.; Wang, X. Person re-identification by salience matching. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2528–2535. [Google Scholar]
Farenzena, M.; Bazzani, L.; Perina, A.; Murino, V.; Cristani, M. Person re-identification by symmetry-driven accumulation of local features. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2360–2367. [Google Scholar]
Gray, D.; Tao, H. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In Proceedings of the Computer Vision-ECCV 2008, Marseille, France, 12–18 October 2008; Springer: Heidelberg, Germany, 2008; pp. 262–275. [Google Scholar]
Zheng, W.S.; Gong, S.; Xiang, T. Reidentification by relative distance comparison. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 653–668. [Google Scholar] [CrossRef]
Köstinger, M.; Hirzer, M.; Wohlhart, P.; Roth, P.M.; Bischof, H. Large scale metric learning from equivalence constraints. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2288–2295. [Google Scholar]
Varior, R.R.; Shuai, B.; Lu, J.; Xu, D.; Wang, G. A siamese long short-term memory architecture for human re-identification. In Proceedings of the Computer Vision-ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 135–153. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the Computer Vision-ECCV 2018, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 501–508. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 274–282. [Google Scholar]
Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; Chen, Z. Relation-aware global attention for person re-identification. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3183–3192. [Google Scholar]
Gu, J.; Wang, K.; Luo, H.; Chen, C.; Jiang, W.; Fang, Y.; Zhao, J. Msinet: Twins contrastive search of multi-scale interaction for object reid. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19243–19253. [Google Scholar]
Zhang, M.; Yu, Z.; Han, Y.; Li, T. A review of pedestrian re-identification for complex scenes. Comput. Sci. 2022, 49, 138–150. [Google Scholar]
Qi, L.; Yu, P.; Gao, Y. A review of pedestrian re-identification research in weakly supervised scenarios. J. Softw. 2020, 31, 2883–2902. [Google Scholar]
Agrawala, A. Learning with a probabilistic teacher. IEEE Trans. Inf. Theory 1970, 16, 373–379. [Google Scholar] [CrossRef]
Wu, Y.; Lin, Y.; Dong, X.; Yan, Y.; Ouyang, W.; Yang, Y. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5177–5186. [Google Scholar]
Xin, X.; Wang, J.; Xie, R.; Zhou, S.; Huang, W.; Zheng, N. Semi-supervised person re-identification using multi-view clustering. Pattern Recognit. 2019, 88, 285–297. [Google Scholar] [CrossRef]
Yu, H.X.; Zheng, W.S.; Wu, A.; Guo, X.; Gong, S.; Lai, J.H. Unsupervised person re-identification by soft multilabel learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 2143–2152. [Google Scholar]
Yang, Q.; Yu, H.X.; Wu, A.; Zheng, W.S. Patch-based discriminative feature learning for unsupervised person re-identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3628–3637. [Google Scholar]
Chen, L.; Ye, F.; Huang, T.; Huang, L.; Weng, B.; Xu, C.; Hu, J. An unsupervised pedestrian re-identification method based on inter-domain merging in camera domain. Comput. Res. Dev. 2023, 60, 415–425. [Google Scholar]
Wang, J.; Zhu, X.; Gong, S.; Li, W. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2275–2284. [Google Scholar]
Wang, G.; Wang, G.; Zhang, X.; Lai, J.; Yu, Z.; Lin, L. Weakly-supervised person Re-ID: Differentiable graphical learning and a new benchmark. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 2142–2156. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Zhang, C.; Li, Z.; Tang, Y.; Wang, Z. Discriminative feature mining with relation regularization for person re-identification. Inf. Process. Management. 2023, 60, 103295. [Google Scholar] [CrossRef]
Wang, Z.; Feng, J.; Zhang, Y. Pedestrian detection in infrared image based on depth transfer learning. Multimed. Tools Appl. 2022, 81, 39655–39674. [Google Scholar] [CrossRef]
Zhou, R.; Chang, X.; Shi, L.; Shen, Y.D.; Yang, Y.; Nie, F. Person reidentification via Multi-Feature fusion with adaptive graph learning. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 1592–1601. [Google Scholar] [CrossRef]
Weinberger, K.Q.; Saul, L.K. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 2009, 10, 207–244. [Google Scholar]
Sun, Y.; Zheng, L.; Deng, W.; Wang, S. SVDNet for Pedestrian Retrieval. In Proceedings of the IEEE international Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3800–3808. [Google Scholar]
Cui, X.; Liang, Y.; Zhang, W. Dual-Branch Person Re-Identification Algorithm Based on Multi-Feature Representation. Electronics 2023, 12, 1869. [Google Scholar] [CrossRef]
Li, M.; Yuan, L.; Wen, X.; Wang, J.; Xie, G.; Jia, Y. Multi-Scale Attention Network Based on Multi-Feature Fusion for Person Re-Identification. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
Yu, H.X.; Wu, A.; Zheng, W.S. Cross-view asymmetric metric learning for unsupervised person re-identification. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 994–1002. [Google Scholar]
Xiao, T.; Li, S.; Wang, B.; Lin, L.; Wang, X. Joint Detection and Identification Feature Learning for Person Search. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3376–3385. [Google Scholar] [CrossRef]
Chen, Y.; Zhu, X.; Gong, S. Deep association learning for unsupervised video person re-identification. arXiv 2018, arXiv:1808.07301. [Google Scholar]
Ding, G.; Khan, S.H.; Tang, Z.; Zhang, J.; Porikli, F.M. Towards better Validity: Dispersion based Clustering for Unsupervised Person Re-identification. arXiv 2019, arXiv:1906.01308. [Google Scholar]
Li, M.; Gong, S. Unsupervised tracklet person re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1770–1782. [Google Scholar] [CrossRef] [PubMed]
Bilen, H.; Vedaldi, A. Weakly supervised deep detection networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, 27–30 June 2016; pp. 2846–2854. [Google Scholar]
Dong, Q.; Zhu, X.; Gong, S. Single-label multi-class image classification by deep logistic regression. Proc. AAAI Conf. Artif. Intell. 2019, 33, 3486–3493. [Google Scholar] [CrossRef]

Figure 1. Weak supervision diagram.

Figure 2. Development history of pedestrian re-identification [19,22].

Figure 3. Weakly supervised algorithm model network architecture.

Figure 4. Multi-feature fusion module.

Figure 5. Schematic diagram of triplet loss.

Figure 6. Model performance results.

Figure 7. CMC curves for Market-1501 dataset.

Figure 8. DuKeMTMC partial retrieval visualization results chart.

Figure 9. Market-1501 partial retrieval visualization results chart.

Table 1. Experimental Environment.

Configuration	Parameter
CPU	13th Gen Intel(R) Core(TM) i5-13600 @ 3.50 GHz
GPU	NVIDIA GeForce RTX 4090
Python	3.9
Pytorch	1.12.0
CUDA	11.4
cuDNN	8.0.2

Table 2. Module effectiveness analysis.

Module	Whether to Use	Markt-1501		CUHK03
		mAP	Rank-1	mAP	Rank-1
Probability map and multi-feature fusion module	-	57.4	81.9	38.5	56.4
Probability Map Module	√	61.1	87.2	43.3	62.3
multi-feature fusion module	√	63.5	89.6	45.8	67.5
Image module + feature fusion module	√	65.4	91.1	49.8	71.2

Table 3. Data scalability analysis.

Category	CUHK03
	Rank-1	Rank-5	Rank-10
36	12.6	35.6	59.6
227	34.3	50.3	64.3
467	52.6	64.5	75.0
767	71.2	80.3	88.0

Table 4. Data diversity analysis comparison with state-of-the-art methods.

ID/Wrap	CUHK03		PRID2011
	mAP	Rank-1	mAP	Rank-1
1 (Full supervision)	68.1	71.5	56.3	68
2	61.4	68	52.6	63
3	55.7	59.4	48.2	60.4
10	46.5	48.2	39.9	54.7
Stochastic Averaging (5)	62.3	69.6	53	65.4

Table 5. Comparative experiments.

Method	Way	Market-1501		DuKeMTMC		CUHK03		Effectiveness
		mAP	Rank-1	mAP	Rank-1	mAP	Rank-1
PCB	Strong supervision	81.6	93.8	69.2	83.3	57.5	63.7	More Complicated
PCB + RPP [34]	Strong supervision	80.9	93.3	68.1	82.9	-	-	More Complicated
MGN	Strong supervision	86.9	95.7	78.4	88.7	68.0	67.4	Better, more complicated
SVDNet [52]	Strong supervision	62.1	82.3	56.8	76.7	-	-	More Complicated
MSMG-Net [53]	Strong supervision	88.6	96.3	80.1	89.9	-	-	Multi-scale and multi-granularity supervision, difficult to deploy
MSAN [54]	Strong supervision	88.6	96.1	-	-	68.6	72.2	More Complicated
CAMEL	Unsupervised	26.3	54.5	-	-	31.9	39.4	Worse
PAUL	Unsupervised	40.1	68.5	53.2	72.0	-	-	Worse
OIMI [56]	Unsupervised	13.5	33.7	43.8	51.1	-	-	Poor
DAL [57]	Unsupervised	23.0	49.3	-	-	-	-	Poor
DBC [58]	Unsupervised	43.8	64.3	66.1	75.2	-	-	Worse
UTAL [59]	Unsupervised	35.2	49.9	36.6	48.3	-	-	Worse
CV-MIML	Weakly supervised	-	-	59.53	78.5	-	-	Better
WSDDN [60]	Weakly supervised	47.1	59.2	60.7	65.4	-	-	Better
HSLR [61]	Weakly supervised	35.8	56.4	54.7	61.7	-	-	Worse
SSLR [61]	Weakly supervised	31.2	51.9	50.0	56.3	-	-	Worse
OURS	Weakly supervised	65.4	86.6	59.6	79.3	49.8	61.2	Better, lower cost, easier to deploy

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qin, C.; Wang, Z.; Zhang, L.; Peng, Q.; Lin, G.; Lu, G. Pedestrian Re-Identification Based on Weakly Supervised Multi-Feature Fusion. Algorithms 2024, 17, 426. https://doi.org/10.3390/a17100426

AMA Style

Qin C, Wang Z, Zhang L, Peng Q, Lin G, Lu G. Pedestrian Re-Identification Based on Weakly Supervised Multi-Feature Fusion. Algorithms. 2024; 17(10):426. https://doi.org/10.3390/a17100426

Chicago/Turabian Style

Qin, Changming, Zhiwen Wang, Linghui Zhang, Qichang Peng, Guixing Lin, and Guanlin Lu. 2024. "Pedestrian Re-Identification Based on Weakly Supervised Multi-Feature Fusion" Algorithms 17, no. 10: 426. https://doi.org/10.3390/a17100426

APA Style

Qin, C., Wang, Z., Zhang, L., Peng, Q., Lin, G., & Lu, G. (2024). Pedestrian Re-Identification Based on Weakly Supervised Multi-Feature Fusion. Algorithms, 17(10), 426. https://doi.org/10.3390/a17100426

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pedestrian Re-Identification Based on Weakly Supervised Multi-Feature Fusion

Abstract

1. Introduction

2. Related Works

2.1. Pedestrian Re-ID Research

2.2. Strongly Supervised Pedestrian Re-Identification Research

2.2.1. Traditional Methods Based on Feature Expression and Metric Learning

2.2.2. Pedestrian Re-Identification Based on Deep Learning

2.3. Weakly Supervised Pedestrian Re-Identification Research

2.3.1. Semi-Supervised Re-ID

2.3.2. Unsupervised Re-ID

2.4. The Problems That Need to Be Further Addressed in Pedestrian Re-Identification

3. Our Approach

3.1. Weakly Supervised Multi-Feature Fusion Pedestrian Re-Recognition Algorithm Network Architecture

3.2. Image Pseudo-Label Loss Function

3.3. Multi-Feature Fusion Algorithm

3.4. Weak-Supervised Triplet Loss

4. Experiment

4.1. Experimental Setup and Environment

4.2. Experimental Dataset

4.3. Experimental Results

4.4. Ablation Experiment

4.5. Comparison with Existing Technology

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI