Dual-Stage Attribute Embedding and Modality Consistency Learning-Based Visible–Infrared Person Re-Identification

Cheng, Zhuxuan; Fan, Huijie; Wang, Qiang; Liu, Shiben; Tang, Yandong

doi:10.3390/electronics12244892

Open AccessArticle

Dual-Stage Attribute Embedding and Modality Consistency Learning-Based Visible–Infrared Person Re-Identification

¹

School of Information Engineering, Shenyang University of Chemical Technology, Shenyang 110142, China

²

State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China

³

Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110003, China

⁴

Key Laboratory of Manufacturing Industrial Integrated, Shenyang University, Shenyang 110044, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(24), 4892; https://doi.org/10.3390/electronics12244892

Submission received: 24 November 2023 / Revised: 1 December 2023 / Accepted: 1 December 2023 / Published: 5 December 2023

(This article belongs to the Special Issue Lifelong Machine Learning-Based Efficient Robotic Object Perception)

Download

Browse Figures

Versions Notes

Abstract

:

Visible–infrared person re-identification (VI-ReID) is an emerging technology for realizing all-weather smart surveillance systems. To address the problem of pedestrian discriminative information being difficult to obtain and easy to lose, as well as the wide modality difference in the VI-ReID task, in this paper we propose a two-stage attribute embedding and modality consistency learning-based VI-ReID method. First, the attribute information embedding module introduces the fine-grained pedestrian information in the attribute label into the transformer backbone, enabling the backbone to extract identity-discriminative pedestrian features. After obtaining the pedestrian features, the attribute embedding enhancement module is utilized to realize the second-stage attribute information embedding, which reduces the adverse effect of losing the person discriminative information due to the deepening of network. Finally, the modality consistency learning loss is designed for constraining the network to mine the consistency information between two modalities in order to reduce the impact of modality difference on the recognition results. The results show that our method reaches 74.57% mAP on the SYSU-MM01 dataset in All Search mode and 87.02% mAP on the RegDB dataset in IR-to-VIS mode, with a performance improvement of 6.00% and 2.56%, respectively, proving that our proposed method is able to reach optimal performance compared to existing state-of-the-art methods.

Keywords:

visible–infrared person re-identification; attribute information; dual-stage embedding; modality consistency

1. Introduction

Person Re-identification (ReID) technology aims to obtain the required clues by retrieving the target pedestrians across visible light cameras. Specifically, given an image (Query) of the target person to be retrieved, person re-identification technology matches a number of images with high similarity to the target person from a pedestrian gallery. Conventional single-modality pedestrian retrieval mainly addresses the challenges of backgrounds, body poses, viewpoints, and occlusions. These challenges can lead to increased intra-class discrepancies and decreased inter-class discrepancies among pedestrian features. Specifically, intra-class discrepancies indicate the variation of the same pedestrian across multiple images, while inter-class discrepancies indicate the variation between different pedestrians. For single-modality ReID, researchers have proposed supervised ReID approaches [1,2,3,4], unsupervised ReID approaches [5,6], and lifelong ReID approaches [7,8], which can substantially improve identification efficiency. The above approaches show that significant breakthroughs have been made in the research area of single-modality person re-identification.

However, single-modality ReID has difficulty achieving all-weather pedestrian retrieval in practical application scenarios using only the pedestrian images captured by visible light cameras, as it is difficult for visible images to provide effective person discriminative information in low-light scenarios (e.g., at night), while infrared cameras are able to solve this problem very well. Therefore, researchers have proposed the visible–infrared person re-identification (VI-ReID) task. As shown in Figure 1, the VI-ReID task matches the query images from one modality with a pedestrian gallery from another modality. VI-ReID has drawn much attention in the area of computer vision due to its wide application prospects in the field of all-weather intelligent surveillance.

The VI-ReID task is challenging because of the modality differences caused by the heterogeneous imaging principles of infrared and visible cameras; moreover, as shown in Figure 2, problems arising from backgrounds, body poses, viewpoints, and occlusions in single-modality ReID continue to exist in VI-ReID as well. Due to the problems mentioned above, VI-ReID is more complex and challenging than single-modality ReID.

The main challenges in the VI-ReID task, such as modality differences and extraction of identity-discriminative features, need to be further addressed. In this paper, a novel two-stream VI-ReID network is proposed to extract fine-grained features with identity discrimination while reducing modality differences by introducing pedestrian attribute information. First, the attribute information is embedded into the transformer backbone to explore the intrinsic connections between pedestrian attributes and images at the feature level by leveraging the powerful feature extraction ability of the transformer architecture. Then, attribute embedding enhancement is applied to the features in the later stage of the network to establish a two-stage attribute information embedding process that enhances the features’ discrimination ability for different identities, thereby mitigating the impact of easily lost fine-grained information during the training stage. Finally, a modality consistency learning loss is designed to encourage each modality’s classifier to provide consistent identity predictions for the same identity feature, enabling the network to better learn cross-modality consistency features and reduce modality differences.

By reducing the modality differences and learning identity-discriminative features, our network can eliminate intra-class discrepancies and promote inter-class discrepancies, making for improved recognition of pedestrians in complex lighting conditions.

The main contributions of this paper are threefold:

We propose a novel attribute information embedding module to learn fine-grained information with modality consistency, which is the first exploration for fusing attribute information with token embeddings in the transformer backbone.
We design an attribute embedding enhancement module to implement the secondary embedding of attribute information to ensure that the learned fine-grained discriminative features are not destroyed during training.
To reduce modality differences, we design a modality consistency learning loss that can eliminate distribution discrepancy between the predictions for pedestrian images with the same identity.

Having introduced the research motivation and application prospects of the VI-ReID task in this introduction, current research progress and related approaches involving single-modality and cross-modality ReID are introduced in the next section. Then, the proposed VI-ReID method based on dual-stage attribute embedding and modality consistency learning is introduced in the methods section, and the attribute information embedding module, attribute embedding enhancement module, and modality consistency learning loss are elaborated. Finally, the superiority of our network is validated by comparing it with state-of-the-art (SOTA) methods on two mainstream datasets and carrying out ablation experiments to validate the effectiveness of the different modules.

2. Related Work

2.1. Single-Modality Person Re-Identification

Single-modality person re-identification typically deals with visible-to-visible matching; existing research has focused on dealing with the the challenges of backgrounds, body poses, viewpoints, and occlusions in visible images.

For supervised person re-identification, in 2018 Sun et al. [1] divided the human body into multiple horizontal grids by part-based representation to learn discriminative information. In 2020, Wang et al. [2] utilized a two-stream network including a backbone and a keypoint detection network to jointly extract body part features. In 2022, Zhang et al. [3] proposed the N-tuple loss to uniformly optimize the distances between multiple samples from different classes in the feature distribution. In 2023, Yang et al. [4] designed a Support Example Miner (SEM) and a variant of triplet loss to correct outlier samples that cause adverse effects. For unsupervised supervised ReID, in 2020, Yu et al. [5] embedded the asymmetric metric into an unsupervised ReID network that applies transformations for features captured from different camera view to address the distortions. In 2022, Song et al. [6] designed a softened label training method that utilizes the cosine similarity among pedestrian features to find the identity-identifiable instances. For lifelong ReID, in 2022 Huang et al. [7] proposed a meta-based Coordinated Data Replay (CDR) strategy to alleviate catastrophic forgetting. In 2023, Pu et al. [8] proposed a memorizing and generalizing framework, introducing Adaptive Knowledge Accumulation (AKA) to improve the generalization ability to unseen domains.

Moreover, as supplemental information, pedestrian attributes have proven to be a valid information source for extracting fine-grained features. Using pedestrian attributes as supplemental information can enhance the performance of networks on tasks such as ReID and face recognition. In 2017, Liu et al. [9] first introduced attribute labels to the ReID task, and additionally provided attribute annotations for existing single-modality ReID datasets. In 2019, Li et al. [10] designed an attribute-identity united prediction dictionary learning network that uses predicted attributes as a bridge to build relations between different domains and generates the labels of target instances.

2.2. Visible–Infrared Cross-Modality Person Re-Identification

VI-ReID poses challenges due to the need to bridge the gap between heterogeneous infrared and visible images due to the large discrepancy between the two modalities. Currently, researchers have proposed a range of approaches to alleviate modality difference in VI-ReID. These approaches can be broadly categorised into three groups depended on their implementation principles: metric learning, image translation, and feature alignment.

To alleviate modality difference, methods based on image translation aim to use image generation networks to mutually transform two modalities at the image level or generate images of an intermediate modality for two different modalities. In 2018, Dai et al. [11] proposed a visible–infrared generative adversarial network (GAN) to perform image transformation and utilized an alternating generator and discriminator training method to obtain modality-consistent features. In 2019, Wang et al. [12] designed a visible–infrared pairwise image generation network that interconverts cross-modality images by separating the features and decoding the exchanged features. Wang et al. [13] utilized a two-stage discrepancy reduction network containing an encoder layer to process inter-modality and intra-modality differences separately during the image transformation process. In 2020, Li et al. [14] constructed an intermediate modality to mitigate modality difference by using a lightweight architecture to convert cross-modality images to obtain X-modal images. In 2021, Wei et al. [15] designed a modality co-learning network in which the pedestrian images pass through a synchronous modality generation module to improve the nonlinear characterization of the synchronous modality and construct the synchronous modality images that retain the spatial information and pedestrian structure information.

Methods based on metric learning focus on how to use designed loss functions or metric strategies to minimize the feature discrepancy between two modalities and maximize the feature discrepancy between different identities. In 2019, Feng et al. [16] designed a Cross-Modality Euclidean Constraint (CMEC) to eliminate modality difference by maximising the intra-class cross-modality feature similarity of pedestrians. They additionally introduced a View Classifier (VC) loss to constrain the view classifier when learning view-related information. In 2020, Wu et al. [17] designed a Modality-Aware Similarity Preserving (MASP) strategy to make the inter-modality and intra-modality similarity between two instances as equal as possible. This helps to alleviate modality-specific information in the modality-shared space. Ye et al. [18] constructed a graph structure and proposed a Local Aggregation Learning (LAL) loss to encourage the model to extract instance-level feature representations more easily in the early stages. In 2021, Hao et al. [19] designed an MC (Modality Confusion) loss to mitigate the modality discrepancy by setting the true and confused labels of two modalities such that the embedded information extracted by the backbone cannot be correctly classified into the corresponding modality, thereby utilizing the modality confusion to mitigate the modality difference.

The key to feature alignment-based methods lies in how to use the learned features of cross-modality images to construct or transform new features with smaller modality difference. The largest difference between feature alignment and image transformation is the object that is aligned or transformed. The former often operates on features, while the latter generally operates on images. To achieve feature-level modality alignment, in 2020 Lu et al. [20] introduced an inter-modality specific shared representation transfer network to uncover the potential modality-shared and modality-specific representations. In 2021, Chen et al. [21] constructed a dual-level feature search space to fuse the features from two modalities. This dual-level structure can jointly select identity-related clues from the channel and spatial information. Fu et al. [22] proposed a BN (Batch Normalization)-oriented search structure in which the BN layer in the backbone has two alternative trails: independent parameters for two modalities or shared parameters for two modalities. Wu et al. [23] proposed a unified modality feature alignment model that uses channel attention-guided instance normalization in the final feature extraction layer to reduce modality difference.

With the rise of transformer networks, researchers have proposed VI-ReID methods based on Visual Transformer [24]. In 2022, Chen et al. [25] designed a structure-aware positional transformer (SPOT) model that learns the modality-invariant structure feature for each modality. Jiang et al. [26] proposed a Cross-Modality Transformer (CMT) to achieve query-adaptive feature alignment through an instance-level alignment module. In 2023, Liang et al. [27] designed a Cross-Modality Transformer-based network (CMTR) able to can generate identity-discriminative features and learn the information of each modality. Zhao et al. [28] proposed a Discriminative Feature Learning Network (DFLN) containing a spatial representation perceive module to extract long-term dependencies between different positions. Lu et al. [29] proposed a Progressive Modality-Shared Transformer (PMT) and a Modality-Shared Enhancement (MSE) loss to mitigate the cross-modality difference.

In addition, although attribute information has been applied in single-modality ReID, it takes on a new meaning in the context of VI-ReID, namely, learning the consistency relationship between infrared and visible pedestrian images. As shown in Figure 3, because the attribute information is fixed in both modal images, the consistency relationship between the two modalities can be learned from the attribute information to learn more identity-discriminative representations. In 2020, Zhang et al. [30] designed a method that adopts attribute labels as auxiliary information to mine fine-grained features and reduce modality difference. They supposed that certain color-irrelevant pedestrian recognition clues are consistent in images of two modalities based on this viewpoint. However, they only fine-tuned the network using pedestrian attributes, without considering the intrinsic connections between images at the feature level or the problem that attribute information can easily be lost during training.

3. Methods

As shown in Figure 4, to learn fine-grained information with identity discrimination and reduce modality difference, this paper proposes a VI-ReID method based on dual-stage attribute embedding and modality consistency learning. We first formulate the VI-ReID task in this section. Then, the attribute information embedding module (AIEM) and attribute embedding enhancement module (AEEM) are elaborated separately. Finally, the modality consistency learning (MCL) loss is introduced. The MCL loss, id loss, hetero-center triplet loss, and attribute learning loss are jointly used to optimize our proposed network.

For input image,

v i s

represents the visible branch and

i r

represents the infrared branch. The visible input can be denoted as

x_{(i, j)}^{v i s}

and the infrared input can be denoted as

x_{(i, j)}^{i r}

, where

(i, j)

indicates the jth training sample of the ith pedestrian in a training batch and

x_{(i, j)}^{v i s}, x_{(i, j)}^{i r} \in R^{H \times W \times C}

, where

H \times W

is the image size and C denotes the total number of channels. In a training batch, P pedestrian identities are sampled randomly, then K visible and K infrared pedestrian images corresponding to each identity are sampled as inputs to the network; thus,

i \in (1, 2, \dots, P]

,

j \in (1, 2, \dots, K]

.

3.1. Attribute Information Embedding Module

Instead of being used to enhance discriminative features, as in single-modality ReID, the perception of pedestrian attributes in VI-ReID can help to learn the consistency relationship between two modalities. However, the intrinsic connection between pedestrian attributes and features is often neglected. In this paper, an attribute information embedding module (AIEM) is introduced in the embedding stage with the aim of learning the consistent pedestrian attribute information in two modalities by fusing attribute information embeddings with token embeddings. The structure of AIEM is shown in Figure 5.

In AIEM, the input image x is first divided into several patches with size

R^{L \times C \times P \times P}

, where P indicates the size of the patch and L indicates the length of the patch sequence. After that, the patches are reshaped and transformed into a sequence of token embeddings after a linear projection with the shape

R^{L \times D}

, where D denotes the dimension of the embeddings. A supplemental class token embedding is integrated into the patch sequence to learn class information and a set of position embeddings are added to the sequence to learn the positional relationship between different patches.

The transformer can naturally integrate attribute embedding into the process of feature extraction, which is an advantage compared to CNN. In AIEM, the attribute embeddings are generated from the attribute labels through a fully connected layer, which is shared by both modalities, as the attribute information is consistent across modal images of the same pedestrian. Specifically, the input image is denoted as

x^{m} (m = v i s, i r)

, the patch sequence of the input image is denoted as

x_{p 1}^{m}, x_{p 2}^{m}, \dots, x_{p N}^{m}

, the position embeddings are denoted as

e_{p 1}^{p o s}, e_{p 2}^{p o s}, \dots, e_{p N}^{p o s}

, and the attribute label corresponding to

x^{m}

is denoted as

a_{p 1}^{}, a_{p 2}^{}, \dots, a_{p N}^{}

, meaning that the input embedding

I (x^{m})

of the transformer backbone can be expressed as follows:

\begin{matrix} \begin{matrix} I (x_{}^{m}) & = L P ({x_{p 1}^{m}, x_{p 1}^{m}, \dots, x_{p N}^{m}}) \\ + {e_{p 1}^{p o s}, e_{p 2}^{p o s}, \dots, e_{p N}^{p o s}} \\ + F C {a_{p 1}^{}, a_{p 1}^{}, \dots, a_{p N}^{}} \end{matrix} \end{matrix}

(1)

where

L P

denotes the linear projection in Figure 4 and

F C

denotes the fully connected layer. After that,

I (x_{}^{m})

is input to the transformer backbone to extract the pedestrian features. The extracted feature can be expressed as follows:

\begin{matrix} f_{}^{m} = E_{T} (I (x_{}^{m})) \end{matrix}

(2)

where

E_{T}

indicates the backbone with n transformer layers and

f_{}^{m}

denotes the image features of the two modalities.

3.2. Attribute Embedding Enhancement Module

The fine-grained information of pedestrian attributes has a high capacity for identity discrimination, however, as the network becomes deeper it is easy for this information to be lost during the training stage. In order to address this challenge, we design a lightweight attribute embedding enhancement module (AEEM) at the end of the backbone to realize the dual-stage embedding of attribute information, while the channel attention is used as the guiding mask to achieve the effect of attribute selection. The architecture of the AEEM is shown in Figure 4.

Specifically, the attribute embedding enhancement module first utilizes a fully connected layer and a ReLU function to embed the attribute labels into vectors, then computes the channel weights

w 1

and

w 2

and gradually weights the visible and infrared features to filter out the attribute embedded features that are effective for identification. The above process can be expressed by the following equation:

\begin{matrix} \begin{matrix} w_{1} & = & S i g m o i d (δ (F C_{2} [[f_{}^{m} \cdot δ (F C_{1} (a))]])), \\ w_{2} & = & S i g m o i d (B N (f_{}^{m} \times w_{1})), \\ {f^{'}}^{m} & = & f^{m} \times w_{2} \end{matrix} \end{matrix}

(3)

where

F C_{1}, F C_{2}

denotes the fully connected layers,

δ

denotes the ReLU function,

[[\cdot]]

denotes the concatenation operation,

{f^{'}}^{m}

denotes the visible and infrared modality features enhanced with attribute information, and a denotes the attribute label corresponding to the pedestrian image

x^{m}

.

In order to constrain the network to better learn the fine-grained information contained in the attribute labels, we set up an attribute classifier to obtain the network’s prediction of the attribute information and optimize it through the cross-entropy loss, which can be calculated as follows:

\begin{matrix} L_{a t r} = - \frac{1}{P K} \sum_{i = 1}^{P} \sum_{j = 1}^{K} a_{i} log C^{a t r} ({f^{'}}_{(i, j)}^{m}) \end{matrix}

(4)

where

C^{a t r} (\cdot)

denotes the attribute classifier and

a_{i}

denotes the attribute label corresponding to the ith pedestrian.

3.3. Modality Consistency Learning

Inspired by Mean-Teacher [23,31,32], we introduce a pair of modality-specific classifiers and a pair of modality-mean classifiers to obtain identity predictions in two modalities, as shown in Figure 4. Given a feature

{f^{'}}_{i}^{m}

used for identity prediction, if the two modality-mean classifiers provide the same identity prediction, it can be considered that the network has learned modality consistency information belonging to that identity.

The modality consistency learning loss

L_{m c l}

can be obtained by constraining the predictions of two modality-mean classifiers using the KL (Kullback–Leibler) divergence, which is calculated as follows:

\begin{matrix} \begin{matrix} L_{m c l} = & - \frac{1}{P K} \sum_{i = 1}^{P} \sum_{j = 1}^{K} d_{K L} (C^{v i s} ({f^{'}}_{(i, j)}^{v i s}) ∥{C^{'}}_{}^{i r} ({f^{'}}_{(i, j)}^{v i s})) \\ - \frac{1}{P K} \sum_{i = 1}^{P} \sum_{j = 1}^{K} d_{K L} (C^{i r} ({f^{'}}_{(i, j)}^{i r}) ∥{C^{'}}_{}^{v i s} ({f^{'}}_{(i, j)}^{i r})) \end{matrix} \end{matrix}

(5)

where

C^{v i s}

and

C^{i r}

respectively denote modality-specific classifiers for visible and infrared,

{C^{'}}^{v i s}

and

{C^{'}}^{i r}

respectively denote modality-mean classifiers for visible and infrared, and

d_{K L} (\cdot ∥\cdot)

denotes the KL divergence. The parameters of the modality-mean classifiers are determined by themselves together with the modality-specific classifiers, and are updated in a time-average manner as follows:

\begin{matrix} θ_{t}^{^{'} m} = (1 - α) θ_{t - 1}^{^{'} m} + α θ_{t}^{m} \end{matrix}

(6)

where

θ_{t}^{^{'} m}

indicates the parameter of the modality-mean classifiers at the tth iteration,

θ_{t}^{m}

indicates the parameter of the modality-specific classifiers at the tth iteration, and

α \in (0, 1]

are the hyperparameters used to control the update rate.

3.4. Loss Function

In addition, to the proposed attribute learning loss and modality consistency learning loss, we utilized two basic losses in the person re-identification task: the id loss

L_{i d}

for identity classification, and the hetero-center triplet loss

L_{h c t r i}

to train our network;

L_{i d}

is computed by the following equation:

\begin{matrix} \begin{matrix} L_{i d} = & - \frac{1}{P K} \sum_{i = 1}^{P} \sum_{j = 1}^{K} y_{i} log C^{i d} ({f^{'}}_{(i, j)}^{m}) \\ - \frac{1}{P K} \sum_{i = 1}^{P} \sum_{j = 1}^{K} y_{i} log C^{v i s} ({f^{'}}_{(i, j)}^{v i s}) \\ - \frac{1}{P K} \sum_{i = 1}^{P} \sum_{j = 1}^{K} y_{i} log C^{i r} ({f^{'}}_{(i, j)}^{i r}) \end{matrix} \end{matrix}

(7)

where

C^{i d}

is the modality-shared classifier,

C^{v i s}

is the visible modality-specific classifier,

C^{i r}

is the infrared modality-specific classifier, and

y_{i}

is identity label.

The hetero-center triplet loss weakens the limitation of the triplet loss, leading to a more accurate projecting of images from different modalities in the same feature space. The hetero-center triplet loss is expressed by the following equation:

\begin{matrix} \begin{matrix} L_{h c t r i} & = \sum_{i = 1}^{P} [ρ + {∥c_{i}^{v i s} - c_{i}^{i r}∥}_{2} - min {∥c_{i}^{v i s} - c_{j}^{n}∥}_{2}]_{+} \\ + \sum_{i = 1}^{P} [ρ + {∥c_{i}^{i r} - c_{i}^{v i s}∥}_{2} - min {∥c_{i}^{i r} - c_{j}^{n}∥}_{2}]_{+} \end{matrix} \end{matrix}

(8)

where

ρ

indicates the margin parameter,

{∥\cdot∥}_{2}

indicates the Euclidean distance, and

c_{i}^{v i s}

and

c_{i}^{i r}

respectively indicate the visible and infrared feature centers. Combining the whole network, the total loss of the proposed method is expressed as follows:

\begin{matrix} L = L_{i d} + L_{t r i} + λ L_{a t r} + L_{m c l} \end{matrix}

(9)

where

λ

is a hyperparameter that can adapt the contribution of the attribute learning loss to the total loss.

4. Experiments

In this section, we first introduce two mainstream VI-ReID datasets, evaluation metrics, and implementation details, then validate the advantage of our proposed method through a comparison with SOTA methods. Finally, an ablation study is implemented for two modules and the propsed modality consistency learning (MCL) loss to verify their effectiveness and a visual analysis is adopted to elaborate the superiority of our proposed method.

4.1. Datasets and Evaluation Metrics

SYSU-MM01 [33] is a large-scale VI-ReID dataset that contains 5792 infrared images and 287,628 visible images of 395 pedestrians captured in both outdoor and indoor environments. All the images were captured by two near-infrared cameras and four visible cameras. The testing set contains 3803 images as the query and 301 (3010) visible images as the gallery. There are two modes in SYSU-MM01 for evaluation: Indoor-search and All-search. For SYSU-MM01 dataset, we utilize the attribute labels proposed by Zhang et al. [30], consisting of gender (male or female), wearing glasses (yes or no), sleeve type (long or short), hairstyle (long or short), type of lower-body clothing (long or short), type of lower-body clothing (dress or pants), carrying satchel (yes or no), and carrying backpack (yes or no). For all annotated attribute labels, the value of negative samples is 0 and that of positive samples is 1.

RegDB [34] was captured by a pair of aligned cameras (one far-infrared and one visible), and includes a total of 412 pedestrians and 8240 images. Ten images were captured separately for each person in each modality. RegDB has two modes for evaluation: visible-to-infrared (VIS to IR), which retrieves the infrared gallery set from a visible query set, and infrared-to-visible (IR to VIS), which retrieves the visible gallery set from an infrared query set. The labels for RegDB dataset were annotated with eight attributes for the pedestrians, as was done for the SYSU-MM01 dataset.

Evaluation Metrics. For the two experimental datasets described above we employed the standard evaluation metrics used in most VI-ReID methods, namely, the cumulative matching characteristic (CMC) curve and the mean average precision (mAP), to evaluate our network. The former calculates the probability of the target pedestrian appearing in the top N retrieved results; we adopted Rank-1 in our experiments. The latter calculates the average precision of all query identities.

4.2. Implementation Details

The proposed method was realized in the Pytorch1.7 deep learning framework on one NVIDIA RTXA6000 GPU. ViT-B/16 pretrained on the ImageNet [35] dataset was utilized as the feature extractor, and the number of transformer layers n was set to 4. All the pedestrian images input to the network were resized to 384 × 144, and horizontal flipping and random erasure were used for data enhancement. Additional Gaussian blur and color jitter were introduced for infrared images. The hyperparameters

α

and

λ

were set to 0.3 and 0.2, respectively, and the margin parameter for the hetero-center triplet loss was set to 0.1. The batch size of the training stage was set to 64. Eight pedestrian identities were randomly selected in each training batch, with four visible images and four infrared images selected for each identity. The Adam optimizer was used for training. The basic learning rate was set to

3 \times 10^{- 4}

and the weight decay was set to

1 \times 10^{- 4}

.

4.3. Comparison with State-of-the-Art Methods

In this section, to verify the effectiveness of the proposed method, we compare its performance with different methods on existing mainstream datasets. The following methods were selected for the comparative experiments: (1) image translation-based methods: cmGAN [11], AlignGAN [12], D²RL [13], XIV-ReID [14], SMCL [15], Hi-CMD [36]; (2) metric learning-based methods: MSR [16], CMSP [17], DDAG [18], MCLNet [19], MAUM [37]; (3) feature alignment-based methods: cm-SSFT [20], NFS [21], CMNAS [22], MPANet [23]; (4) transformer-based methods: SPOT [25], CMT [26], CMTR [27], DFLN-ViT [28], PMT [29].

Table 1 shows the comparative results on the SYSU-MM01 and RegDB datasets, where the optimal performance is shown in bold and “-” indicates that the corresponding paper did not provide the data.

The results of the comparison show that our proposed method realizes the optimal performance of Rank-1 and mAP in All-Search mode, which are 1.39% and 6.00% higher than the sub-optimal method CMT, and achieves the optimal performance of Rank-1 and mAP in Indoor-Search mode, which are 1.90% and 3.77% higher than the sub-optimal method MAUM. On the RegDB dataset, the proposed method achieves a Rank-1 of 93.42% and mAP of 88.61% in VIS-to-IR mode and achieves a Rank-1 of 92.25% and mAP of 87.02% in IR-to-VIS mode. The Rank-1 result of our proposed method is 1.75% lower than the sub-optimal CMT method in VIS-to-IR mode, which may be because ViT does not perform as well as CNN on small-scale datasets. Analysing the experimental results, it can be concluded that the reasons for the better performance of our method are as follows.

(1) In terms of extracting discriminative pedestrian feature representations, we utilize a pair of transformers with shared weights to learn the representations of two modalities. Then, the attribute information embedding module (AIEM) introduces the pedestrian attribute information into the input embedding and explores the intrinsic connection between the pedestrian attributes and the features during feature extraction to enhance the identity discriminative ability of the features.

(2) The attribute embedding enhancement module (AEEM) enhances the attribute information in the pedestrian features extracted from the backbone, thereby reducing the detrimental effects of identity discrimination information loss due to the deepening network. To better constrain the network to learn attribute information, an attribute classifier is designed and the attribute learning loss is calculated for optimization.

(3) In mitigating modality differences, our method introduces a modality consistency learning strategy to learn modality consistency information by constraining the modality-mean classifier to provide consistent prediction results for different modal features of the same identity.

4.4. Ablation Study

In order to validate the effectiveness of the Attribute Information Embedding (AIE) module, Attribute Embedding Enhancement (AEE) module, and Modality Consistency Learning (MCL) loss, we implemented a detailed ablation study on SYSU-MM01 and added our designed modules into the baseline one-by-one. The ablation study compared the following four methods: (1) Baseline, using ViT-B/16 as the backbone and adopting the id loss and hetero-center triplet as the loss function; (2) Baseline + AIE; (3) Baseline + AIE + AEE; and (4) Baseline + AIE + AEE + MCL. Table 2 shows the ablation results for all settings.

In All-Search mode, the Rank-1 and mAP of the baseline + AIE are improved by 10.64% and 13.59%, respectively, compared with the baseline, which means that AIE is effective in introducing pedestrian attribute information to the network by using attribute labels and achieves the purpose of enhancing the feature identity discrimination ability. After adding AEE, Rank-1 and mAP are improved by 11.37% and 10.71%, respectively, which means that AEE achieves the purpose of enhancing the attribute information in the pedestrian features, and to a certain extent reduces the impact of the loss of pedestrian attribute information due to the deepening of network. After adding the MCL loss, Rank-1 and mAP are improved by 2.69% and 2.91%, respectively, indicating that the MCL loss plays the role of constraining the network to learn the modality consistency information. Similar results can be obtained in Indoor-Search mode. In conclusion, the ablation study shows that the modules and MCL loss of our method play an obvious role in enhancing the feature identity discrimination ability and reducing the modality difference.

4.5. Visual Analysis

In order to validate the effectiveness of the proposed method more intuitively, as shown in Figure 6, we generated the Class Activation Map [38] (CAM) of the SYSU-MM01 dataset in All-Search mode. The darker colors on the map indicate that the network pays more attention to these regions. From the class activation maps, it is apparent that compared with the baseline method, which only pays attention on a small area of the persons body, the method proposed in this paper focuses on more fine-grained information related to pedestrian attributes, which is the key in the task of VI-ReID. Thus, the network can significantly enhance the identity discrimination ability of pedestrian features by introducing the pedestrian attribute information.

In order to explore the effect of the proposed method on the feature distance, as shown in Figure 7, the intra-class and inter-class feature distance were visualized on the SYSU-MM01 dataset in All-Search mode; blue indicates the feature’s intra-class distance, green indicates the feature’s inter-class distance, the vertical line denotes the average of the corresponding distribution, and the initial feature distance distribution denotes the untrained feature distance distribution. Compared with the baseline method, the average of the intra-class feature distance of the features extracted by the our network decreases, indicating that the proposed method succeeds in reducing the modality difference. Meanwhile, the average of the inter-class feature distance increases, which indicates that the proposed method has better identity discrimination ability.

The identity discrimination and modality difference elimination ability of the network can be more intuitively seen via t-SNE [39]. As shown in Figure 8, each color denotes an identity, with each circle and fork denoting the learned visible and infrared modality features, respectively, and the dotted line in the initial features indicating the modality difference. It is obvious that the visible and infrared features of the same identity extracted by the baseline are scattered, while the feature distribution of our proposed method is more compact.

In order to validate the improvement of our network on the pedestrian retrieval results, the infrared and visible images were retrieved as query images on the SYSU-MM01 dataset in All-Search mode. The retrieval results are shown in Figure 9, where the red frame denotes the retrieval error and the green frame denotes the retrieval correctness. From the retrieval results, it is obvious that our proposed method has good ability to recognize the identity of pedestrians in the image and has a higher retrieval rate.

5. Conclusions

In this paper, we propose a novel dual-stage attribute embedding and modality consistency learning network to address the problem of pedestrian discriminative features being difficult to obtain and easily lost during the training process. The proposed attribute information embedding module can better learn modality-invariant information via the attribute embedding fused with the token embeddings. We have additionally designed an attribute embedding enhancement module to minimize the impact of the loss of learned pedestrian discriminative information resulting from network deepening. Finally, the modality consistency learning loss is used to learn the consistency information between the two modalities. Numerous experimental results on the mainstream datasets prove that our method performs favorably compared to SOTA methods. Considering the potential promotion of temporal information to pedestrian recognition, in future studies, we intend to start working on video-based ReID tasks [40] and utilize images of pedestrians in time series to study person features.

Author Contributions

Methodology, Z.C.; formal analysis, H.F. and Q.W.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C. and S.L.; project administration, Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (62273339, 62073205, U20A20200) and the Youth Innovation Promotion Association Foundation of Chinese Academy of Sciences (2019203).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository The data presented in this study are openly available in [SYSU-MM01] and [RegDB] at [https://doi.org/10.1109/ICCV.2017.575] and [https://doi.org/10.3390/s17030605], reference number [33,34].

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline). In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Wang, G.; Yang, S.; Liu, H.; Wang, Z. High-Order Information Matters: Learning Relation and Topology for Occluded Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6448–6457. [Google Scholar]
Zhang, Z.; Lan, C.; Zeng, W.; Chen, Z.; Chang, S. Beyond Triplet Loss: Meta Prototypical N-Tuple Loss for Person Re-identification. IEEE Trans. Multimed. 2020, 24, 4158–4169. [Google Scholar] [CrossRef]
Yang, S.; Zhang, Y.; Zhao, Q.; Pu, Y.; Yang, H. Prototype-Based Support Example Miner and Triplet Loss for Deep Metric Learning. Electronics 2023, 12, 3315. [Google Scholar] [CrossRef]
Yu, H.; Wu, A.; Zheng, W. Unsupervised Person Re-Identification by Deep Asymmetric Metric Embedding. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 956–973. [Google Scholar] [CrossRef] [PubMed]
Song, Y.; Liu, S.; Yu, S.; Zhou, S. Adaptive Label Allocation for Unsupervised Person Re-Identification. Electronics 2022, 11, 763. [Google Scholar] [CrossRef]
Huang, Z.; Zhang, Z.; Lan, C.; Zeng, W. Lifelong Unsupervised Domain Adaptive Person Re-identification with Coordinated Anti-forgetting and Adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14268–14277. [Google Scholar]
Pu, N.; Zhong, Z.; Sebe, N.; Lew, M. A Memorizing and Generalizing Framework for Lifelong Person Re-Identification. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13567–13585. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Zhao, H.; Tian, M.; Sheng, L. HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 350–359. [Google Scholar]
Li, L.; Yan, S.; Yu, Z.; Tao, D. Attribute-Identity Embedding and Self-Supervised Learning for Scalable Person Re-Identification. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3472–3485. [Google Scholar] [CrossRef]
Dai, P.; Ji, R.; Wang, H.; Wu, Q.; Huang, Y. Cross-Modality Person Re-Identification with Generative Adversarial Training. In Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 677–683. [Google Scholar]
Wang, G.; Zhang, T.; Cheng, J.; Liu, S. RGB-Infrared Cross-Modality Person Re-Identification via Joint Pixel and Feature Alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3622–3631. [Google Scholar]
Wang, Z.; Wang, Z.; Zheng, Y.; Chuang, Y.Y.; Satoh, S.I. Learning to Reduce Dual-Level Discrepancy for Infrared-Visible Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 618–626. [Google Scholar]
Li, D.; Wei, X.; Hong, X.; Gong, Y. Infrared-visible Cross-Modal Person Re-Identification with an X Modality. In Proceedings of the AAAI conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 4610–4617. [Google Scholar]
Wei, Z.; Yang, X.; Wang, N.; Gao, X. Syncretic Modality Collaborative Learning for Visible Infrared Person Re-Identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 225–234. [Google Scholar]
Feng, Z.; Lai, J.; Xie, X. Learning Modality-Specific Representations for Visible-Infrared Person Re-Identification. IEEE Trans. Image Process. 2019, 29, 579–590. [Google Scholar] [CrossRef] [PubMed]
Wu, A.; Zheng, W.; Gong, S.; Lai, J. Person Re-identification by Cross-Modality Similarity Preservation. Int. J. Comput. Vis. 2020, 128, 1765–1785. [Google Scholar] [CrossRef]
Ye, M.; Shen, J.; Crandall, D.; Shao, L.; Luo, J. Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-identification. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 229–247. [Google Scholar]
Hao, X.; Zhao, S.; Ye, M.; Shen, J. Cross-Modality Person Re-Identification via Modality Confusion and Center Aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 16383–16392. [Google Scholar]
Lu, Y.; Wu, Y.; Liu, B.; Zhang, T.; Li, B. Cross-Modality Person Re-Identification With Shared-Specific Feature Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13376–13386. [Google Scholar]
Chen, Y.; Wan, L.; Li, Z.; Jing, Q.; Sun, Z. Neural Feature Search for RGB-Infrared Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 587–597. [Google Scholar]
Fu, C.; Hu, Y.; Wu, X.; Shi, H. Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 11803–11812. [Google Scholar]
Wu, A.; Dai, P.; Chen, J.; Lin, C.; Wu, Y. Discover Cross-Modality Nuances for Visible-Infrared Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4330–4339. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Chen, C.; Ye, M.; Qi, M.; Wu, J.; Jiang, J.; Lin, C. Structure-Aware Positional Transformer for Visible-Infrared Person Re-Identification. IEEE Trans. Image Process. 2022, 31, 2352–2364. [Google Scholar] [CrossRef] [PubMed]
Jiang, K.; Zhang, T.; Liu, X.; Qian, B.; Zhang, Y.; Wu, F. Cross-Modality Transformer for Visible-Infrared Person Re-Identification. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 480–496. [Google Scholar]
Liang, T.; Jin, Y.; Liu, W.; Li, Y. Cross-Modality Transformer with Modality Mining for Visible-Infrared Person Re-Identification. IEEE Trans. Multimed. 2023, 1–13, Early Access. [Google Scholar]
Zhao, J.; Wang, H.; Zhou, Y.; Yao, R.; Chen, S.; Saddik, A. Spatial-Channel Enhanced Transformer for Visible-Infrared Person Re-Identification. IEEE Trans. Multimed. 2023, 25, 3668–3680. [Google Scholar] [CrossRef]
Lu, H.; Zou, X.; Zhang, P. Learning Progressive Modality-Shared Transformers for Effective Visible-Infrared Person Re-identification. In Proceedings of the AAAI conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 1835–1843. [Google Scholar]
Zhang, S.; Chen, C.; Song, W.; Gan, Z. Deep Feature Learning with Attributes for Cross-Modality Person Re-Identification. J. Electronic Imaging 2020, 29, 033017. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean Teachers are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Ge, Y.; Chen, D.; Li, H. Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Wu, A.; Zheng, W.; Yu, H.; Gong, S.; Lai, J. RGB-Infrared Cross-Modality Person Re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5390–5399. [Google Scholar]
Nguyen, D.; Hong, H.; Kim, K.; Park, K. Person Recognition System Based on a Combination of Body Images from Visible Light and Thermal Cameras. Sensors 2017, 17, 605. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Li, F. ImageNet:A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Seokeon, C.; Sumin, L.; Youngeun, K.; Taekyung, K.; Changick, K. Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10254–10263. [Google Scholar]
Liu, J.; Sun, Y.; Zhu, F.; Pei, H.; Yang, Y.; Li, W. Learning Memory-Augmented Unidirectional Metrics for Cross-modality Person Re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19344–19353. [Google Scholar]
Ramprasaath, R.; Michael, C.; Abhishek, D.; Ramakrishna, V.; Devi, P.; Dhruv, B. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar]
Laurens, M.; Geoffrey, H. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Daniel, D.; Dawei, D.; Christopher, F.; Joseph, V.; Roderic, C.; Kellie, C. MEVID: Multi-view Extended Videos with Identities for Video Person Re-Identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 1634–1643. [Google Scholar]

Figure 1. Schematic of the visible–infrared person re-identification task.

Figure 2. Images of one pedestrian captured by different visible light cameras.

Figure 3. Example of attribute information in VI-ReID; each of the colours represents different attribute information. It can be seen that the attribute information is consistent across the two modalities.

Figure 4. Overall network structure of our proposed method. To reduce the modality differences and learn identity-discriminative features, we propose a dual-stage attribute embedding and modality consistency learning network. (1) In the first stage, AIEM fuses the attribute information with token embeddings for fine-grained feature extraction. (2) In the second stage, AEEM implements a secondary embedding to enhance the attribute information. (3) The modality consistency learning, attribute learning loss, hetero-center triplet loss, and id loss are combined to jointly optimize the network.

Figure 5. Attribute Information Embedding Module; * indicates the class token.

Figure 6. Visualization of the class activation maps in two modalities.

Figure 7. Visualization of the intra-class and inter-class feature distance.

Figure 8. Visualization of learned features.

Figure 9. The retrieval results of our proposed method.

Table 1. Performance comparison with SOTA methods on SYSU-MM01 and RegDB datasets.

Method	Venue	SYSU-MM01				RegDB
		All-Search		Indoor-Search		VIS to IR		IR to VIS
		Rank-1	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP
cmGAN	IJCAI-2018	26.97	27.80	31.63	42.19	-	-	-	-
D²RL	CVPR-2019	28.90	29.20	-	-	43.40	44.10	-	-
MSR	TIP-2019	37.35	38.11	39.64	50.88	48.43	48.67	-	-
Hi-CMD	CVPR-2020	34.94	35.94	-	-	70.93	66.04	-	-
AlignGAN	ICCV-2019	42.40	40.70	45.90	54.30	57.90	53.60	56.30	53.40
CMSP	IJCV-2020	43.56	44.98	48.62	57.50	65.07	64.50	-	-
cm-SSFT	CVPR-2020	47.70	54.10	-	-	65.40	65.60	63.80	64.20
XIV-ReID	AAAI-2020	49.92	50.73	-	-	62.21	60.18	-	-
DDAG	ECCV-2020	54.75	53.02	61.02	67.98	69.34	63.46	68.06	61.80
NFS	CVPR-2021	56.91	55.45	62.79	69.79	80.54	72.10	77.95	69.79
DFLN-ViT	TMM-2022	59.84	57.70	62.13	69.03	-	-	-	-
CM-NAS	ICCV-2021	61.99	60.02	67.01	72.95	84.54	80.32	82.57	78.31
CMTR	TMM-2023	62.58	61.33	67.02	73.78	80.62	74.42	81.06	73.75
SPOT	TIP-2022	65.34	62.25	69.42	74.63	80.35	72.46	79.37	72.26
MCLNet	ICCV-2021	65.40	61.98	72.56	76.58	80.31	73.07	75.93	69.49
SMCL	ICCV-2021	67.39	61.78	68.84	75.56	83.93	79.83	83.05	78.57
PMT	AAAI-2023	67.53	64.98	71.66	76.52	84.83	76.55	84.16	75.13
MPANet	CVPR-2021	70.58	68.24	76.74	80.95	83.70	80.90	82.80	80.70
MAUM	CVPR-2022	71.68	68.79	76.97	81.94	87.87	85.09	86.95	84.34
CMT	ECCV-2022	71.88	68.57	76.90	79.91	95.17	87.30	91.97	84.46
Ours	-	73.27	74.57	78.80	83.68	93.42	88.61	92.25	87.02

Table 2. Ablation study on the SYSU-MM01 dataset. “Base” indicates the baseline method and bold indicates the optimal performance.

Setting	SYSU-MM01
	All-Search		Indoor-Search
	Rank-1	mAP	Rank-1	mAP
Base	48.57	47.36	67.86	61.55
Base + AIE	59.21	60.95	70.24	71.48
Base + AIE + AEE	70.58	71.66	75.95	81.84
Base + AIE + AEE + MCL	73.27	74.57	78.80	83.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, Z.; Fan, H.; Wang, Q.; Liu, S.; Tang, Y. Dual-Stage Attribute Embedding and Modality Consistency Learning-Based Visible–Infrared Person Re-Identification. Electronics 2023, 12, 4892. https://doi.org/10.3390/electronics12244892

AMA Style

Cheng Z, Fan H, Wang Q, Liu S, Tang Y. Dual-Stage Attribute Embedding and Modality Consistency Learning-Based Visible–Infrared Person Re-Identification. Electronics. 2023; 12(24):4892. https://doi.org/10.3390/electronics12244892

Chicago/Turabian Style

Cheng, Zhuxuan, Huijie Fan, Qiang Wang, Shiben Liu, and Yandong Tang. 2023. "Dual-Stage Attribute Embedding and Modality Consistency Learning-Based Visible–Infrared Person Re-Identification" Electronics 12, no. 24: 4892. https://doi.org/10.3390/electronics12244892

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Stage Attribute Embedding and Modality Consistency Learning-Based Visible–Infrared Person Re-Identification

Abstract

1. Introduction

2. Related Work

2.1. Single-Modality Person Re-Identification

2.2. Visible–Infrared Cross-Modality Person Re-Identification

3. Methods

3.1. Attribute Information Embedding Module

3.2. Attribute Embedding Enhancement Module

3.3. Modality Consistency Learning

3.4. Loss Function

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Study

4.5. Visual Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI