Progressive Discriminative Feature Learning for Visible-Infrared Person Re-Identification

Zhou, Feng; Cheng, Zhuxuan; Yang, Haitao; Song, Yifeng; Fu, Shengpeng

doi:10.3390/electronics13142825

Open AccessArticle

Progressive Discriminative Feature Learning for Visible-Infrared Person Re-Identification

by

Feng Zhou

¹,

Zhuxuan Cheng

²

,

Haitao Yang

¹,

Yifeng Song

²

and

Shengpeng Fu

^2,*

¹

Department of Criminal Ivestigation, Hunan Police Academy, Changsha 410138, China

²

Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(14), 2825; https://doi.org/10.3390/electronics13142825

Submission received: 15 June 2024 / Revised: 6 July 2024 / Accepted: 12 July 2024 / Published: 18 July 2024

(This article belongs to the Special Issue Deep Learning-Based Image Restoration and Object Identification)

Download

Browse Figures

Versions Notes

Abstract

:

The visible-infrared person re-identification (VI-ReID) task aims to retrieve the same pedestrian between visible and infrared images. VI-ReID is a challenging task due to the huge modality discrepancy and complex intra-modality variations. Existing works mainly complete the modality alignment at one stage. However, aligning modalities at different stages has positive effects on the intra-class and inter-class distances of cross-modality features, which are often ignored. Moreover, discriminative features with identity information may be corrupted in the processing of modality alignment, further degrading the performance of person re-identification. In this paper, we propose a progressive discriminative feature learning (PDFL) network that adopts different alignment strategies at different stages to alleviate the discrepancy and learn discriminative features progressively. Specifically, we first design an adaptive cross fusion module (ACFM) to learn the identity-relevant features via modality alignment with channel-level attention. For well preserving identity information, we propose a dual-attention-guided instance normalization module (DINM), which can well guide instance normalization to align two modalities into a unified feature space through channel and spatial information embedding. Finally, we generate multiple part features of a person to mine subtle differences. Multi-loss optimization is imposed during the training process for more effective learning supervision. Extensive experiments on the public datasets of SYSU-MM01 and RegDB validate that our proposed method performs favorably against most state-of-the-art methods.

Keywords:

visible-infrared person re-identification; cross-modality; deep learning; instance normalization

1. Introduction

Person re-identification (Re-ID) is a challenging task in video surveillance, and it can be deemed as an image retrieval problem, which aims to re-associate a specific pedestrian across the non-overlapping cameras [1]. The challenges of Re-ID mainly come from the backgrounds, body poses, viewpoints, and occlusions. These challenges can lead to increased intra-class variations as well as decreased inter-class variations among pedestrian features. Specifically, intra-class variations indicate the difference of the same pedestrian across multiple images, while inter-class variations mean the difference between different pedestrians. Due to its widespread applications in real-world surveillance and social security, it has received much research attention in person Re-ID tasks. Existing Re-ID methods [2,3,4,5,6,7,8,9,10] mainly research pedestrian images captured by visible cameras. However, these methods are not practicable under low-light conditions (e.g., at night or rainy weather) because it is difficult to extract effective identity clues [11]. One solution to this problem is to capture images in low-light conditions with infrared cameras and then perform a visible-infrared person re-identification (VI-ReID) task. The main challenge of VI-ReID comes from the modality discrepancy between the heterogeneous images: because of different imaging mechanisms, the colors and textures of a person inter modalities are different. Due to the modality discrepancy, VI-ReID is more complex and challenging than visible–visible person re-identification (VV-ReID).

Some pioneering methods have been proposed to alleviate the modality discrepancy [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]. These methods mainly deal with the challenges by three approaches, including metric learning, image translation and modality alignment. As shown in Figure 1a, most existing metric learning-based methods constrain the network to obtain discriminative and modality shared features by specifically designed loss functions. Figure 1b illustrates the process of image translation-based methods, which alleviate the modality discrepancy between visible and infrared input by converting two different modalities to an intermediate modality at the image level. However, these methods are limited by the performance of existing image generation networks. The modality alignment-based methods aim to extract consistent modality semantics in the aligned visible and infrared domains by changing the distribution of features or pixels. The modality alignment-based methods can adversely affect the VI-ReID task because they damage identity information when changing the feature distribution, which is the main challenge of those methods.

To fully alleviate modality discrepancy and explore identity-discriminative information, we propose a novel VI-ReID framework, termed the progressive discriminative feature learning (PDFL) network. As shown in Figure 1c, the motivation of PDFL is to align visible features with infrared features by instance normalization (IN) while preserving identity information.

Moreover, we observe that directly applying IN at different stages of feature extraction has different impacts on the intra-class and inter-class distances of cross-modality features. As shown in Figure 2, the blue section denotes intra-class distance distribution and the green section denotes inter-class distance distribution of multi-modal features. We expect the intra-class distances to be as small as possible. It means that the green section should be as close to the origin as possible in the graph. In contrast, the inter-class distances to be as big as possible. Reflected in the graph, this means that the green section should be positioned as far from the origin as possible to indicate larger intra-class distances. And we define a variable

δ

to measure the distance between two entities. In principle, maximizing the distance between the two entities is considered optimal. We observe that multi-stage IN performs better in reducing intra-class distance compared to any specific stage IN, where

δ_{1} < δ_{2} < δ_{3} < δ_{4}

. This indicates that multi-stage IN can effectively reduce the discrepancy between features of the same identity but in different modalities. However, the effect of IN on inter-class distance is not as pronounced as that on intra-class distance. In other words, IN does not enhance the network’s ability to identify pedestrians with different identities. One possible reason is that IN changes the feature distribution, but this change damages the identity information to some extent and thus adversely affects the identification.

Based on the above factors, we design two modules to avoid damaging identity information during the alignment process. Specifically, at an early stage of feature extraction, the adaptive cross fusion module (ACFM) adaptively fuses the input feature with the instance-normalized feature of two modalities. ACFM balances the alleviation of modality discrepancy and the preservation of identity information by the fusion of two heterogeneous distributional features. At a late stage of feature extraction, the dual-attention-guided instance normalization module (DINM) extracts more discriminative modality-consistent features. DINM ensures that the alignment process does not damage identity information by the means of dual-attention guidance, which generates the guidance masks that embed both channel information and spatial information, where channel information ensures that IN is not performed on identity-relevant channels and spatial information can provide an accurate body location to cope with the changeful pedestrian poses.

The introduction of local information has been proven to be beneficial for mining discriminative features in the ReID task [2,4,5,6,7,23,28]. Inspired by this viewpoint, we introduce part-level feature learning (PFL) to generate different body part features of a person to learn subtle differences.

The main contributions of this paper can be summarized as follows:

We propose a novel progressive normalized feature learning (PDFL) network, which achieves feature-level multi-stage modality alignment for VI-ReID. This enables our network to learn discriminative features in a unified feature space.
To align features while keeping the identity information, an adaptive cross fusion module (ACFM) and dual-attention-guided instance normalization module (DINM) are proposed to align modalities progressively by feature fusion and selectively applying instance normalization.
Extensive experiments are conducted on two mainstream benchmarks of VI-ReID, and the results validate the superior performance of PDFL against the existing methods.

2. Related Works

2.1. Single-Modality Person ReID

Single-modality person Re-ID typically deals with visible-to-visible matching, and existing research has focused on dealing with intra-class variations in backgrounds, body poses, viewpoints, and occlusions in visible images. In existing deep learning-based approaches [5,6,7,23,28,29], improvements in this task come mainly from two aspects: representation learning and metric learning. The representation-based approaches are aware of local and global information. Sun et al. [2] divided the human body into multiple horizontal grids by part-based representation to obtain discriminative feature. Hou et al. [3] further enhanced the global discriminative power of body features by utilizing attention information. Wang et al. [4] utilized a two-stream network that includes a backbone and a keypoint detection network to jointly extract body part features. Sarfraz et al. [5] simply included the detected joint positions into the CNN to learn effective body part representation. Zhang et al. [6] focused on distinguishing body parts by mining global structure-related information and considered the relationship between each body part feature and global features. Wu et al. [7] designed a channel-level attention module to extract the local features in different channels. Li et al. [8] proposed a lightweight network to focus on combined pattern information while reducing the retrieval cost. Meng et al. [9] proposed a metric learning-based method by computing the consistency of the spatial graph of consecutive frames. Zhang et al. [10] proposed an N-tuple loss to jointly optimize the distances between multiple samples from multiple classes in the feature space. Although the above methods have achieved great breakthrough in addressing intra-class variations, most existing VV-ReID methods have difficulty in performing cross-modality image retrieval in complex lighting environments.

2.2. Visible-Infrared Cross-Modality ReID

For the VI-ReID task, it is challenging to bridge the heterogeneous visible and infrared images due to the large discrepancy between the two modalities. To address the challenge of modality discrepancy, many VI-ReID approaches [30] have been proposed. Nguyen et al. [31] constructed the first VI-ReID dataset for decreasing the impact of noise in person recognition. Ye et al. [11] utilized cross-modality feature distribution and contextual information as auxiliary information to improve retrieval performance. Wu et al. [12] constructed a new large-scale VI-ReID dataset and proposed a deep zero-padding network learning features in a unified space. Wei et al. [17] proposed a co-attentive learning mechanism to learn discriminative features in each modality and bridge the cross-modality gap. Other methods mitigate the modality gap from the image translation perspective, which first achieves modality unification and then learns modality-shared representations. Examples include the GAN-based approaches, AlignGAN [13] and JSIA [14]. AlignGAN designed a generative adversarial network to transfer the cross-modality features into a unified feature space. JSIA generated cross-modality paired images and performed both set-level and instance-level alignments. Furthermore, some attention-based discriminative feature learning networks have been proposed. Chen et al. [18] used the attention module to implement a combination of a dual-level feature extraction and distinguishable search to learn coarse-grained identity-related features. Wu et al. [23] exploited nuanced but identity-relevant information by a modality alleviation module and a pattern alignment module. Liang et al. [25] first applied a transformer network to the VI-ReID task and designed a novel modality embedding to encode cross-modality information.

However, existing approaches focus on a specific stage to align or replenish the modality information. These approaches ignore the potential enhancement of multi-stage alignment for decreasing the modality discrepancy. Furthermore, identity information is often impaired in the alignment process, which inevitably limits the performance of VI-ReID.

3. Proposed Methods

Figure 3 provides an overview of the proposed PDFL network, which adopts a one-stream ResNet-50 [32] as the backbone. The feature extracted by convolutional layer 1 is fed into the ACFM that adaptively fuses the instance normalized feature with the input feature; the former alleviates the modality discrepancy, while the latter retains the identity information. The output features of convolutional layer 3 and 4 are respectively fed into the DINM, which generates guidance masks aiming to guide IN to further align two modalities while maintaining the discriminative ability of the features. PFL generates different body part features of a person to discover subtle differences. The whole network is supervised by identity classification loss, triplet loss, separation loss and modality learning loss.

3.1. Adaptive Cross Fusion Module

Extracting and fusing information from visible and infrared images is the key to VI-ReID. Previous works simply multiplied or concatenated the features of the two modalities at a specific stage [16,17,18,22], but this fusion method cannot compensate for the huge differences between visible and infrared images. Moreover, the modality-consistent information outside the specific stage has not been effectively mined.

The problem caused by modality discrepancy can be attributed to the channel semantic heterogeneity between two modalities [33]. The key to this problem is how we can minimize the discrepancy of different channel-level semantics while maintaining the identity information of the features. IN is an effective way to reduce channel-level differences between two modalities. Nevertheless, directly using IN may impair identity information because of the change in the original feature distribution, thereby adversely affecting the VI-ReID task. To overcome these imperfections, we design an ACFM to learn the modality-aligned feature while preserving identity information by an adaptive integration process.

For each input image

x

, its feature

Z \in R^{H \times W \times C}

extracted by the first convolutional layer of the backbone is donated as the input of ACFM, where

H \times W

is the feature size and C denotes the total number of channels. We first adopt the IN to obtain the instance-normalized feature set

\hat{Z} \in R^{H \times W \times C}

. The IN can be expressed as:

\begin{matrix} \hat{Z_{k}} = IN (Z_{k}) = \frac{Z_{k} - E [Z_{k}]}{\sqrt{Var [Z_{k}] + ϵ}}, \end{matrix}

(1)

where

Z_{k} \in R^{H \times W}

is the k-th channel dimension of feature

Z

, the mean E[·] and standard-deviation Var[

\cdot]

are calculated per-dimension, and

ϵ

is a smoothing term applied to avoid dividing by zero. In ACFM, we adopt a cross-attention block to capture the inter-relationship between the instance-normalized feature and input feature to obtain the identity-relevant aligned feature. Specifically, we take the instance-normalized feature as the query

Q \in R^{C \times H W}

, while the key

K \in R^{H W \times C}

and value

V \in R^{H W \times C}

are taken from the input feature. The (

Q

,

K

,

V

) triplets are generated by independent

1 \times 1

convolution layers, then we apply cross-attention between vectorized features from the two branches via:

\begin{matrix} Attention (Q, K, V) = V \times Softmax (\frac{Q^{T} K}{\sqrt{d_{k}}}), \end{matrix}

(2)

where

Softmax (\cdot)

denotes the softmax function. Finally, the output of the cross-attention operation is added to the branch of the instance-normalized feature by a skip connection to obtain the fused feature

f_{a}

. Furthermore, we seek to introduce an adaptive fusion strategy with learnable weights to automatically adjust the contribution of each branch:

\begin{matrix} f_{a} = α Attention (Q, K, V) + β \hat{Z}, \end{matrix}

(3)

where

α

and

β

are two learnable parameters, and the initial value is set to 0.5.

3.2. Dual-Attention-Guided Instance Normalization Module

DINM regards modality alignment as a selective process: identity-irrelevant features are aligned to ensure that modality discrepancy is alleviated, while identity-relevant are preserved to avoid adverse effects on the ReID task. The process is guided by dual-attention, meaning that the identity-relevant information that should be retained is determined by the dual-attention masks that embed both channel information and spatial information, where channel information indicates the attention to the identity-relevant channel, and spatial information locates the identity-relevant feature region. The spatial information in the horizontal and vertical directions is embedded separately, which ensures that DINM has better robustness in the face of variable pedestrian poses.

As shown in Figure 3, the input of DINM is the feature

Z \in R^{H \times W \times C}

extracted by convolutional layers 3 and 4 of the backbone, and DINM first generates two masks

m_{h}

and

m_{w}

. As shown in Figure 4, following coordinate attention [34], we factorize the global average pooling into a pair of 1D feature encoding operations, called X Pooling and Y Pooling. For X Pooling, given the input feature

Z

, the output of the c-th channel at height h can be formulated as:

\begin{matrix} p_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} Z_{c} (h, i), \end{matrix}

(4)

similarly, for Y Pooling, the output of the c-th channel at width w can be formulated as:

\begin{matrix} p_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} Z_{c} (j, w), \end{matrix}

(5)

As described above, X Pooling and Y Pooling provide a global receptive field and embed positional information. Then, the output of X Pooling and Y Pooling is concatenated and sent to a shared

1 \times 1

convolutional layer

{Conv}_{1}

, yielding:

\begin{matrix} f_{p} = δ ({Conv}_{1} ([p^{h}, p^{w}])), \end{matrix}

(6)

where

f_{p} \in R^{C / r \times (H + W)}

is the intermediate feature that embeds horizontal and vertical spatial information and r is the channel dimension reduction ratio. To balance the performance and complexity, r is set to 16.

[\cdot, \cdot]

denotes the concatenation operation along the spatial dimension, and

δ (\cdot)

is a non-linear activation function. Then DINM splits

f_{p}

along the spatial dimension into two separate tensors

s_{h} \in R^{C / r \times H}

and

s_{w} \in R^{C / r \times W}

. Another two

1 \times 1

convolutional blocks

{Conv}_{h} (\cdot)

and

{Conv}_{w} (\cdot)

are used to separately transform

s_{h}

and

s_{w}

two tensors with the same channel number to the input

Z

, yielding:

\begin{matrix} \begin{matrix} m_{h} = σ ({Conv}_{h} (s_{h})), \\ m_{w} = σ ({Conv}_{w} (s_{w})), \end{matrix} \end{matrix}

(7)

The outputs

m_{h}

and

m_{w}

are used as masks to guide IN, and

σ (\cdot)

is the sigmoid activation function. Except for channel information, DINM considers encoding the spatial information, which can help the whole network more accurately recognize the changeful poses of the pedestrian. Finally, the output feature

f_{d}

can be represented by:

\begin{matrix} \begin{matrix} f_{h} = m_{h} \times Z + (1 - m_{h}) \times \hat{Z}, \\ f_{w} = m_{w} \times Z + (1 - m_{w}) \times \hat{Z}, \\ f_{d} = f_{h} + f_{w}, \end{matrix} \end{matrix}

(8)

where

f_{h}

and

f_{w}

are instance-normalized features guided by

m_{h}

and

m_{w}

, respectively.

\hat{Z}

is the IN result of

Z

.

3.3. Part-Level Feature Learning

To introduce fine-grained local information, we generate different body part features of a person to discover the subtle differences. The output feature of the last DINM

f_{d}

is used to generate attention maps with multiple

1 \times 1

convolutional layers. Each attention map implies that the model focuses on different regions of the image. Therefore, we obtain the part-level feature with attention maps

M = [M_{1}, M_{2}, \dots, M_{n}]

by a

1 \times 1

convolutional block

{Conv}_{p}

as follows:

\begin{matrix} M_{k} = σ ({Conv}_{k} (f_{d})), k \in [1, n], \end{matrix}

(9)

where

σ (\cdot)

is a sigmoid activation function.

C o n v_{k}

denotes the i-th convolutional layer, and

M_{k}

denotes the attention map which focus on the body part i. With these attention maps, we can divide the feature

f_{d}

into n body part features as follows:

\begin{matrix} b_{k} = M_{k} \times f_{d} (k = 1, 2, \dots, n), \end{matrix}

(10)

Once the input feature is split into n body part features according to the attention maps, the k-th body part feature

B_{k}

=

GAP

(b_{k}) \in R^{C}

is pooled by global average pooling

GAP (\cdot)

. Finally, the output feature

f_{f i n a l} \in R^{(n + 1) C}

of the whole network is formulated as:

\begin{matrix} f_{f i n a l} = [B_{1}, B_{2}, \dots, B_{n}, GAP (f_{d})], \end{matrix}

(11)

The output features of the whole network are obtained by the concatenation of global feature

GAP (f_{d})

and local feature

B_{k}

.

3.4. Objective Function

Once the feature

f_{f i n a l}

has been extracted by the network, we calculate two basic ReID losses: the identity classification loss

L_{i d}

from the cross-entropy loss and triplet loss

L_{t r i}

to supervise the feature learning.

Furthermore, to ensure the attention map

M

in Equation (9) can capture different parts, we apply the separation loss to assure the attention maps can focus on different body parts. After resizing the attention map

M^{H \times W \times n}

to

M^{(H W) \times n}

, the separation loss can be expressed as:

\begin{matrix} L_{s e p} = \frac{2}{n (n - 1)} \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} (M^{T} M)_{i j}, \end{matrix}

(12)

where

{(M^{T} M)}_{i j}

is the element of

(M^{T} M)

on row i and column j. By reducing the overlapping part between every two attention maps, the separation loss constrains the body part features to attend to diverse areas.

We also apply modality learning loss to make the modality-specific classifiers provide the same identity predictions for the same pedestrian regardless of the modality. Following [23], we set two mean classifiers for modality-invariant features. The classification predictions of visible (infrared) prediction features provided by the modality-specific classifiers are constrained to be consistent by the KL divergence:

\begin{matrix} \begin{matrix} L_{m l} = \sum_{w = 1}^{N_{v}} d_{K L} (C_{v} (f_{v}^{w}) ∣ ∣ {\tilde{C}}_{r} (f_{v}^{w})) \\ + \sum_{q = 1}^{N_{r}} d_{K L} (C_{r} (f_{r}^{q}) ∣ ∣ {\tilde{C}}_{v} (f_{r}^{q})), \end{matrix} \end{matrix}

(13)

where

C_{v} (\cdot)

,

C_{r} (\cdot)

denote modality-specific classifiers of visible and infrared modalities, and the mean classifiers of those ones are

{\tilde{C}}_{v}

,

{\tilde{C}}_{r}

, respectively.

The total loss

L

of the network can be expressed as:

\begin{matrix} L = L_{i d} + L_{t r i} + λ_{1} L_{s e p} + λ_{2} L_{m l}, \end{matrix}

(14)

where

λ_{1}

,

λ_{2}

are hyperparameters to balance the contributions of different loss functions.

4. Experiments

4.1. Datasets and Evaluation Metrics

SYSU-MM01 [12] is a large-scale VI-ReID dataset that includes 287,628 visible images and 5792 infrared images of 395 pedestrians captured in both indoor and outdoor environments. All the images are captured by four visible cameras and two near-infrared cameras. The testing set contains 3803 images as the query, and 301 (3010) visible images as the gallery for single-shot (multi-shot). SYSU-MM01 has two evaluation settings: all-search and indoor-search. All settings treat visible images as the gallery set and infrared images as the query set.

RegDB [31] is captured by a pair of aligned cameras (one visible and one far-infrared), and it contains a total of 412 pedestrians and 8240 images. Ten images are captured for each person in each modality separately. RegDB has two evaluation settings: one is visible to infrared, which searches the infrared gallery set from a visible query set, and the other setting is infrared to visible, which searches the visible gallery set from an infrared query set.

Evaluation Metrics. For both datasets, we adopt standard evaluation metrics in most VI-ReID methods, namely, the cumulative matching characteristic (CMC) curve and mean average precision (mAP) to evaluate our model. The former calculates the probability that a query identity appears in the top N retrieved results; we adopt the Rank-1 accuracy in our experiments. The latter calculates the average precision of all query identities.

4.2. Implementation Details

The proposed method is implemented with the PyTorch and trained on a single NVIDIA RTX A6000 GPU. For each mini-batch, we randomly sample 8 identities for each modality and 8 images for each identity. All the input images are first resized to

384 \times 144

. In addition, we use the Adam optimizer for optimization and the initial learning rate is set to

3.5 \times 10^{- 4}

, which decays by 0.1 and 0.01 at 70th and 120th epochs. The whole training process consists of 140 epochs. The hyperparameters

λ_{1}

and

λ_{2}

are set to

0.5

and

2.5

, respectively.

4.3. Comparation with State-of-the-Art Methods

Comparisons on SYSU-MM01. We compare our PDFL with the state-of-the-art (SOTA) methods under all settings. As shown in Table 1, PDFL achieves the best performance in all settings, which strongly proves the effectiveness of our method. Specifically, PDFL achieves the Rank-1 accuracy of

73.64 %

and mAP of

70.69 %

in the all-search with single-shot setting, improving the Rank-1 accuracy by

1.96 %

and mAP by

1.90 %

over the SOTA method MAUM. PDFL also achieves the Rank-1 accuracy of

79.70 %

and mAP of

83.01 %

in the indoor-search with single-shot setting, respectively improving the Rank-1 and mAP by

2.73 %

and

1.07 %

, respectively, over MAUM. Although images in SYSU-MM01 change heavily across modalities in terms of illumination, background, perspective, and pose of pedestrians, our PDFL can solve the VI-ReID problem by alleviating the modality discrepancy and mining discriminative features.

We also show the comparison results on SYSU-MM01 datasets in Figure 5. The baseline model has too many incorrect retrieval results as shown in the red boxes in Figure 5. In contrast, our model, by eliminating modality differences while preserving more identity information, enables the model to obtain as much pedestrian identity information as possible on the basis of eliminating modality differences, thus better accomplishing pedestrian identity recognition. It also can be seen from Figure 5 that, from the Rank-1 accuracy to Rank-10, our proposed model achieves high accuracy and good performance.

Comparisons on RegDB. As shown in Table 1, it can be seen that our PDFL has distinct advantages over the SOTA methods on RegDB, which is a small-scale dataset. Specifically, PDFL performs better than any other recent works with the Rank-1 accuracy of

88.16 %

and mAP of

86.34 %

in the visible to infrared setting, and the Rank-1 accuracy of

87.48 %

and the mAP of

85.24 %

in the infrared to visible setting. These results show the effectiveness of our approach to process different modality information and extract pedestrian discriminative features, regardless of whether the infrared images in the datasets are near-infrared or far-infrared.

4.4. Ablation Study

In this section, we perform detailed ablation studies on SYSU-MM01 to evaluate the contribution of each module and the effect of the number of body parts in the PFL on the results.

Effectiveness of each module: To validate the contribution of each module, we gradually add the proposed modules into the network. “Base” denotes the baseline network that adopts ResNet-50 to extract pedestrian features. As shown in Table 2, compared with “Base”, “Base+ACFM” can substantially increase the performance. This indicates the interaction of input features with instance-normalized features at an early stage via ACFM, which greatly reduces the modality discrepancy, thus the model can better identify pedestrians. The performance of the proposed model with ACFM has increased by nearly 4 percentage points. The results of “Base+ACFM+DINM” indicates that at a late stage, the modality-aligned features derived through the application of the DINM effectively mitigate the discrepancy between modalities, while circumventing the erosion of identity information that may occur with the direct application of the IN. Finally, the results of “Base+ACFM+DINM+PFL” show that by generating multiple parts features of a person, it becomes feasible to delve into the nuanced distinctions more effectively, thereby enhancing the discriminative capacity of the features.

Verifying the effectiveness of different loss functions PDFL: We also validate the effectiveness of different losses. We add the different loss into the model one by one to verify their effectiveness. As shown in Table 3, it is obvious that the separation loss

L_{s e p}

can reduce the overlapping part between every two attention maps. Therefore, it can well constrain our model focus on different body areas. Moreover, with the modality learning loss

L_{m l}

, our model can well improve the performance of PDFL. The performance also indicates that the prediction of the same identity is well constrained by

L_{m l}

and learns more modality invariant features by the mean classifiers.

Number of body parts: The number of body parts will also influence the performance of the model. Intuitively, the number of body parts n determines the granularity of the feature learned by the network. However, granularity alone does not guarantee enhanced model performance; finer granularity does not necessarily equate to superior performance metrics. As illustrated in Figure 6, the Rank-1 accuracy and mAP improve at first as the number of body parts increases. When n is equal to 7 or more, the performance drops significantly because some of the body part features are very similar to others when the granularity is small. As a result, an over-increased n diminishes the discriminative ability of the body part features. The experiments show that the model with n = 6 has achieved the best performance on the SYSU-MM01 dataset. In this paper, we set n = 6 in PFL.

5. Visualization Analysis

In Figure 7, the learned feature of the baseline is visualized via t-SNE; the features of the same identity but different modalities are dispersed. In Figure 7b, we show the learned feature with the baseline model. From the feature maps, it can be observed that the distances between different classes are noticeably smaller, while the intra-class distances, especially the intra-class differences across different modalities, are larger, indicating that the differences between modalities have not been well eliminated. Correspondingly, we show the learned feature with the proposed PDFL in Figure 7c. It can be seen that the distances between different classes are significantly increased, while the intra-class distances are reduced; in particular, the intra-class differences across different modalities are diminished, indicating that the differences between modalities are well suppressed. For PDFL, since ACFM and DINM alleviate the modality discrepancy by aligning feature distribution progressively, each identity can be clustered into more compact feature clusters.

In addition, we adopt the attention maps to further expound the learned discriminative features from images of two modalities. In Figure 8, the attention maps from our PDFL are focused on some identity-relevant information such as hair, shoes, and glasses. Furthermore, PDFL performs well even in scenes with large changes in pedestrian poses, which reveals that using attention embedded with spatial information to guide IN is effective.

6. Conclusions

In this paper, we proposed a progressive discriminative feature learning (PDFL) network to learn discriminative features progressively for VI-ReID. PDFL is designed to both alleviate the modality discrepancy and utilize identity information to distinguish between different pedestrians. Therefore, the proposed PDFL focuses on extracting modality-aligned features and ensuring the discriminative ability of the features. The proposed adaptive cross fusion module (ACFM) is able to achieve the interaction between the input feature and instance-normalized feature to mine discriminative modality-aligned features at an early stage. We have also designed a dual-attention guided instance normalization module (DINM) to align two modalities into an unified feature space while preserving the identity information at a late stage, which is achieved by selectively applying IN via guidance masks. Then, part-level feature learning (PFL) generates different body part features of one pedestrian to learn more discriminative features. Extensive experiments on SYSU-MM01 and RegDB demonstrate that our method is superior to the SOTA methods.

Author Contributions

Conceptualization, F.Z.; Data curation, Y.S.; Formal analysis, F.Z. and H.Y.; Funding acquisition, Y.S.; Investigation, S.F.; Methodology, Z.C.; Project administration, H.Y. and Y.S.; Resources, S.F.; Software, F.Z. and Z.C.; Supervision, S.F.; Validation, F.Z. and Y.S.; Visualization, F.Z. and H.Y.; Writing—review editing, S.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (52075529), the Youth Innovation Promotion Association of Chinese Academy of Sciences (2021199) and the National Key Program (2021YFC3002002).

Data Availability Statement

The dataset SYSU-MM01 is an open dataset and can be downloaded at https://doi.org/10.1109/ICCV.2017.575. The dataset RegDB is an open dataset and can be downloaded at https://doi.org/10.3390/s17030605.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zeng, Z.; Li, Z.; Cheng, D.; Zhang, H.; Zhan, K.; Yang, Y. Twostream multirate recurrent neural network for video-based pedestrian reidentification. IEEE Trans. Ind. Inf. 2018, 14, 3179–3186. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Hou, R.; Ma, B.; Chang, H.; Gu, X.; Shan, S.; Chen, X. Relation-aware global attention for person re- identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9309–9318. [Google Scholar]
Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; Zhou, E.; Sun, J. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6449–6458. [Google Scholar]
Sarfraz, M.S.; Schumann, A.; Eberle, A.; Stiefelhagen, R. A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 28–23 June 2018; pp. 420–429. [Google Scholar]
Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; Chen, Z. Relation-aware global attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3186–3195. [Google Scholar]
Wu, L.; Liu, D.; Zhang, W.; Chen, D.; Ge, Z.; Boussaid, F.; Bennamoun, M.; Shen, J. Pseudo-pair based self-similarity learning for unsupervised person re-identification. IEEE Trans. Image Process. 2022, 31, 4803–4816. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Wu, G.; Zheng, W.S. Combined depth space based architecture search for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 6729–6738. [Google Scholar]
Meng, J.; Zheng, W.-S.; Lai, J.-H.; Wang, L. Deep graph metric learning for weakly supervised person re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6074–6093. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Lan, C.; Zeng, W.; Chen, Z.; Chang, S.-F. Beyond triplet loss: Meta prototypical N-Tuple loss for person re-identification. IEEE Trans. Multimedia 2022, 24, 4158–4169. [Google Scholar] [CrossRef]
Ye, M.; Cheng, Y.; Lan, X.; Zhu, H. Improving night-time pedestrian retrieval with distribution alignment and contextual distance. IEEE Trans. Ind. Informa. 2020, 16, 615–624. [Google Scholar] [CrossRef]
Wu, A.; Zheng, W.-S.; Yu, H.-X.; Gong, S.; Lai, J. Rgb-infrared cross-modality person re- identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5390–5399. [Google Scholar]
Wang, G.; Zhang, T.; Cheng, J.; Liu, S.; Yang, Y.; Hou, Z. Rgb-infrared cross-modality person re-identification via joint pixel and feature alignment. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3622–3631. [Google Scholar]
Wang, G.A.; Yang, T.; Cheng, J.; Chang, J.; Liang, X.; Hou, Z. Cross-modality paired-images generation for rgb-infrared person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12144–12151. [Google Scholar]
Feng, Z.; Lai, J.; Xie, X. Learning modality-specific representations for visible-infrared person re-identification. IEEE Trans. Image Process. 2020, 29, 579–590. [Google Scholar] [CrossRef] [PubMed]
Wu, A.; Zheng, W.-S.; Gong, S.; La, J. RGB-IR person re-identification by cross-modality similarity preservation. Int. J. Comput. Vis. 2020, 128, 1765–1785. [Google Scholar] [CrossRef]
Wei, X.; Li, D.; Hong, X.; Ke, W.; Gong, Y. Co-attentive lifting for infrared-visible person re-identification. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1028–1037. [Google Scholar]
Chen, Y.; Wan, L.; Li, Z.; Jing, Q.; Sun, Z. Neural feature search for RGB-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 587–597. [Google Scholar]
Zhang, D.; Zhang, Z.; Ju, Y.; Wang, C.; Xie, Y.; Qu, Y. Dual mutual learning for cross-modality person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5361–5373. [Google Scholar] [CrossRef]
Fu, C.; Hu, Y.; Wu, X.; Shi, H.; Mei, T.; He, R. CM-NAS: Cross-modality neural architecture search for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 11823–11832. [Google Scholar]
Hao, X.; Zhao, S.; Ye, M.; Shen, J. Cross-modality person re-identification via modality confusion and center aggregation. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16403–16412. [Google Scholar]
Wei, Z.; Yang, X.; Wang, N.; Gao, X. Syncretic modality collaborative learning for visible infrared person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 225–234. [Google Scholar]
Wu, Q.; Dai, P.; Chen, J.; Lin, C.W.; Wu, Y.; Huang, F.; Zhong, B.; Ji, R. Discover cross-modality nuances for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 4330–4339. [Google Scholar]
Lu, H.; Zou, X.; Zhang, P. Learning progressive modality-shared transformers for effective visible-infrared person re-identification. arXiv 2022, arXiv:2212.00226. [Google Scholar] [CrossRef]
Liang, T.; Jin, Y.; Liu, W.; Li, Y. Cross-modality transformer with modality mining for visible-infrared person re-identification. IEEE Trans. Multimedia 2023, 25, 8432–8444. [Google Scholar] [CrossRef]
Liu, J.; Sun, Y.; Zhu, F.; Pei, H.; Yang, Y.; Li, W. Learning memory-augmented unidirectional metrics for cross-modality person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19344–19353. [Google Scholar]
Zheng, H.; Zhong, X.; Huang, W.; Jiang, K.; Liu, W.; Wang, Z. Visible-infrared person re-identification: A comprehensive survey and a new setting. Electronics 2022, 11, 454. [Google Scholar] [CrossRef]
Ma, L.; Guan, Z.; Dai, X.; Gao, H.; Lu, Y. A Cross-Modality Person Re-Identification Method Based on Joint Middle Modality and Representation Learning. Electronics 2023, 12, 2687. [Google Scholar] [CrossRef]
Gohar, I.; Riaz, Q.; Shahzad, M.; Hashmi, M.Z.U.H.; Tahir, H.; Haq, M.E.U. Person re-identification using deep modeling of temporally correlated inertial motion patterns. Sensors 2020, 20, 949. [Google Scholar] [CrossRef] [PubMed]
Uddin, M.K.; Bhuiyan, A.; Bappee, F.K.; Islam, M.M.; Hasan, M. Person Re-Identification with RGB–D and RGB–IR Sensors: A Comprehensive Survey. Sensors 2023, 23, 1504. [Google Scholar] [CrossRef] [PubMed]
Nguyen, D.T.; Hong, H.G.; Kim, K.W.; Park, K.R. Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors 2017, 17, 605. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhang, Y.; Kang, Y.; Zhao, S.; Shen, J. Dual-Semantic Consistency Learning for Visible-Infrared Person Re-Identification. IEEE Trans. Inf. Forensics Secur. 2022, 18, 1554–1565. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13708–13717. [Google Scholar]
Li, D.; Wei, X.; Hong, X.; Gong, Y. Infrared-visible cross-modal person re-identification with an x modality. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 4610–4617. [Google Scholar]
Ye, M.; Shen, J.; Crandall, D.J.; Shao, L.; Luo, J. Dynamic dual-attentive aggregation learning for visible-infrared person re-identification. In Proceedings of the Computer Vision–ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 229–247. [Google Scholar]

Figure 1. (a) Illustration of metric learning. These methods constrain the network to obtain discriminative and modality-shared features with specifically designed loss functions. (b) Illustration of image translation-based methods. These methods alleviate the modality discrepancy by converting two different modalities to an intermediate modality. (c) Illustration of proposed PDFL. The motivation of our PDFL is to align visible features with infrared features by instance normalization (IN) while preserving identity information, which is important for identification but likely to be damaged in the alignment process.

Figure 2. (a–d) show the intra-class and inter-class distances of cross-modality features when using instance normalization (IN) at different stages of feature extraction. The intra-class and inter-class distances are indicated in blue and green colors, respectively. The vertical lines are the means of inter-class and intra-class distances.

Figure 3. Framework of the proposed PDFL. (1) The features of two modalities are sent into the ACFM and DINM to alleviate the modality discrepancy progressively while maintaining the identity information. (2) PFL generates different body part features of one person to learn more discriminative features. The output of the last DINM

f_{d}

is concatenated with the generated body part features for identity prediction after global average pooling.

Figure 3. Framework of the proposed PDFL. (1) The features of two modalities are sent into the ACFM and DINM to alleviate the modality discrepancy progressively while maintaining the identity information. (2) PFL generates different body part features of one person to learn more discriminative features. The output of the last DINM

f_{d}

is concatenated with the generated body part features for identity prediction after global average pooling.

Figure 4. The operation of X Pooling and Y Pooling in DINM. (a) X Pooling, (b) Y Pooling.

Figure 5. The comparison results on the SYSU-MM01 dataset. (a) The person re-identification results of the baseline model. (b) The person re-identification results of our PDFL.

Figure 6. Ablation result with different numbers of the body part in part-level feature learning.

Figure 7. The learned features is visualized via t-SNE. Different colors represent different identities in the testing set of SYSU-MM01. The circles and crosses indicate the visible features and infrared features, separately. (a) Feature distribution of the baseline method that is only pre-trained on ImageNet; (b) feature distribution of the baseline method; (c) feature distribution of PDFL.

Figure 8. Attention maps extracted by the baseline and PDFL. The middle row and the bottom row are extracted by the baseline and PDFL, separately.

Table 1. Performance comparison with SOTA methods on SYSU-MM01 and RegDB datasets.

Method	SYSU-MM01				RegDB
	All-Search		Indoor-Search		VIS to IR		IR to VIS
	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP
MSR [15]	37.35	38.11	39.64	50.88	48.43	48.67	-	-
AlignGAN [13]	42.40	40.70	45.90	54.30	57.90	53.60	56.30	53.40
CMSP [16]	43.56	44.98	48.62	57.50	65.07	64.50	-	-
SSFT [21]	47.70	54.10	-	-	65.40	65.60	63.80	64.20
XIVReID [35]	49.92	50.73	-	-	62.21	60.18	-	-
DDAG [36]	54.75	53.02	61.02	67.98	69.34	63.46	68.06	61.80
NFS [18]	56.91	55.45	62.79	69.79	80.54	72.10	77.95	69.79
CMNAS [20]	61.99	60.02	67.01	72.95	84.54	80.32	82.57	78.31
CMTR [25]	62.58	61.33	67.02	73.78	80.62	74.42	81.06	73.75
MCLNet [12]	65.40	61.98	72.56	76.58	80.31	73.07	75.93	69.49
SMCL [22]	67.39	61.78	68.84	75.56	83.93	79.83	83.05	78.57
PMT [24]	67.53	64.98	71.66	76.52	84.83	76.55	84.16	75.13
MPANet [23]	70.58	68.24	76.74	80.95	83.70	80.90	82.80	80.70
MAUM [26]	71.68	68.79	76.97	81.94	87.87	85.09	86.95	84.34
Ours	73.64	70.69	79.70	83.01	88.16	86.34	87.48	85.24

Table 2. The ablation result of different modules.

Setting	SYSU-MM01 (All-Search)
	Single-Shot		Multi-Shot
	Rank-1	mAP	Rank-1	mAP
Base	66.80	64.45	73.77	59.53
Base + ACFM	71.18	68.19	74.16	62.65
Base + ACFM + DINM	73.10	70.64	75.72	65.76
Base + ACFM + DINM + PFL	73.64	70.69	75.65	65.82

Table 3. The ablation result of different loss functions.

Setting	SYSU-MM01 (All-Search)
	Single-Shot		Multi-Shot
	Rank-1	mAP	Rank-1	mAP
PDFL + $L_{i d} + L_{t r i}$	64.02	62.76	63.67	56.71
PDFL + $L_{i d} + L_{t r i} + L_{s e p}$	67.09	65.11	67.35	59.05
PDFL + $L_{i d} + L_{t r i} + L_{s e p} + L_{m l}$	73.64	70.69	75.65	65.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, F.; Cheng, Z.; Yang, H.; Song, Y.; Fu, S. Progressive Discriminative Feature Learning for Visible-Infrared Person Re-Identification. Electronics 2024, 13, 2825. https://doi.org/10.3390/electronics13142825

AMA Style

Zhou F, Cheng Z, Yang H, Song Y, Fu S. Progressive Discriminative Feature Learning for Visible-Infrared Person Re-Identification. Electronics. 2024; 13(14):2825. https://doi.org/10.3390/electronics13142825

Chicago/Turabian Style

Zhou, Feng, Zhuxuan Cheng, Haitao Yang, Yifeng Song, and Shengpeng Fu. 2024. "Progressive Discriminative Feature Learning for Visible-Infrared Person Re-Identification" Electronics 13, no. 14: 2825. https://doi.org/10.3390/electronics13142825

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Progressive Discriminative Feature Learning for Visible-Infrared Person Re-Identification

Abstract

1. Introduction

2. Related Works

2.1. Single-Modality Person ReID

2.2. Visible-Infrared Cross-Modality ReID

3. Proposed Methods

3.1. Adaptive Cross Fusion Module

3.2. Dual-Attention-Guided Instance Normalization Module

3.3. Part-Level Feature Learning

3.4. Objective Function

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparation with State-of-the-Art Methods

4.4. Ablation Study

5. Visualization Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI