Identity-Guided Spatial Attention for Vehicle Re-Identification

Lv, Kai; Han, Sheng; Lin, Youfang

doi:10.3390/s23115152

Open AccessArticle

Identity-Guided Spatial Attention for Vehicle Re-Identification

by

Kai Lv

,

Sheng Han

and

Youfang Lin

^*

Beijing Key Laboratory of Traffic Data Analysisand Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(11), 5152; https://doi.org/10.3390/s23115152

Submission received: 30 April 2023 / Revised: 20 May 2023 / Accepted: 25 May 2023 / Published: 28 May 2023

(This article belongs to the Section Vehicular Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In vehicle re-identification, identifying a specific vehicle from a large image dataset is challenging due to occlusion and complex backgrounds. Deep models struggle to identify vehicles accurately when critical details are occluded or the background is distracting. To mitigate the impact of these noisy factors, we propose Identity-guided Spatial Attention (ISA) to extract more beneficial details for vehicle re-identification. Our approach begins by visualizing the high activation regions of a strong baseline method and identifying noisy objects involved during training. ISA generates an attention map to mask most discriminative areas, without the need for manual annotation. Finally, the ISA map refines the embedding feature in an end-to-end manner to improve vehicle re-identification accuracy. Visualization experiments demonstrate ISA’s ability to capture nearly all vehicle details, while results on three vehicle re-identification datasets show that our method outperforms state-of-the-art approaches.

Keywords:

vehicle re-identification; deep learning; machine learning; attention mechanism; vehicle details

1. Introduction

This paper investigates the challenge of identifying a specific vehicle from a vast image gallery database, known as vehicle re-identification [1,2,3,4,5]. The accuracy of this task relies heavily on the use of deep learning techniques [6,7,8,9] and computational resources, which are constrained by the availability of large-scale datasets. However, practical datasets of vehicle images obtained from traffic cameras are often riddled with noise, including occlusion and background artifacts. These nuisances adversely affect the efficacy of deep learning models and significantly compromise the accuracy of the re-identification task. Therefore, effective training on noisy vehicle images is crucial for achieving reliable vehicle re-identification results.

In vehicle re-identification, methods primarily rely on visual appearance as a means of distinguishing vehicles, rather than the license plate. Despite being the most distinctive characteristic of a vehicle, the license plate may prove to be ambiguous and untrustworthy under certain circumstances. Firstly, the low resolution of the license plate image captured alongside the vehicle may render it difficult to decipher the characters accurately. Secondly, the license plate can be obscured or even falsified, leading to its unreliability as a reliable identifier. Therefore, visual appearance-based approaches have been adopted in the majority of existing literature to tackle the challenge of vehicle re-identification.

Vehicle details are important in enabling the identification of distinct vehicles. As such, vehicle re-identification methods should concentrate on differentiating vehicles from other sources of noise, such as backgrounds or occluders. To achieve this, some existing methods employ attention mechanisms to emphasize vehicle-specific details. For example, He et al. [10] leverage a local region detector to identify areas such as windows, lights and brand logos. Similarly, Meng et al. [11] segment a vehicle into four distinct views and implement a shared visible attention mechanism to extract view-aware features. Wang et al. [12] propose a method that detects vehicle keypoints and extracts features from local regions. These approaches [10,11,12] achieve competitive accuracy in vehicle re-identification, underscoring the significance of vehicle details in extracting discriminative features from noisy data.

In order to visualize the regions to which discriminative features attend, as depicted in Figure 1, we employ the Class Activation Map (CAM) technique [13], which facilitates the generation of activation maps for input images. The method presented in [14] serves as a robust baseline for re-identification tasks, surpassing the performance of numerous vehicle re-identification approaches [10,11,12]. In this paper, our objective is to examine the attention map of the aforementioned baseline in order to discern the origins of the extracted features. It is important to note that the proposed method builds upon the foundation laid by the baseline described in [14].

The baseline method exhibits two primary issues with regard to attention. Firstly, as illustrated in the first two examples of Figure 1, the baseline extracts features from several patches that are unrelated to the vehicle. The activation regions encompass not only vehicle components but also various other objects, such as guardrails and green belts. Clearly, these objects are not integral to the vehicle and introduce significant noise into the feature embedding. Secondly, as demonstrated in the last two examples of Figure 1, the baseline may yield fragmented feature attention. In such instances, the baseline fails to concentrate on the vehicle’s appearance, instead directing attention towards the majority of the background. However, the background serves as a source of noise for vehicle re-identification and should not be incorporated into the generation of feature embeddings.

We propose the Identity-guided Spatial Attention (ISA) method to regulate the feature-extraction process, ensuring that the features are focused on vehicle-specific details. Specifically, we propose Single Identity-guided Spatial Attention (SISA), which assigns importance scores to each class at the spatial level. Multiple SISAs are integrated to generate an ISA map, emphasizing the discriminative regions within the input image. Subsequently, an ISA module is incorporated into the vehicle re-identification framework in an end-to-end fashion. During this procedure, the neural network’s attention is enhanced, shifting from a few localized details to a more comprehensive coverage of discriminative regions.

ISA is capable of filtering out noisy factors, yielding more relevant and focused attention. As demonstrated in Figure 1, the attention maps reveal that the proposed method zeroes in on specific vehicle details, such as lights and wheels. Notably, the algorithm disregards noisy areas, including guardrails and green belts, which are deemed irrelevant. In contrast, the baseline method extracts features from these unrelated objects. Furthermore, experimental results suggest that the relevant and focused features contribute to improved re-identification accuracy.

Additionally, the proposed method can be readily implemented in an unsupervised manner. In contrast, to thoroughly exploit vehicle details, some approaches [10,11,12] incorporate supervised detection components, such as vehicle keypoint or region detection. This inevitably leads to increased computational cost and can adversely impact the method’s applicability. In the present study, no supplementary manual annotation is employed for the identification of discriminative regions. Moreover, visualized results substantiate the accuracy of discriminative regions identified via our method.

In summary, the primary contributions of this paper are as follows:

We introduce a spatial attention method to eliminate noisy factors and concentrate more on vehicle-specific details. An attention map is generated to highlight the discriminative regions of the input image.
We propose an ISA module that leverages the ISA map to produce an attention matrix. The feature maps are refined by the attention weight, resulting in the acquisition of robust features.
Distinguished from previous attention methods, ISA constitutes an unsupervised technique, necessitating no supplementary manual annotation and readily adaptable to other vehicle re-identification frameworks.

2. Related Work

2.1. Vehicle Re-Identification

To achieve a discriminative feature representation for vehicle re-identification, previous methods have utilized vehicle viewpoint or orientation cues. However, since vehicles are captured by cameras from various angles, images of identical cars from different orientations can vary significantly, making it challenging to identify similar vehicles of the same orientation. To address this issue, Wang et al. [12] propose a novel framework that includes orientation-invariant feature embedding and spatial-temporal regularization. Additionally, they utilize a key point regressor to obtain vehicle key points, which can distinguish similar cars based on subtle differences.

Another approach to vehicle re-identification is to learn viewpoint-aware deep metrics, as demonstrated by Chu et al. [15], who use a two-branch network. Zhou and Shao [16] adopt an adversarial training architecture with a viewpoint-aware attention model to infer multi-view features from single-view input. Zhou et al. [17] address the uncertainty of different vehicle orientations by using an LSTM to model transformations across continuous orientation variations of the same vehicle. Zhu et al. [18] propose training re-identification models for vehicle, orientation and camera separately. They then penalize the final similarity between testing images based on orientation and camera similarity.

Previous works have also exploited local details and regions to address vehicle re-identification problems [10,19,20,21]. Meng et al. [11] use a parser to segment four views of a vehicle and propose a parsing-based view-aware embedding network to generate fine-grained representations. He et al. [10] address the near-duplicate problem in vehicle re-identification by proposing a part-regularized approach that enhances local features. Liu et al. [22] introduce a multi-branch model that learns global and regional features simultaneously. They use adaptive ratio weights of regional features for the fusion process and propose a Group–Group loss to optimize the distance within and across vehicle image groups. Shen et al. [23] propose a two-stage framework that incorporates important visual-spatial-temporal path information for regularization. Khorramshahi et al. [24] present a self-supervised attention approach for vehicle re-identification to extract vehicle-specific discriminative features. Liu et al. [11] propose PVEN for view-aware feature alignment and enhancement in vehicle ReID using parsing and attention mechanisms. PVEN parses vehicles into four views, aligns features via mask average pooling and enhances features using common-visible attention. PCRNet [25] also utilizes vehicle parsing to learn discriminative part-level features, model the correlation among vehicle parts and achieve precise part alignment for vehicle re-identification.

In this paper, we also utilize discriminative regions to improve re-identification accuracy, where a region mask is generated to highlight vehicle details without additional manual annotation.

2.2. Attention Methods

The attention mechanism was originally introduced in Natural Language Processing [26,27,28,29]. Bahdanau et al. [26] propose an extension to the encoder–decoder model that learns to align and translate jointly in machine translation. The model automatically searches for parts of a source sentence that are useful in predicting a target word, rather than deploying a hard segment. Luong et al. [30] introduce global and local attention mechanisms to neural machine translation, where global attention considers all source words and local attention focuses on a subset of source words at a time. Vaswani et al. [31] propose Transformer, which replaces the recurrent layers with multi-headed self-attention. In this way, the sequence transduction model is entirely based on the attention mechanism. Shaw et al. [32] extend the self-attention mechanism to consider relative positions or distances between sequence elements.

Recently, attention mechanisms have been widely used in image classification [33,34,35,36], semantic segmentation [37,38,39] and object recognition [40,41,42,43,44], due to their ability to boost the performance of deep neural networks. Hu et al. [45] focus on channel relationships and propose the Squeeze-and-Excitation block, which adaptively calibrates channel-wise feature responses by explicitly modeling interdependencies. Wang et al. [46] propose the Residual Attention Network by stacking attention modules that generate attention-aware features. Woo et al. [47] propose the convolutional block attention module, which utilizes attention-based feature refinement with two different modules: channel and spatial. Park et al. [48] present a new approach to enhancing the representation power of networks via a bottleneck attention module. Bello et al. [49] introduce a two-dimensional relative self-attention mechanism as a stand-alone computational primitive, considering a possible alternative to convolutions. Li et al. [50] present the Harmonious Attention Convolutional Neural Network for joint learning of attention selection and feature representations and introduce a cross-attention interaction mechanism. Based on predicted part quality scores, Wang et al. [51] propose an identity-aware attention module to highlight pixels of the target pedestrian and handle the occlusion between pedestrians with a coarse identity-aware feature. Liu et al. [52] propose the Multi-Task Attention Network for multi-task learning, where a task-specific attention module is designed for each task to allow for automatic learning of both task-shared and task-specific features.

In this paper, we realize the attention mechanism by recognizing discriminative areas during training, due to the importance of regional clues in re-identification.

3. Method

In this section, we describe the proposed Identity-guided Spatial Attention (ISA) for vehicle re-identification. Firstly, we describe the spatial attention map that identifies comprehensive discriminative regions of the vehicle in the input image. Secondly, we describe the composition of the MIA module. Finally, we illustrate the network framework that incorporates the spatial attention module.

3.1. Preliminaries

The objective of vehicle re-identification is to retrieve gallery images that correspond to the same identity as a given query image. To achieve this, a model is trained on a dataset of N vehicle images and their corresponding labels, denoted as

{< I_{i}, y_{i} >}^{N} i = 1

. During training, the model is trained using an image classification approach that includes a Fully Connected (FC) layer. At test time, the FC layer is removed and the model maps vehicle images to feature vectors. The similarity of two images is then calculated by measuring the distance between their corresponding feature vectors, using the cosine distance metric in this paper. The images in the gallery set are ranked based on their similarity to the query image and relevant metrics are computed for evaluation.

However, we notice that the model is vulnerable to noise interference, such as backgrounds or occlusions, as shown in Figure 1. In addition, some sample images exhibit scattered attention, which negatively affects feature extraction. To overcome these challenges, we propose a spatial attention mechanism that can be incorporated into existing re-identification networks. Specifically, our approach concentrates the model’s attention and filters out noise factors. Consequently, the feature vectors contain less noisy information, resulting in improved re-identification accuracy.

3.2. Framework

The model architecture is shown in Figure 2 and it can be trained end-to-end without requiring regional annotations for learning ISA. To obtain a more informative feature map, we utilize a ResNet-50 variant as the backbone, which strikes a balance between accuracy and efficiency for most re-identification algorithms. The last classification layer that was originally trained for ImageNet [53] is removed and the stride of the last pooling layer is set to 1.

To integrate the proposed spatial attention into the re-identification framework, we introduce an identity-guided spatial attention module, which is detailed in Section 3.3 and Section 3.4. The ISA module can be broken down into two steps. In the first step, the 3D feature map is passed through a global average pooling (GAP) layer and an FC layer within the ISA module, generating an attention map called the spatial attention map. This map highlights discriminative regions while compressing noisy factors and shares the same size as the input feature map along the height and width dimensions.

In the second step, the ISA module is used to weight the feature map through element-wise multiplication, resulting in a refined feature map. This refined feature map is then passed through a GAP layer and an FC layer. To optimize the model, we employ both cross-entropy loss and triplet loss. Cross-entropy loss treats the re-identification task as a classification problem and is calculated as follows:

L_{C E} = - \sum_{i = 1}^{B} y_{i}^{T} \log {\hat{y}}_{i},

(1)

where B is the batch size during training,

y_{i}

is a one-hot vector representing the ground truth label of the image

I_{i}

and

{\hat{y}}_{i}

is the corresponding probability distribution over all categories predicted via the model.

To further improve the model’s similarity learning, we utilize triplet loss which employs triplets of samples from different identities. Let

Θ (\cdot)

denote the deep model that maps raw images to feature vectors. The triplet loss is computed as follows:

\begin{matrix} L_{T r i} = \max (d (Θ (I_{a}), Θ (I_{p})) - d (Θ (I_{a}), Θ (I_{n})) + m a r g i n, 0), \end{matrix}

(2)

where

I_{a}

is an anchor image,

I_{p}

is a positive instance with the same label as

I_{a}

and

I_{n}

is a negative instance with a different label. These three samples are all sampled from a training batch using a hard example mining strategy. The

L 2

distance metric function,

d (\cdot)

, is used to compute the distance between feature vectors and

m a r g i n

is added to encourage the loss backpropagation between positive and negative pairs.

The total loss of our method is a combination of the cross-entropy loss and the triplet loss, as follows:

L = L_{C E} + α \cdot L_{T r i},

(3)

where

α

is a hyper-parameter that controls the weight ratio between the two losses.

3.3. Identity-Guided Spatial Attention

The process of the Identity-Guided Spatial Attention (ISA) module is shown in Figure 3. We first introduce the concept of the Single Identity-guided Spatial Attention (SISA) map, which indicates the importance of different regions within an image. Building upon SISA, we then describe the Identity-guided Spatial Attention (ISA) framework, which is obtained by aggregating multiple SISA maps. The ISA framework consists of the single ISA map for the target identity and the single ISA maps for the remaining identities. Finally, we utilize the obtained ISA map to extract features that focus on key details.

3.3.1. Single Identity-guided Spatial Attention Map

In this part, we describe the generation of the Single Identity-guided Spatial Attention (SISA) map, which indicates the importance of different regions in an image. The proposed method is based on the ID-discriminative Embedding (IDE) model. We use the Global Average Pooling (GAP) layer and a fully connected layer of the IDE model. The feature map F after the GAP operation is denoted as

f = (f_{1}, f_{2}, \dots, f_{i}, \dots, f_{c})

, where c is the number of channels and

f_{k}

is calculated as follows:

f_{k} = \frac{1}{h \cdot w} \sum_{i = 1}^{h} \sum_{j = 1}^{w} F_{i, j, k},

(4)

where h and w denote the height and width of the feature map, respectively.

The resulting feature map from the GAP layer is then converted into a 1D tensor via the fully connected layer. Each value in the 1D tensor represents the probability of belonging to a particular category. This process can be expressed as:

p = s o f t m a x (W \cdot f),

(5)

where W is an

n \times c

weight parameter of the fully connected layer and n is the number of categories in the training set. The

s o f t m a x (\cdot)

function normalizes the

n \times 1

tensor to a probability distribution over the predicted classes. The output tensor p is then used in the loss functions. For the sake of simplicity, the activation function and normalization operation that are commonly used are omitted from the above description.

The spatial attention map generated via the ISA module for a specific identity label l is denoted as

M^{l}

, which is represented as an

h \times w

tensor. It is calculated by taking the sum of each channel of the feature map F, multiplied by the corresponding weight in the fully connected parameter W. More specifically, the calculation is as follows:

M_{i, j}^{l} = \sum_{k = 1}^{c} W_{l, k} \cdot F_{i, j, k},

(6)

where each spatial unit

(i, j)

of the feature map F is utilized. This results in an

h \times w

tensor that indicates the importance of different regions for the specific identity l.

It is worth noting that for each class, the average value of the single ISA map is equal to the corresponding dimension of the softmax input. In other words, for identity l, we have:

\frac{1}{h \cdot w} \sum_{j = 1}^{h} \sum_{k = 1}^{w} M_{j, k}^{l} = g_{l} .

(7)

The input to the softmax function is denoted as g and it is obtained as the product of the learned weight matrix W and the feature vector f. Thus,

M_{i, j}^{l}

represents the activation score of the position

(i, j)

, which has a direct impact on the predicted probability of the l-th category. The higher the value of

M_{i, j}^{l}

, the greater the contribution of the corresponding position to the predicted probability of the l-th category, and conversely, the lower the value of

M_{i, j}^{l}

, the lower the contribution of the corresponding position to the predicted probability of the l-th category.

Furthermore, we introduce a hyperparameter r which represents the discriminative ratio and is used to control the size of the recognized discriminative regions. For a single ISA map, we can use a threshold to separate high and low activation levels and control the area of discriminative regions. Thus, r is defined as the ratio of the area of the discriminative regions to the entire image.

3.3.2. Generating Identity-Guided Attention Map

The identity-guided attention map M consists of two parts: the single ISA map of identity

g t

M^{g t}

and the single ISA maps of the rest of the identities

M^{g t^{'}}

.

To improve the accuracy of discriminative region recognition, we first utilize a single ISA map of identity

g t

. Discriminative regions are indicated by high activation scores in

M^{g t}

. We define

M^{g t}

as:

M^{g t} = M^{l}, w h e r e l = g t .

(8)

Using a single spatial attention map is not always sufficient for accurate recognition of discriminative regions. To overcome this limitation, we propose to use multiple spatial attention maps. Specifically, we utilize all n identities except for the ground truth identity (

g t

). Then, we define the multi-identity attention map as a sum of all individual spatial attention maps, denoted as:

M^{g t^{'}} = ℜ (\sum_{l = 1}^{n} M^{l}), w h e r e l \neq g t,

(9)

where n is the identity number when training. Different from

M^{g t}

, the discriminative regions of

M^{l^{'}}

are indicated by low activation scores. Thus, we apply a reverse operation ℜ on the attention maps. This approach allows us to consider multiple categories to obtain a refined and robust spatial attention map, rather than relying on a single attention map of identity gt. To illustrate the effectiveness of the proposed approach, two examples of discriminative region recognition results are presented in Figure 4. It shows that some important patches such as lights and sunroof are highlighted.

In the field of re-identification, previous works have also explored the utilization of attention mechanisms. For instance, the method Quality-aware Part Models (QPM) [51] proposes an identity-aware attention module to emphasize pixels associated with the target pedestrian, thereby addressing occlusion issues between pedestrians using a coarse identity-aware feature. In contrast, our method focuses specifically on vehicle re-identification, tackling occlusions and extracting essential information from within the vehicles. Furthermore, while QPM derives attention information from the quality scores of various body parts, our approach leverages attention obtained through a classification task, specifically the class activation map. This distinction enables us to accentuate the discriminative regions within the vehicles for effective re-identification purposes.

3.4. Features with Identity-Guided Spatial Attention

In this section, we propose to use the ISA map to enhance the performance of vehicle re-identification. In widely used re-identification pipelines, the GAP layer is employed to generate a 1D embedding feature tensor by averaging all spatial units. However, this operation results in the loss of significant information at the spatial level. Therefore, we aim to focus more deeply on regions that contain abundant clues during training. To achieve this, we implement an attention weight matrix on the feature map before the GAP operation.

In Section 3.3, we present the ISA map for determining whether a region is discriminative or not, but it does not provide a specific importance score. To address this limitation, we introduce a parameter

w_{i m p a c t}

, which represents the attention weight of discriminative regions and is greater than zero. Additionally, we define an

h \times w

tensor A as the attention weight matrix, where each element of A is equal to

w_{i m p a c t}

or zero. The calculation procedure is expressed as follows:

F^{'} = F ⊙ (A + 1),

(10)

where the ⊙ operation represents the Hadamard product. The resulting tensor

F^{'}

represents the refined 3D feature tensor, which is then passed through the GAP layer. The weight matrix A can be easily obtained by binarizing the ISA map M.

4. Experimental Results

In this section, we evaluate our proposed method on three widely used vehicle re-identification datasets: VeRi-776 [55], VehicleID [56] and VERI-Wild [57]. We also describe the evaluation metrics used and implementation details.

VeRi-776 [55] dataset contains 51,035 images of 776 vehicles captured by 20 cameras. The training set consists of 37,778 images of 576 vehicles and the remaining 13,257 images of 200 vehicles are used for testing. The test set has 1678 images for the query set and the rest of the images are in the gallery set. The dataset presents challenges such as various viewpoints, complex backgrounds and different distances.

VehicleID [56] dataset has 221,763 images of 26,267 vehicles. The training set has 110,178 images of 13,134 vehicles and the remaining vehicle images form the test set. The test set has three subsets: small, medium and large, with 800, 1600 and 2400 vehicles, respectively. The dataset only contains two vehicle orientations—front and back—and there is only one matching image in the gallery for each query image.

VERI-Wild [57] dataset contains 416,314 images of 40,671 identities captured by 174 cameras over a month. The training set contains 277,797 images of 30,671 identities. The test set is split into three subsets: small, medium and large, with 3000, 5000 and 10,000 identities, respectively. The dataset involves complex backgrounds, various viewpoints and different illumination and weather conditions. There can be multiple matching images in the gallery set for a probe image.

We use mean Average Precision (mAP) and rankk accuracy as evaluation metrics, following [57]. The mAP metric is suitable for scenarios where each probe image has multiple matching gallery images. The average precision of one query image is calculated by:

AP = \sum_{g = 1}^{G} p (g) Δ r (g),

(11)

where

p (g)

denotes the recognition accuracy of the first g gallery images in the retrieved list. When the k-th image contains the same identity as the query,

Δ r (g) = 1

. Then, mAP is defined as:

mAP = \sum_{q = 1}^{Q} A P (q),

(12)

where Q is the size of the query set. The rankk accuracy specifies the percentage of probe images that matched correctly with one of the top k images in the gallery set. Finally, we provide the implementation details of our proposed method.

4.1. Implementation Details

We adopt IBN-Net-50 [58] as the backbone network in our experiments, which is a variant of ResNet-50 [54]. IBN-Net-50 replaces part of the batch normalization with instance normalization, leading to more discriminative features with a negligible increase in computation cost. We set the stride of the last pooling layer to 1 to obtain larger feature maps. The proposed model is implemented using the Pytorch deep learning framework.

We resize input images to

256 \times 256

during both training and testing phases. The batch size is set to 64 and the number of images belonging to one category in the batch is set based on the properties of the different datasets. The hyperparameter r is set to 0.5. For VeRi-776 [55], we randomly select 16 images of each vehicle identity in a batch, as each category has an average of 75.6 images. For VehicleID [56] and VERI-Wild [57], each class has 4 images in a batch. We train the proposed method based on a stable re-identification network for 120 epochs. We use SGD as the optimizer, with an initial learning rate of

0.01

. We adopt a warm-up strategy for the first 10 epochs, where the learning rate is increased linearly from

0.0001

to

0.01

. After the 30th epoch, the learning rate drops symmetrically to 7.7

\times 10^{- 5}

with a cosine annealing strategy. We employ a batch normalization layer following the GAP layer. During testing, we use the output of the batch normalization layer as the final embedding features for evaluation. Considering that the GPU in use is the NVIDIA RTX A4000 Graphics Card, the computation time for processing a batch of 128 images is 0.505 s. The required GPU memory is about 3000 M. It is important to note that these figures are specific to this GPU and may differ for other GPU models.

The discriminative ratio r of our method is set to 0.5. We apply random erasing and horizontal flipping during training for data augmentation. We follow the optimal parameter settings [59] designed for image classification tasks. Specifically, the parameters of random erasing are set to

s_{l} = 0.02

,

s_{h} = 0.4

and

r_{1} = 0.3

. The area of the erasing rectangle is randomly selected from the range

(s_{l}, s_{r})

and

r_{1}

controls the aspect ratio of the rectangular region. The probability of performing random erasing is set to

0.5

. Horizontal flipping is also carried out with a

0.5

probability.

4.2. Parameters Analysis

The attention weight

w_{i m p a c t}

. The parameter

w_{i m p a c t}

denotes the attention weight of obtained regions during training. We conduct experiments with

w_{i m p a c t} \in \{0.1, 0.2, 0.3\}

. Experimental results are shown in Table 1. It can be seen that when

i m p a c t = 0.2

, ISA achieves the highest mAP regardless of the parameter r.

4.3. Comparison with State-of-the-Art

In order to evaluate the effectiveness of our proposed method, we compare it with several state-of-the-art methods including RAM [19], VAMI [16], PGAN [60], AAVER [61], PRN [10], PVEN [11] and FDA-Net [57]. RAM extracts both global and regional features using a region branch to focus on more details. PRN employs a key region detection model to allow the re-identification network to pay more attention to important areas, which are manually selected in advance. Similarly, PGAN detects more regions to achieve higher accuracy. PVEN introduces a parsing model to segment a vehicle into four views. However, note that PRN, PGAN and PVEN require extra detection or segmentation methods with corresponding manual annotations.

4.3.1. Evaluation on VeRi-776

Table 2 shows the comparison of the proposed method with several state-of-the-art approaches, with +STR indicating spatiotemporal details utilized in the corresponding approaches. The results demonstrate that our method outperforms other methods, including those involving auxiliary data. The proposed method achieves an mAP of 0.821, which is higher than the other methods. Regarding rank1, the proposed method obtains a rank1 of 0.973, which outperforms most of the compared methods.

Our proposed method outperforms TransREID [62] and SAVER [24] by a significant margin among the methods that do not involve extra annotations. While SAVER achieves a competitive mAP (0.796) on VeRi-776, our proposed method outperforms it in the small set of VehicleID with a much higher rank1 value (0.871 vs. 0.799). These results demonstrate the effectiveness of our method.

Compared to state-of-the-art methods that use additional annotations, such as VARID [63], our method improves the mAP metric by 2.5%. VARID uses viewpoint labels in the training process. In contrast, our proposed method does not require any additional annotations, yet achieves higher mAP and rank1 values.

Table 2. We compare our method with several state-of-the-art vehicle re-identification approaches on the VeRi-776 dataset [55]. The evaluation criteria employed are mAP, rank1 and rank5. Methods that require extra annotations are denoted by §. Numbers in bold are the highest values.

Methods		mAP	rank1	rank5
LOMO [64]		0.096	0.253	0.465
BOW-CN [65]	§	0.122	0.339	0.536
EALN [66]	§	0.574	0.843	0.94
BIR [67]	§	0.707	0.904	0.97
RAM [19]	§	0.615	0.886	0.94
VAMI+STR [16]	§	0.613	0.859	0.918
GSTE [68]		0.594	0.962	0.989
VANet [15]	§	0.663	0.897	0.959
AAVER [61]	§	0.663	0.901	0.943
PRN [10]	§	0.743	0.943	0.987
PRF [5]	§	0.779	0.964	0.985
PCRNet [25]	§	0.786	0.954	0.984
TransREID [62]		0.782	0.965	-
TransREID+views [62]	§	0.796	0.970	0.984
PVEN [11]	§	0.795	0.956	0.984
PGAN [60]	§	0.793	0.965	0.983
SAVER [24]		0.796	0.964	0.986
VARID [63]	§	0.793	0.96	0.992
DFNet [69]	§	0.809	0.97	0.990
baseline		0.805	0.953	0.982
ISA (Ours)		0.821	0.973	0.990

4.3.2. Evaluation on VehicleID

The performance comparison on VehicleID is presented in Table 3, where we compare the proposed method with state-of-the-art methods, including RAM [19], PRN [10] and PCRNet [25]. Our method, which integrates local and global features, achieves superior performance in terms of rank1 and rank5 compared to most previous works. While PCRNet [25] achieves good performance on VehicleID and outperforms our method in terms of rank1 on the large subset, it requires manually labeled parsing data to carefully exploit vehicle features. In summary, our proposed method shows competitive results on the VehicleID dataset without relying on auxiliary information.

4.3.3. Evaluation on VERI-Wild

Table 4 presents the comparison results on the VERI-Wild dataset. The compared methods include GSTE [68], FDA-Net [57], SAVER [24] and PCRNet [25]. The proposed ISA achieves 0.830, 0.781 and 0.710 mAP on the small, medium and large subsets, respectively. GSTE [68] leverages spatio-temporal features to improve vehicle re-identification, but it still lags behind our method. FDA-Net [57] utilizes a multi-task framework to learn view-specific feature representations, but it requires extra annotations. SAVER [24] and PCRNet [25] achieve good performance by introducing additional semantic segmentation data, but our method outperforms both of them on all three subsets without extra annotations. These results demonstrate that the proposed method can effectively recognize vehicles in diverse scenarios and has great potential in practical applications.

5. Conclusions

In this paper, we proposed a novel Identity-guided Spatial Attention method for vehicle re-identification that exploits multiple spatial attention maps. Our method contributes by incorporating spatial attention for vehicle-specific details, introducing the Identity-guided Spatial Attention (ISA) module for feature refinement and offering an unsupervised technique that is widely applicable. It enables the model to focus on important vehicle-specific details, enhances feature representation through the ISA module and can be easily integrated into various vehicle re-identification frameworks without requiring additional manual annotation. The experimental results on three benchmark datasets demonstrate the superiority of the proposed method compared with state-of-the-art approaches. Our method achieves state-of-the-art performance on both VeRi-776 and VERI-Wild datasets and outperforms most of the compared methods on VehicleID dataset. Unlike many other approaches that rely on additional annotations to facilitate their training process, our method offers two significant advantages. Firstly, these extra annotations, such as key points, orientation or segmentation details, are expensive to obtain. By eliminating the need for such annotations, our method reduces the financial burden associated with data collection and annotation efforts. Secondly, the absence of these annotations also reduces the computational resources required during the training process. This efficiency makes our method more accessible and feasible for real-world deployment, as it can be trained and executed using fewer computational resources, ultimately increasing its practicality and wide applicability.

One limitation of this study is the reliance on a specific dataset for evaluation. While we have achieved promising results using this dataset, the generalizability of our method to other datasets and real-world scenarios may vary. Therefore, we will perform further validation on diverse datasets to establish the robustness and effectiveness of our approach in different practical scenarios.

Author Contributions

Conceptualization, Y.L.; Methodology, K.L.; Validation, S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 62206013.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhou, J.; Dong, Q.; Zhang, Z.; Liu, S.; Durrani, T.S. Cross-Modality Person Re-Identification via Local Paired Graph Attention Network. Sensors 2023, 23, 4011. [Google Scholar] [CrossRef] [PubMed]
Pan, W.; Huang, L.; Liang, J.; Hong, L.; Zhu, J. Progressively Hybrid Transformer for Multi-Modal Vehicle Re-Identification. Sensors 2023, 23, 4206. [Google Scholar] [CrossRef]
Lv, K.; Du, H.; Hou, Y.; Deng, W.; Sheng, H.; Jiao, J.; Zheng, L. Vehicle Re-Identification with Location and Time Stamps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 399–406. [Google Scholar]
Sheng, H.; Lv, K.; Liu, Y.; Ke, W.; Lyu, W.; Xiong, Z.; Li, W. Combining pose invariant and discriminative features for vehicle reidentification. IEEE Internet Things J. 2020, 8, 3189–3200. [Google Scholar] [CrossRef]
Lv, K.; Sheng, H.; Xiong, Z.; Li, W.; Zheng, L. Pose-Based View Synthesis for Vehicles: A Perspective Aware Method. IEEE Trans. Image Process. 2021, 29, 5163–5174. [Google Scholar] [CrossRef]
Wang, S.; Wu, Z.; Hu, X.; Lin, Y.; Lv, K. Skill-based Hierarchical Reinforcement Learning for Target Visual Navigation. IEEE Trans. Multimed. 2023. [Google Scholar] [CrossRef]
Hu, X.; Wu, Z.; Lv, K.; Wang, S.; Lin, Y. Agent-centric relation graph for object visual navigation. arXiv 2021, arXiv:2111.14422. [Google Scholar]
Zhang, H.; Lin, Y.; Han, S.; Lv, K. Lexicographic Actor-Critic Deep Reinforcement Learning for Urban Autonomous Driving. IEEE Trans. Veh. Technol. 2022, 72, 4308–4319. [Google Scholar] [CrossRef]
Lv, K.; Sheng, H.; Xiong, Z.; Li, W.; Zheng, L. Improving driver gaze prediction with reinforced attention. IEEE Trans. Multimed. 2020, 23, 4198–4207. [Google Scholar] [CrossRef]
He, B.; Li, J.; Zhao, Y.; Tian, Y. Part-regularized near-duplicate vehicle re-identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3997–4005. [Google Scholar]
Meng, D.; Li, L.; Liu, X.; Li, Y.; Yang, S.; Zha, Z.J.; Gao, X.; Wang, S.; Huang, Q. Parsing-based View-aware Embedding Network for Vehicle Re-Identification. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7103–7112. [Google Scholar]
Wang, Z.; Tang, L.; Liu, X.; Yao, Z.; Yi, S.; Shao, J.; Yan, J.; Wang, S.; Li, H.; Wang, X. Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 379–387. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Chu, R.; Sun, Y.; Li, Y.; Liu, Z.; Zhang, C.; Wei, Y. Vehicle re-identification with viewpoint-aware metric learning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8282–8291. [Google Scholar]
Zhou, Y.; Shao, L. Viewpoint-aware attentive multi-view inference for vehicle re-identification. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6489–6498. [Google Scholar]
Zhou, Y.; Shao, L. Vehicle re-identification by adversarial bi-directional lstm network. In Proceedings of the IEEE Workshop on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 653–662. [Google Scholar]
Zhu, X.; Luo, Z.; Fu, P.; Ji, X. VOC-ReID: Vehicle re-identification based on vehicle-orientation-camera. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 602–603. [Google Scholar]
Liu, X.; Zhang, S.; Huang, Q.; Gao, W. Ram: A region-aware deep model for vehicle re-identification. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
Zapletal, D.; Herout, A. Vehicle Re-identification for Automatic Video Traffic Surveillance. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1568–1574. [Google Scholar]
Shen, D.; Zhao, S.; Hu, J.; Feng, H.; Cai, D.; He, X. ES-Net: Erasing Salient Parts to Learn More in Re-Identification. IEEE Trans. Image Process. 2021, 30, 1676–1686. [Google Scholar] [CrossRef]
Liu, X.; Zhang, S.; Wang, X.; Hong, R.; Tian, Q. Group-group loss-based global-regional feature learning for vehicle re-identification. IEEE Trans. Image Process. 2019, 29, 2638–2652. [Google Scholar] [CrossRef]
Shen, Y.; Xiao, T.; Li, H.; Yi, S.; Wang, X. Learning deep neural networks for vehicle re-id with visual-spatio-temporal path proposals. In Proceedings of the2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1900–1909. [Google Scholar]
Khorramshahi, P.; Peri, N.; Chen, J.; Chellappa, R. The Devil Is in the Details: Self-supervised Attention for Vehicle Re-identification. In Proceedings of the ECCV, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12359, pp. 369–386. [Google Scholar]
Liu, X.; Liu, W.; Zheng, J.; Yan, C.; Mei, T. Beyond the Parts: Learning Multi-view Cross-part Correlation for Vehicle Re-identification. In Proceedings of the ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 907–915. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Bollmann, M.; Bingel, J.; Søgaard, A. Learning attention for historical text normalization by learning to pronounce. In Proceedings of the ACL, Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 332–344. [Google Scholar]
Xu, M.; Wong, D.F.; Yang, B.; Zhang, Y.; Chao, L.S. Leveraging Local and Global Patterns for Self-Attention Networks. In Proceedings of the ACL, Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3069–3075. [Google Scholar]
Chen, H.; Huang, S.; Chiang, D.; Dai, X.; Chen, J. Combining Character and Word Information in Neural Machine Translation Using a Multi-Level Attention. In Proceedings of the NAACL, Association for Computational Linguistics, New Orleans, LA, USA, 1–6 June 2018; pp. 1284–1293. [Google Scholar]
Luong, T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the EMNLP, The Association for Computational Linguistics, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the NeurIPS, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. In Proceedings of the NAACL, Association for Computational Linguistics, New Orleans, LA, USA, 1–6 June 2018; pp. 464–468. [Google Scholar]
Zhao, H.; Jia, J.; Koltun, V. Exploring Self-Attention for Image Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10073–10082. [Google Scholar]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Zhang, Z.; Lin, H.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. ResNeSt: Split-Attention Networks. arXiv 2020, arXiv:2004.08955. [Google Scholar]
Chen, G.; Lin, C.; Ren, L.; Lu, J.; Zhou, J. Self-Critical Attention Learning for Person Re-Identification. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9636–9645. [Google Scholar]
Zheng, H.; Fu, J.; Zha, Z.; Luo, J. Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 15–20 June 2019; pp. 5012–5021. [Google Scholar]
Zhou, L.; Gong, C.; Liu, Z.; Fu, K. SAL: Selection and Attention Losses for Weakly Supervised Semantic Segmentation. IEEE Trans. Multimed. 2021, 23, 1035–1048. [Google Scholar] [CrossRef]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 603–612. [Google Scholar]
Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid Attention Network for Semantic Segmentation. In Proceedings of the 2018 British Machine Vision Conference, Newcastle, UK, 3–6 September 2018; BMVA Press: Durham, UK, 2018; p. 285. [Google Scholar]
Gao, G.; Zhao, W.; Liu, Q.; Wang, Y. Co-Saliency Detection With Co-Attention Fully Convolutional Network. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 877–889. [Google Scholar] [CrossRef]
Wang, W.; Song, H.; Zhao, S.; Shen, J.; Zhao, S.; Hoi, S.C.H.; Ling, H. Learning Unsupervised Video Object Segmentation Through Visual Attention. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3064–3074. [Google Scholar]
Zhu, L.; Chen, J.; Hu, X.; Fu, C.; Xu, X.; Qin, J.; Heng, P. Aggregating Attentional Dilated Features for Salient Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 3358–3371. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Zheng, H.; Fu, J.; Zha, Z.; Luo, J.; Mei, T. Learning Rich Part Hierarchies with Progressive Attention Networks for Fine-Grained Image Recognition. IEEE Trans. Image Process. 2020, 29, 476–488. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; So Kweon, I. Cbam: Convolutional block attention module. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Park, J.; Woo, S.; Lee, J.; Kweon, I.S. BAM: Bottleneck Attention Module. In Proceedings of the BMVC, Newcastle, UK, 3–6 September 2018; BMVA Press: Durham, UK, 2018; pp. 147–160. [Google Scholar]
Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention augmented convolutional networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3286–3295. [Google Scholar]
Li, W.; Zhu, X.; Gong, S. Harmonious attention network for person re-identification. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2285–2294. [Google Scholar]
Wang, P.; Ding, C.; Shao, Z.; Hong, Z.; Zhang, S.; Tao, D. Quality-aware part models for occluded person re-identification. IEEE Trans. Multimed. 2022. [Google Scholar] [CrossRef]
Liu, S.; Johns, E.; Davison, A.J. End-to-end multi-task learning with attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 15–20 June 2019; pp. 1871–1880. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Li, F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, USA, 20–25 June 2009; IEEE Computer Society: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Liu, X.; Liu, W.; Mei, T.; Ma, H. A deep learning-based approach to progressive vehicle re-identification for urban surveillance. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016; pp. 869–884. [Google Scholar]
Liu, H.; Tian, Y.; Yang, Y.; Pang, L.; Huang, T. Deep relative distance learning: Tell the difference between similar vehicles. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 2167–2175. [Google Scholar]
Lou, Y.; Bai, Y.; Liu, J.; Wang, S.; Duan, L. Veri-wild: A large dataset and a new method for vehicle re-identification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 15–20 June 2019; pp. 3235–3243. [Google Scholar]
Pan, X.; Luo, P.; Shi, J.; Tang, X. Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net. In Proceedings of the European Conference Computer Vision, Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11208, pp. 484–500. [Google Scholar]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. In Proceedings of the AAAI, New York, NY, USA, 7–12 February 2020; pp. 13001–13008. [Google Scholar]
Zhang, X.; Zhang, R.; Cao, J.; Gong, D.; You, M.; Shen, C. Part-guided attention learning for vehicle instance retrieval. IEEE Trans. Intell. Transp. Syst. 2020, 23, 3048–3060. [Google Scholar] [CrossRef]
Khorramshahi, P.; Kumar, A.; Peri, N.; Rambhatla, S.S.; Chen, J.C.; Chellappa, R. A dual-path model with adaptive attention for vehicle re-identification. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6132–6141. [Google Scholar]
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Li, Y.; Liu, K.; Jin, Y.; Wang, T.; Lin, W. VARID: Viewpoint-aware re-identification of vehicle based on triplet loss. IEEE Trans. Intell. Transp. Syst. 2020, 23, 1381–1390. [Google Scholar] [CrossRef]
Liao, S.; Hu, Y.; Zhu, X.; Li, S.Z. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2197–2206. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
Lou, Y.; Bai, Y.; Liu, J.; Wang, S.; Duan, L.Y. Embedding adversarial learning for vehicle re-identification. IEEE Trans. Image Process. 2019, 28, 3794–3807. [Google Scholar] [CrossRef]
Wu, M.; Zhang, Y.; Zhang, T.; Zhang, W. Background segmentation for vehicle re-identification. In Proceedings of the International Conference on Multimedia Modeling, Daejeon, Republic of Korea, 5–8 January 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 88–99. [Google Scholar]
Bai, Y.; Lou, Y.; Gao, F.; Wang, S.; Wu, Y.; Duan, L.Y. Group-sensitive triplet embedding for vehicle reidentification. IEEE Trans. Multimedia 2018, 20, 2385–2399. [Google Scholar] [CrossRef]
Bai, Y.; Liu, J.; Lou, Y.; Wang, C.; Duan, L. Disentangled Feature Learning Network and a Comprehensive Benchmark for Vehicle Re-Identification. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 6854–6871. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Liang, B.; Xie, W.; Liao, Y.; Kuang, Z.; Zhuang, Y.; Ding, X. Dual domain multi-task model for vehicle re-identification. IEEE Trans. Intell. Transp. Syst. 2020, 23, 2991–2999. [Google Scholar] [CrossRef]

Figure 1. Illustrative examples of attention on vehicle images. We employ the activation map [13] to demonstrate the regions to which the features are directed. The baseline refers to the method proposed by Luo et al. [14]. Contrasting the baseline, which yields irrelevant and dispersed attention, our method exhibits relevant and focused attention.

Figure 2. The structure of our network. The network includes ResNet-50 [54] as the backbone network, which generates a 3D feature map. The process can be divided into two steps. Firstly, the Identity-guided Spatial Attention (ISA) module is introduced to create an attention weight matrix. Secondly, the feature map is refined using the attention weight matrix through Hadamard product. Both steps include GAP and fully connected layers and employ cross-entropy and triplet loss for training.

Figure 3. The process of the Identity-guided Spatial Attention (ISA) module, which aims to generate the identity-guided spatial attention map M, consists of two parts: the single ISA map

M^{g t}

for the ground truth identity gt and the ISA map

M^{g t^{'}}

for the other identities

l^{'}

.

Figure 3. The process of the Identity-guided Spatial Attention (ISA) module, which aims to generate the identity-guided spatial attention map M, consists of two parts: the single ISA map

M^{g t}

for the ground truth identity gt and the ISA map

M^{g t^{'}}

for the other identities

l^{'}

.

Figure 4. Two visualized spatial attention maps. Each pair consists of an original image and a corresponding image that highlights the recognized discriminative regions. Examples (a) and (b) can illustrate that our method will not focus on the occluder and the background, respectively.

Table 1. Parameter analysis of

w_{i m p a c t}

. Experiments are conducted on VeRi-776 [55].

Table 1. Parameter analysis of

w_{i m p a c t}

. Experiments are conducted on VeRi-776 [55].

$w_{impact}$	mAP	rank1
0.1	0.793	0.957
0.2	0.814	0.970
0.3	0.824	0.978

Table 3. We compare our proposed method with state-of-the-art approaches on VehicleID [56], which is divided into three subsets: small, medium and large. Evaluation metrics include mAP, rank1 and rank5. The approaches that require extra annotations are denoted by §. Numbers in bold are the highest values.

Method		Small			Medium			Large
Method		mAP	rank1	rank5	mAP	rank1	rank5	mAP	rank1	rank5
VAMI [16]	§	-	0.631	0.833	-	0.529	0.751	-	0.473	0.703
AAVER [61]	§	-	0.747	0.938	-	0.686	0.900	-	0.635	0.856
EALN [66]	§	0.775	0.751	0.881	0.742	0.718	0.839	0.71	0.693	0.814
RAM [19]	§	-	0.752	0.915	-	0.723	0.870	-	0.677	0.845
PRN [10]	§	-	0.784	0.923	-	0.750	0.883	-	0.742	0.864
SAVER [24]		-	0.799	0.952	-	0.776	0.911	-	0.753	0.883
PGAN [60]		-	-	-	-	-	-	0.839	0.778	0.921
TransReID [62]		-	0.823	0.961	-	-	-	-	-	-
PVEN [11]	§	-	0.847	0.970	-	0.806	0.945	-	0.778	0.920
PCRNet [25]	§	-	0.866	0.981	-	0.822	0.963	-	0.804	0.942
DDM [70]		0.823	0.757	0.905	0.802	0.743	0.889	0.785	0.731	0.853
VARID [63]	§	0.885	0.858	0.969	0.847	0.812	0.941	0.824	0.795	0.922
baseline		0.903	0.849	0.972	0.879	0.817	0.96	0.845	0.78	0.93
ISA (Ours)		0.910	0.871	0.987	0.891	0.831	0.961	0.860	0.791	0.947

Table 4. We present a comparison of the state-of-the-art vehicle re-identification methods on VERI-Wild [57], which has been divided into three subsets based on the number of vehicle instances. To evaluate the performance of the methods, multiple metrics such as mAP, rank1 and rank5 are used. Methods marked with § require extra annotations for training. Numbers in bold are the highest values.

Method		Small			Medium			Large
Method		mAP	rank1	rank5	mAP	rank1	rank5	mAP	rank1	rank5
DRDL [56]		0.225	0.570	0.750	0.193	0.519	0.710	0.148	0.446	0.610
GSTE [68]		0.314	0.605	0.801	0.262	0.521	0.749	0.195	0.454	0.665
FDA-Net [57]	§	0.351	0.640	0.828	0.298	0.578	0.783	0.228	0.494	0.705
AAVER [61]		0.622	0.758	0.927	0.536	0.682	0.888	0.416	0.586	0.815
SAVER [24]		0.809	0.945	0.981	0.753	0.927	0.974	0.677	0.895	0.958
PCRNet [25]	§	0.812	0.925	-	0.753	0.893	-	0.671	0.85	-
VARID [63]	§	0.754	0.753	0.952	0.708	0.688	0.918	0.642	0.632	0.832
baseline		0.801	0.923	0.965	0.756	0.908	0.956	0.682	0.843	0.953
ISA (Ours)		0.830	0.949	0.988	0.781	0.941	0.988	0.710	0.916	0.983

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lv, K.; Han, S.; Lin, Y. Identity-Guided Spatial Attention for Vehicle Re-Identification. Sensors 2023, 23, 5152. https://doi.org/10.3390/s23115152

AMA Style

Lv K, Han S, Lin Y. Identity-Guided Spatial Attention for Vehicle Re-Identification. Sensors. 2023; 23(11):5152. https://doi.org/10.3390/s23115152

Chicago/Turabian Style

Lv, Kai, Sheng Han, and Youfang Lin. 2023. "Identity-Guided Spatial Attention for Vehicle Re-Identification" Sensors 23, no. 11: 5152. https://doi.org/10.3390/s23115152

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identity-Guided Spatial Attention for Vehicle Re-Identification

Abstract

1. Introduction

2. Related Work

2.1. Vehicle Re-Identification

2.2. Attention Methods

3. Method

3.1. Preliminaries

3.2. Framework

3.3. Identity-Guided Spatial Attention

3.3.1. Single Identity-guided Spatial Attention Map

3.3.2. Generating Identity-Guided Attention Map

3.4. Features with Identity-Guided Spatial Attention

4. Experimental Results

4.1. Implementation Details

4.2. Parameters Analysis

4.3. Comparison with State-of-the-Art

4.3.1. Evaluation on VeRi-776

4.3.2. Evaluation on VehicleID

4.3.3. Evaluation on VERI-Wild

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI