Next Article in Journal
Climate Sensitivity of the Arid Scrublands on the Tibetan Plateau Mediated by Plant Nutrient Traits and Soil Nutrient Availability
Previous Article in Journal
New Orbit Determination Method for GEO Satellites Based on BeiDou Short-Message Communication Ranging
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Vehicle Re-Identification Based on UAV Viewpoint: Dataset and Method

1
Schoole of Computer and Engineering, Central South University, South Lushan Road, Changsha 410083, China
2
Schoole of Geosciences and Info-Physics, Central South University, South Lushan Road, Changsha 410083, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2022, 14(18), 4603; https://doi.org/10.3390/rs14184603
Submission received: 8 August 2022 / Revised: 8 September 2022 / Accepted: 11 September 2022 / Published: 15 September 2022

Abstract

:
High-resolution remote sensing images bring a large amount of data as well as challenges to traditional vision tasks. Vehicle re-identification (ReID), as an essential vision task that can utilize remote sensing images, has been widely used in suspect vehicle searches, cross-border vehicle tracking, traffic behavior analysis, and automatic toll collection systems. Although there have been a large number of studies on vehicle ReID, most of them are based on fixed surveillance cameras and do not take full advantage of high-resolution remote sensing images. Compared with images collected by fixed surveillance cameras, high-resolution remote sensing images based on Unmanned Aerial Vehicles (UAVs) have the characteristics of rich viewpoints and a wide range of scale variations. These characteristics bring richer information to vehicle ReID tasks and have the potential to improve the performance of vehicle ReID models. However, to the best of our knowledge, there is a shortage of large open-source datasets for vehicle ReID based on UAV views, which is not conducive to promoting UAV-view-based vehicle ReID research. To address this issue, we construct a large-scale vehicle ReID dataset named VRU (the abbreviation of Vehicle Re-identification based on UAV), which consists of 172,137 images of 15,085 vehicles captured by UAVs, through which each vehicle has multiple images from various viewpoints. Compared with the existing vehicle ReID datasets based on UAVs, the VRU dataset has a larger volume and is fully open-source. Since most of the existing vehicle ReID methods are designed for fixed surveillance cameras, it is difficult for these methods to adapt to UAV-based vehicle ReID images with multi-viewpoint and multi-scale characteristics. Thus, this work proposes a Global Attention and full-Scale Network (GASNet) for the vehicle ReID task based on UAV images. To verify the effectiveness of our GASNet, GASNet is compared with the baseline models on the VRU dataset. The experiment results show that GASNet can achieve 97.45% Rank-1 and 98.51% mAP, which outperforms those baselines by 3.43%/2.08% improvements in terms of Rank-1/mAP. Thus, our major contributions can be summarized as follows: (1) the provision of an open-source UAV-based vehicle ReID dataset, (2) the proposal of a state-of-art model for UAV-based vehicle ReID.

Graphical Abstract

1. Introduction

Along with the advancement of remote sensing technology, the spatial resolution of remote sensing images has reached the “sub-meter level” [1,2], while the remote sensing images through Unmanned Aerial Vehicles (UAVs) can obtain even higher resolution. Furthermore, the high mobility and flexibility of UAVs bring remote sensing images with rich viewpoints and a wide range of scale variations. Although those remote sensing images provide rich information for vision tasks, it also brings challenges.
The focus of this work is the vehicle Re-IDentification (ReID) task through UAVs. As an important vision task, Vehicle ReID has been widely used in a wide range of applications, such as video surveillance [3], intelligent transportation [4], and urban computing [5]. The goal of vehicle ReID is to discover, localize, and track the queried target vehicle from a large volume of vehicle data. Most of the existing vehicle ReID works are based on fixed surveillance cameras, as shown in Figure 1a. Fixed surveillance cameras cannot provide multi-view and multi-scale vehicle images due to the fixed locations and limited viewing angles. On the contrary, the UAV-based remote sensing images (as shown in Figure 1b) have the potential to improve the performance of vehicle ReID models because they can bring richer information to the vehicle ReID task. However, the UAV-based vehicle ReID works are still in their exploratory stage. To the best of our knowledge, there are only two UAV-based datasets for the vehicle ReID task, namely, VRAI [6] and UAV-VeID [7], none of which is fully open source. Therefore, to facilitate future UAV-based vehicle ReID works, this paper constructs a UAV-based large-scale open-source vehicle ReID dataset named Vehicle ReID based on UAV (VRU).
To obtain multi-view and multi-scale vehicle images in realistic scenarios, we deployed five UAVs to shoot videos of vehicles at multiple scenes, such as highways, intersections, and parking lots, in various periods, such as morning, noon, afternoon, and night, with numerous weather conditions, such as sunny, cloudy, and drizzly. More than 15 h of video data were selected from the shot videos to produce the dataset of multi-view and multi-scale vehicle images. Finally, the VRU dataset consists of 172,137 images from 15,085 vehicle instances. Compared with the existing UAV-based vehicle ReID dataset, the VRU dataset has the largest data volume.
Since most of the existing vehicle ReID models are trained and tested through vehicle images taken by fixed surveillance cameras, it is difficult to adapt them to UAV-based multi-view and multi-scale vehicle images. Therefore, we propose a Global Attention and full-Scale Network (GASNet), which can extract view and scale-invariant features from multi-view and multi-scale vehicle images, thus improving the recognition capability of reidentifying vehicles. To verify the effectiveness of GASNet, we compared GASNet with two mainstream ReID models, namely MGN [8] and SCAN [9]. The experimental results show that GASNet achieves 97.45% Rank-1 and 98.51% mAP, respectively, outperforming MGN [8] and SCAN [9]. Moreover, through the ablation experiments, we can conclude that the global attention module can effectively utilize multi-view information, while the full-scale module can aggregate information from different scales. Through integrating the two modules, GASNet can outperform those baseline models on the VRU dataset.
The main contributions of this work can be summarized as follows.
(1) We provide a large open-source dataset, named VRU, for vehicle ReID tasks from the perspective of UAVs. Benefiting from the maneuverability and flexibility of UAVs, VRU is characterized by rich scales and diverse views. In addition, to reflect the practical scenarios in the real world, the vehicle images collected for VRU are shot in multiple weather conditions, multiple time periods, and multiple urban traffic scenes. Therefore, VRU will bring numerous challenges to the vehicle ReID task and may inspire novel research works.
(2) We propose a vehicle ReID model named GASNet, which integrates the global attention and full-scale convolution modules to take advantage of the rich scales and diverse views in the VRU dataset.
(3) We conduct comprehensive experiments to evaluate the effectiveness of the proposed GASNet model. The experimental results show that GASNet can achieve 97.45% Rank-1 and 98.51% mAP, respectively, and outperforms the state-of-art vehicle ReID models. This illustrates that GASNet can effectively address the multi-view and multi-scale challenges brought by the UAV-based vehicle ReID images.
The rest of this work is organized as follows. Section 2 reviews the related works. Section 3 describes the VRU dataset, and presents the GASNet model. Section 4 presents the experimental results of evaluating the proposed datasets and model. Section 5 discusses the significance and implication of the proposed dataset and model. Section 6 concludes this work and discusses the future works.

2. Related Works

In recent years, vehicle ReID has received increasing attention. Usually, a vehicle ReID model first utilizes a convolutional neural network to extract vehicle features and then employs metric learning to optimize the distance among the extracted features [9,10,11,12]. However, most metric learning methods do not consider the variations of viewpoints and scales, although significant visual appearance changes may reduce the effectiveness of metric learning methods. Therefore, we propose a new vehicle ReID method for the viewpoint and scale changes.
Concerning the vehicle, ReID datasets, most existing vehicle ReID datasets are obtained from fixed surveillance cameras. For example, Liu et al. [11] constructed a relatively small vehicle ReID dataset named VeRi, which consists of about 40,000 images from 619 vehicles captured by 20 surveillance cameras. The dataset is also labeled with luggage racks, vehicle types, colors, and brands. Then, Liu et al. [13] proposed VeRi-776 dataset by extending VeRi dataset in terms of data volume and new tags associated with license plates and spatiotemporal traces. During the same period, Liu et al. [14] proposed a dataset named VehicleID, which consists of 221,763 images of 26,267 vehicles. Compared with the aforementioned datasets, VERI-Wild [15] dataset, which is proposed by Lou et al., has a larger volume of data, consisting of 416,314 images from 40,671 vehicles. The datasets mentioned above were all captured by fixed surveillance cameras. However, this type of dataset is limited by the data collection equipment, which suffers from the insufficient diversity of vehicle viewpoints and scales. As for the visual task, the images from various viewpoints may be extremely different in visual appearance, while the scale variation may also lead to significant changes in feature distribution [16]. Thus, the variation of viewpoints and scales is of great value to improve the generalization performance of ReID models.
To increase the viewpoint diversity of vehicles in fixed surveillance camera scenarios, Zhou et al. [17] use Generative Adversarial Networks (GANs) to generate vehicle images with various viewpoints. However, the fidelity and diversity of the multi-view vehicle images generated by GAN are still much lower than the authentic multi-view images. Therefore, it is desirable to build a vehicle ReID dataset, consisting of authentic vehicle images with multi-view and multi-scale features. As aforementioned, the remote sensing images captured by UAVs can provide multi-view and multi-scale features for vehicles. However, to the best of our knowledge, there are only two existing UAV-based vehicle ReID datasets, namely, VRAI [6] and UAV-VeID [7]. VRAI dataset is the first UAV-based vehicle ReID dataset, containing 137,613 images from 13,022 vehicles. The UAV-VeID dataset is smaller than the VRAI dataset, and it consists of only 41,917 images from 4601 vehicles. Although these two datasets were obtained by UAV, neither of them is fully open source. In contrast, our VRU dataset not only has the largest data volume but also will be completely open-source.
A few works have utilized multi-scale and multi-view information in ReID datasets to improve the ReID performance. For example, from the multi-scale perspective, MGN [8] model extracts the global features and local features from an object through a multi-branch network; Zhou et al. [18] use multiple convolutional streams to detect multi-scale features; Chen et al. [19] formulate a novel deep pyramid feature learning CNN architecture for multi-scale feature fusion. From the perspective of the multi-view, Wang et al. [20] set 20 key points for a vehicle instance and clustered all key points into four view-based region masks to obtain orientation invariant feature vectors in different views. However, none of these approaches consider all perspectives simultaneously. Therefore, this paper proposes the GASNet model, which can extract vehicle features with view invariance and scale invariance, thus making full use of the information provided by the UAV vehicle ReID dataset. In addition, the problem of information imbalance in different spatial locations and channels also affects model performance [9]. GSANet draws on the idea of SCAN [9] and uses channel attention mechanism and spatial attention mechanism to solve the problem of information imbalance.
The existing works on vehicle ReID tasks have identified numerous techniques to speed up the training process and improve the performance of the vehicle ReID models. For example, the batch size has a significant impact on the accuracy of a vehicle ReID model. Specifically, the larger the batch size, the better the model that can be trained. In this paper, we also compare our GASNet with the baseline under various batch sizes. It is also found that the selection of loss functions also affects the model performance [21]. Most of the vehicle ReID models utilize ID Loss and Triplet Loss jointly for model training [22,23,24]. However, the optimization objectives of these two losses lie in different feature spaces, which may lead to the trade-off between the two losses, i.e., the decrease of one loss may be at the cost of the increase of the other loss during the training process. To address this issue, Luo et al. [21] proposed a BNNeck module, which adds a Batch Normalization (BN) layer between the ID Loss and the Triplet Loss. The normalization of features can reduce the influence between the ID Loss and the Triplet Loss, which in turn makes it easier for the two losses to converge simultaneously. Thus, the BNNeck module can speed up the training process and improve the model performance. To enhance the performance of the GASNet model, this work also adopts the aforementioned training tricks.

3. Materials and Methods

To take full advantage of high-resolution remote sensing images for vehicle ReID research, we construct a vehicle ReID dataset based on the UAV perspective. Based on this, we propose a new model to address the multi-view and multi-scale challenges of this dataset. This section will introduce our dataset and method.

3.1. Dataset

Currently, the data source for the vehicle ReID task is mainly road surveillance cameras. However, due to the fixed shooting positions, it is difficult to collect vehicle images from different viewpoints with various scales.

3.1.1. Data Collection

Recently, due to the breakthroughs in UAVs in terms of flight time, automatic control algorithms, and wireless data transmission, it becomes possible to construct large-scale UAV-based datasets for the vehicle ReID task. Furthermore, since UAVs have better mobility and flexibility than surveillance cameras, this paper uses UAVs to construct a vehicle image dataset, named VRU, for the vehicle ReID task.
To collect vehicle image data under various scenes, 5 ‘DJI Mavic 2 Pro’ UAVs are deployed. The frame rate of the UAV used for data collection is 30 frames per second, the resolution of the collected image is 3840 × 2160 pixels, and the format of the shot video is MOV. The configuration of the UAVs and the attached cameras are enumerated in Table A9 and Table A10 (in Appendix A.2), respectively.

3.1.2. Multi-View and Multi-Scale

We design different shooting strategies for vehicles in different scenarios to ensure the viewpoint and scale diversity of the remotely sensed vehicle images. For the vehicles in parking lots, we capture vehicle instances through two UAVs that rotate from opposite directions and change their flying altitudes from time to time. For a moving vehicle, five UAVs capture the vehicle simultaneously. Four UAVs obtain the front, rear, left, and right views of the same vehicle, respectively. The remaining one rotates and shoots video within a pre-defined range of the vehicle. We set the height range of a flying UAV from 15 to 60 m. Each UAV flies at a different height, and the height difference between adjacent UAVs is more than 5 m. The shooting angle is between 40 and 80 degrees. With the above shooting strategy, we can obtain images of vehicles from different viewpoints with various scales as shown in Figure 2.

3.1.3. Multi-Scene, Multi-Time, and Multi-Weather

As the data support for the vehicle ReID task, a dataset has a great impact on the performance of a model. A good dataset should fit the data distribution in real scenarios. As for vehicle images, the types of vehicles that appear on different roads may vary greatly. For example, due to the regulations of urban traffic management, large vehicles cannot appear in the downtown area during the daytime, although they are allowed on suburban roads. For the same reason, the tricycles may appear on suburban roads rather than downtown. Therefore, as illustrated in Figure 3a, to obtain data fitting the real scenarios, we selected representative downtown/suburb scenes, such as elevated-outer-ring roads (first row), urban traffic intersections (second row), and parking lots (third row), as data collection locations.
In vision tasks, illumination variation can directly affect the recognition results of a model. The illumination changes over time. Furthermore, the types and quantity of vehicles on roads may vary at different periods. For example, during rush hour, a large number of cars appear on the urban roads. To fit the varied illumination conditions and vehicle type distribution in real scenarios, we choose four time periods to remotely sense vehicles as shown in Figure 3b, where the first, the second, the third, and the fourth rows are taken in the morning, noon, afternoon, and evening, respectively.
The weather conditions may also affect the distribution of vehicles. Moreover, the weather condition may also affect the clarity of the remote sensing images. Thus, to fit the real scenarios, this paper collects data under three weather conditions, such as sunny days, cloudy days, and drizzles, corresponding to the first to third rows in Figure 3c.

3.1.4. Data Processing

After obtaining the original videos taken by UAVs, we selected 200 video sequences, each with a duration of 5 min, as the initial data for building the dataset. Then, we extract one image per second from the video sequences. Thus, 60,000 raw images in total are extracted to further construct the UAV-based vehicle ReID dataset.

3.1.5. Data Annotation

Since the same vehicle may be captured by different UAVs at the same moment, we first align the timestamps of all the images so that the vehicle images can be annotated correctly. To improve the accuracy of the annotation, we recruit 20 volunteers who know the vehicle ReID task and the corresponding dataset to manually locate and annotate all raw images. By localizing and cropping the vehicles in the 60,000 raw images, a total of 172,137 images containing 15,085 vehicle instances were obtained. The comparison of raw images and vehicle images in the VRU dataset is shown in Figure 4.
By cropping the raw images with multiple vehicles, we obtain final vehicle images, each of which contains only one vehicle. Since each image is cropped to fit the size of the vehicle, the pixel width of an image can be approximated as the corresponding vehicle scale. Thus, we can plot the distribution of the vehicle scales, as shown in Figure 5, where we also compare the scale distribution of the VehicleID dataset and VERI-Wild dataset, both of which are collected through surveillance cameras. From Figure 5, we can observe that our VRU dataset has the largest scale diversity.
This paper also counts the number of vehicles associated with each viewpoint in the VRU dataset. The statistical results in Figure 6 show that the distribution of images corresponding to the four viewpoints is relatively uniform. This fully illustrates the viewpoint diversity of the VRU dataset.
Meanwhile, this paper also counts the number of images corresponding to each vehicle instance in the VRU dataset. The statistical results are shown in Figure 7, from which it can be observed that more than 97% of the vehicle instances have more than three images. For most of the vehicle instances, the number of images ranges from 8 to 15. In summary, the VRU dataset contains rich viewpoint and scale information, which provides data support for training a more robust vehicle ReID model.

3.1.6. Dataset Partitioning

According to the general practice of vehicle ReID dataset division, the VRU dataset is divided into a training set and three test sets, namely small, medium, and large test sets. The training set includes 80,532 images from 7085 vehicle instances. The small, medium and large test sets contain 13,920 images from 1200 vehicle instances, 27,345 images from 2400 vehicle instances, and 91,595 images from 8000 vehicle instances, respectively. In the ReID dataset, the test set usually includes the query set and the gallery set. The vehicle ReID task is to retrieve images in the gallery set that are consistent with the vehicles in the query set. Therefore, we put one image of each vehicle instance in the VRU test set into the gallery setting, and the rest of the images are used as the query set.

3.1.7. Dataset Comparison

To the best of our knowledge, the existing datasets for vehicle ReID based on UAV remote sensing images are VRAI and UAV-VeID, which are also multi-view and multi-scale. On the contrary, our dataset is not only multi-view and multi-scale, but also multi-weather, multi-light intensity, with larger data volume. This allows the images in the VRU dataset to provide richer information to the vehicle ReID models. Additionally, our dataset will be fully open source. A comparison of the VRU dataset with the VRAI and UAV-VeID datasets is shown in Table 1.

3.2. GASNet

Most of the existing vehicle ReID methods are studied based on the data acquired by fixed surveillance cameras, which are difficult to adapt to UAV remote sensing images with multi-view and multi-scale characteristics, because the morphology of the same vehicle instance under different viewpoints has large differences, and the information provided by vehicle images varies at different scales. Therefore, to cope with the multi-view and multi-scale challenges brought by UAV remote sensing images, this paper proposes a vehicle ReID model (called GASNet) based on the global relationship-aware attention and full-scale mechanism. GASNet captures features with global information by introducing a global relationship-aware attention mechanism [25], and introduces a full-scale network [18] to correlate features of different scales, extract features in the image that do not change with viewpoint and scale, and then achieve the purpose of improving the generalization ability of the model.

3.2.1. Overall Structure of GASNet

As shown in Figure 8, the GASNet model consists of a backbone network that extracts viewpoint-invariant features and a branching network that extracts scale-invariant features. Following a common practice in ReID, we use ResNet50 [26] as the base network block and insert the global relationship-aware attention module (abbreviated as GA) block by block after the second residual block to form the backbone network, and access the full-scale module (abbreviated as FS) after the third residual block to form the branch network. A BNNeck structure is added at the end of both the backbone network and the branch network to optimize the feature distribution so that the training of the whole network can be completed faster and better.

3.2.2. Global Attention Module

Attention mechanisms can be divided into local attention and global attention based on their learned attention weights. Local attention focuses on local saliency regions of the target, and global attention grasping the overall information of the target. One of the challenges of the vehicle ReID task based on UAV remote sensing images is that the vehicle viewpoint changes more frequently. To solve this problem, it is necessary to extract vehicle features that do not change with the viewpoint, that is, to extract features that contain the overall vehicle information. For this purpose, we introduce a global attention module for relationship perception. The relationship-aware global attention module consists of a globally-aware spatial attention mechanism and a globally-aware channel attention mechanism. This module takes the features of any position on the feature map as nodes, and by emphasizing the relationship between nodes, mining the correlation and semantic information of the global scope, to extract the vehicle features that do not change with the viewing angle.

3.2.3. Full-Scale Module

Vehicle ReID datasets based on UAV remote sensing images have rich scale variation, which brings challenges as well as opportunities to improve the performance of the ReID models. To extract highly discriminative features with scale invariance, we introduce a full-scale convolutional structure. The structure consists of multiple convolutional streams with receptive fields of different sizes, and each convolutional stream pays attention to features of different scales, thereby obtaining multi-scale feature maps. The multi-scale feature maps are finally dynamically fused through the channel-based weights generated by the unified aggregation gate to obtain full-scale features. The scale-invariant vehicle features can then be captured through a full-scale convolutional structure.

4. Results

To verify the performance of the GASNet model proposed in this paper, we build the GASNet network using the PyTorch framework, train, and test it on a Tesla A100 GPU. The learning rate is set to 0.00035, Adam optimization is used, the network is constrained using a triplet loss function and a cross-entropy loss function, and all experiments are trained for 60 rounds.

4.1. Evaluation Indicators

The main metrics for evaluating a vehicle ReID model include mean Average Precision (mAP), Cumulative Matching performance Curve (CMC), and Rank-N table. Consider a query q provided to a ReID model, which retrieves a sequence of n images from the gallery, where exactly N q images match the query. Generally, the higher a matching image is in the retrieval sequence, the better the retrieval precision is. Specifically, if all the N q matched images rank the top- N q positions in the retrieval sequence, the total retrieval precision of query q reaches its highest value. Thus, assume that G ( k ) denotes whether a matched image is at the k-th position in the retrieved sequence and P ( k ) reflects the probability that the top-k retrieved images include the matched images, then the multiplication of P ( k ) G ( K ) can represent the retrieval precision at the k-th position and k = 1 n P ( k ) G ( K ) can reflect the total retrieval precision associated with query q. Therefore, the AP of query q can be defined as the total retrieval precision divided by N q , as shown in Formula (1), where mAP can be further formalized as the average of APs over all queries, which can be used to assess the overall performance of a ReID model.
A P q = k = 1 n P k × G k N q , m A P = i = 1 M A P q i M
In contrast to AP, which reflects the retrieval precision from the perspective of matching queries, an alternative metric (called CMC@k) intends to characterize the retrieval precision from the perspective of matching positions, as shown in Formula (2).
C M C @ k = i = 1 M F q i , k M
In Formula (2), F ( q i , k ) indicates whether query q i ’s matched images locate among the top k images of the retrieved sequence. Thus, CMC@k measures the average retrieval precision of all queries at the k-th position in the retrieved sequence. As a single-value metric, mAP can only reflect the performance of the ReID model as a whole, while CMC@k, as a multi-value metric with varied values of k, can reflect a ReID model’s precision distribution in terms of matching positions.
If most of the CMC@k values of two ReID models are almost the same, it is relatively difficult to compare the two models. Therefore, instead of comparing all CMC@k values, it can select only a few significant CMC@k values for comparison, which is also known as the Rank-N table. Rank-1 and Rank-5 are the most common CMC@k values, indicating the probability of matching images in the top 1 and top 5 of the retrieved sequence, respectively. Similar to most ReID works, this paper adopts mAP and Rank-N as the evaluation metrics.

4.2. Ablation Experiments

To verify that the global attention module (abbreviated as GA) and the full-scale module (abbreviated as FS) can improve the performance of the model, we conduct ablation experiments while keeping the experimental conditions unchanged. First, we construct the baseline model (abbreviated as baseline) by removing both GA and FS. Then, two comparison models (abbreviated as baseline+GA and baseline+FS) are constructed by adding GA and FS to the baseline, respectively. Finally, the performance of the baseline model, the comparison models, and the GASNet model is compared. As batch size can affect the experimental results [21], we set four batch sizes, 32, 64, 96, and 128, respectively.

4.2.1. The Ablation Experiment for GA

To verify the performance enhancement brought by the GA module, this section designs the ablation experiments for the GA module. Figure 9 shows the corresponding experimental results on the small test set of the VRU dataset. It can be seen from Figure 9 that baseline+GA outperforms the baseline model in terms of Rank-1, Rank-5, and mAP metrics along with the variation of batch sizes. This illustrates the benefit of the GA module. More specifically, the performance gap is maximized with the small batch sizes. This implies that the integration of the GA module enables the baseline to rely less on the diverse information provided by large batches, as the model can learn highly discriminative features through the integration of the global information even with small batches. Meanwhile, we also conducted experiments on the medium and large test sets in the VRU dataset. The experimental results are enumerated in Table A1, Table A2, Table A3 and Table A4 (in Appendix A.1), which shows similar results as that of the small test set.

4.2.2. The Ablation Experiments for FS

Figure 10 shows the experimental results of the FS module on the small test set of the VRU dataset, from which it can be seen that baseline+FS outperforms the baseline model in terms of Rank-1 and mAP metrics along with the variation of batch sizes. In addition, we also conducted experiments on the medium test set and the large test set in the VRU dataset, and the experimental results are shown in Table A5, Table A6, Table A7 and Table A8 in the Appendix A.1, which shows similar results as that of the small test set.
The above experiments effectively verify the performance improvement from the FS module, especially when the GPU cannot support the experiment with a large batch size. It can also be inferred that increasing the batch size to a certain extent can greatly help improve the model precision.

4.3. The Performance Comparison Experiment

To demonstrate the performance of the GASNet model, this section compares it with MGN and SCAN on the VRU dataset. MGN extracts the global and local features of a vehicle through a backbone network and two branch networks, respectively, and finally fuses the two types of features to improve a ReID model’s precision. SCAN uses both a channel-attention mechanism and a spatial-attention mechanism to optimize a ReID model’s weights and forces the model to focus on high discriminative regions to improve the model’s performance.
Based on the results of the two previous ablation experiments, we set the batch size as 128 for better performance. The results of the comparison experiments are shown in Table 2.
It can be observed from Table 2 that GASNet outperforms both MGN and SCAN on all three test sets. Compared to the models that only add FS or GA alone, GASNet performs optimally on three test sets generated from the VRU dataset. The experimental results effectively verify the superiority of the GASNet model.

5. Discussion

As mentioned in Section 3, compared with the existing vehicle ReID datasets collected through the fixed surveillance cameras, our VRU dataset can provide vehicle images with multiple viewpoints, multiple scales, multiple scenes, multiple time periods, and multiple weather conditions. This data diversity not only provides challenges for most of the existing vehicle ReID models but also provides opportunities to improve the generalization performance of the existing models. As illustrated in Table 1, although there are two UAV-based ReID datasets, only our VRU dataset is fully open-source. Therefore, our VRU dataset can provide a lot of research opportunities for the vehicle ReID research community. For example, although the proposed GASNet model has explicitly utilized the multi-view and multi-scale characteristics of the VRU dataset, future works can further explicitly consider the multi-scene, multi-time, and multi-weather characteristics to design novel vehicle ReID models. Moreover, the future ReID models can regard our VRU dataset as a benchmark dataset to evaluate their performance because our VRU dataset consists of vehicle images collected from various scenarios reflecting various real-world scenes. Our experiment study on the VRU dataset has illustrated that the existing vehicle ReID models for the fixed surveillance cameras perform relatively poorly, while our GASNet model can significantly outperform these models because our model explicitly considers the multi-view and multi-scale characteristics of the VRU dataset.
It has to be admitted that although UAVs have better mobility and flexibility compared to fixed surveillance systems, they suffer from limited battery capacity. Thus, a UAV has a rather limited working time. As a comparison, the fixed system can monitor a fixed area continuously. Thus, the UAV-based system is unsuitable for regular monitoring. The benefits of UAVs are that they can work in uncovered areas (monitoring dead spots) and can also be used to track target vehicles once they have been identified. In addition, the short continuous working duration of UAVs can be compensated by increasing the number of UAVs. Since a normal UAV usually can only monitor a road section for half an hour and it takes one and half hours for fully recharging, 4 UAVs are enough to alternatively monitor the road section. Therefore, to implement a practical UAV-based vehicle ReID system, we give the following suggestions to solve the problem of the short working time of UAVs: (1) it is necessary to prepare multiple UAVs to take duties in turn for covering an area; (2) to improve the accuracy of the vehicle ReID task, multiple UAVs should be used to shoot from different directions at the same time to provide multi-view vehicle images; (3) to provide different scales information, multiple UAVs should be used to shoot from different heights simultaneously; (4) for the task with a high requirement on processing time, a UAV with real-time data transmission capability is required.

6. Conclusions

This paper studies the task of vehicle ReID based on UAV remote sensing images. To provide vehicle data with rich viewing angles and a wide range of scale variations, this paper proposes a large-scale open-source vehicle ReID dataset consisting of UAV remote sensing images, named VRU. The dataset contains a lot of information such as multi-background, multi-illumination, multi-weather, multi-viewpoint, and multi-scale, which is expected to promote the research of vehicle ReID tasks based on UAV remote sensing images. Aiming at the challenges of multi-viewpoint and multi-scale vehicle ReID based on UAV remote sensing images, this paper proposes a novel vehicle ReID method based on relation-aware global attention and full-scale mechanism. The relation-aware global attention module forces the network to pay more attention to high discriminative features, and the full-scale module is used to fuse features of different scales to obtain full-scale vehicle features, thereby improving the discriminative performance of the model. In the future, we will explicitly consider the characteristics of multi-weather, multi-illumination, and multi-background associated with the VRU dataset to design novel vehicle ReID models. Furthermore, we will try to take advantage of both fixed surveillance cameras and UAVs by integrating these two types of datasets for the vehicle ReID task.

Author Contributions

Conceptualization, M.L. and H.L.; methodology, H.L.; software, Y.X.; validation, M.L., H.L. and Y.X.; formal analysis, Y.X. and M.L.; investigation, Y.X. and M.L.; resources, H.L.; data curation, H.L.; writing—original draft preparation, Y.X. and H.L.; writing—review and editing, H.L. and M.L.; visualization, Y.X.; supervision, H.L.; project administration, H.L.; funding acquisition, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (No. U20A20182 and No. 42271481).

Data Availability Statement

The VRU dataset is hosted on GitHub: https://github.com/GeoX-Lab/ReID (accessed on 12 July 2022).

Acknowledgments

The authors thank the anonymous reviewers and the editors for their valuable comments to improve our manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1

To verify the performance of GASNet on VRU, we have done numerous comparative experiments, the results of which are enumerated as follows.
Table A1. Test results of the baseline model and the global attention model with input batch 32 on the VRU dataset.
Table A1. Test results of the baseline model and the global attention model with input batch 32 on the VRU dataset.
ModelsVRUBatchsizeRank-1 (%)Rank-5 (%)mAP
BaselineSmall3288.2598.9592.98
Medium3283.1797.4489.41
Big3270.2392.0479.77
Baseline+GASmall3295.2499.6597.28
Medium3292.8499.1395.68
Big3286.0097.4591.04
Table A2. Test results of the baseline model and the global attention model with input batch 64 on the VRU dataset.
Table A2. Test results of the baseline model and the global attention model with input batch 64 on the VRU dataset.
ModelsVRUBatchsizeRank-1 (%)Rank-5 (%)mAP
BaselineSmall6494.2499.7096.78
Medium6490.5699.0494.34
Big6482.7896.9189.00
Baseline+GASmall6496.1999.6397.61
Medium6494.2899.2596.59
Big6488.3298.0792.63
Table A3. Test results of the baseline model and the global attention model with input batch 96 on the VRU dataset.
Table A3. Test results of the baseline model and the global attention model with input batch 96 on the VRU dataset.
ModelsVRUBatchsizeRank-1 (%)Rank-5 (%)mAP
BaselineSmall9695.3899.6097.33
Medium9692.8699.2395.77
Big9685.1097.7890.70
Baseline+GASmall9696.4099.7697.95
Medium9694.9299.2496.92
Big9688.9998.2693.11
Table A4. Test results of the baseline model and the global attention model with input batch 128 on the VRU dataset.
Table A4. Test results of the baseline model and the global attention model with input batch 128 on the VRU dataset.
ModelsVRUBatchsizeRank-1 (%)Rank-5 (%)mAP
BaselineSmall12896.0899.6497.74
Medium12893.3399.2596.02
Big12886.8698.0591.85
Baseline+GASmall12896.9399.6398.20
Medium12894.6299.3796.79
Big12888.9798.1993.09
Table A5. Test results of the baseline model and the full-scale model with input batch 32 on the VRU dataset.
Table A5. Test results of the baseline model and the full-scale model with input batch 32 on the VRU dataset.
ModelsVRUBatchsizeRank-1 (%)Rank-5 (%)mAP
BaselineSmall3288.2598.9592.98
Medium3283.1797.4489.41
Big3270.2392.0479.77
Baseline+FSSmall3290.5599.2296.17
Medium3290.3398.4593.98
Big3282.2495.9088.21
Table A6. Test results of the baseline model and the full-scale model with input batch 64 on the VRU dataset.
Table A6. Test results of the baseline model and the full-scale model with input batch 64 on the VRU dataset.
ModelsVRUBatchsizeRank-1 (%)Rank-5 (%)mAP
BaselineSmall6494.2499.7096.78
Medium6490.5699.0494.34
Big6482.7896.9189.00
Baseline+FSSmall6495.8799.6197.61
Medium6493.7099.2296.18
Big6487.9197.5892.21
Table A7. Test results of the baseline model and the full-scale model with input batch 96 on the VRU dataset.
Table A7. Test results of the baseline model and the full-scale model with input batch 96 on the VRU dataset.
ModelsVRUBatchsizeRank-1 (%)Rank-5 (%)mAP
BaselineSmall9695.3899.6097.33
Medium9692.8699.2395.77
Big9685.1097.7890.7
Baseline+FSSmall9696.5599.5097.95
Medium9694.3999.1196.53
Big9688.8498.0192.92
Table A8. Test results of the baseline model and the full-scale model with input batch 128 on the VRU dataset.
Table A8. Test results of the baseline model and the full-scale model with input batch 128 on the VRU dataset.
ModelsVRUBatchsizeRank-1 (%)Rank-5 (%)mAP
BaselineSmall12896.0899.6497.74
Medium12893.3399.2596.02
Big12886.8698.0591.85
Baseline+FSSmall12896.4399.6697.89
Medium12894.7699.1096.76
Big12889.3898.0793.27

Appendix A.2

To construct the vehicle ReID dataset based on the perspective of UAV, we used 5 DJI UAVs for data collection. The model we use is the ‘DJI Mavic 2 Pro’. The configuration of ‘DJI Mavic 2 Pro’ and its attached camera are enumerated in Table A9 and Table A10, respectively.
Table A9. The configuration of ‘DJI Mavic 2 Pro’.
Table A9. The configuration of ‘DJI Mavic 2 Pro’.
Aircraft
ParametersValue
Takeoff Weight907 g
DimensionsFolded: 214 × 91 × 84 mm
Unfolded: 322 × 42 × 84 mm
Diagonal Distance354 mm
Max Ascent Speed5 m/s (S-mode), 4 m/s (P-mode)
Max Descent Speed3 m/s (S-mode), 3 m/s (P-mode)
Max Speed72 km/h (S-mode) (near sea level, no wind)
Max Service Ceiling Above Sea Level6000 m
Max Flight Time31 min (at a consistent 25 kph, no wind)
Overall Flight Time25 min (in normal flight, 15% remaining battery level)
Max Flight Distance18 km (at a consistent 50 kph, no wind)
Hovering Accuracy RangeVertical: ±0.1 m (when vision positioning is active)
±0.5 m (with GPS positioning)
Horizontal: ±0.3 m (when vision positioning is active)
±1.5 m (with GPS positioning)
All data are from DJI official website.
Table A10. The configuration of the attached camera.
Table A10. The configuration of the attached camera.
Camera
ParametersValue
Sensor1″ CMOS
Effective Pixels: 20 million
LensFOV: approx. 77°
35 mm Format Equivalent: 28 mm
Aperture: f/2.8–f/11
Shooting Range: 1 m to  
ISO RangeVideo: 100–6400
Photo: 100–3200 (auto)
100–12,800 (manual)
Shutter SpeedElectronic Shutter: 8-1/8000 s
Still Image Size5472 × 3648
Still Photography ModesSingle shot
Burst shooting: 3/5 frames
Auto Exposure Bracketing (AEB): 3/5
bracketed frames at 0.7 EV Bias
Interval: 2/3/5/7/10/15/20/30/60 s (JPEG)
5/7/10/15/20/30/60 s (RAW)
Video Resolution4 K: 3840 × 2160 24/25/30 p
2.7 K: 2688 × 1512 24/25/30/48/50/60 p
FHD: 1920 × 1080 24/25/30/48/50/60/120 p
Color ModeDlog-M (10-bit)
support HDR video (HLG 10-bit)
Max Video Bitrate100 Mbps
Photo FormatJPEG/DNG (RAW)
Video FormatMP4/MOV
All data are from DJI official website.

References

  1. Zhang, S.; Shao, H.; Li, X.; Xian, W.; Shao, Q.; Yin, Z.; Lai, F.; Qi, J. Spatiotemporal Dynamics of Ecological Security Pattern of Urban Agglomerations in Yangtze River Delta Based on LUCC Simulation. Remote Sens. 2022, 14, 296. [Google Scholar] [CrossRef]
  2. Wang, T.; Zhang, Y.; Zhang, Y.; Zhang, Z.; Xiao, X.; Yu, Y.; Wang, L. A Spliced Satellite Optical Camera Geometric Calibration Method Based on Inter-Chip Geometry Constraints. Remote Sens. 2021, 13, 2832. [Google Scholar] [CrossRef]
  3. Valera, M.; Velastin, S. Intelligent distributed surveillance systems: A review. IEEE Proc.-Vis. Image Signal Process. 2005, 152, 192–204. [Google Scholar] [CrossRef]
  4. Zhang, J.; Wang, F.Y.; Wang, K.; Lin, W.H.; Xu, X.; Chen, C. Data-Driven Intelligent Transportation Systems: A Survey. IEEE Trans. Intell. Transp. Syst. 2011, 12, 1624–1639. [Google Scholar] [CrossRef]
  5. Zheng, Y.; Capra, L.; Wolfson, O.; Yang, H. Urban Computing: Concepts, Methodologies, and Applications. ACM Trans. Intell. Syst. Technol. 2014, 5, 1–55. [Google Scholar] [CrossRef]
  6. Wang, P.; Jiao, B.; Yang, L.; Yang, Y.; Zhang, S.; Wei, W.; Zhang, Y. Vehicle Re-identification in Aerial Imagery: Dataset and Approach. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 460–469. [Google Scholar] [CrossRef]
  7. Teng, S.; Zhang, S.; Huang, Q.; Sebe, N. Viewpoint and Scale Consistency Reinforcement for UAV Vehicle Re-Identification. Int. J. Comput. Vis. 2021, 129, 719–735. [Google Scholar] [CrossRef]
  8. Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning Discriminative Features with Multiple Granularities for Person Re-Identification. In Proceedings of the 26th ACM Multimedia Conference (MM), Seoul, Korea, 22–26 October 2018; pp. 274–282. [Google Scholar] [CrossRef]
  9. Teng, S.; Liu, X.; Zhang, S.; Huang, Q. SCAN: Spatial and Channel Attention Network for Vehicle Re-Identification. In Proceedings of the 19th Pacific-Rim Conference on Multimedia (PCM), Hefei, China, 21–22 September 2018; pp. 350–361. [Google Scholar] [CrossRef]
  10. Yan, K.; Tian, Y.; Wang, Y.; Zeng, W.; Huang, T. Exploiting Multi-Grain Ranking Constraints for Precisely Searching Visually-Similar Vehicles. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 562–570. [Google Scholar] [CrossRef]
  11. Liu, X.; Liu, W.; Ma, H.; Fu, H. Large-scale vehicle re-identification in urban surveillance videos. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar] [CrossRef]
  12. Shen, Y.; Xiao, T.; Li, H.; Yi, S.; Wang, X. Learning Deep Neural Networks for Vehicle Re-ID with Visual-spatio-temporal Path Proposals. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1918–1927. [Google Scholar] [CrossRef]
  13. Liu, X.; Liu, W.; Mei, T.; Ma, H. A Deep Learning-Based Approach to Progressive Vehicle Re-identification for Urban Surveillance. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 869–884. [Google Scholar] [CrossRef]
  14. Liu, H.; Tian, Y.; Wang, Y.; Pang, L.; Huang, T. Deep Relative Distance Learning: Tell the Difference Between Similar Vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2167–2175. [Google Scholar] [CrossRef]
  15. Lou, Y.; Bai, Y.; Liu, J.; Wang, S.; Duan, L.Y. VERI-Wild: A Large Dataset and a New Method for Vehicle Re-Identification in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3230–3238. [Google Scholar] [CrossRef]
  16. Tan, W.; Yan, B.; Bare, B. Feature Super-Resolution: Make Machine See More Clearly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3994–4002. [Google Scholar] [CrossRef]
  17. Zhou, Y.; Shao, L. Cross-view GAN based vehicle generation for re-identification. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 4–7 September 2017; pp. 1–12. [Google Scholar] [CrossRef]
  18. Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-Scale Feature Learning for Person Re-Identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 3701–3711. [Google Scholar] [CrossRef]
  19. Chen, Y.; Zhu, X.; Gong, S. Person Re-Identification by Deep Learning Multi-Scale Representations. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2590–2600. [Google Scholar] [CrossRef]
  20. Wang, Z.; Tang, L.; Liu, X.; Yao, Z.; Yi, S.; Shao, J.; Yan, J.; Wang, S.; Li, H.; Wang, X. Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-identification. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 379–387. [Google Scholar] [CrossRef]
  21. Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of Tricks and A Strong Baseline for Deep Person Re-identification. In Proceedings of the 32nd IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–20 June 2019; pp. 1487–1495. [Google Scholar] [CrossRef]
  22. Kim, K.T.; Choi, J.Y. Deep Neural Networks Learning based on Multiple Loss Functions for Both Person and Vehicles Re-Identification. J. Korea Multimed. Soc. 2020, 23, 891–902. [Google Scholar] [CrossRef]
  23. Chen, H.; Lagadec, B.; Bremond, F. Partition and reunion: A two-branch neural network for vehicle re-identification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–20 June 2019; pp. 184–192. [Google Scholar]
  24. Franco, A.O.R.; Soares, F.F.; Neto, A.V.L.; de Macedo, J.A.F.; Rego, P.A.L.; Gomes, F.A.C.; Maia, J.G.R. Vehicle Re-Identification by Deep Feature Embedding and Approximate Nearest Neighbors. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), New York, NY, USA, 19–24 July 2020; pp. 1–8. [Google Scholar]
  25. Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; Chen, Z. Relation-Aware Global Attention for Person Re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 3183–3192. [Google Scholar] [CrossRef]
  26. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
Figure 1. The comparison of vehicle images by surveillance cameras and UAVs.
Figure 1. The comparison of vehicle images by surveillance cameras and UAVs.
Remotesensing 14 04603 g001
Figure 2. Images in VRU with different viewpoints and scales.
Figure 2. Images in VRU with different viewpoints and scales.
Remotesensing 14 04603 g002
Figure 3. Images in VRU with different scenes, time, and weather.
Figure 3. Images in VRU with different scenes, time, and weather.
Remotesensing 14 04603 g003
Figure 4. The comparison of raw images and pictures in the VRU dataset.
Figure 4. The comparison of raw images and pictures in the VRU dataset.
Remotesensing 14 04603 g004
Figure 5. The distribution of vehicle scales.
Figure 5. The distribution of vehicle scales.
Remotesensing 14 04603 g005
Figure 6. The distribution of vehicle views.
Figure 6. The distribution of vehicle views.
Remotesensing 14 04603 g006
Figure 7. The distribution of image quantity per vehicle.
Figure 7. The distribution of image quantity per vehicle.
Remotesensing 14 04603 g007
Figure 8. The overall structure of GASNet.
Figure 8. The overall structure of GASNet.
Remotesensing 14 04603 g008
Figure 9. The results of the ablation experiment for GA.
Figure 9. The results of the ablation experiment for GA.
Remotesensing 14 04603 g009
Figure 10. The results of the ablation experiment for FS.
Figure 10. The results of the ablation experiment for FS.
Remotesensing 14 04603 g010
Table 1. Comparison of datasets.
Table 1. Comparison of datasets.
DatasetsVRUUAV-VeIDVRAI
Identities15,085460113,022
Images172,13741,917137,613
Multi-view
Multi-scale
Weather×
Lighting×
Open-source××
Table 2. The performance comparison with other ReID models.
Table 2. The performance comparison with other ReID models.
ModelsVRURank-1 (%)Rank-5 (%)mAP (%)
MGNSmall81.7295.0882.48
Medium78.7593.7580.06
Big66.2587.1571.53
SCANSmall75.2295.0383.95
Medium67.2790.5177.34
Big52.4479.6364.51
BaselineSmall96.0899.6497.74
Medium93.3399.2596.02
Big86.8698.0591.85
Baseline+GASmall96.9399.6398.20
Medium94.6299.3796.79
Big88.9798.1993.09
Baseline+FSSmall96.4399.6697.89
Medium94.7699.1096.76
Big89.3898.0793.27
GASNetSmall97.4599.6698.51
Medium95.5999.3397.31
Big90.2998.4093.93
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lu, M.; Xu, Y.; Li, H. Vehicle Re-Identification Based on UAV Viewpoint: Dataset and Method. Remote Sens. 2022, 14, 4603. https://doi.org/10.3390/rs14184603

AMA Style

Lu M, Xu Y, Li H. Vehicle Re-Identification Based on UAV Viewpoint: Dataset and Method. Remote Sensing. 2022; 14(18):4603. https://doi.org/10.3390/rs14184603

Chicago/Turabian Style

Lu, Mingming, Yongchuan Xu, and Haifeng Li. 2022. "Vehicle Re-Identification Based on UAV Viewpoint: Dataset and Method" Remote Sensing 14, no. 18: 4603. https://doi.org/10.3390/rs14184603

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop