V2ReID: Vision-Outlooker-Based Vehicle Re-Identification
Abstract
:1. Introduction
- We applied for the first time the VOLO [10] architecture in vehicle re-identification and show that attending neighbouring pixels can enhance the performance of ReID.
- We evaluated our method on a large-scale ReID benchmark dataset and obtained state-of-the-art results using only visual cues.
- We provide an understandable and thorough guide on how to train your model by comparing different experiments using various hyperparameters.
2. Related Work
2.1. A Brief History of Re-Identification
2.2. Attention Mechanism in Re-Identification
“In its most generic form, attention could be described as merely an overall level of alertness or ability to engage with surroundings.”[16]
Method | Year | Model | VeRi-776 | Vehicle-ID (mAP (%)/R-1 (%)) | |||
---|---|---|---|---|---|---|---|
mAP (%) | Rank-1 (%) | S | M | L | |||
LF | 2017 | OIFE [7] | 48.00 | 89.43 | - | - | 67.00/82.90 |
2018 | RAM [8] | 61.50 | 88.60 | 75.20/91.50 | 72.30/87.00 | 67.70/84.50 | |
2019 | PRN + RR [9] | 74.30 | 94.30 | 78.40/92.30 | 75.00/88.30 | 74.20/86.40 | |
ML | 2017 | Siamese-CNN + PathLSTM [26] | 58.27 | 83.49 | - | - | - |
2017 | PROVID [27] | 53.42 | 81.56 | - | - | - | |
2017 | NuFACT [27] | 48.47 | 76.76 | 48.90/69.51 | 43.64/65.34 | 38.63/60.72 | |
2018 | JFSDL [28] | 53.53 | 82.90 | 54.80/85.29 | 48.29/78.79 | 41.29/70.63 | |
2019 | VANet [29] | 66.34 | 89.78 | 88.12/97.29 | 83.17/95.14 | 80.35/92.97 | |
2020 | MidTriNet + UT [30] | - | 89.15 | 91.70/97.70 | 90.10/96.40 | 86.10/94.80 | |
AM | 2018 | RNN-HA [31] | 56.80 | 74.79 | - | - | - |
2018 | RNN-HA (ResNet + 672) [31] | - | - | 83.8/88.1 | 81.9/87.0 | 81.1/87.4 | |
2019 | AAVER [18] | 61.18 | 88.97 | 74.69/93.82 | 68.62/89.95 | 63.54/85.64 | |
2020 | SPAN w/ CPDM [32] | 68.90 | 94.00 | - | - | - | |
UL | 2017 | XVGAN [33] | 24.65 | 60.20 | 52.89/80.84 | - | - |
2018 | GAN + LSRO + re-ranking [34] | 64.78 | 88.62 | 86.50/87.38 | 83.44/86.88 | 81.25/84.63 | |
2019 | SSL + re-ranking [35] | 69.90 | 89.69 | 88.67/91.92 | 88.13/91.81 | 86.67/90.83 |
3. Proposed VReID
3.1. Rise of the Transformers
- 1.
- Encoder–decoder: This refers to the original Transformer structure and is typically used in neural machine translation (sequence-to-sequence modelling).
- 2.
- Encoder-only: The outputs of the encoder are used as a representation of the input. This structure is usually used for classification or sequence labelling problems.
- 3.
- Decoder-only: Here, the cross-attention module is removed. Typically, this structure is used for sequence generation, such as language modelling.
3.2. Transformer in Vision
3.2.1. Reshaping and Preparing the Input
Algorithm 1: PyTorch-style command for non-overlapping vs. overlapping patches. |
# non-overlapping patches self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size) # overlapping patches self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=stride_size) |
3.2.2. Self-Attention
3.2.3. Transformer Encoder Block
3.2.4. Data-Hungry Architecture
3.2.5. Combining Transformers and CNNs in Vision
3.3. Vision Outlooker
3.3.1. Outlook Attention
- into outlook weights ;
- into value representations .
3.3.2. Class Attention
3.4. Transformers in Vehicle Re-Identification
- Multi-head attention modules are able to capture long-range dependencies and push the models to capture more discriminative parts compared to CNN-based methods;
- Transformers are able to preserve detailed and discriminative information because they do not use convolution and downsampling operators.
3.5. Designing Your Loss Function
3.5.1. Classification Loss
3.5.2. Metric Loss
3.5.3. Combining Classification and Metric Loss
3.6. Techniques to Improve Your Re-Identification Model
3.6.1. Batch Normalization Neck
Algorithm 2: PyTorch-style command for BNNeck. |
# x = output of network global_feat = x if neck == ’no’: feat = global_feat else: feat = nn.BatchNorm1d(global_feat) x_cls = nn.Linear(feat) # return: cls for ID, global_feat for triplet loss return x_cls, global_feat |
3.6.2. Label Smoothing
3.7. VReID Architecture
- Preparing the input data (1)–(2): The model accepts as input mini-batches of three-channel RGB images of shape (H × W × C), where H and W are the height and width. All the images then go through data augmentation such as normalization, resizing, padding, flipping, etc. After the data transform, the images are split into non-overlapping or overlapping patches. While ViT uses one convolutional layer for non-overlapping patch embedding, VOLO uses four layers. Besides the number of layers, there is also a difference in the size of the patches. In order to encode expressive finer-level features, VOLO changes the patch size (P) from to . The total number of patches is then .
- VOLO Backbone (3)–(7): VOLO comprises Outlooker (3), Transformer (5) and Class Attention (7) blocks. A [cls] token (6) is added before the class attention layers (7). Depending on the model variant (D1–D5), the number of layers per block differs. After the patch embeddings (2) go through the Outlooker block (3), the tokens are downsampled (4). Positional encoding is then added, and the tokens are fed into the Transformer blocks.
- Classifying the vehicle (8)–(10): The output features (8) are run through the classifier heads (10), consisting of different losses. Optionally, when using the BNNeck, it is inserted in (9).
4. Datasets and Evaluation
4.1. Datasets
4.2. Evaluation
5. Experiments and Results
5.1. Implementation Details
5.1.1. Data Preparation
5.1.2. Experimental Protocols
5.2. Results
5.2.1. Baseline Model
5.2.2. The Importance of Pre-Training
5.2.3. The Importance of the Loss Function
5.2.4. The Importance of the Learning Rate
“The learning rate is perhaps the most important hyperparameter. If you have time to tune only one hyperparameter, tune the learning rate.”[104]
5.2.5. Using Different Optimizers
5.2.6. Going Deeper
5.2.7. The Importance of the LR Scheduler
- Linear warm-up: Figure 16 visualizes how the loss and mAP varied depending on the number of warm-up epochs. Without using any warm-up (blue), the spike in the loss was deeper and it took the model longer to recover from it. When using a warm-up of 50 epochs (green), the spike was narrower. Finally, testing using 75 warm-up epochs, there was no spike during the training.
- Number of restart epochs: Figure 17 shows the evolution of the learning rate using different numbers of restart epochs (140, 150, 190) and decay rates (0.1 or 0.8). The decay rate is the value by which, at every restart, the learning rate is decayed by, using the following multiplication: LR × decay_rate. When using 150 restart epochs with a decay rate of 0.8 (orange), the mAP score dipped, but recovered quickly and achieved a higher score compared to the two others. When restarting with 140 epochs (blue) or 190 epochs (green), both with a decay rate of 0.1, there was no dip in the mAP during training; however, the resulting values were lower.
5.2.8. Visualization of the Ranking List
- 1.
- Our model was able to identify the correct type of vehicle (model, colour).
- 2.
- The same vehicle can be identified from different angles/perspectives (see the first and last rows).
- 3.
- Occlusion and illumination can interfere with the model’s performance (see the 1st and 2nd rows).
- 4.
- Using information on the background and the timestamp would enhance our model’s predictive ability. Looking at the third row, the retrieved vehicle was very similar to the query vehicle. However, when looking at the background, there was information (black car) that was not detected. As for the fourth row, there was no red writing on the wrong match; furthermore, that truck carried more sand than the truck from the query.
- 5.
- Overall, the model was highly accurate at predicting the correct matches. As a human, we would have to look more than twice to grasp the tiny differences between the query and the retrieved gallery images.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhang, J.; Wang, F.Y.; Wang, K.; Lin, W.H.; Xu, X.; Chen, C. Data-driven intelligent transportation systems: A survey. IEEE Trans. Intell. Transp. Syst. 2011, 12, 1624–1639. [Google Scholar] [CrossRef]
- Zheng, Y.; Capra, L.; Wolfson, O.; Yang, H. Urban computing: Concepts, methodologies, and applications. ACM Trans. Intell. Syst. Technol. 2014, 5, 1–55. [Google Scholar] [CrossRef]
- Liu, X.; Liu, W.; Ma, H.; Fu, H. Large-scale vehicle re-identification in urban surveillance videos. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11-15 July 2016; pp. 1–6. [Google Scholar]
- Liu, W.; Zhang, Y.; Tang, S.; Tang, J.; Hong, R.; Li, J. Accurate estimation of human body orientation from RGB-D sensors. IEEE Trans. Cybern. 2013, 43, 1442–1452. [Google Scholar] [CrossRef]
- Deng, J.; Hao, Y.; Khokhar, M.S.; Kumar, R.; Cai, J.; Kumar, J.; Aftab, M.U. Trends in vehicle re-identification past, present, and future: A comprehensive review. Mathematics 2021, 9, 3162. [Google Scholar]
- Yan, C.; Pang, G.; Bai, X.; Liu, C.; Xin, N.; Gu, L.; Zhou, J. Beyond triplet loss: Person re-identification with fine-grained difference-aware pairwise loss. IEEE Trans. Multimed. 2021, 24, 1665–1677. [Google Scholar] [CrossRef]
- Wang, Z.; Tang, L.; Liu, X.; Yao, Z.; Yi, S.; Shao, J.; Yan, J.; Wang, S.; Li, H.; Wang, X. Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 379–387. [Google Scholar]
- Liu, X.; Zhang, S.; Huang, Q.; Gao, W. Ram: A region-aware deep model for vehicle re-identification. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
- He, B.; Li, J.; Zhao, Y.; Tian, Y. Part-regularized near-duplicate vehicle re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 3997–4005. [Google Scholar]
- Yuan, L.; Hou, Q.; Jiang, Z.; Feng, J.; Yan, S. Volo: Vision outlooker for visual recognition. arXiv 2021, arXiv:2106.13112. [Google Scholar] [CrossRef]
- Wang, H.; Hou, J.; Chen, N. A Survey of Vehicle Re-Identification Based on Deep Learning. IEEE Access 2019, 7, 172443–172469. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25, pp. 1097–1105. [Google Scholar]
- Gazzah, S.; Essoukri, N.; Amara, B. Vehicle Re-identification in Camera Networks: A Review and New Perspectives. In Proceedings of the ACIT’2017 The International Arab Conference on Information Technology, Yassmine Hammamet, Tunisia, 22–24 December 2017; pp. 22–24. [Google Scholar]
- Khan, S.D.; Ullah, H. A survey of advances in vision-based vehicle re-identification. Comput. Vis. Image Underst. 2019, 182, 50–63. [Google Scholar] [CrossRef] [Green Version]
- Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef]
- Lindsay, G.W. Attention in psychology, neuroscience, and machine learning. Front. Comput. Neurosci. 2020, 14, 29. [Google Scholar] [CrossRef]
- Teng, S.; Liu, X.; Zhang, S.; Huang, Q. Scan: Spatial and channel attention network for vehicle re-identification. In Proceedings of the Pacific Rim Conference on Multimedia, Hefei, China, 21–22 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 350–361. [Google Scholar]
- Khorramshahi, P.; Kumar, A.; Peri, N.; Rambhatla, S.S.; Chen, J.C.; Chellappa, R. A dual-path model with adaptive attention for vehicle re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6132–6141. [Google Scholar]
- Zhao, B.; Wu, X.; Feng, J.; Peng, Q.; Yan, S. Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimed. 2017, 19, 1245–1256. [Google Scholar] [CrossRef]
- Mnih, V.; Heess, N.; Graves, A. Recurrent models of visual attention. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, USA, 8–13 December 2014; Volume 27. [Google Scholar]
- Naphade, M.; Wang, S.; Anastasiu, D.C.; Tang, Z.; Chang, M.C.; Yang, X.; Yao, Y.; Zheng, L.; Chakraborty, P.; Lopez, C.E.; et al. The 5th ai city challenge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 4263–4273. [Google Scholar]
- Wu, M.; Qian, Y.; Wang, C.; Yang, M. A multi-camera vehicle tracking system based on city-scale vehicle Re-ID and spatial-temporal information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 4077–4086. [Google Scholar]
- Huynh, S.V. A strong baseline for vehicle re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 4147–4154. [Google Scholar]
- Fernandez, M.; Moral, P.; Garcia-Martin, A.; Martinez, J.M. Vehicle Re-Identification based on Ensembling Deep Learning Features including a Synthetic Training Dataset, Orientation and Background Features, and Camera Verification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 4068–4076. [Google Scholar]
- Liu, H.; Tian, Y.; Yang, Y.; Pang, L.; Huang, T. Deep relative distance learning: Tell the difference between similar vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2167–2175. [Google Scholar]
- Shen, Y.; Xiao, T.; Li, H.; Yi, S.; Wang, X. Learning deep neural networks for vehicle re-id with visual-spatio-temporal path proposals. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1900–1909. [Google Scholar]
- Liu, X.; Liu, W.; Mei, T.; Ma, H. Provid: Progressive and multimodal vehicle reidentification for large-scale urban surveillance. IEEE Trans. Multimed. 2017, 20, 645–658. [Google Scholar] [CrossRef]
- Zhu, J.; Zeng, H.; Du, Y.; Lei, Z.; Zheng, L.; Cai, C. Joint feature and similarity deep learning for vehicle re-identification. IEEE Access 2018, 6, 43724–43731. [Google Scholar] [CrossRef]
- Chu, R.; Sun, Y.; Li, Y.; Liu, Z.; Zhang, C.; Wei, Y. Vehicle re-identification with viewpoint-aware metric learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 8282–8291. [Google Scholar]
- Organisciak, D.; Sakkos, D.; Ho, E.S.; Aslam, N.; Shum, H.P. Unifying Person and Vehicle Re-Identification. IEEE Access 2020, 8, 115673–115684. [Google Scholar] [CrossRef]
- Wei, X.S.; Zhang, C.L.; Liu, L.; Shen, C.; Wu, J. Coarse-to-fine: A RNN-based hierarchical attention model for vehicle re-identification. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 575–591. [Google Scholar]
- Chen, T.S.; Liu, C.T.; Wu, C.W.; Chien, S.Y. Orientation-aware Vehicle Re-identification with Semantics-guided Part Attention Network. arXiv 2020, arXiv:2008.11423. [Google Scholar]
- Zhou, Y.; Shao, L. Cross-View GAN Based Vehicle Generation for Re-identification. In Proceedings of the BMVC, London, UK, 4–7 September 2017; Volume 1, pp. 1–12. [Google Scholar]
- Wu, F.; Yan, S.; Smith, J.S.; Zhang, B. Joint semi-supervised learning and re-ranking for vehicle re-identification. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 278–283. [Google Scholar]
- Wu, F.; Yan, S.; Smith, J.S.; Zhang, B. Vehicle re-identification in still images: Application of semi-supervised learning and re-ranking. Signal Process. Image Commun. 2019, 76, 261–271. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. All you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 x 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image Transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of Transformers. arXiv 2021, arXiv:2106.04554. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with Transformers. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You only look at one sequence: Rethinking Transformer in vision through object detection. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Volume 34, pp. 26183–26197. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 6881–6890. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Volume 34, pp. 12077–12090. [Google Scholar]
- He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15013–15022. [Google Scholar]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A Survey of Visual Transformers. arXiv 2021, arXiv:2111.06091. [Google Scholar]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. 2021, 54, 200. [Google Scholar] [CrossRef]
- Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
- Xu, Y.; Wei, H.; Lin, M.; Deng, Y.; Sheng, K.; Zhang, M.; Tang, F.; Dong, W.; Huang, F.; Xu, C. Transformers in computational visual media: A survey. Comput. Vis. Media 2022, 8, 33–62. [Google Scholar] [CrossRef]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision Transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional Transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Battaglia, P.W.; Hamrick, J.B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. Relational inductive biases, deep learning, and graph networks. arXiv 2018, arXiv:1806.01261. [Google Scholar]
- Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 843–852. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision And Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck Transformers for visual recognition. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16519–16529. [Google Scholar]
- Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.; Keutzer, K.; Vajda, P. Visual Transformers: Token-based image representation and processing for computer vision. arXiv 2020, arXiv:2006.03677. [Google Scholar]
- D’Ascoli, S.; Touvron, H.; Leavitt, M.L.; Morcos, A.S.; Biroli, G.; Sagun, L. Convit: Improving vision Transformers with soft convolutional inductive biases. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 2286–2296. [Google Scholar]
- Yuan, K.; Guo, S.; Liu, Z.; Zhou, A.; Yu, F.; Wu, W. Incorporating convolution designs into visual Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 579–588. [Google Scholar]
- Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 22–31. [Google Scholar]
- Li, Y.; Zhang, K.; Cao, J.; Timofte, R.; Van Gool, L. Localvit: Bringing locality to vision Transformers. arXiv 2021, arXiv:2104.05707. [Google Scholar]
- Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; Douze, M. Levit: A vision Transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12259–12269. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision Transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in Transformer. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Volume 34, pp. 15908–15919. [Google Scholar]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision Transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 558–567. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision Transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
- Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H. Going deeper with image Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 32–42. [Google Scholar]
- Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.; Jiang, Z.; Hou, Q.; Feng, J. Deepvit: Towards deeper vision Transformer. arXiv 2021, arXiv:2103.11886. [Google Scholar]
- Li, D.; Hu, J.; Wang, C.; Li, X.; She, Q.; Zhu, L.; Zhang, T.; Chen, Q. Involution: Inverting the inherence of convolution for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, QC, Canada, 10–17 October 2021; pp. 12321–12330. [Google Scholar]
- Jiang, Z.H.; Hou, Q.; Yuan, L.; Zhou, D.; Shi, Y.; Jin, X.; Wang, A.; Feng, J. All tokens matter: Token labelling for training better vision Transformers. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Volume 34, pp. 18590–18602. [Google Scholar]
- Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
- Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 17–35. [Google Scholar]
- Zhu, H.; Ke, W.; Li, D.; Liu, J.; Tian, L.; Shan, Y. Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 4692–4702. [Google Scholar]
- Lu, T.; Zhang, H.; Min, F.; Jia, S. Vehicle Re-identification Based on Quadratic Split Architecture and Auxiliary Information Embedding. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2022. [Google Scholar] [CrossRef]
- Shen, F.; Xie, Y.; Zhu, J.; Zhu, X.; Zeng, H. Git: Graph interactive Transformer for vehicle re-identification. arXiv 2021, arXiv:2107.05475. [Google Scholar]
- Lian, J.; Wang, D.; Zhu, S.; Wu, Y.; Li, C. Transformer-Based Attention Network for Vehicle Re-Identification. Electronics 2022, 11, 1016. [Google Scholar] [CrossRef]
- Li, H.; Li, C.; Zheng, A.; Tang, J.; Luo, B. MsKAT: Multi-Scale Knowledge-Aware Transformer for Vehicle Re-Identification. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19557–19568. [Google Scholar] [CrossRef]
- Luo, H.; Chen, W.; Xu, X.; Gu, J.; Zhang, Y.; Liu, C.; Jiang, Y.; He, S.; Wang, F.; Li, H. An empirical study of vehicle re-identification on the AI City Challenge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4095–4102. [Google Scholar]
- Yu, Z.; Pei, J.; Zhu, M.; Zhang, J.; Li, J. Multi-attribute adaptive aggregation Transformer for vehicle re-identification. Inf. Process. Manag. 2022, 59, 102868. [Google Scholar] [CrossRef]
- Gibbs, J.W. Elementary Principles in Statistical Mechanics—Developed with Especial Reference to the Rational Foundation of Thermodynamics; C. Scribner’s Sons: New York, NY, USA, 1902; Available online: www.gutenberg.org/ebooks/50992 (accessed on 3 August 2022).
- Bridle, J.S. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing; Springer: Berlin/Heidelberg, Germany, 1990; pp. 227–236. [Google Scholar]
- Lu, C. Shannon equations reform and applications. BUSEFAL 1990, 44, 45–52. [Google Scholar]
- Zheng, L.; Zhang, H.; Sun, S.; Chandraker, M.; Yang, Y.; Tian, Q. Person re-identification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1367–1376. [Google Scholar]
- Liu, W.; Wen, Y.; Yu, Z.; Yang, M. Large-margin SoftMax loss for convolutional neural networks. In Proceedings of the ICML, New York, NY, USA, 20–22 June 2016; p. 7. [Google Scholar]
- Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 212–220. [Google Scholar]
- Chen, B.; Deng, W.; Shen, H. Virtual class enhanced discriminative embedding learning. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 31, pp. 1946–1956. [Google Scholar]
- Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Chen, W.; Chen, X.; Zhang, J.; Huang, K. Beyond triplet loss: A deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 403–412. [Google Scholar]
- Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Y. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6398–6407. [Google Scholar]
- Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 499–515. [Google Scholar]
- Zhu, X.; Luo, Z.; Fu, P.; Ji, X. VOC-ReID: Vehicle re-identification based on vehicle-orientation-camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 602–603. [Google Scholar]
- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 18661–18673. [Google Scholar]
- Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Luo, H.; Jiang, W.; Gu, Y.; Liu, F.; Liao, X.; Lai, S.; Gu, J. A strong baseline and batch normalization neck for deep person re-identification. IEEE Trans. Multimed. 2019, 22, 2597–2609. [Google Scholar] [CrossRef] [Green Version]
- Liu, X.; Liu, W.; Mei, T.; Ma, H. A deep learning-based approach to progressive vehicle re-identification for urban surveillance. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 869–884. [Google Scholar]
- Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
- Fan, X.; Jiang, W.; Luo, H.; Fei, M. Spherereid: Deep hypersphere manifold embedding for person re-identification. J. Vis. Commun. Image Represent. 2019, 60, 51–58. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 3 June 2022).
- Zhang, T.; Li, W. k-decay: A new method for learning rate schedule. arXiv 2020, arXiv:2004.05909. [Google Scholar]
Year | Title | Setting | Based on |
---|---|---|---|
2017 | Vehicle Re-identification in Camera Networks: A Review and New Perspectives [13] | Closed | sensor, vision |
2019 | A Survey of Advances in Vision-Based Vehicle Re-identification [14] | Closed | sensor, vision |
2019 | A Survey of Vehicle Re-Identification Based on Deep Learning [11] | Closed | vision |
2021 | Trends in Vehicle Re-identification Past, Present, and Future: A Comprehensive Review [5] | Closed | sensor, vision |
Method | Description | Advantages | Disadvantages |
---|---|---|---|
Local feature (LF) | Focuses on the local areas of vehicles using key point location and region segmentation | Able to capture unique visual cues; can be combined with global features | Extraction of local features is resource intensive |
Metric learning (ML) | Focuses on the details of the vehicle by learning the similarity of vehicles | Achieves high accuracy | Needs to design a loss function |
Unsupervised learning (UL) | No need for labelled data | Improves the generalization ability; solves the domain shift | Training is unstable |
Attention mechanism (AM) | Model learns to identify what areas need to be paid attention to; self-adaptively extracts features | Learns what areas to focus on; extracts features of distinguishing regions | Poor effect when using few labelled data or complex backgrounds |
Year | Model | VeRi-776 | Vehicle-ID (mAP (%)/R-1 (%)) | |||
---|---|---|---|---|---|---|
mAP (%) | Rank-1 (%) | S | M | L | ||
2021 | TransReID * [47] | 82.30 | 97.10 | - | - | - |
2021 | TransReID (ViT-Base) [47] | 78.2 | 96.5 | 82.3/96.1 | - | - |
2021 | GiT * [79] | 80.34 | 96.86 | 84.65/ - | 80.52/ - | 77.94/ - |
2022 | VAT * [83] | 80.40 | 97.5 | 84.50/ - | 80.50/ - | 78.20/ - |
2022 | QSA * [78] | 82.20 | 97.30 | 88.50/98.00 | 84.70/96.30 | 80.10/92.10 |
2022 | DCAL [77] | 80.20 | 96.90 | - | - | - |
2022 | MsKAT * [81] | 82.00 | 97.10 | 86.30/97.40 | 81.80/95.50 | 74.90/93.90 |
2022 | TANet † [80] | 80.50 | 95.4 | 88.20/82.9 | 87.0/81.5 | 85.9/79.6 |
Column Name | Values | Comments |
---|---|---|
ID | natural number | identifier of the experiment |
pre-trained | ✓ | true (pre-trained) |
✗ | false (from scratch) | |
loss | ||
BNNeck | ✓ | using batch normalization neck |
✗ | not using batch normalization neck |
Specifications | Value |
---|---|
variant (Section 5.2.6) | VOLO-D1 |
pre-trained (Section 5.2.2) | false |
optimizer (Section 5.2.5) | SGD |
momentum | 0.9 |
base learning rate (Section 5.2.4) | |
weight decay | |
loss function (Section 5.2.3) | ID loss |
LR scheduler (Section 5.2.7) | cosine annealing |
warm-up epochs (Section 5.2.7) | 10 |
ID | BNNeck | Loss | LR | Weight Decay | Pre-Trained | mAP % | R-1 % |
---|---|---|---|---|---|---|---|
1 | ✗ | ✗ | 15.75 | 23.42 | |||
✓ | 14.29 | 35.63 | |||||
2 | * | ✗ | 43.95 | 77.11 | |||
✓ | 63.87 | 91.12 | |||||
3 | ✗ | 54.67 | 84.44 | ||||
✓ | 73.12 | 94.39 | |||||
4 | ✗ | 57.39 | 87.72 | ||||
✓ | 78.02 | 96.24 | |||||
5 | ✓ | ✗ | 59.71 | 89.39 | |||
✓ | 77.41 | 95.88 |
ID | BNNeck | Loss | LR | mAP % | R-1 % |
---|---|---|---|---|---|
1 | ✗ | 63.87 | 91.12 | ||
64.77 | 92.07 | ||||
68.91 | 93.68 | ||||
2 | 73.12 | 94.39 | |||
77.04 | 96.06 | ||||
4.51 | 12.93 | ||||
3 | 76.10 | 95.35 | |||
78.02 | 96.24 | ||||
0.94 | 1.54 | ||||
4 | ✓ | 70.73 | 94.57 | ||
72.89 | 94.87 | ||||
77.41 | 95.88 |
Neck | LR | mAP % | R-1 % |
---|---|---|---|
LNNeck | 28.6 | 58.76 | |
73.85 | 95.11 | ||
3.73 | 11.26 | ||
2.01 | 5.42 |
BNNeck | Learning Rate | mAP % | R-1 % |
---|---|---|---|
✗ | 76.10 | 95.35 | |
77.38 | 95.94 | ||
77.72 | 96.42 | ||
78.00 | 96.90 | ||
78.02 | 96.24 | ||
77.88 | 96.30 | ||
6.25 | 21.69 | ||
6.42 | 20.91 | ||
5.90 | 19.30 | ||
3.38 | 9.95 | ||
0.94 | 1.54 | ||
✓ | 70.73 | 94.57 | |
72.89 | 94.87 | ||
74.94 | 95.11 | ||
75.00 | 95.41 | ||
75.35 | 96.72 | ||
75.15 | 95.76 | ||
75.67 | 95.64 | ||
76.44 | 96.42 | ||
76.73 | 96.00 | ||
77.41 | 95.88 | ||
75.98 | 95.70 | ||
75.52 | 95.23 | ||
75.37 | 95.70 |
Optimizer | LR | mAP % | R-1 % |
---|---|---|---|
SGD | 78.02 | 96.24 | |
AdamW | 0.75 | 1.19 | |
70.09 | 93.44 | ||
73.22 | 94.27 | ||
74.32 | 94.93 | ||
63.52 | 87.18 | ||
RMSProp | 0.73 | 0.89 | |
68.74 | 92.55 | ||
73.67 | 94.75 | ||
65.41 | 89.33 |
ID | BNNeck | LR | Variant | Batch Size | mAP % | R-1 % |
---|---|---|---|---|---|---|
1 | ✗ | D1 | 128 | 77.23 | 96.72 | |
256 | 78.02 | 96.24 | ||||
2 | D2 | 128 | 76.60 | 95.94 | ||
256 | 76.18 | 95.41 | ||||
3 | ✓ | D1 | 128 | 73.99 | 95.58 | |
256 | 77.41 | 95.88 | ||||
4 | D2 | 128 | 77.06 | 96.24 | ||
256 | 77.16 | 97.02 |
Model | Variant | # Params | # Layers | Batch Size | Runtime (h) | mAP % | R-1 % |
---|---|---|---|---|---|---|---|
BN, LR = 0.015 | D1 | 26.6 M | 18 | 256 | 11.05 | 77.41 | 95.88 |
D2 | 58.7 M | 24 | 256 | 16.68 | 77.16 | 97.02 | |
D3 † | 86.3 M | 36 | 128 | 24.12 | 75.18 | 95.88 | |
D4 | 193 M | 36 | 128 | 31.69 | 78.77 | 96.66 | |
D5 | 296 M | 48 | 128 | 44.29 | 80.30 | 97.13 | |
LR = 0.002 | D1 | 256 | 10.72 | 78.02 | 96.24 | ||
D2 | 128 | 18.08 | 76.60 | 95.94 | |||
D3 | 128 | 24.40 | 76.19 | 94.93 | |||
D4 | 128 | 32.02 | 78.51 | 96.78 | |||
D5 | 128 | 44.68 | 79.12 | 97.19 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qian, Y.; Barthelemy, J.; Iqbal, U.; Perez, P. V2ReID: Vision-Outlooker-Based Vehicle Re-Identification. Sensors 2022, 22, 8651. https://doi.org/10.3390/s22228651
Qian Y, Barthelemy J, Iqbal U, Perez P. V2ReID: Vision-Outlooker-Based Vehicle Re-Identification. Sensors. 2022; 22(22):8651. https://doi.org/10.3390/s22228651
Chicago/Turabian StyleQian, Yan, Johan Barthelemy, Umair Iqbal, and Pascal Perez. 2022. "V2ReID: Vision-Outlooker-Based Vehicle Re-Identification" Sensors 22, no. 22: 8651. https://doi.org/10.3390/s22228651
APA StyleQian, Y., Barthelemy, J., Iqbal, U., & Perez, P. (2022). V2ReID: Vision-Outlooker-Based Vehicle Re-Identification. Sensors, 22(22), 8651. https://doi.org/10.3390/s22228651