D-TransT: Deformable Transformer Tracking
Abstract
:1. Introduction
- Compared to other object-tracking networks, it takes more training time to converge the network.
- Transformer Tracking ignores multi-scale features, so there is room for further tracking performance improvement.
2. Related Work
2.1. Siamese-Based Networks
2.2. Vision Transformer
2.3. Deformable Attention Modules
3. Deformable Transformer Tracking
- Feature extraction module: Used to extract the search region’s and template’s features.
- Feature fusion module: Instead of the traditional correlation operation, the cross-attention mechanism is used for feature fusion.
- Prediction head module: Classification and bounding box regression is performed based on the feature fusion module to generate tracking results.
3.1. Overall Architecture
3.1.1. Feature Extraction Module
3.1.2. Feature Fusion Module
3.1.3. Prediction Head Network
3.2. Training Loss
4. Experiments
4.1. Implementation Details
4.2. State-of-the-Art Comparison
4.3. Ablation Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Zhang, Z.; Peng, H. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Chen, Z.; Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, L. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv 2021, arXiv:2105.05537. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- LeCun, Y.; Bengio, Y. Convolutional networks for images, speech, and time series. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 1995. [Google Scholar]
- Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the 2022 IEEE/CVF International Conference on Computer Vision, New Orleans, LA, USA, 19–20 June 2022. [Google Scholar]
- Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 11–17 October 2021. [Google Scholar]
- Danelljan, M.; Robinson, A.; Khan, F.; Felsberg, M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
- Danelljan, M.; Bhat, G.; Shahbaz, k.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Ma, C.; Huang, J.; Yang, X.; Yang, M. Hierarchical convolutional features for visual tracking. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Jiang, B.; Luo, R.; Mao, J. Acquisition of localization confidence for accurate object detection. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Bolme, D.S.; Beveridge, J.; Draper, B.; Lui, Y. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010. [Google Scholar]
- Henriques, J.F.; Gaseiro, R.; Martins, P.; Batista, J. Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- LuNežič, A. Discriminative correlation filter TracNer with channel and spatial reliability. Int. J. Comput. Vis. 2018, 126, 671–688. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the 2020 International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.; Vedaldi, A.; Torr, P. Fully-convolutional siamese networks for object tracking. In Proceedings of the 2016 European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
- Li, B.; Yan, J.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef] [Green Version]
- Li, B.; Wu, W.; Wang, Q. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 2012 Advances in Neural Information Processing Systems 25, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
- Danelljan, M.; Van Gool, L.; Timofte, R. Probabilistic regression for visual tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Voigtlaender, P.; Luiten, J.; Torr, P.; Leibe, B. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the 2020 European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; 17 October 2021. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014. [Google Scholar]
- Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Huang, L.; Xin, Z.; Kaiqi, H. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Danelljan, M.; Bhat, G.; Khan, F.; Felsbeg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Zha, Y.; Wu, M.; Qiu, Z.; Dong, S.; Yang, F.; Zhang, P. Distractor-aware visual tracking by online Siamese network. IEEE Access 2019, 7, 89777–89788. [Google Scholar] [CrossRef]
- Bhat, G.; Danelljan, M.; Gool, L.; TimoFte, R. Learning discriminative model prediction for tracking. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Wu, Y.; Jongwoo, L.; Yang, M. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
Notation | Description |
---|---|
Query element | |
x | A linear transformation of the vector |
q | Index for query element |
k | Index for key element |
All set of k | |
The normalized attention weight | |
Number of key elements | |
The position offset of the sampling set point relative to the reference point | |
Represents the features of the l-th layer |
Method | Source | LaSOT | OTB100 | ||||
---|---|---|---|---|---|---|---|
AUC | P | EAO | Robustness | Accuracy | |||
D-TransT | Ours | 65.6 | 73.3 | 69.1 | 0.521 | 0.124 | 0.609 |
TransT | CVPR2021 | 64.9 | 73.2 | 68.9 | 0.501 | 0.111 | 0.593 |
Ocean | ECCV2020 | 56.0 | 65.1 | 56.6 | 0.489 | 0.117 | 0.592 |
SiamR-CNN | CVPR2020 | 64.8 | 72.2 | - | 0.408 | 0.220 | 0.609 |
PrDimp | CVPR2020 | 59.8 | 68.8 | 60.8 | 0.442 | 0.165 | 0.618 |
SiamRPN++ | CVPR2019 | 49.6 | 56.9 | 49.1 | 0.414 | 0.234 | 0.600 |
DaSiamRPN | ECCV2018 | - | - | - | 0.383 | 0.276 | 0.586 |
SiamRPN | CVPR2018 | - | - | - | 0.358 | 0.276 | 0.586 |
ATOM | CVPR2019 | 51.5 | 57.6 | 50.5 | - | - | - |
SiamFC | ECCVW2016 | 33.6 | 42.0 | 33.8 | 0.292 | 0.283 | 0.54 |
Evaluation Metrics | Description |
---|---|
AUC | The area under curve of success plot |
P and | Precision scores |
EAO | Expected average overlap |
Robustness | Failure rate |
Accuracy | Average overlap over successfully tracked frames |
Method | D-ECA | D-CFA | ECA | CFA | Training GPU Hours | AUC | P | |
---|---|---|---|---|---|---|---|---|
TransT | √ | √ | 85 | 64.9 | 73.2 | 68.9 | ||
D-TransT | √ | √ | 65 | 65.3 | 73.4 | 69.0 | ||
D-TransT | √ | √ | 65 | 65.2 | 73.3 | 68.9 | ||
D-TransT | √ | √ | 60 | 65.6 | 73.3 | 69.1 |
Method | D-ECA | D-CFA | ECA | CFA | Training GPU Hours | EAO | Robustness | Accuracy |
---|---|---|---|---|---|---|---|---|
TransT | √ | √ | 85 | 0.501 | 0.111 | 0.593 | ||
D-TransT | √ | √ | 65 | 0.511 | 0.118 | 0.592 | ||
D-TransT | √ | √ | 65 | 0.518 | 0.123 | 0.600 | ||
D-TransT | √ | √ | 60 | 0.521 | 0.124 | 0.609 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhou, J.; Yao, Y.; Yang, R.; Xia, Y. D-TransT: Deformable Transformer Tracking. Electronics 2022, 11, 3843. https://doi.org/10.3390/electronics11233843
Zhou J, Yao Y, Yang R, Xia Y. D-TransT: Deformable Transformer Tracking. Electronics. 2022; 11(23):3843. https://doi.org/10.3390/electronics11233843
Chicago/Turabian StyleZhou, Jiahang, Yuanzhe Yao, Rong Yang, and Yuheng Xia. 2022. "D-TransT: Deformable Transformer Tracking" Electronics 11, no. 23: 3843. https://doi.org/10.3390/electronics11233843