Improved Fully Convolutional Siamese Networks for Visual Object Tracking Based on Response Behaviour Analysis
Abstract
:1. Introduction
- (1)
- CNN was the first deep learning model to be used in the visual object tracking field due to its powerful representation of a target. Wang [7] proposed a tracking algorithm that used fully convolutional networks pre-trained on image classification tasks, and this performed better than the majority of other trackers regarding both precision and success rate at that time. Nam [8] pre-trained a CNN using a large set of videos with tracking ground truths for obtaining a generic target representation. CNN-based trackers have inherent limitations, including computational complexities and the requirement of large-scale supervised training data.
- (2)
- RNN-based trackers are excellent for dealing with temporal information of video frames, including object movement or motion. Yang [9] embedded a long short-term memory (LSTM) network into a recurrent filter learning network as a means of achieving state-of-the-art tracking. Ma [10] exploited a pyramid multi-directional recurrent network to memorise target appearance. However, RNN-based trackers are generally difficult to train and have a considerable number of parameters that require tuning, and the number of these trackers is limited.
- (3)
- GAN-based trackers can generate desired training positive images in the feature space for tackling the issue of sample imbalance [11]. Guo [12] proposed a task-guided generative adversarial network (TGGAN) to learn the general appearance distribution that a target may undergo through a sequence. As RNN trackers, it is also difficult to train and evaluate GAN-based trackers, so their number is also limited.
- (4)
- Recently, Siamese networks (SNN), which follow a tracking using a similarity comparison strategy, have received significant attention from the visual tracking community due to their favourable performance [13,14,15,16,17]. SNN-based trackers formulate the visual object tracking problem by learning a general similarity map through cross-correlation between the feature representations learned for the target template and the search region. Due to the satisfactory balance between performance and efficiency, SNN-based trackers have become the most widely used and researched trackers in recent years.
- A new distractor detecting method is proposed that analyses the response map without training. Following an experimental comparison, it is proven that the proposed response behaviour analysis module can be embedded into other response map- or score map-based trackers as a means of improving tracking performance, making this a common strategy for many other trackers.
- The behaviour of real targets and distractors can be observed and recognised through the analysis of the dynamic pattern of the contours in the response map. This method enables a simple, effective and dynamic analysis of the movement trend of the target and the surrounding distractors over a period of time to be performed for the prediction of the potential impact the distractors have on the target object.
- The performance of the classic SiamFC can be significantly improved through the adoption of the response analysis model during the tracking process. This shows that for certain problems with classical visual target tracking algorithms such as SiamFC, tracking performance can be improved more substantially through the use of well-designed but simple strategies, which do not necessarily require the reconstruction of complex network structures or long training periods.
2. Related Work
2.1. SNN-Based Trackers
2.2. Discriminative Object Representation and Improvement Solutions
3. Proposed Method
3.1. Siamese Trackers
3.2. Improved SiamFC Tracker Based on Response Behaviour Analysis
3.2.1. Response Isohypse Contour
3.2.2. Distractor Approaching Analysis
3.2.3. Object Centre Switching Strategy
3.2.4. Pseudo-Code of the Proposed Method
Algorithm 1: Proposed tracking method |
Input: (N is the total number of sequences) |
for do |
# Response map |
# Target position offset |
#Normalisation |
# Find j isohypse contours |
if |
# (Equation (3)) |
# (Equation (5)) |
if |
end if |
# (Equation (6)) |
if |
end if |
else |
# use the position of previous frame |
end else |
else |
end for |
4. Experiments and Discussion
4.1. Datasets
4.2. Evaluation Metrics
4.3. Implementation Details
4.4. Performance Evaluation
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Source Code
Abbreviations
Symbols and abbreviations | Full meaning |
Target template patch | |
Search patch | |
Learnable parameter of Siamese trackers | |
Scalar offset value | |
Central position of the target | |
Response map denoting the similarity between | |
x coordinate of the ith point in the c1 contour of the kth frame | |
Distance between the ith point of contour c1 and the jth point of contour c2 in the kth frame | |
Mean distance before the kth frame | |
Response map of the ith frame | |
Normalised response map of the ith frame | |
The jth contour of the ith frame | |
Distance threshold | |
Angle between the two highest peaks | |
AO | Average overlap |
SR | Success rate |
IoU | Interaction over Union |
FPS | Frame per second |
References
- Duer, S.; Bernatowicz, D.; Wrzesień, P.; Duer, R. The diagnostic system with an artificial neural network for identifying states in multi-valued logic of a device wind power. In Proceedings of the International Conference: Beyond Databases, Architectures and Structures, Poznan, Poland, 18–20 September 2018; pp. 442–454. [Google Scholar]
- Majewski, M.; Kacalak, W. Smart control of lifting devices using patterns and antipatterns. In Proceedings of the Computer Science Online Conference, Prague, Czech Republic, 26–29 April 2017; pp. 486–493. [Google Scholar]
- Duer, S.; Zajkowski, K.; Płocha, I.; Duer, R. Training of an artificial neural network in the diagnostic system of a technical object. Neural Comput. Appl. 2013, 22, 1581–1590. [Google Scholar] [CrossRef]
- Duer, S.; Zajkowski, K. Taking decisions in the expert intelligent system to support maintenance of a technical object on the basis information from an artificial neural network. Neural Comput. Appl. 2013, 23, 2185–2197. [Google Scholar] [CrossRef]
- Kacalak, W.; Majewski, M. New intelligent interactive automated systems for design of machine elements and assemblies. In Proceedings of the International Conference on Neural Information Processing, Doha, Qatar, 12–15 November 2012; pp. 115–122. [Google Scholar]
- Marvasti-Zadeh, S.M.; Cheng, L.; Ghanei-Yakhdan, H.; Kasaei, S. Deep Learning for Visual Tracking: A Comprehensive Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 3943–3968. [Google Scholar] [CrossRef]
- Wang, L.; Ouyang, W.; Wang, X.; Lu, H. Visual tracking with fully convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 3119–3127. [Google Scholar]
- Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4293–4302. [Google Scholar]
- Yang, T.; Chan, A.B. Recurrent filter learning for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 2010–2019. [Google Scholar]
- Ma, D.; Bu, W.; Wu, X. Multi-Scale Recurrent Tracking via Pyramid Recurrent Network and Optical Flow. In Proceedings of the BMVC, Newcastle, UK, 3–6 September 2018; p. 242. [Google Scholar]
- Song, Y.; Ma, C.; Wu, X.; Gong, L.; Bao, L.; Zuo, W.; Shen, C.; Lau, R.W.; Yang, M.-H. Vital: Visual tracking via adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8990–8999. [Google Scholar]
- Guo, J.; Xu, T.; Jiang, S.; Shen, Z. Generating reliable online adaptive templates for visual tracking. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 226–230. [Google Scholar]
- Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 4277–4286. [Google Scholar]
- Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese Box Adaptive Network for Visual Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6667–6676. [Google Scholar]
- Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware Siamese Networks for Visual Object Tracking. In Proceedings of the European Conference on Computer Vision, Marseille, France, 12–18 October 2008. [Google Scholar]
- Guo, Q.; Wei, F.; Zhou, C.; Rui, H.; Song, W. Learning Dynamic Siamese Network for Visual Object Tracking. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P. Fully-Convolutional Siamese Networks for Object Tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
- Bo, L.; Yan, J.; Wei, W.; Zheng, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Valmadre, J.; Bertinetto, L.; Henriques, J.F.; Vedaldi, A.; Torr, P. End-to-End Representation Learning for Correlation Filter Based Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 702–715. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1401–1409. [Google Scholar]
- Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. Adv. Neural Inf. Processing Syst. 1993, 6, 737–744. [Google Scholar] [CrossRef]
- Tao, R.; Gavves, E.; Smeulders, A.W. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1420–1429. [Google Scholar]
- Zhu, Z.; Wu, W.; Zou, W.; Yan, J. End-to-end flow correlation tracking with spatial-temporal attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 548–557. [Google Scholar]
- Wang, Q.; Teng, Z.; Xing, J.; Gao, J.; Hu, W.; Maybank, S. Learning attentions: Residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4854–4863. [Google Scholar]
- Marvasti-Zadeh, S.M.; Khaghani, J.; Cheng, L.; Ghanei-Yakhdan, H.; Kasaei, S. CHASE: Robust Visual Tracking via Cell-Level Differentiable Neural Architecture Search. In Proceedings of the BMVC, Online, 22–25 November 2021. [Google Scholar]
- Marvasti-Zadeh, S.M.; Khaghani, J.; Ghanei-Yakhdan, H.; Kasaei, S.; Cheng, L. COMET: Context-aware IoU-guided network for small object tracking. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
- Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
- Mayer, C.; Danelljan, M.; Bhat, G.; Paul, M.; Paudel, D.P.; Yu, F.; Van Gool, L. Transforming model prediction for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 8731–8740. [Google Scholar]
- Li, X.; Ma, C.; Wu, B.; He, Z.; Yang, M.-H. Target-aware deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1369–1378. [Google Scholar]
- Yang, L.; Jiang, P.; Wang, F.; Wang, X. Region-based fully convolutional siamese networks for robust real-time visual tracking. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 2567–2571. [Google Scholar]
- Dai, K.; Wang, Y.; Yan, X. Long-term object tracking based on siamese network. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3640–3644. [Google Scholar]
- Voigtlaender, P.; Luiten, J.; Torr, P.H.; Leibe, B. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6578–6588. [Google Scholar]
- Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning Discriminative Model Prediction for Tracking. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
- Wu, Y.; Lim, J.; Yang, M. Object Tracking Benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1834–1848. [Google Scholar] [CrossRef] [PubMed]
- Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
- Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
- Fan, H.; Bai, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Huang, M.; Liu, J.; Harshit; et al. LaSOT: A High-quality Large-scale Single Object Tracking Benchmark. Int. J. Comput. Vis. 2021, 129, 439–461. [Google Scholar] [CrossRef]
Trackers | OTB100 | GOT-10k | LaSOT | FPS | |||
---|---|---|---|---|---|---|---|
Precision | Success | AO | SR | Precision | Success | ||
DiMP-BRA | 0.904 | 68.6 | 0.705 | 0.819 | 0.651 | 0.673 | 15.1 |
DiMP | 0.902 | 68.4 | 0.696 | 0.816 | 0.642 | 0.663 | 15.2 |
DaSiamRPN | 0.88 | 65.9 | 0.444 | 0.53 | 0.605 | 0.615 | 134.4 |
SiamFC-RBA | 0.85 | 62.6 | 0.517 | 0.584 | 0.420 | 0.382 | 42.7 |
SiamRPN | 0.83 | 61.9 | 0.517 | 0.615 | 0.570 | 0.588 | 3.17 |
SiamFC | 0.77 | 57.4 | 0.348 | 0.353 | 0.372 | 0.319 | 43.8 |
CFNet | 0.76 | 57.4 | 0.261 | 0.243 | 0.312 | 0.258 | 2541 |
Staple | 0.77 | 38.49 | 0.246 | 0.248 | 0.278 | 0.240 | 28.7 |
CSK | 0.52 | 57.2 | 0.205 | 0.174 | 0.149 | 0.125 | 133.3 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, X.; Cao, S.; Dong, C.; Song, T.; Xu, Z. Improved Fully Convolutional Siamese Networks for Visual Object Tracking Based on Response Behaviour Analysis. Sensors 2022, 22, 6550. https://doi.org/10.3390/s22176550
Huang X, Cao S, Dong C, Song T, Xu Z. Improved Fully Convolutional Siamese Networks for Visual Object Tracking Based on Response Behaviour Analysis. Sensors. 2022; 22(17):6550. https://doi.org/10.3390/s22176550
Chicago/Turabian StyleHuang, Xianyun, Songxiao Cao, Chenguang Dong, Tao Song, and Zhipeng Xu. 2022. "Improved Fully Convolutional Siamese Networks for Visual Object Tracking Based on Response Behaviour Analysis" Sensors 22, no. 17: 6550. https://doi.org/10.3390/s22176550
APA StyleHuang, X., Cao, S., Dong, C., Song, T., & Xu, Z. (2022). Improved Fully Convolutional Siamese Networks for Visual Object Tracking Based on Response Behaviour Analysis. Sensors, 22(17), 6550. https://doi.org/10.3390/s22176550