ParallelTracker: A Transformer Based Object Tracker for UAV Videos
Round 1
Reviewer 1 Report
In this paper, the authors propose an efficient object detection and tracking framework for remote sensing video data. The proposed method, namely ParallelTracker, is based on ViT and consists of three novel components, i.e., PEM, TPM, TPKM. The proposed method is technically sound and the experiments have illustrated the effectiveness of ParallelTracker. Overall. the studied problem in this paper is important and interesting for related areas. However, after pursuing the details of this paper, I have the following concerns:
Major concerns:
1. The motivation and contribution of this paper need to be highlighted. As the major task of this paper is UAV video tracking, the authors should focus on the characteristics of UAV video. Unfortunately, this is not clear enough for me. The challenges in UAV video and differences between UAV videos and other domain videos should be details. Moreover, the proposed method is also like a general framework, which can easily be developed in other domains. Then, why does this paper focus on UAV video tracking?
2. The writing of this paper needs to be improved largely. The details of the proposed method are not clear enough for readers. For example, in Section 3,1, how to downsample template tokens by patch embedding layers? What are the physical meanings of target query t_q and template tokens t_t? And the differences between online templates and templates are not detailed in this paper. Moreover, as the major concept in this paper, the Template Prior Knowledge is not clearly defined. That made it really hard to understand the proposed method and its technical contributions.
3. In the experiment setting, the training details of this paper are not clear enough. How to use the open domain datasets such as TrackingNet [49], LaSOT [50], GOT- 10k [51], and COCO [52] should be detailed. And the experiments are not sufficient. The major parameter of /lambda in Equation (4) should be discussed as their sensitivities in experiments rather than given them directly.
I hope the authors can take full consideration of these concerns. Besides, there are still several minor concerns as follows:
Minor concerns:
1. The citation and references are not standardized. For example, the proposed ParallelTracker is based on Vit[42]. But in the full paper, reference [42] of ViT is not cited at all.
2. The abbreviation is not well-addressed. For example, in the Abstract, "a Prior Knowledge Extractor Module (PEM)" should be "a Prior knowledge Extractor Module (PEM)", and "a Template Features Parallel Enhancing Module (TPM)" should be "a Template features Parallel enhancing Module (TPM)".
In summary, I think this paper studies an important problem and experiments illustrate good results of the proposed method with a comparison to several state-of-the-art methods. However, I still have some major concerns which I hope the authors can fully consider and improve the manuscript.
Author Response
Reviewer #1
Comment 1:
The motivation and contribution of this paper need to be highlighted. As the major task of this paper is UAV video tracking, the authors should focus on the characteristics of UAV video. Unfortunately, this is not clear enough for me. The challenges in UAV video and differences between UAV videos and other domain videos should be details. Moreover, the proposed method is also like a general framework, which can easily be developed in other domains. Then, why does this paper focus on UAV video tracking?
Response:
Thank you. Challenges in UAV video tracking can be summarized as follows:
Firstly, UAV videos are prone to motion blur, occlusion, and background clutter due to the rapid movement of the drone. The altitude variation of the drone's six degrees of freedom (6 DOF) movement further complicates the task by causing objects to appear at different scales. As a result, extracting stable features from UAV videos is a comparatively difficult task compared to videos from other fields. Furthermore, prevalent VIT-based tracking algorithms exhibit reduced accuracy when operating under such challenging conditions.
Secondly, object tracking in UAV videos requires rapid and accurate location of moving objects, making real-time performance a critical requirement. Additionally, it is necessary to reduce the number of parameters and the convergence difficulty of the tracker since it needs to be deployed on the UAV platform. However, existing VIT-based trackers often suffer from poor convergence and have a large number of parameters, which hinders their deployment on UAV platforms.
Therefore, the contribution of our ParallelTracker is mainly aimed at the above two aspects:
(1) Enhancing the accuracy of VIT-based tracking algorithms for UAV video by addressing the challenges of motion blur, camera motion, and occlusion.
(2) Improving the convergence of VIT-based object tracking algorithms by reducing the number of parameters and facilitating their training and implementation on UAV platforms.
Specifically, our proposed solution to these challenges involves three modules. Firstly, the Prior Knowledge extractor Module (PEM) extracts local spatial information from the template image, imbuing the Vision Transformer with inductive biases similar to those of CNNs, such as locality and spatial invariance. This approach helps alleviate the computational burden during training. Secondly, the Template Features Parallel enhancing Module (TPM), a parallel attention module, extracts object-oriented discriminative features that can overcome occlusion, motion blur, and background clutter by leveraging the prior knowledge of template images obtained from PEM. Finally, the Template Prior Knowledge Merge Module (TPKM) extracts features and facilitates information exchange between templates and search regions. This module further enhances the discriminative power of the model, leading to better tracking performance in challenging UAV video sequences.
Comment 2:
The writing of this paper needs to be improved largely. The details of the proposed method are not clear enough for readers. For example, in Section 3,1, how to downsample template tokens by patch embedding layers? What are the physical meanings of target query t_q and template tokens t_t? And the differences between online templates and templates are not detailed in this paper. Moreover, as the major concept in this paper, the Template Prior Knowledge is not clearly defined. That made it really hard to understand the proposed method and its technical contributions.
Response:
Thank you. (1) The patch embedding layer is composed of convolutional layers to increase the channel resolution while reducing the spatial resolution. The specific operation of patch embedding layers is: Given T templates (i.e., the first template and T −1 online templates) with the size of T × ht × wt ×3, we first map it into patch embeddings using a convolutional layer with stride 4 and kernel size 7. Then, we flatten the patch embeddings, resulting in a token sequence with the size of T × ht/4 × wt/4 × C, where C is the number of channels and set to 96, ht and wt are the height and the width of template and set to 128.
(2) The target query is initialized using a random vector and upsampled through bicubic interpolation to match the shape of the template tokens. Meanwhile, the template token is a tensor that is obtained by inputting the template image into the patch embedding layers.
(3) The templates used in this study are image patches of size 128 × 128, which are cropped from the initial frames of the videos. In contrast, the online templates are also image patches of the same size, but they are cropped from video frames selected by the PSH algorithm. During the tracking process, two scenarios need to be considered. In the initial frames, the online templates are identical to the templates. However, in the subsequent frames, the online templates are dynamically selected from the current frames by the PSH.
(4) We propose PEM to extract local spatial information from the template image, boosting the capacity of the tracker while mitigating the computational burden during training. The utilization of local spatial information in the Vision Transformer enables the model to capture specific local features of the object. Additionally, this approach imbues the model with inductive biases that are similar to CNNs, such as locality and spatial invariance. Prior knowledge can be defined as the local spatial feature of the target, which informs the model's understanding of the target's spatial context.
Comment 3:
In the experiment setting, the training details of this paper are not clear enough. How to use the open domain datasets such as TrackingNet [49], LaSOT [50], GOT- 10k [51], and COCO [52] should be detailed. And the experiments are not sufficient. The major parameter of /lambda in Equation (4) should be discussed as their sensitivities in experiments rather than given them directly.
Response:
Thank you. The training dataset comprises the train splits of several benchmark datasets, including LaSOT [50], GOT-10K [51], COCO [52], and TrackingNet [49]. More specifically, the training dataset includes 118, 287 images from COCO2017, 10, 000 video sequences from GOT-10k, 1, 120 video sequences from LaSOT, and 10, 214 video sequences from the TrackingNet dataset. The sizes of search and template image patches are 320 × 320 pixels and 128 × 128 pixels respectively, corresponding to 25 and 4 times the bounding box area. Data augmentations include horizontal flip and brightness jittering.
We leverage the lambda parameter utilized in DETR [43] in Equation (4). Moreover, the function described by Equation (4) is aligned with its usage in DETR. To evaluate the effect of this parameter on our tracker, we conduct sensitivity analysis experiments on the UAV123 dataset, as detailed in Table 1. Since Equation (4) is only active during Stage 1, we exclusively employ this stage of the ParallelTracker for testing. It can be concluded that proportional variances in these two parameters have little impact on the accuracy. Supplement the experimental results to Section 4.3.5.
Table 1. Sensitivity analysis for λL1 and λGIOU.
|
λL1 |
λGIOU |
AUC |
|
ParallelTracker-stage1 |
5 |
2 |
68.42 |
|
ParallelTracker-stage1 |
2 |
5 |
68.27 |
|
ParallelTracker-stage1 |
1 |
1 |
68.38 |
|
Comment 4:
The citation and references are not standardized. For example, the proposed ParallelTracker is based on Vit [42]. But in the full paper, reference [42] of ViT is not cited at all.
Response:
Thank you. We added this reference in Introduction Line 46.
Comment 5:
The abbreviation is not well-addressed. For example, in the Abstract, "a Prior Knowledge Extractor Module (PEM)" should be "a Prior knowledge Extractor Module (PEM)", and "a Template Features Parallel Enhancing Module (TPM)" should be "a Template features Parallel enhancing Module (TPM)".
Response:
Thank you. We have fixed it.
Reviewer 2 Report
Detecting objects and tracking their position in visual data are vital tasks in remote sensing. This paper presents a methodology for object detection and tracking in remotely sensed video data.
In this regard, modern transformer-based approaches usually require extensive training time to converge and do not fully utilize the information from templates in tracking. Addressing this drawback to accelerate convergence and further improve tracking accuracy, the authors propose the ParallelTracker that extracts prior knowledge from templates for better convergence and enriches template features in a parallel manner.
The proposed approach combines three central components: 1) A Prior Knowledge Extractor Module (PEM) to extract spatial information from the template image. 2) A Template features Parallel enhancing Module (TPM), a parallel for capturing more target-oriented features and enhancing feature extraction. And 3) a simple location prediction head to complete the tracking. Combining these three components contributes to the convergence and the balance between performance and speed of the tracking process.
The work addresses an important topic in the field. It is mainly well-written and clear (except for some parts I recommend supporting with visual examples, e.g., the features computed in equations 1-3). Therefore, I recommend considering the publication of the paper in Remote Sensing after addressing the following comments:
· Line 130: "Firstly, we obtain downsampled template tokens by using a fully convolutional patch embedding layer. Next, these tokens are fed into an average pooling layer, which reduces their resolution while preserving vital information". Since the prior spatial information extraction is one of the main contributions of the proposed approach, I strongly recommend adding a real example of this stage.
· In the same section: what are the parameters of the downsampling? Are they automatically extracted according to the input video's spatial resolution, or should one re-train the network for different sensors and scene configurations?
· In Figure 2, tile (b)-TPM: the left side converts prior tokens to prior tokens. What is modified in the prior token during this process? If you use the tokens from the output to feed the next step, you should indicate that in the term you use for the output tokens to prevent confusion, i.e., don't use the same term ("prior tokens") as in the input.
· I assume the templates' linearization includes all three bands of an RGB image. If so, is the system adaptable for data with more/less number of bands?
· How does the proposed method perform under varying illumination effects, for example, changing from clear to a cloudy sky or overcast?
· The conclusion is concise and not informative.
Good luck
Author Response
Comment 1:
Line 130: "Firstly, we obtain downsampled template tokens by using a fully convolutional patch embedding layer. Next, these tokens are fed into an average pooling layer, which reduces their resolution while preserving vital information". Since the prior spatial information extraction is one of the main contributions of the proposed approach, I strongly recommend adding a real example of this stage.
Response:
Thank you. We obtained quantitative results by comparing our proposed ParallelTracker, which includes the PEM module, with a version of ParallelTracker without PEM. See experiments #1 and #4 in Table 3 of Section 4.3 in the original manuscript. At the same time, we visualized a real case to address your issue. Please refer to Figure 2 for details. The results show that after removing the PEM, our tracker’s performance will drop significantly, which is consistent with the results of the ablation study in Section 4.3.1.
Figure 2. Visualization results of PEM ablation study. ParallelTracker without PEM loses tracking in Frame #50.
Comment 2:
In the same section: what are the parameters of the downsampling? Are they automatically extracted according to the input video's spatial resolution, or should one re-train the network for different sensors and scene configurations?
Response:
Thank you. We have now provided a detailed explanation of the patch embedding layer in the text. The input size is fixed, and if a new image with a different size is used as input, it should be resized to the same size before being input into our network (actually, almost all of the transformer based networks). If we wish to avoid resizing the input images, it would be advisable to retrain the network.
Comment 3:
In Figure 2, tile (b)-TPM: the left side converts prior tokens to prior tokens. What is modified in the prior token during this process? If you use the tokens from the output to feed the next step, you should indicate that in the term you use for the output tokens to prevent confusion, i.e., don't use the same term ("prior tokens") as in the input.
Response:
Thank you. We now have modified it from “prior tokens” to “enhanced prior tokens”.
Comment 4:
I assume the templates' linearization includes all three bands of an RGB image. If so, is the system adaptable for data with more/less number of bands?
Response:
Thank you. Yes, the linearization in the patch embedding layer can process any number of bands of images, just like the convolution layer in a CNN.
Comment 5:
How does the proposed method perform under varying illumination effects, for example, changing from clear to a cloudy sky or overcast?
Response:
Thank you. Our method is robust to changes in illumination. As shown in Figure 4, it can be found that our algorithm performs well in such cases. The CNN-based tracker (DiMP) also handles this situation well.
Figure 4. Visualization under varying illumination
Comment 5:
The conclusion is concise and not informative.
Response:
Thank you. We have modified the conclusion to include more information. Please see the text.
We have developed a new end-to-end tracking algorithm, called ParallelTracker, which incorporates prior knowledge and parallel attention mechanisms to integrate image priors with feature extraction and interaction processes. Specifically, the PEM module is designed to address the lack of spatial prior information in the VIT-based tracker and enhance efficiency, the TPM module leverages the prior information in the template image to extract object-oriented and discriminative features in searched frames, and the TPKM module facilitates information exchange between templates and search regions. In addition, PSH incorporates an object template update strategy to enhance the tracker's ability to accommodate changes in object shape and occlusions.
Experimental results demonstrate ParallelTracker outperforms the state-of-the-art algorithms in UAV videos. ParallelTracker can also maintain high levels of accuracy comparable to the latest methods in close-range video tracking scenarios, showing its powerful generalizing ability. Moreover, ParallelTracker can significantly reduce the epochs required by the popular methods to achieve convergence. This leads to a noteworthy reduction in the convergence difficulty of the VIT-based tracker.
Reviewer 3 Report
In this work, the authors tackle the problem of visual object tracking (VOT) in videos shot by unmanned aerial vehicles (UAVs). They propose a simple vision transformer (VIT) tracking system, as well as a number of techniques, incorporating prior knowledge and parallel attention mechanisms, to help proper feature extraction and information interaction, and thus are able to reach effective convergence and improved performance. The introduction and the related work sections present the problem and showcase the latest developments in the literature to solve it. This eases the reader into the topic and paves the way for better understanding of what this manuscript has to offer. The proposed scheme is then well described. This is aided with appropriate figures and mathematical formulation. Next, the experiments, computations and their results are described clearly. This section exhibits its strengths not only from the achieved metric values, but rather from the comparison with a large number of counterpart algorithms from the literature. The proposed scheme exhibits a comparable or superior performance. This is true both for quantitative and qualitative analyses. The conclusions section is relatively very short. Furthermore, it is titled “Conclusions and Future Work”. However, no suggestions for any future works are provided. This section needs to be revised. The references are adequate and relatively recent.
Overall, this manuscript is very well written. However, a proof-reading of the manuscript would improve its presentation. The following is a non-exhaustive list of language mistakes that should be corrected:
1. Line 51.
2. The caption of Table 2.
3. The title of section 5 needs to be updated.
Author Response
Comment 1:
The following is a non-exhaustive list of language mistakes that should be corrected:
- Line 51.
- The caption of Table 2.
- The title of section 5 needs to be updated.
Response:
Thank you. Line 5 was changed to "the information from templates is not fully utilized in and integrated into tracking".
Line 51 was changed to " The current algorithms used for tracking unmanned aerial vehicles (UAVs) face several challenges, as highlighted in previous studies [13,55] ".
The caption of Table 2 was changed to "Comparison on LaSOT [50] and UAV123 [53] datasets. The best two results are shown in red and blue fonts. "
The caption of Table 5 changed to "Ablation for TPKM".
The caption of Table 6 changed to "Ablation for PSH".
The title of section 5 was updated to "Conclusions".
Round 2
Reviewer 2 Report
The authors have addressed all my comments. Good luck