Next Article in Journal
Geomorphological Response of Alluvial Streams to Flood Events during Base-Level Lowering: Insights from Drone-Based Photogrammetric Surveys in Dead Sea Tributaries
Previous Article in Journal
Drought Offsets the Controls on Colored Dissolved Organic Matter in Lakes
 
 
Article
Peer-Review Record

Adaptive and Anti-Drift Motion Constraints for Object Tracking in Satellite Videos

Remote Sens. 2024, 16(8), 1347; https://doi.org/10.3390/rs16081347
by Junyu Fan and Shunping Ji *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4:
Remote Sens. 2024, 16(8), 1347; https://doi.org/10.3390/rs16081347
Submission received: 20 February 2024 / Revised: 6 April 2024 / Accepted: 8 April 2024 / Published: 11 April 2024
(This article belongs to the Section Earth Observation Data)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposed an improved DiMP tracker for object tracking in satellite videos. The main contributions are 1) a novel bounding box regression branch to improve tracker performance, 2) an LSTM-based trajectory prediction module, and 3) an anti-drift module to improve tracking performance.

Strength:
Overall, this paper looks good to me. The paper is well-written and easy to follow. The proposed methods are technically sound. The experiments demonstrate the effectiveness of the proposed method.

Weakness:
- The ADM is claimed as a major contribution. However, the ablation study does not solely examine this part. It would be useful to further evaluate the effectiveness of the proposed ADM. This also holds for TDEM. In other words, it is better to ablate ADM and TDEM separately instead of ablating the whole MCB.
- It is better to put section 4.2 (ablation study) after section 4.3 (Results and analysis) for better readability.
- Some other tracking methods can be discussed in the related literature such as,
Wilson, D., Alshaabi, T., Van Oort, C., Zhang, X., Nelson, J., & Wshah, S. (2022). Object Tracking and Geo-Localization from Street Images. Remote Sensing14(11), 2575.

Minor comments:
- Fig. 2 temple patch => template patch. Also, the fonts are too small in Fig. 2
- Fig. 1 More captions are needed to describe the proposed method.

Comments on the Quality of English Language

No obvious grammar errors noted

Author Response

Reviewer 1:

 

Comments and Suggestions for Authors:

This paper proposed an improved DiMP tracker for object tracking in satellite videos. The main contributions are 1) a novel bounding box regression branch to improve tracker performance, 2) an LSTM-based trajectory prediction module, and 3) an anti-drift module to improve tracking performance.

 

Strength:

Overall, this paper looks good to me. The paper is well-written and easy to follow. The proposed methods are technically sound. The experiments demonstrate the effectiveness of the proposed method.

Answer: Dear Reviewer#1, thank you for your positive comments.

 

Weakness:

1.1 The ADM is claimed as a major contribution. However, the ablation study does not solely examine this part. It would be useful to further evaluate the effectiveness of the proposed ADM. This also holds for TDEM. In other words, it is better to ablate ADM and TDEM separately instead of ablating the whole MCB.

Answer: Dear Reviewer, Based on your opinion, we designed new ablation experiments for the internal components ADM and TDEM of MCB to assess their contributions independently. We have added detailed descriptions of these ablation experiments in Section 4.2.3. These new contents demonstrate the effectiveness of our proposed method.

 

Section 4.2.3 added in the manuscript:                                                                              

The proposed MCB consists of two main components: TDEM and ADM. TDEM   

462

463

464

465

466

467

 

utilizes historical trajectory information to estimate trajectory distribution, while ADM utilizes TDEM to detect tracking drift and implement motion constraints and compensation during the tracking process. The effectiveness of TDEM and ADM is demonstrated through ablation experiments conducted on the SatSOT dataset. Without ADM, we utilize peak responses in the response map outputted by TCB to evaluate the current tracking state. When the value of peak response falls below a given threshold, indicating tracking uncertainty, TDEM is employed for motion compensation

469

470

 

. Results in Table 4 indicate that the improvement of tracking performance is influenced by the specified threshold. The best performance is achieved when the threshold is set to 0.3. Furthermore,

472

473

474

 

 the combination of ADM and TDEM brings a more significant improvement in tracker performance compared to introducing TDEM alone, further confirming the effectiveness of MCB.

 

Table 4. Ablation study of TDEM and ADM in the MCB on SatSOT, where "/0.1", "/0.2", "/0.3", and "/0.4" denote setting the response threshold to 0.1, 0.2, 0.3, and 0.4 respectively, with the best results highlighted in red.

Tracker

TDEM

ADM

Prec.(%)

Succ.(%)

TCB+AERB

-

-

61.6

46.2

TCB+AERB+TDEM/0.1

-

60.9

45.6

TCB+AERB+TDEM/0.2

-

61.3

45.9

TCB+AERB+TDEM/0.3

-

63.5

47.2

TCB+AERB+TDEM/0.4

-

62.2

46.3

TCB+AERB+MCB

66.3

49.0

 

 

1.2 It is better to put section 4.2 (ablation study) after section 4.3 (Results and analysis) for better readability.

Answer: Thank you.We have carefully considered your suggestion. While we greatly appreciate your valuable opinion, we believe that maintaining the current section order may be a better choice. Section 4.3 provides a comprehensive analysis of the tracking algorithm's performance and discusses future work directions, which logically progresses to the subsequent summary section. Swapping the positions of Sections 4.2 and 4.3 might disrupt this logical order and cause confusion for readers regarding the paper's structure and content. Therefore, we hope to keep the current section order in this case. If you have any other suggestions on this matter, we are very willing to listen and make appropriate adjustments.

 

 

1.3 Some other tracking methods can be discussed in the related literature such as,

Wilson, D., Alshaabi, T., Van Oort, C., Zhang, X., Nelson, J., & Wshah, S. (2022). Object Tracking and Geo-Localization from Street Images. Remote Sensing, 14(11), 2575.

Answer: Thank you for recommending the research by Wilson et al. (2022) on Geo-Localization from street images. We find this study relevant to our work, particularly in its application to the field of autonomous driving. We have cited this paper in the introduction section to further enrich our research background. The revised part of the introduction section:

Single object tracking, predicting the dynamic state of targets based on initial video     frame cues, is fundamental research for applications such as visual surveillance [1], human-computer interaction [2], and autonomous driving [3,4].

 

[4] Wilson, D.; Alshaabi, T.; Van Oort, C.; Zhang, X.; Nelson,J.; Wshah, S. Object Tracking and Geo-Localization from Street Images. REMOTE SENSING 2022, 14. https://doi.org/10.3390/rs14112575.          

 

 

Minor comments:

1.4 Fig. 2 temple patch => template patch. Also, the fonts are too small in Fig. 2

Answer: Thank you. We have corrected the spelling error in Figure 2, changing "temple patch" to "template patch." Additionally, we have adjusted the font size in Figure 2 to improve readability and ensure compliance with the journal's requirements. The revised part of the Figure2:Figure 2. The architecture of AERB. Using the template patch and the test patch as input, the branch introduces spatial attention by considering interactions between patches. HG and SE networks are utilized to enhance corner localization perception in both spatial and channel dimensions.

 

 

1.5 Fig. 1 More captions are needed to describe the proposed method.

Answer: Thank you. We have added more detailed captions to Figure 1 to describe our proposed method more clearly. This will help readers better understand our work.

 

The revised part of the Figure1:Figure 1. The architecture of the proposed tracker. TCB takes the output of the feature extraction network as input, and combines AERB to observe the target state in the current frame. In MCB, TDEM uses the historical motion information of the target to estimate the trajectory distribution, ADM detects the tracking state and introduces motion constraints into the tracking process according to the trajectory distribution estimated by TDEM.

Reviewer 2 Report

Comments and Suggestions for Authors

1.      It is suggested to put the first paragraph of section 1 to a better position.

2.      Line 46-47, why top-down view angle will degrade the tracking issue? For autonomous driving, mentioned by the author, object also will be covered by others.

3.      In Tab.1, what is the rotation axis for ROT, vertical to the surface of earth or others?

4.      It is suggested to open the source code.

Author Response

Reviewer 2:

 

Comments and Suggestions for Authors:

2.1 It is suggested to put the first paragraph of section 1 to a better position.

Answer: Dear Reviewer#2, thank you for your positive comments and detailed suggestions that have greatly helped to improve our manuscript. We have made the following changes:(1) The content related to the development of satellite video tracking technology has been moved to the beginning of the chapter to better highlight the background and significance of our research topic.(2) The introduction of object tracking algorithm classifications has been shifted to the third paragraph, before discussing the advantages and disadvantages of mainstream tracking algorithms to deal with the challenges in satellite video.(3) Local revisions and optimizations have been made to enhance the fluency and naturalness of the writing throughout Section 1.

 

 

2.2 Line 46-47, why top-down view angle will degrade the tracking issue? For autonomous driving, mentioned by the author, object also will be covered by others.

Answer: Thank you. We would like to clarify that this is a characteristic specific to video satellite Earth observation. We emphasize that top-down-view video satellites capture large-size images from a very far distance to moving targets. This leads to moving objects in satellite videos typically appearing small in size at a low resolution, resulting in a lack of distinguishable features for target recognition. This differs from the scenario in autonomous driving.

 

 

2.3 In Tab.1, what is the rotation axis for ROT, vertical to the surface of earth or others?

Answer: Thank you. We confirm that the rotation axis   the surface of earth.

 

 

2.4 It is suggested to open the source code.

Answer: Thank you. We will open-source the code soon at http://gpcv.whu.edu.cn/data/.

 

 

Reviewer 3 Report

Comments and Suggestions for Authors

The proposed work addresses satellite videos in general but addresses a particular class of them. Specifically, it appears to address only videos which observe takers on or close to the Earth (i.e., aeroplanes or similar types of fast-moving vehicles). There is a class of videos that is not addressed. This class regards videos in space, in which non-cooperative targets are tracked. That is satellite videos in which old satellites (or other types of space garbage) are tracked; such works occult a significant part of the current literature in satellite videos. The proposed work does not seem to address these cases. This is fine, but it should be clarified in the introduction or title. The reason is that both of these classes require different treatments (i.e. satellite tracking videos often require the consideration of Earth and satellite motion). As such it is not clear if the proposed method can be generalised to these cases. Thus, my recommendation is either to clarify the domain of application of the proposed method, or present experiments that demonstrate the ability of the method to generalise to these cases.

 

 

 

The work in [52] for tracking evaluation regards only one tracked object. It is not clear if the proposed method copes with multiple tracking targets. My recommendation is that The paper should clarify whether it regards single or multiple-object tracking. If it deals with single object tracking it should explicitly evaluate cases where multiple moving objects appear in the scene and occlude the target. If it deals with multiple object tracking it should employ a metric that evaluates the tracking of multiple objects, specifically for cases where occlusions may lead to identify (id) switches, using a more relevant metric, such as: one of the following.

- Luiten, J., Os̆ep, A., Dendorfer, P. et al. HOTA: A Higher Order Metric for Evaluating Multi-object Tracking. Int J Comput Vis 129, 548–578 (2021). https://doi.org/10.1007/s11263-020-01375-2

- Mohammadjavad Abbaspour, Mohammad Ali Masnadi-Shirazi, Online multi-object tracking with δ-GLMB filter based on occlusion and identity switch handling, Image and Vision Computing, Volume 127, 2022, 104553, ISSN 0262-8856, https://doi.org/10.1016/j.imavis.2022.104553.

- Keni Bernardin, Alexander Elbs, R. Stiefelhagen, Multiple Object Tracking Performance Metrics and Evaluation in a Smart Room Environment, EURASIP Journal on Image and Video Processing, EURASIP Journal on Image and Video Processing,  2008(1), DOI: 10.1155/2008/246309

 

Author Response

Reviewer3:

 

Comments and Suggestions for Authors:

3.1 The proposed work addresses satellite videos in general but addresses a particular class of them. Specifically, it appears to address only videos which observe takers on or close to the Earth (i.e., aeroplanes or similar types of fast-moving vehicles). There is a class of videos that is not addressed. This class regards videos in space, in which non-cooperative targets are tracked. That is satellite videos in which old satellites (or other types of space garbage) are tracked; such works occult a significant part of the current literature in satellite videos. The proposed work does not seem to address these cases. This is fine, but it should be clarified in the introduction or title. The reason is that both of these classes require different treatments (i.e. satellite tracking videos often require the consideration of Earth and satellite motion). As such it is not clear if the proposed method can be generalised to these cases. Thus, my recommendation is either to clarify the domain of application of the proposed method, or present experiments that demonstrate the ability of the method to generalise to these cases.

Answer: Dear Reviewer#3, thank you for your valuable comments and suggestions. We further elaborated on the research scope of this article:Scope of the proposed method: In the revised manuscript, we have clarified the domain of application in the first paragraph of the introduction to emphasize that our research primarily deals with satellite videos obtained from staring video satellite technology, which captures data (near) of the Earth's surface. The moving targets in these videos are mainly cars, airplanes, and ships. We acknowledge that the proposed method may require adaptations to generalize to other classes of satellite videos, such as those tracking non-cooperative targets in space.

 

The revised part of the introduction section:

Single object tracking, predicting the dynamic state of targets based on initial video frame cues, is fundamental research for applications such as visual surveillance [1], human-computer interaction [2], and autonomous driving [3,4]. In the Earth observation field, remarkable advancements in video satellite technology [5,6] have been witnessed in recent years. By employing a stare observation approach [7], video satellites are capable of continuously observing specific regions, providing valuable video data of the Earth surface. The development has facilitated the emergence of a new task: interested object tracking using satellite videos. This task enables real-time monitoring and tracking of various objects of interest, such as vehicles, aircrafts, ships, and trains, on a broad region of the Earth surface, leading to a wide range of substantial applications including traffic and environmental monitoring [8], military reconnaissance [9], and disaster management [10].

 

 

3.2 The work in [52] for tracking evaluation regards only one tracked object. It is not clear if the proposed method copes with multiple tracking targets. My recommendation is that the paper should clarify whether it regards single or multiple-object tracking. If it deals with single object tracking it should explicitly evaluate cases where multiple moving objects appear in the scene and occlude the target. If it deals with multiple object tracking it should employ a metric that evaluates the tracking of multiple objects, specifically for cases where occlusions may lead to identify (id) switches, using a more relevant metric, such as: one of the following.

- Luiten, J., Os̆ep, A., Dendorfer, P. et al. HOTA: A Higher Order Metric for Evaluating Multi-object Tracking. Int J Comput Vis 129, 548–578 (2021). https://doi.org/10.1007/s11263-020-01375-2

- Mohammadjavad Abbaspour, Mohammad Ali Masnadi-Shirazi, Online multi-object tracking with δ-GLMB filter based on occlusion and identity switch handling, Image and Vision Computing, Volume 127, 2022, 104553, ISSN 0262-8856, https://doi.org/10.1016/j.imavis.2022.104553.

- Keni Bernardin, Alexander Elbs, R. Stiefelhagen, Multiple Object Tracking Performance Metrics and Evaluation in a Smart Room Environment, EURASIP Journal on Image and Video Processing, EURASIP Journal on Image and Video Processing,  2008(1), DOI: 10.1155/2008/246309

Answer: Thank you. We would like to clarify that the proposed method specifically addresses the task of single-object tracking in satellite video scenes. We have revised the text of the manuscript to explicitly mention this. Regarding the performance of the tracker in the presence of multiple moving objects (which we interpret as interference from similar objects) and occlusions, we have conducted quantitative analyses on these challenging scenarios. The proposed method demonstrates the best performance in terms of precision and success rate when dealing with similar object interference and partial occlusion.

 

The revised part of the manuscript:Abstract:

Object tracking in satellite videos has garnered significant attention due to its increasing importance. However, several challenging attributes, such as the presence of tiny objects, occlusions, similar objects, and background clutter interference, make it a difficult task. Many recent tracking algorithms have been developed to tackle these challenges in tracking a single interested object, but they still have some limitations in addressing them effectively. This paper introduces a novel correlation filter-based tracker, which uniquely integrates attention-enhanced bounding box regression and motion constraints for improved single-object tracking performance.

 

Keywords:single object tracking

 

Introduction: (1) Furthermore, single object tracking in satellite videos frequently encounters issues of occlusion and interference. While these issues also exist in general tracking scenarios, they are intensified in the satellite context due to the top-down viewing angle. For instance, moving vehicles often face full occlusion from buildings or natural landscapes in satellite videos. Moreover, interference factors such as atmospheric turbulence, changes in illumination, similar objects, and background clutter further hinder the object tracking process, leading to tracking drift.(2) We propose an anti-drift module for satellite video single object tracking, which models the difference between the observation distribution and motion trend distribution of the target to detect drift.

 

Related Literature:

2.4. Single Object Tracking in Satellite Video

 

 

The precision and success rate plot of the trackers against similar object interference, partial occlusion, full occlusion challenges on pages 17-18 of the revised manuscript.

 

Reviewer 4 Report

Comments and Suggestions for Authors

 Object tracking in satellite videos has garnered significant attention due to its pivotal role in various domains such as environmental monitoring, urban planning, and national security. The ability to accurately track objects in satellite imagery enables precise analysis of dynamic phenomena and facilitates informed decision-making processes. As satellite technology continues to advance, the demand for robust object tracking algorithms becomes increasingly pronounced, underscoring the urgency of developing effective solutions to meet evolving needs.

Strengths:

Innovative Attention Mechanism: This paper introduces a unique dual-attention mechanism, significantly improving target perception and corner localization crucial for accurate object tracking in challenging satellite video scenarios.

 

Effective Motion Feature Integration: By utilizing motion features through an LSTM network, the proposed method overcomes limitations of small-sized and low-resolution target appearance features, enhancing tracking performance over time.

 

Weakness:

 

While demonstrating significant improvements over recent trackers, the scalability and generalization of the proposed method may vary across different satellite video datasets and scenarios, necessitating further validation on diverse datasets.

 

Recommendation:

 

Researchers interested in advancing satellite video object tracking should explore and potentially extend the proposed method, conducting additional experiments on varied datasets to enhance scalability and generalizability. Additionally, assessing computational efficiency for real-time applications would be beneficial.

Author Response

Reviewer 4:

 

Comments and Suggestions for Authors:

 Object tracking in satellite videos has garnered significant attention due to its pivotal role in various domains such as environmental monitoring, urban planning, and national security. The ability to accurately track objects in satellite imagery enables precise analysis of dynamic phenomena and facilitates informed decision-making processes. As satellite technology continues to advance, the demand for robust object tracking algorithms becomes increasingly pronounced, underscoring the urgency of developing effective solutions to meet evolving needs.

 

Strengths:

Innovative Attention Mechanism: This paper introduces a unique dual-attention mechanism, significantly improving target perception and corner localization crucial for accurate object tracking in challenging satellite video scenarios.

 

Effective Motion Feature Integration: By utilizing motion features through an LSTM network, the proposed method overcomes limitations of small-sized and low-resolution target appearance features, enhancing tracking performance over time.

 

Weakness:

While demonstrating significant improvements over recent trackers, the scalability and generalization of the proposed method may vary across different satellite video datasets and scenarios, necessitating further validation on diverse datasets.

Answer: Dear Reviewer#4, thank you for your valuable comments and suggestions on our manuscript. We would like to address your concerns and provide clarifications as follows:

 

Recommendation:

4.1 Researchers interested in advancing satellite video object tracking should explore and potentially extend the proposed method, conducting additional experiments on varied datasets to enhance scalability and generalizability.

4.2 Additionally, assessing computational efficiency for real-time applications would be beneficial.

Answer: Thank you for your valuable comments and suggestions. The satellite video object tracking datasets are very rare, the two open-source SatSOT dataset and the satellite video data provided by the SatVideoDT challenge are what we can utilize.

 

The SatSOT dataset contains 105 moving target video sequences from three different video satellites: Jilin-1, Skybox, and Carbonite-2. These video sequences cover 11 typical challenges encountered in complex satellite video scenes. the large-scale satellite video data from SatVideoDT, which contains 1,126 moving object tracking sequences from the Jilin-1 satellite.

When more diverse datasets are available in the future, we will further test our method. =

 

 

Regarding the computational efficiency assessment, in Section 4.2.5 of the paper, we have evaluated the tracking efficiency of the proposed tracker. On the GTX1060 testing platform, it can basically achieve real-time performance with more than 25 FPS.

 

The tracking efficiency of the proposed tracker on pages 15 of the revised manuscript.

 

ε

Prec.(%)

Succ.(%)

FPS

0.07

65.8

48.3

25.29

0.06

64.6

47.6

25.49

0.05

66.3

49.0

25.05

0.04

63.9

47.2

24.85

0.03

64.1

47.4

24.84

 

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

I find that the comments of the review have been fully addressed and that the paper merits publication.

Back to TopTop