Next Article in Journal
Multipath and Deep Learning-Based Detection of Ultra-Low Moving Targets Above the Sea
Previous Article in Journal
Monitoring Spatial-Temporal Variability of Vegetation Coverage and Its Influencing Factors in the Yellow River Source Region from 2000 to 2020
Previous Article in Special Issue
Multi-Modal Prototypes for Few-Shot Object Detection in Remote Sensing Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Contrastive-Augmented Memory Network for Anti-UAV Tracking in TIR Videos

1
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
2
Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China
3
Key Laboratory of Target Cognition and Application Technology (TCAT), Chinese Academy of Sciences, Beijing 100190, China
4
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(24), 4775; https://doi.org/10.3390/rs16244775
Submission received: 21 October 2024 / Revised: 16 December 2024 / Accepted: 19 December 2024 / Published: 21 December 2024

Abstract

:
With the development of unmanned aerial vehicle (UAV) technology, the threat of UAV intrusion is no longer negligible. Therefore, drone perception, especially anti-UAV tracking technology, has gathered considerable attention. However, both traditional Siamese and transformer-based trackers struggle in anti-UAV tasks due to the small target size, clutter backgrounds and model degradation. To alleviate these challenges, a novel contrastive-augmented memory network (CAMTracker) is proposed for anti-UAV tracking tasks in thermal infrared (TIR) videos. The proposed CAMTracker conducts tracking through a two-stage scheme, searching for possible candidates in the first stage and matching the candidates with the template for final prediction. In the first stage, an instance-guided region proposal network (IG-RPN) is employed to calculate the correlation features between the templates and the searching images and further generate candidate proposals. In the second stage, a contrastive-augmented matching module (CAM), along with a refined contrastive loss function, is designed to enhance the discrimination ability of the tracker under the instruction of contrastive learning strategy. Moreover, to avoid model degradation, an adaptive dynamic memory module (ADM) is proposed to maintain a dynamic template to cope with the feature variation of the target in long sequences. Comprehensive experiments have been conducted on the Anti-UAV410 dataset, where the proposed CAMTracker achieves the best performance compared to advanced tracking algorithms, with significant advantages on all the evaluation metrics, including at least 2.40%, 4.12%, 5.43% and 5.48% on precision, success rate, success AUC and state accuracy, respectively.

1. Introduction

In recent years, advancements in technologies such as intelligent control, microelectronics and sensors have substantially influenced the development of unmanned aerial vehicles (UAVs), particularly in terms of miniaturization, cost efficiency and reduced energy consumption [1,2,3,4,5]. Consequently, due to their portability and flexibility, along with the open-source programming systems, UAVs have received widespread attention and been applied in multiple scenarios [6], including aerial photography [7], geographical mapping, intelligent monitoring [8] and rescue [9,10]. At the same time, however, the advantages of UAVs lead to their increasing abuse, which has brought a significant negative impact on public safety [11]. Therefore, anti-UAV technologies are critically important in both practical applications and research endeavors.
In order to address the public safety issues caused by misused UAVs, multiple anti-UAV methods are introduced, such as radar detection [12], acoustic detection [13] and visual-based methods [14,15]. Among the methods, both radar detection systems and acoustic detection systems are able to perceive targets in a 360-degree surrounding area. But radar systems struggle with detecting small drone targets, especially in complex environments, while acoustic detection systems are sensitive to noise interference in urban environments. Compared to other anti-UAV technologies, visual-based approaches offer higher efficiency, lower power consumption and easier deployment, thereby being extensively adopted to counter the potential threat of illegal drone intrusions [6]. In addition, techniques in thermal infrared (TIR) mode provide stability under different weather conditions and low-light circumstances [16]. As a vital and fundamental step in computer vision, visual object tracking allows the continuous real-time monitoring of particular targets [17]. Combined with the capacity of full-time and full-weather observation afforded by thermal infrared sensors, TIR object tracking offers a promising solution to anti-UAV tasks.
Although TIR object tracking methods are effective for all-time and all-weather anti-UAV tracking tasks, there are still some difficulties due to the TIR videos and the drone targets inside. Unlike RGB videos, frames in TIR videos are usually in grayscale and the resolutions are lower. Grayscale frames in TIR videos provide no color features and insufficient texture features, while low resolution results in inadequate structure and edge features [18,19]. Similarity interference and intensity change are also typical challenges in TIR target tracking tasks [16]. At the meantime, the drone targets in the TIR videos are usually in a smaller scale than the targets in universal tracking tasks, leading to sparse target features. In addition, under some circumstances, drone targets move fast in the video. Therefore, to capture the fast-moving drone targets, sudden camera movements are inevitable [20]. Moreover, in long videos, the background may change greatly, and the appearance of the targets may also shift. In general, the main challenges in TIR anti-UAV tracking include (1) limited target features, due to the quality of TIR videos and the small scale of drone targets; (2) sudden camera movement caused by the fast-moving drone targets; (3) interference from thermal crossover and similar distractors; and (4) target appearance variation and dynamic background change in long-term tracking tasks.
As an important area of computer vision, visual single object tracking is applied in a wide range of social systems and great improvements have been achieved in this field [17]. Generally, current single object trackers can be broadly categorized into two types, including correlation filter trackers and deep learning trackers [21].
Correlation filters were first introduced to the filed of tracking by Bolme et al. in MOSSE [22]. Thanks to their computational efficiency and stability, correlation filter trackers (CF-Trackers) have been improved from various perspectives [21]. CSK [23] introduce fast Fourier transform into the cyclic matrix, enhancing the efficiency of the algorithm and improving the accuracy by converting the image space into a nonlinear space. KCF [24] employs kernel functions to process tracking on high dimensions. To avoid the limitation of single kernel function, MKCF [25] conducts an adaptive fusion of multiple kernel functions to train the correlation filter. STAPLE [26] strengthens feature representation by applying the histogram of oriented gradient (HOG) in feature extraction. To reduce the impact of scale variation, Danelljan et al. propose a discriminative scale space tracker, also known as DSST [27], which trains correlation filters on a scale pyramid representation. SRDCF [28] alleviates boundary effects by employing a spatial weights function to limit the filter coefficients. STRCF [29] introduces the alternative direction method of multipilers (ADMM) to make each sub-problem have a closed solution. To address the boundary problem caused by the usage of cyclic matrix in generating training samples, BACF [30] enlarges the object search area and improves the quality of generated samples. However, the fixed spatial constraints applied to balance the boundary effects ignore the diversity information of the targets. Therefore, ASRCF [31] performs an adaptive spatial regularization and obtain more reliable filter coefficients in the tracking process. Adaptive approaches are also employed in [32], where the authors propose an adaptive spatial-aware CF to increment the weights to the target areas. Although they provide remarkable inference speed, most CF trackers have a lack of flexibility and generalization due to the usage of handcraft features and regularization strategies. Therefore, CF trackers struggle with diverse targets and complex backgrounds [17].
With the breakthroughs [33,34,35] made by deep learning methods in ImageNet large-scale visual recognition competition (ILSVRC) [36] and visual object tracking (VOT) [37,38,39], deep learning methods have gained considerable attraction in computer vision fields. The methodology of deep learning and a convolutional neural network (CNN) is also introduced to visual object tracking tasks. A category of deep learning trackers (DL trackers) is integrating features extracted from deep learning networks and traditional CF trackers [40,41,42]. However, these trackers are not capable of handling various tracking tasks [17]. Therefore, end-to-end trained DL trackers, especially those based on Siamese networks, are mainly investigated, due to better exploitation of the feature representation and feature understanding abilities.
The concept of the Siamese network was first proposed by Tao et al. in SINT [43]. Following that, Bertinetto et al. designed SiamFC [44], employing a fully convolutional structure and a cross-correlation layer to calculate the similarity between the template and searching areas. Inspired by SiamFC, massive Siamese network-based methods are proposed. SiamRPN [45] and SiamRPN++ [46] introduce RPN into Siamese networks, also employing classification and regression prediction in target locating. Apart from the anchor-based approaches, anchor-free methods like SiamCAR [47] and SiamBAN [48] perform classification and regression prediction directly on the response map, leading to significant process on tracking speed. Besides the Siamese networks that search the target in a particular region guided by the previous tracking results, global instance search (GIS) approaches seek targets in the whole image, which alleviates the impact of sudden change. Among them, GlobalTrack [49] introduces the RCNN structure into Siamese networks, conducting a two-stage refinement to enhance the prediction of bounding boxes in global view of the searching image. SiamRCNN [50] proposed a two-stage re-detection and a tracklet mechanism to reduce the disturbance from distractors.
With Transformer [51] showing promising ability in various downstream tasks in computer vision, a number of studies on Transformer-based trackers are proposed [52,53,54,55,56,57,58,59,59]. TransT [53] integrates CNN feature extraction with Transformer feature fusion, exploiting the advantage of the attention mechanism in long-distance dependencies. STARK [52] propose a template updating strategy on the CNN transformer structure to learn spatial and temporal information at the same time. SwinTrack [54] applies swin-transformer as the feature extractor, enhancing object tracking by better feature representation. Unlike the former methods, OSTrack [55] introduces a one-stream tracking framework based on Transformer, instead of the two-stage framework of feature extraction and relation modeling. After that, SeqTrack [56] and ARTrack [57] convert the tracking task into a sequence generation problem, and employ an autoregressive strategy to predict the bounding boxes. AQATrack [59] designs a novel spatial-temporal transformer to learn deep spatial-temporal constraints and perform prediction with the assistance from a spatio-temporal feature fusion module.
Although general visual object tracking has achieved remarkable promotion, different image qualities and target characteristics in TIR frames and RGB frames result in the unsatisfactory performance of the general trackers on TIR tracking tasks. Therefore, trackers for TIR tracking tasks are investigated. MCFTS [60] employs a correlation filter-based ensemble tracker with multilayer convolutional features. Nevertheless, TIR tracking is still confused by scarce benchmarks. To alleviate this problem, Liu et al. propose a TIR pedestrian tracking dataset called PTB-TIR [61] and a large-scale TIR object tracking dataset named LSOTB-TIR [62], which contains 1400 sequences and over 600 K frames with 47 categories of targets. Based on these benchmarks, plenty of trackers for TIR targets are proposed. HSSNet [63], designed by Li et al., utilizes hierarchical convolutional features to obtain a better spatial and semantic feature representation of TIR targets. Liu et al. [64] propose a a multi-level similarity-based Siamese network, namely, MLSSNet, which contains a structure similarity module and a semantic similarity module to compute the similarity for TIR objects in different aspects. Furthermore, MMNet [65] performs multi-level feature matching for TIR targets by adopting a discriminative matching module for inter-class recognition and a fine-grained aware module for intra-class recognition. To reduce the influence of occlusion and distractors, Yuan et al. [66] design a spatial-temporal memory network, collecting high-quality results in previous frames for tracking.
The aforementioned TIR trackers have shown brilliant performance on regular TIR tracking tasks like pedestrian tracking and animal tracking, where the targets are of normal size or the background are quite clean. These trackers often struggle when facing anti-UAV tracking tasks, as the targets are usually smaller and some city backgrounds are clutter [67]. Hence, tracking drones focusing on TIR sensors have received attention from researchers. Jiang et al. [68] propose a large-scale benchmark called Anti-UAV, which contains over 300 video pairs with both visible light and infrared modalities. Huang et al. present Anti-UAV410 [20], consisting of 410 infrared sequences with 438k bounding boxes. SiamSTA [69] designs a two-stage re-detection mechanism inspired by SiamRCNN, reducing the interference of occlusion and camera movement. Shi et al. propose GASiam [70], employing graph attention to support the re-detection of drone targets. UTTracker [14], presented by Yu et al., tackles target appearance variation by a multi-region local tracker. UTTracker also contains a global detection branch to avoid the quick position switch of targets.
Nevertheless, there are still some problems that remain unsolved in TIR anti-UAV tracking. First, usual solutions to camera movement or out-of-view objects are applying a global detection algorithm for assistance, which makes the tracking pipeline more complicated. Second, most trackers tend to put more focus on learning the similarity between ground truth and the templates than the differences between distractors and the templates. This results in weak ability in instance discrimination when dealing with thermal crossover and distractors. Finally, long-term anti-UAV tracking tasks are still challenging, as the target appearances and backgrounds change dynamically in long sequences. Memory modules utilized in current TIR anti-UAV trackers usually rigidly stack previous tracking results together or simply adopt the latest high-quality tracking result as a dynamic template. These approaches cannot combine the common discriminative features in history results and features of recent predictions flexibly.
To address the remaining challenges, in this paper, the anti-UAV tracking task is converted to a global instance search (GIS) problem to perform re-detection targets without other algorithms. The GIS problem is then divided into two subproblems, proposal detection and instance discrimination. Proposal detection gathers regions that are similar to the template, and instance discrimination finds the target among the proposals. Inspired by the ability of contrastive learning in distinguishing inter-class samples [71], a contrastive learning strategy is adopted to better extract the discriminative features of the templates. To maintain high tracking performance in long sequences, a memory module is also necessary. Based on these ideas, a contrastive-augmented memory network (CAMTracker) for TIR anti-UAV tracking tasks is proposed.
Comprehensively speaking, CAMTracker takes the whole image as input to re-detect targets without other algorithms. Generally, CAMTracker contains a Siamese feature extraction backbone and utilizes a two-stage tracking framework, composed of a proposal detection stage and an instance discrimination stage. The proposal detection stage returns possible target regions through an instance-guided region proposal network (IG-RPN). The instance discrimination stage distinguishes the target among the candidates, which is executed mainly by a contrastive-augmented matching module (CAM). In CAM, there is a classification branch, a regression branch and an embedding branch. The branches are supervised by corresponding loss. Specifically, a refined contrastive loss function is proposed to supervise the embedding branch. Furthermore, an adaptive dynamic memory module (ADM) is employed to assist achieving high-quality long-term tracking performance as well. Inspired by the capability of the attention mechanism in capturing key information [51], a attention-like updating strategy is applied in ADM to better extract common features of the initial template and tracking results, while the unique features in recent predictions remain. Experiments on the Anti-UAV410 dataset demonstrate the robustness and effectiveness of CAMTracker when handling long-term UAV tracking in TIR mode with interference of distractors and thermal crossover.
Generally, there are the following contributions in this article:
(1)
A contrastive-augmented memory network is proposed for anti-UAV tracking tasks in TIR videos, which employs a GIS tracking structure and defects the interference from camera movement. Comprehensive experiments have shown that the proposed CAMTracker achieves better performance than other mainstream algorithms on benchmarks in the TIR anti-UAV field.
(2)
A contrastive-augmented matching module is designed for better instance discrimination. Apart from regular regression and classification prediction, another branch trained by contrastive loss assists the tracker classifying the target.
(3)
A refined contrastive loss function is designed to better balance the emphasis on positive and negative samples, thus strengthening the discrimination ability of the tracker.
(4)
An adaptive dynamic memory module is proposed to execute memory updating and provide the latest critical features of the targets, thereby improving the tracker’s long-term tracking capability.

2. Materials and Methods

The proposed CAMTracker contains a pair of shared-weighted backbones for feature extraction, a two-stage tracking framework consisting of a proposal detection stage and an instance discrimination stage and an adaptive dynamic memory module. Figure 1 provides an overview of the architecture. The proposal detection stage operates coarse detection under the guide of target information and generates possible candidates through an instance-guided region proposal network (IG-RPN). The instance discrimination stage distinguishes the target from the candidates in search frame and performs subtle location, including a contrastive-augmented matching module (CAM) and an adaptive dynamic memory module (ADM).

2.1. Multi-Scale Siamese Feature Extraction

Considering the insufficient color and texture features in infrared images, together with the small scale of UAV targets, it is challenging to track the targets with only single-scale features. Therefore, a multi-scale Siamese feature extraction module is applied to fuse semantic and spatial characteristics and obtain robust deep features. The Siamese network contains two weight-shared branches, namely, a template branch and a search branch. Each branch is composed of a ResNet and a feature pyramid network (FPN). Given a template frame I z and a search frame I x , the frame pair first passes through the backbone network and outputs feature maps from different levels. After that, the FPN takes the multi-level features as input and returns corresponding multi-scale image features Z = Z i | i O L } = ϕ ( φ ( I z ) ) and X = { X i | i O L } = ϕ ( φ ( I x ) ) , where φ and ϕ represent the backbone and the FPN, respectively. O L means the output layer set for FPN. Moreover, an RoIAlign algorithm is applied to extract the initial template target feature F z , which can be calculated as
F z = δ ( Z i , B 0 ) ,
where δ denotes the RoIAlign algorithm and B 0 indicates the ground truth box of the target in the initial frame. Z i means the template frame feature from the level corresponding to the target size. The weight-sharing network acquires semantic consistent features of the template frame and the search frame, thereby enhancing the identification of common objects in the image pair. Moreover, the multi-scale design not only obtains deep semantic features, but preserves the spatial details of the targets, benefiting precise target position locating.

2.2. IG-RPN for Proposal Detection

The proposal detection is mainly executed by an instance-guided region proposal network (IG-RPN), which contains an RPN module and a correlation encoder in front, as Figure 2 depicts, where ⊙ indicates correlation calculation. The initial target feature from the template branch and the feature of the search frame from the search branch are encoded together by a correlation operation. Then, the encoded correlation feature is inputted into the RPN to generate possible proposals through classification and regression.
Assume that F z R C × K × K is the template feature and F x R C × H × W is the search frame feature from the template branch and the search branch, respectively, where K × K is the output size of the RoIAlign layer, H, W and C are the height, width and the number of the feature channels of the search frame feature maps. The process of the correlation encoder can be present as follows:
φ ( F z ) = C o n v K × K ( F z ) ,
C o r r ( F z , F x ) = C o n v 1 × 1 ( F x φ ( F z ) ) ,
where C o n v K × K and C o n v 1 × 1 denote convolution operation with the kernel size of K × K and 1 × 1 , respectively. ⊗ indicates the correlation calculation. In the correlation encoder, F z is first convoluted by a K × K filter and converted to an integrated target feature with the size of 1 × 1 . After that, correlation between F x and the integrated target feature φ ( F z ) is conducted, which outputs a correlation map. The correlation map is then refined by a 1 × 1 convolution layer and delivered to the RPN. Classification scores and regression results will be presented by the RPN; thereby, the proposal list is formed by the regression bounding boxes with the highest scores.
Conventionally, tracking approaches perform correlation calculation between features of templates and neighborhood search regions to inference the location of targets, where the target sizes in the templates and the search regions are similar. Due to the utilization of global search strategy, there is a considerable gap between the size of the aligned template feature and possible target size in the global features. Therefore, the template feature is converted into a single-pixel feature for the correlation operation to maintain the size balance.

2.3. Contrastive-Augmented Instance Matching

Since there is insufficient color and texture information, it remains challenging to extract discriminative features from thermal infrared frames. Generally, the correlation features calculated from the features of the template and the searching regions are delivered to the heads for classification and regression predictions, both in single-stage Siamese networks and GIS methods. Multiple approaches have been designed to acquire more accurate correlation representation for more precise predictions. However, for single-stage Siamese networks, it is difficult to figure out the crucial information of the infrared drone when there are background noises involved in the searching region. On the other hand, numerous distractors would be introduced to the classification and regression heads in GIS approaches, seriously affecting the stability of the tracker. Therefore, it is unreliable to capture the infrared UAV accounting on the correlation features only.
In order to tackle this challenge, a contrastive-augmented matching module is proposed, with the acronym CAM, which incorporates contrastive learning strategies into the anti-UAV tracker to enhance the ability to distinguish targets from distractors. The structure of CAM is illustrated in Figure 3, mainly consisting of a regular branch and an embedding branch. In Figure 3, Ⓗ means the Hadamard production. Among the branches, the regular branch performs classification and regression in most GIS methods. Meanwhile, the unique embedding branch is responsible for mapping templates and proposals into an embedding space and calculating the similarity scores between the templates and candidates. The similarity scores are used to assist the classification prediction during the inference process.
To be specific, assume there are Z R C × H × W and X t i R C × H × W , representing the template feature and the i-th proposal feature in the t-th frame, respectively. C, H and W means the number of channels, height and width of the features. Before entering the two branches, Z and X t i pass through two projection layers, respectively. Both projection layers are convolutional layers with 3 × 3 kernels and padding of 1 pixel. Then, in the regular branch, Hadamard production is applied on the projected template and proposal features to output the modulated feature. The modulated feature passes through a global average pooling layer and a fully connected layer, becoming a 1024 × 1 × 1 vector. The vector is processed by two fully connected layers to obtain the classification score and regression offsets, respectively.
C l s t i = F C c l s ( F C ( G A P ( p z ( Z ) p x ( X t i ) ) ) ,
B o x t i = F C r e g ( F C ( G A P ( p z ( Z ) p x ( X t i ) ) ) ,
where p z and p x represent the projection functions that process F z and F x , both realized by 3 × 3 convolution layers with the padding of one pixel. Ⓗ in the equation means the Hadamard production. F C c l s , F C r e g , F C and G A P denote the fully connected layer for classification, fully connected layer for regression, shared fully connected layer and global average pooling layer, respectively.
Meanwhile, the embedding branch calculates the embedding features of the template and candidates, as well as their cosine similarities. As Figure 3 demonstrates, global average pooling is first performed on the projected template and candidate features. Subsequently, the embedding features of particular channels are computed by a multi-layer perceptron, which is composed of two fully connected layers. Finally, the branch calculates cosine similarities between the embedding features of the template and candidates. An embedding feature f e m b e d R C × 1 × 1 can be obtained through the following function:
f e m b e d = F C 2 ( F C 1 ( G A P ( X ) ) ) ,
where X indicates the template feature or candidate features. G A P means global average pooling layer, and F C 1 and F C 2 denote the two full-connection layers in the multi-layer perceptron. Consequently, the cosine similarity between the template embedding feature z and the i-th candidate embedding feature x i in the t-th frame can be derived from
s t i = z · x t i z · x t i ,
in which · represents the inner production for vectors.
During the process of inference, the module determines whether the candidates are targets depending not only on the classification scores but also the cosine similarities. The similarity scores of the candidates are obtained through a softmax function:
S i m t i = e x p ( s t i ) k c a n d i d a t e s e x p ( s t k ) ,
where i means the i-th candidate in the t-th frame. By multiplying the similarity score and the classification score, the final discrimination scores of the candidates in the t-th searching frame can be acquired:
F i n a l t = { C l s t i × S i m t i | i = 1 , 2 , . . . , N } ,
where N indicates the number of proposals in the t-th frame. During the tracking process, the highest scored proposal is determined as the predicted result in the frame.

2.4. Offline Training with Refined Contrastive Loss

Aiming to enhance the network’s ability to learn stronger semantic similarity and accurate location, CAMTracker is trained offline with a multi-task loss function, which consists of the loss from IG-RPN and the loss from CAM, which can be denoted as
L t o t a l = L I G R P N + λ C A M L C A M ,
where L t o t a l means the overall loss for training. L I G R P N and L C A M indicate the loss for the IG-RPN module and the loss for the CAM module. λ C A M means the weight factor to balance L I G R P N and L C A M .
For training IG-RPN in the proposal detection stage, a loss function similar to RPN is adpoted, which contains a binary cross-entropy (BCE) function [33] for classification and a smooth-L1 loss [72] for regression:
L I G R P N = λ c L c l s + λ r L r e g ,
L c l s = 1 N c i L b c e ( y i , y ^ i ) ,
L r e g = 1 N r i y ^ i L s m o o t h L 1 ( b i , b ^ i ) .
In Equations (11)–(13), L c l s and L r e g denote the loss for the classification branch and the regression branch in the IG-RPN module, respectively. λ c and λ r are weight factors for L c l s and L r e g . In Equations (12) and (13), y i and b i indicate the predicted classification score and bounding box offsets, respectively. y ^ i and b ^ i denote the ground truth label and bounding box offsets, respectively. N c and N r denote the number of the samples assigned for classification training and regression training, respectively.
In Equation (12), L b c e indicates the BCE loss function, which can be demonstrated as
L b c e ( y i , y ^ i ) = y ^ i log y i + ( 1 y ^ i ) log ( 1 y i ) .
In Equation (13), L s m o o t h L 1 indicates the smooth-L1 loss function, which can be shown as
L s m o o t h L 1 ( b i , b ^ i ) = j { x , y , w , h } s m o o t h L 1 ( b i j b ^ i j ) ,
where s m o o t h L 1 is the smooth-L1 function as follows:
s m o o t h L 1 ( x ) = x 2 2 , if | x | < 1 , | x | 1 / 2 , otherwise .
Similarly, the regular branch in CAM is also trained by a BCE loss and a smooth-L1 loss. Meanwhile, the embedding branch is trained by a refined contrastive loss.
The embedding branch is designed to find an embedding space where the positive proposals and the template have similar embedding features while the embedding of negative proposals are orthogonal with them. In anti-UAV tracking tasks, the negative proposals are not from a convergent class. The diversity of negative proposals is neglected if there is only the BCE loss being applied. Therefore, to perform ideal representing learning for classification, a contrastive-like strategy can be adopted, taking the diversity of negative proposals into consideration by viewing each of them as a class. A softmax-based info-noise contrastive estimation loss (infoNCE Loss) is usually employed for this kind of situation [73]:
L i n f o = x + l o g e x p ( z · x + / τ ) e x p ( z · x + / τ ) + x e x p ( z · x / τ ) ,
where z means the query, namely, the normalized template embedding feature in the tracking scenarios. x + and x represent the positive candidate sample and negative candidate sample, respectively. z · x + indicate the inner production between z and x + , which reflects the similarity between z and x + , and τ is the temperature factor to avoid an extreme high inner production. The loss function tends to keep the positive samples close to the template and negative samples far away from the template in the embedding space.
However, infoNCE Loss does not consider the positive and negative samples in a balanced manner. Equation (17) can be written as
L i n f o = x + l o g e x p ( z · x + / τ ) e x p ( z · x + / τ ) + x e x p ( z · x / τ ) = x + l o g e x p ( z · x + / τ ) + x e x p ( z · x / τ ) e x p ( z · x + / τ ) = x + l o g ( 1 + x e x p ( z · x / τ ) e x p ( z · x + / τ ) ) .
Assume p is a positive sample and n is a negative sample; it can be inferred that
L i n f o p = L i n f o e x p ( z · p / τ ) e x p ( z · p / τ ) p = l o g ( 1 + x e x p ( z · x / τ ) e x p ( z · x + / τ ) ) x e x p ( z · x / τ ) e x p ( z · p / τ ) x e x p ( z · x / τ ) e x p ( z · p / τ ) e x p ( z · p / τ ) e x p ( z · p / τ ) p = e x p ( z · p / τ ) e x p ( p · p / τ ) + x e x p ( z · x / τ ) x e x p ( z · x / τ ) ( e x p ( z · p / τ ) ) 2 e x p ( z · p / τ ) p = x e x p ( z · x / τ ) ( e x p ( z · p / τ ) ) 2 + x e x p ( z · x / τ ) e x p ( z · p / τ ) e x p ( z · p / τ ) p = x e x p ( z · x / τ ) P 2 + x P · e x p ( z · x / τ ) P p .
L i n f o n = L i n f o e x p ( z · n / τ ) e x p ( z · n / τ ) n = x + l o g ( 1 + exp ( z · n / τ ) e x p ( z · x + / τ ) ) e x p ( z · n / τ ) e x p ( z · x + / τ ) e x p ( z · n / τ ) e x p ( z · x + / τ ) e x p ( z · n / τ ) e x p ( z · n / τ ) n = x + e x p ( z · x + / τ ) e x p ( z · x + / τ ) + e x p ( z · n / τ ) 1 e x p ( z · x + / τ ) e x p ( z · n / τ ) n = x + 1 e x p ( z · x + / τ ) + e x p ( z · n / τ ) e x p ( z · n / τ ) n = x + 1 e x p ( z · x + / τ ) + N N n .
For simplicity, we use P = e x p ( z · p / τ ) and N = e x p ( z · n / τ ) . It can be inferred from Equations (19) and (20) that positive samples have a larger impact on infoNCE loss than negative samples. Therefore, for better balance between positive samples and negative samples, the contrastive loss is refined as
L c o n = l o g ( 1 + x + x e x p ( z · x / τ ) e x p ( z · x + / τ ) ) = l o g ( 1 + x + x e x p ( z · x / τ ) e x p ( z · x + / τ ) ) = l o g ( 1 + x + x e x p ( ( z · x z · x + ) / τ ) )
and the whole loss for CAM is functioned as
L C A M = λ c L c l s + λ r L r e g + λ c o n L c o n ,
where L c l s and L r e g are similar, as in Equations (12) and (13). λ c , λ r and λ c o n are the corresponding weights of the losses.

2.5. Template Updating with Adaptive Dynamic Memory Module

In long sequences, there may be obvious changes in the appearance of the targets during the tracking procedure. To be aware of the alteration of the target feature, a dynamic template for online tracking is introduced to fully utilize the information of previous tracking results for optimal decision. Traditionally, online tracking algorithms take the latest prediction as the dynamic template, which is easily misled by some hard negative samples. Some construct memory banks to record historical high-quality results for the prediction, which causes considerable consumption on calculation. To address the drawbacks of the conventional methods, inspired by [51], an adaptive updating scheme of maintaining a single dynamic template is designed for the instance discrimination stage. When there comes a new high-quality prediction, the dynamic template is updated with the most recent prediction under the guide of the initial template, which not only follows the latest results, but also takes historical predictions into consideration.
The adaptive dynamic template updating strategy is presented as Figure 4 shows, where Ⓒ indicates concatenation and ⊗ means matrix multiplication. Especially, the initial template T i n i t R C × K × K is regarded as the dynamic template at the beginning of online tracking. At the t-th frame of the sequence, the network first distinguishes whether the predicted result r t = [ x m i n , y m i n , x m a x , y m a x ] is a high-quality prediction according to the classification score threshold t h r e s c . If r t is a high-quality prediction, the target feature R t R C × K × K is obtained through RoIAlign, corresponding to the bounding box predicted in r t . After that, an integrated feature F t R 2 C × K × K is obtained by concatenating R t and D t , where D t R C × K × K indicates the dynamic template at the t-th frame. F t now contains information of historical predictions and the latest high-quality result. Subsequently, F t is reshaped to the size of 2 K K × C and T i n i t is reshaped to the size of K K × C . Then reshaped F t and reshaped T i n i t are multiplied to get the pixel-to-pixel similarity matrix w R 2 K K × K K , which is operated by a softmax function on a single dimension for normalization. The process can be denoted as
F t = C o n c a t ( R t , D t ) ,
w = S o f t m a x ( R e s 2 H W × C ( F t ) R e s C × H W ( T i n i t ) / C ) ,
where C o n c a t , S o f t m a x and ⊗ mean concatenation, softmax function on single dimension and matrix multiplication, respectively. R e s M × N indicates the reshape operation with an output size of M × N . C is a normalization parameter to control the magnitude in w. In detail, each pixel in w can be formally denoted as
w i j = e x p ( ( f t i , · t i n i t · , j ) / s ) 0 k 2 H W e x p ( ( f t k , · t i n i t · , j ) / s ) ,
where i and j are the coordination of the pixel, and f t and t i n i t indicate reshaped F t and T i n i t , respectively. f t i , · is the i-th row of f t and t i n i t · , j is the j-th column of t i n i t . ⊙ means vector dot production.
Afterwards, w is regarded as a weight map to guide the updating of the dynamic template. Pixels in w demonstrate the similarity between the integrated historical feature and the initial template, hence indicating the common critical information of the historical target features. Therefore, by multiplying f t and w, target features in historical predictions are captured and refined under the guide of the initial template, which works as the query in the following module:
D t + 1 = R e s C × H × W ( f t T w ) ,
where D t + 1 is the dynamic template in the ( t + 1 ) -th frame and f t T means the transpose of f t . Through the adaptive updating scheme, almost all high-quality predictions in the sequence are considered for more optimal tracking results. Meanwhile, the information of early predictions are gradually weakened due to the continuous concatenation between the dynamic template and the latest prediction; thus, the information of latest prediction is better emphasized.
Unlike maintaining a memory bank, a single dynamic template costs much less computing consumption. Nevertheless, high-quality predictions in adjacent frames usually share high similarity. Repetitive updating in these frames offers few contributions while occupying computing resources. Therefore, the updating module is activated every 10 frames for more conciseness. The high-quality prediction with the highest classification score during the past 10 frames will be used as the input of ADM.

2.6. Overall Online Tracking Inference with Local Bias

During the online tracking procedure, two CAM modules are performed. One takes the initial template and the proposals as input, while the other takes the dynamic template and the proposals as input. Assume the output scores for the t-th frame are S i n i t t R N and S d y n t R N , where N is the number of proposals. The final classification score is derived as the average score of S i n i t t and S d y n t , which is shown as below.
C l s t = { 1 2 ( S i n i t t [ i ] + S d y n t [ i ] ) | i = 1 , 2 , . . . , N } .
At the same time, a simple local bias is adopted during online tracking to better use the temporal information. For each proposal, the distance between the predicted bounding box B p r o p and the final prediction box B l a s t is considered. Assume the center of B p r o p and B l a s t is C p r o p and C l a s t , respectively. A distance score is proposed as below.
D i s i t = ( 1 + e x p ( C i p r o p C i l a s t 2 / r ) ) / 2 , i = 1 , 2 , . . . , N ,
where i means the i-th proposal. r is the width of the square which covers the same area as B i l a s t . By Equation 28, a score map is constructed to suppress proposals far away from the last prediction. The final score of a proposal is
F i n a l t = { C l s t i × D i s t i | i = 1 , 2 , . . . , N } .
The proposal with highest final score is deemed as the target in the frame.

3. Results

The proposed CAMTracker is evaluated mainly in two datasets: the Anti-UAV-410 dataset [20] for anti-UAV tracking tasks and the LSOTB-TIR dataset [62] for generalized evaluation. Quantitative and qualitative evaluations are conducted on CAMTrackers and other state-of-the-art trackers. An ablation study is performed to verify the effectiveness of the critical modules in CAMTracker.

3.1. Dataset

3.1.1. Anti-UAV410

Anti-UAV410 contains 410 sequences and over 150,000 frames. The numbers of sequences in training, validation and test sets are 200, 90 and 120, respectively. The frame size of all sequences is 640 × 512. Sequences in Anti-UAV410 cover various scenarios, including different light conditions and diverse backgrounds. Meanwhile, the dataset covers multiple difficulties in anti-UAV tracking tasks. Therefore, six attributes of sequences are defined correspondingly, including out-of-view (OV), occlusion (OC), thermal crossover (TC), fast motion (FM), scale variation (SV) and dynamic background clutter (DBC).
Apart from the challenging scenarios, the sequences in the dataset are also labeled in accordance with the target sizes. Four scales are defined based on the length of the diagonal of the bounding boxes, including normal size (NS): [50, i n f ), medium size (MS): [30, 50), small size (SS): [10, 30) and tiny size (TS): [2, 10).

3.1.2. LSOTB-TIR

As a widely used TIR tracking dataset, there are 1400 sequences and more than 600,000 frames of various image sizes in LSOTB-TIR, where 1280 sequences are for training and 120 sequences are for testing. Four scenarios are contained in LSOTB-TIR according to the camera mounting manners, including surveillance camera, hand-held camera, drone-mounted camera and vehicle-mounted camera. A total of 47 and 22 target classes are included in the training set and test set of LSOTB-TIR, respectively.

3.2. Implementation Details

The proposed approach is implemented on a WSL2 carrying an Ubuntu 22.04.2 system, with an Intel i5-13600KF CPU and an NVIDIA GeForce RTX 4090 GPU. The algorithm is programmed in an environment of Python 3.7, PyTorch 1.13 and CUDA 11.7. ResNet-50 and a feature pyramid network (FPN) are employed as the backbone architecture. The interval of ADM module is set to 10 frames. The score threshold of high-quality tracking results is set to 0.7.
For model training, CAMTracker is optimized by stochastic gradient descent (SGD). The momentum and weight decay are set to 0.9 and 1 × 10−4. During training, each image pair is generated from the same sequence. The maximum distance between two frames in an image pair is set as 300. The model is trained on Anti-UAV410 for 12 epochs, with an initial learning rate of 0.01. The learning rate descends by 0.1 at the beginning of the eighth and 11th epoch. The weights in Equations (11) and (22) are set as λ c = 1 , λ r = 1 and λ c o n = 0.01 . The temperature factor in CAM is set to 0.1.

3.3. Evaluation Metrics

Following [74], one-pass evaluation (OPE) is adopted to evaluate the performance of the proposed CAMTracker. The target position in the first frame will be given, then, the algorithm tracks the target in subsequent frames. The evaluation metrics include the precision rate (P), success rate ( S r ) and success AUC ( S a u c ) from [74] and the state accuracy ( S A ) from [68].

3.3.1. Precision Rate

As one of the most popular metrics in the filed of object tracking, the precision rate assesses the tracking accuracy through the center location error (CLE) between the predicted bounding box B o x p r e d and the ground truth bounding box B o x g t . The formula for CLE is
C L E = C p r e d C g t 2 ,
where C p r e d and C g t indicate the centers of the predicted bounding box and the ground truth, respectively. If CLE is less than a threshold, it is judged as a successful tracking on the frame, otherwise, it is considered as a failure. Precision is the proportion of successful tracking frames, which can be shown as
P = 1 N t = 1 N ϵ ( C L E < t h r e s p ) ,
where N is the number of all sequence frames. D p means the center distance between the ground truth and the predicted target, while t h r e s p denotes the threshold. ϵ is an indicator function which returns 1 when the input is true, otherwise, it returns 0. In the experiments, the threshold is conventionally set to 20 pixels.

3.3.2. Success Rate

Although the precision rate gives a summary of the tracking results, the distance between the centers of bounding boxes cannot reflect the tracking performance comprehensively. It is also important that the predicted bounding boxes cover the right regions. The success rate is defined as the rate of the frames where the Intersection over Union (IoU) between B o x p r e d and B o x g t is larger than a threshold, which is shown as
S r = 1 N t = 1 N ϵ ( I o U t > t h r e s s ) ,
where t h r e s s indicates the threshold, and I o U t denotes the IoU between the ground truth and the predicted bounding box in the t-th frame. During the experiments, t h r e s s is conventionally set as 0.5.

3.3.3. Success AUC

Using one particular threshold for the precision rate and success rate may not be representative enough. Therefore, the precision plot and success plot are adopted to show the precision rate and success rate of the tracker performance with different thresholds. The area under curve (AUC) of the success plot is limited; therefore, it is applied as success AUC to evaluate the performance more comprehensively.

3.3.4. State Accuracy

In anti-UAV tasks, it is a common situation that the targets are out of view or occluded. Hence, the presence of the drone targets should be taken into consideration. The state accuracy proposed by [68] introduces the visibility of the targets into the evaluation metrics, and can be calculated by
S A = 1 N t = 1 N ( I o U t × δ ( v t > 0 ) + p t × ( 1 δ ( v t > 0 ) ) ,
where δ ( ) denotes an indicator function, outputting 0 when the input is false otherwise 1. v t represents the visibility flag in the ground truth, while p t means the predicted visibility results. p t = 1 when the prediction remains empty, else p t = 0 . By applying state accuracy, the tracking performance on visible targets is more focused, while the judging ability on the presence of targets can be evaluated.

3.4. Quantitative Evaluation

To comprehensively evaluate the performance of the proposed CAMTracker, a comparison with 21 state-of-the-art trackers is performed, among which there are five correlation filter-based trackers, including DSST [27], Staple [26], SRDCF [28], KCF [24] and ECO [75], and 16 deep learning based trackers, including GlobalTrack [49], SiamRCNN [50], AiATrack [76], KeepTrack [77], MixformerV2-B [78], Stark-ST101 [52], SwinTrack-Base, SwinTrack-Tiny [54], DiMP50 [79], PrDiMP50 [80], ToMP50, ToMP101 [81], ATOM [82], SiamBAN [48], SiamCAR [47], TCTrack [83], SeqTrack [56], ROMTrack [84] and AQATrack [59]. During the experiments, the parameters in the compared trackers are set as the default.
The comprehensive performance of CAMTracker, along with other trackers, is demonstrated in Table 1, where the best, the second best and the third best of each metrics are highlighted with red, green and blue, respectively. Generally speaking, correlation-based trackers struggle in anti-UAV tasks compared to deep learning-based trackers. Among deep learning-based trackers, GIS trackers, including GlobalTrack, SiamRCNN and CAMTracker, outperform the others, apparently. Some trackers that take the whole frame as input, such as AiATrack and KeepTrack, also perform well in the task. In addition, it is worth mentioning that CAMTracker reaches the best quantitative performance on all the metrics, surpassing other trackers by at least 2.40%, 4.12%, 5.43% and 5.48% on precision, success rate, success AUC and state accuracy, respectively. Apart from the statistics, Figure 5 illustrates the precision curve and success curve. As the plots show, CAMTracker keeps leading with different thresholds in both precision and success curves, which, again, proves the effectiveness of the proposed tracker.

3.5. Attribute-Based Evaluation

To comprehensively depict the effectiveness and robustness of CAMTracker, the performance of the proposed method on sequences with different attributes is analyzed.
The precision and success curves of the trackers under various situations are presented in Figure 6. CAMTracker exceeds the others in all the scenarios. Especially under occlusion (OC) and dynamic background clutter (DBC) scenes, CAMTracker obtains great advantages over the other trackers. In scale variation (SV) and out-of-view (OV) scenarios, CAMTracker slightly leads GlobalTrack, and both of them outperform the others with relatively big gaps. Comprehensively, the experiment results show the ability and superiority of CAMTracker on tracking drones under diverse challenges in TIR videos.
Besides different scenarios, experiments are conducted to assess the performance of the trackers when facing different target sizes, which are mentioned in the Dataset part as well. The results are presented in Figure 7 and Figure 8.
According to Figure 7 and Figure 8, it can be inferred that (1) as the target size decreases, the overall performances of the trackers decline as well; (2) when the target is of normal size, CAMTracker only achieves quite a small lead; (3) a big advantage is obtained by CAMTracker in tracking tiny targets, which is 7.20% in precision and 7.70% in success AUC; and 4) the smaller the target size is, the better CAMTracker performs compared with other trackers. The previous experiments have demonstrated the versatility of CAMTracker in confronting heterogeneous challenges, either scenarios or target sizes.

3.6. Qualitative Evaluation

Besides the aforementioned quantitative evaluation and attribute evaluation, CAMTracker is also evaluated qualitatively by presenting visual results on particular sequences, aiming to show the performance of CAMTracker in an intuitive way. In this part, six trackers are selected for comparison, including ECO, SiamRCNN, GlobalTrack, Stark-ST101, SwinTrack-Base and MixformerV2-B.
Figure 9 visualizes tracking results on some challenging sequences. The results of the first sequence show the ability of CAMTracker when facing thermal crossover. As the drone remains in front of the building, other trackers gradually miss the target, while the prediction from CAMTracker stays on the ground truth. The second row depicts a typical occlusion scene. When the target becomes obstructed by the tree, all the trackers fail in capturing the drone. CAMTracker, along with GlobalTrack, finds the targets at the first frame the target reappears. The third sequence illustrates the fast motion situation caused by camera movement. Fast motion also leads to a blurred target. It can be found that most trackers cannot follow the target during the fast motion stage, while CAMTracker succeeds in every frame during the camera movement, even the target is fuzzy. The targets in the fourth and fifth row are both in tiny scale. The fourth and fifth sequences show the tracking results under a cluttered background and out-of-view of the target, respectively. Unlike other trackers, CAMTracker achieves much better performance under these two scenarios. The visualized results further prove the capability and superiority of CAMTracker in handling challenging situations in TIR anti-UAV tracking tasks.

3.7. Ablation Study

An ablation study on critical modules is conducted on the Anti-UAV410 dataset to verify the effectiveness of proposed components in CAMTracker. The results of the ablation study are presented in Table 2, where the best performances on each metrics are highlighted as bold. GlobalTrack is the baseline to analyze the usefulness of the proposed CAM and ADM. In addition, when CAM is applied, the performances of the tracker with common contrastive loss and the refined one are evaluated, respectively.
From Table 2, it can be inferred that both CAM and ADM bring a positive impact on the tracker. The utilization of ADM improves the success AUC by 4.08% and state accuracy by 4.11%, which is a huge advancement. CAM also enhances the performance of the tracker by more than 1.8% on success AUC and state accuracy. Especially, when employing the refined contrastive loss, there is about a 0.9% improvement on success AUC and state accuracy. When combined together, CAM, ADM and the refined contrastive loss contribute 2.40%, 4.12%, 5.43% and 5.48% progress compared to the baseline on precision, success rate, success AUC and state accuracy, respectively.

3.8. Generalization Evaluation

To evaluate the tracking generalization of the proposed CAMTracker in generalized TIR tracking tasks, experiments are conducted on LSOTB-TIR dataset. There are 16 trackers applied in the comparison, including DSST [27], CFNet [85], UDT [86], SRDCF [28], BACF [30], SiamFC [44], SiamMask [87], SiamRPN++ [46], ATOM [82], ECO [75], MDNet [88], ECO-stir [89], ECO-TIR [62], AQATrack [59], SeqTrack [56] and ROMTrack [84]. The precision plot and success plot of the trackers on the test set of LSOTB-TIR are shown in Figure 10. The proposed CAMTracker scores 71.0% on precision rate and 58.7% on success AUC. The results show CAMTracker has an average performance on generalized TIR tracking tasks. ROMTrack, SeqTrack and AQATrack achieve much better performance than other trackers, which shows the ability of transformer-based trackers on tracking generalized objects. The performance of CAMTracker is behind the top tracker by 13.4% in precision rate and 13.2% in success AUC, which shows more studies should be carried out for further improvement.

3.9. Evaluation Scenarios with Closed Distractors

During the tracking procedure, it is important to handle the interference from distractors, especially distractors closed to the target. To assess the ability of CAMTracker on handling closed distractors, qualitative experiments on several representative sequences in the LSOTB-TIR dataset are conducted. The tracking results are shown in Figure 11.
In the first sequence, when the target car passes the first closed distractor, CAMTracker is not interfered with. At frame 49, the top half of the image crashes, and the tracker takes a car nearby as the target. But at frame 53, the tracker finds the target again.
In the second sequence, there are two birds. At frame 50, the distractor becomes closed to the target and the tracker is influenced. Then, the tracker finds the target again until another approach comes from the distractor at frame 62. But the tracker is able to re-detect the target.
In the third sequence, the distractor looks almost the same as the target. During the tracking procedure, the tracker is confused several times and cannot successfully predict the target.
The qualitative results indicate that the proposed CAMTracker is able to distinguish some of the closed distractors by local bias and the discriminative features. Cars in the first sequences are similar but different, so the tracker can complete the tracking task successfully. In the second sequence, when the tracker is distracted, the distractor becomes blurred due to the quick motion, and the tracker re-detects the target, luckily. In the third sequence, the target and the distractor share the same appearance. The slight change in posture distracts the tracker, leading to tracking failure.

4. Discussion

As the experiments depict in Section 3, the proposed CAMTracker outperforms the state-of-the-art trackers in TIR anti-UAV tracking tasks, while staying in an averaging position in general TIR tracking tasks. According to Table 1, CAMTracker achieves the best performance on all the four metrics, and leads the second-best tracker for 2.40%, 4.12%, 5.43% and 5.48% on precision, success rate, success AUC and state accuracy, respectively. The high scores in the precision rate show the ability of CAMTracker in locating the target, while the advantage in the success rate and success AUC show the capability of CAMTracker for precisely perceiving the target. The solid lead in the SA score indicates CAMTracker performs much better than other trackers when facing situations when the target cannot be seen, such as occlusion or out-of-view scenarios.
The attribute-based comparison in Figure 6 shows CAMTracker has a great advantage in scenarios with occlusion, leading by 8.2% in precision rate and 7.8% in success AUC. The performance of CAMTracker on out-of-view sequences leads the second-best tracker by 2.3% in precision rate and 4.1% in success AUC, while surpassing the third-best tracker by 10.3% in precision and 9.8% in success AUC. The results show the advantage of the GIS strategy applied in CAMTracker, which results in more emphasis on re-detection during the tracking procedure and keeping the focus on possible target loss. Compared to other GIS-based trackers, CAMTracker outperforms GlobalTrack in thermal crossover sequences by 2.8% in precision rate and 5.5% in success AUC, which verifies the effectiveness of the CAM module in capturing the discriminative information of drone targets. In dynamic background clutter sequences, CAMTracker has an advantage of 7.9% in precision rate and 5.4% in success AUC. This result proves the effectiveness of the ADM module in long-term tracking tasks.
As shown in Figure 7 and Figure 8, CAMTracker achieves the best performances on targets of all sizes. Specifically, CAMTracker has quite a small advantage in normal-sized targets, only 0.3% in precision rate and 1.3% in success rate. As the target size decreases, the advantage of CAMTracker increases. In tiny target sequences, CAMTracker outperforms the second best tracker by 7.1% in precision rate and 7.7% in success rate. The results show the weakness of trackers for general scenarios when facing small particular targets. The good performance of CAMTracker in tracking tiny targets again proves the ability of the CAM module in extracting the discriminative features of small targets.
Even though the proposed CAMTracker has achieved considerable performance in TIR anti-UAV tracking tasks, there are still some challenges in anti-UAV tracking that remain to be improved. Figure 12 demonstrates two typical challenging sequences that remain to be solved. The first row is from sequence “20190925_124000_1_7” in the Anti-UAV410 dataset. The extreme thermal crossover causes the failure. In addition, the camera does not focus on the target, making the edge of the target more vague. It is even hard for human eyes to recognize the target. The second row is from sequence “3700000000002_133828_2”. The main reason of failure is that this sequence is the clutter background, in addition to the tiny size of the target. It is easy for the target to become submerged in the background. Therefore, CAMTracker needs to be improved when facing these kind of scenarios.
Besides the challenging sequences in anti-UAV tasks, the tracking performance of CAMTracker on generalized scenarios is still improvable. According to Figure 10, CAMTracker performs ordinarily compared to other trackers. Unlike in anti-UAV tracking tasks, targets in generalized tracking tasks usually move smoothly and are seldom invisible in the sequences. Therefore, trackers that focus on local regions and spatio-temporal information usually perform better in these scenarios. CAMTracker emphasizes more on a global vision, which introduces more distractors and noise in generalized tracking tasks. In addition, insufficient learning on spatio-temporal constraints in CAMTracker results in a heavy dependence on the target appearance, which restricts the performance in generalized scenarios, especially crowd scenes. Therefore, the balance between local and global regions and the ability to capture spatio-temporal constraints are the main areas in which the proposed CAMTracker needs to improve.

5. Conclusions

In this article, a contrastive-augmented memory network (CAMTracker) with remarkable performance in long-term anti-UAV tracking tasks in thermal infrared (TIR) videos is proposed. In CAMTracker, a two-stage tracking framework is built to handle camera movement, out-of-view and fast motion situations, where the first stage coarsely searches possible candidate proposals through an instance-guided RPN (IG-RPN). Subsequently, in the second stage, to improve the tracker’s discrimination ability when facing distractors and thermal crossover, a contrastive learning strategy and a refined contrastive loss are incorporated into a matching module, namely, a contrastive-augmented matching module. Furthermore, to mitigate model degradation caused by target appearance variation and dynamic backgrounds, a dynamic template is maintained by adaptively combining previous predictions to assist the instance matching in long-term tracking tasks.
Comprehensive experiments have been conducted on the Anti-UAV410 dataset and demonstrated the robustness and effectiveness of the proposed CAMTracker, which achieves 88.56% on precision, 84.87% on success rate, 66.68% on success AUC and 67.10% on state accuracy. The performance surpasses that of most state-of-the-art tracking algorithms by at least 2.40%, 4.12%, 5.43% and 5.48% on precision, success rate, success AUC and state accuracy, respectively, which proves the superiority of the proposed CAMTracker.
Although achieving a promising performance on anti-UAV tracking tasks, CAMTracker still has two main limitations. One is that, as a GIS based tracker, CAMTracker tends to focus on the global vision, therefore introducing more distractors and noise. When facing crowds of similar objects, it is more difficult to distinguish the target. The other limitation is the lack of understanding on spatio-temporal constraints. CAMTracker emphasizes re-detection more to counter challenges in anti-UAV tasks, resulting in the inadequate utilization of spatio-temporal information, which affects the tracking performance in generalized scenarios.
In future studies, approaches to switching trackers between global and local search can be introduced. Moreover, spatio-temporal strategies can be integrated into the tracker to exploit the information in sequences more comprehensively.

Author Contributions

Conceptualization, Z.W. and Y.H.; methodology, Z.W., J.Y. and Y.L.; resources, G.Z. and F.L.; writing—original draft preparation, Z.W.; supervision, Y.H. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Anti-UAV410 dataset is available at https://github.com/HwangBo94/Anti-UAV410 (accessed on 20 December 2024). LSOTB-TIR dataset is available at https://github.com/QiaoLiuHit/LSOTB-TIR (accessed on 20 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Fan, J.; Yang, X.; Lu, R.; Xie, X.; Li, W. Design and implementation of intelligent inspection and alarm flight system for epidemic prevention. Drones 2021, 5, 68. [Google Scholar] [CrossRef]
  2. Filkin, T.; Sliusar, N.; Ritzkowski, M.; Huber-Humer, M. Unmanned aerial vehicles for operational monitoring of landfills. Drones 2021, 5, 125. [Google Scholar] [CrossRef]
  3. McEnroe, P.; Wang, S.; Liyanage, M. A survey on the convergence of edge computing and AI for UAVs: Opportunities and challenges. IEEE Internet Things J. 2022, 9, 15435–15459. [Google Scholar] [CrossRef]
  4. Wang, Z.; Cao, Z.; Xie, J.; Zhang, W.; He, Z. RF-based Drone Detection Enhancement via a Generalized Denoising and Interference-removal Framework. IEEE Signal Process. Lett. 2024, 31, 929–933. [Google Scholar] [CrossRef]
  5. Zhou, T.; Xin, B.; Zheng, J.; Zhang, G.; Wang, B. Vehicle Detection Based on YOLOv7 for Drone Aerial Visible and Infrared Images. In Proceedings of the 2024 6th International Conference on Image Processing and Machine Vision, New York, NY, USA, 12–14 January 2024; pp. 30–35. [Google Scholar]
  6. Wang, B.; Li, Q.; Mao, Q.; Wang, J.; Chen, C.P.; Shangguan, A.; Zhang, H. A Survey on Vision-Based Anti Unmanned Aerial Vehicles Methods. Drones 2024, 8, 518. [Google Scholar] [CrossRef]
  7. Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
  8. Huang, B.; Xu, T.; Jiang, S.; Chen, Y.; Bai, Y. Robust visual tracking via constrained multi-kernel correlation filters. IEEE Trans. Multimed. 2020, 22, 2820–2832. [Google Scholar] [CrossRef]
  9. Jafferis, N.T.; Helbling, E.F.; Karpelson, M.; Wood, R.J. Untethered flight of an insect-sized flapping-wing microscale aerial vehicle. Nature 2019, 570, 491–495. [Google Scholar] [CrossRef] [PubMed]
  10. Cliff, O.M.; Saunders, D.L.; Fitch, R. Robotic ecology: Tracking small dynamic animals with an autonomous aerial vehicle. Sci. Robot. 2018, 3, eaat8409. [Google Scholar] [CrossRef] [PubMed]
  11. Svanström, F.; Alonso-Fernandez, F.; Englund, C. Drone detection and tracking in real-time by fusion of different sensing modalities. Drones 2022, 6, 317. [Google Scholar] [CrossRef]
  12. Li, Y.; Fu, M.; Sun, H.; Deng, Z.; Zhang, Y. Radar-based UAV swarm surveillance based on a two-stage wave path difference estimation method. IEEE Sens. J. 2022, 22, 4268–4280. [Google Scholar] [CrossRef]
  13. Sun, Y.; Li, J.; Wang, L.; Xv, J.; Liu, Y. Deep Learning-based drone acoustic event detection system for microphone arrays. Multimed. Tools Appl. 2024, 83, 47865–47887. [Google Scholar] [CrossRef]
  14. Yu, Q.; Ma, Y.; He, J.; Yang, D.; Zhang, T. A unified transformer based tracker for anti-uav tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 3036–3046. [Google Scholar]
  15. Elsayed, M.; Reda, M.; Mashaly, A.S.; Amein, A.S. LERFNet: An enlarged effective receptive field backbone network for enhancing visual drone detection. Vis. Comput. 2024, 1–14. [Google Scholar] [CrossRef]
  16. Yuan, D.; Zhang, H.; Shu, X.; Liu, Q.; Chang, X.; He, Z.; Shi, G. Thermal infrared target tracking: A comprehensive review. IEEE Trans. Instrum. Meas. 2023, 73, 5000419. [Google Scholar] [CrossRef]
  17. Marvasti-Zadeh, S.M.; Cheng, L.; Ghanei-Yakhdan, H.; Kasaei, S. Deep learning for visual tracking: A comprehensive survey. IEEE Trans. Intell. Transp. Syst. 2021, 23, 3943–3968. [Google Scholar] [CrossRef]
  18. Gao, Z.; Li, D.; Wen, G.; Kuai, Y.; Chen, R. Drone based RGBT tracking with dual-feature aggregation network. Drones 2023, 7, 585. [Google Scholar] [CrossRef]
  19. Zhang, F.; Peng, H.; Yu, L.; Zhao, Y.; Chen, B. Dual-modality space-time memory network for RGBT tracking. IEEE Trans. Instrum. Meas. 2023, 72, 1–12. [Google Scholar] [CrossRef]
  20. Huang, B.; Li, J.; Chen, J.; Wang, G.; Zhao, J.; Xu, T. Anti-UAV410: A Thermal Infrared Benchmark and Customized Scheme for Tracking Drones in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 2852–2865. [Google Scholar] [CrossRef]
  21. Kumar, A.; Vohra, R.; Jain, R.; Li, M.; Gan, C.; Jain, D.K. Correlation filter based single object tracking: A review. Inf. Fusion 2024, 112, 102562. [Google Scholar] [CrossRef]
  22. Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar]
  23. Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Proceedings, Part IV 12. Springer: Berlin/Heidelberg, Germany, 2012; pp. 702–715. [Google Scholar]
  24. Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef]
  25. Tang, M.; Feng, J. Multi-kernel correlation filter for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 3038–3046. [Google Scholar]
  26. Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1401–1409. [Google Scholar]
  27. Danelljan, M.; Häger, G.; Khan, F.; Felsberg, M. Accurate scale estimation for robust visual tracking. In Proceedings of the British Machine Vision Conference, Nottingham, UK, 1–5 September 2014; Bmva Press: Durham, UK, 2014. [Google Scholar]
  28. Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 4310–4318. [Google Scholar]
  29. Li, F.; Tian, C.; Zuo, W.; Zhang, L.; Yang, M.H. Learning spatial-temporal regularized correlation filters for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4904–4913. [Google Scholar]
  30. Kiani Galoogahi, H.; Fagg, A.; Lucey, S. Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1135–1143. [Google Scholar]
  31. Dai, K.; Wang, D.; Lu, H.; Sun, C.; Li, J. Visual tracking via adaptive spatially-regularized correlation filters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4670–4679. [Google Scholar]
  32. Tang, F.; Ling, Q. Spatial-aware correlation filters with adaptive weight maps for visual tracking. Neurocomputing 2019, 358, 369–384. [Google Scholar] [CrossRef]
  33. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  34. Chatfield, K. Return of the devil in the details: Delving deep into convolutional nets. arXiv 2014, arXiv:1405.3531. [Google Scholar]
  35. Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  36. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  37. Kristan, M.; Pflugfelder, R.; Leonardis, A.; Matas, J.; Porikli, F.; Cehovin, L.; Nebehay, G.; Fernandez, G.; Vojir, T. The vot2013 Challenge: Overview and Additional Results. In Proceedings of the Computer Vision Winter Workshop, Křtiny, Czech Republic, 3–5 February 2014; pp. 59–66. [Google Scholar]
  38. Kristan, M.; Matas, J.; Leonardis, A.; Felsberg, M.; Cehovin, L.; Fernandez, G.; Vojir, T.; Hager, G.; Nebehay, G.; Pflugfelder, R. The visual object tracking vot2015 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Washington, DC, USA, 7–13 December 2015; pp. 1–23. [Google Scholar]
  39. Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Cehovin Zajc, L.; Vojir, T.; Bhat, G.; Lukezic, A.; Eldesokey, A.; et al. The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
  40. Liu, P.; Liu, C.; Zhao, W.; Tang, X. Multi-level context-adaptive correlation tracking. Pattern Recognit. 2019, 87, 216–225. [Google Scholar] [CrossRef]
  41. Zhang, J.; Jin, X.; Sun, J.; Wang, J.; Sangaiah, A.K. Spatial and semantic convolutional features for robust visual object tracking. Multimed. Tools Appl. 2020, 79, 15095–15115. [Google Scholar] [CrossRef]
  42. Zhu, X.F.; Wu, X.J.; Xu, T.; Feng, Z.H.; Kittler, J. Robust visual object tracking via adaptive attribute-aware discriminative correlation filters. IEEE Trans. Multimed. 2021, 24, 301–312. [Google Scholar] [CrossRef]
  43. Tao, R.; Gavves, E.; Smeulders, A.W. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1420–1429. [Google Scholar]
  44. Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
  45. Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8971–8980. [Google Scholar]
  46. Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J.S. Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 15–20. [Google Scholar]
  47. Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar]
  48. Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677. [Google Scholar]
  49. Huang, L.; Zhao, X.; Huang, K. Globaltrack: A simple and strong baseline for long-term tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11037–11044. [Google Scholar]
  50. Voigtlaender, P.; Luiten, J.; Torr, P.H.; Leibe, B. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6578–6588. [Google Scholar]
  51. Vaswani, A. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  52. Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10448–10457. [Google Scholar]
  53. Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 8126–8135. [Google Scholar]
  54. Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inf. Process. Syst. 2022, 35, 16743–16754. [Google Scholar]
  55. Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 341–357. [Google Scholar]
  56. Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. Seqtrack: Sequence to sequence learning for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14572–14581. [Google Scholar]
  57. Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. Autoregressive visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 9697–9706. [Google Scholar]
  58. Bai, Y.; Zhao, Z.; Gong, Y.; Wei, X. Artrackv2: Prompting autoregressive tracker where to look and how to describe. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 19048–19057. [Google Scholar]
  59. Xie, J.; Zhong, B.; Mo, Z.; Zhang, S.; Shi, L.; Song, S.; Ji, R. Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, United States, 17–21 June 2024; pp. 19300–19309. [Google Scholar]
  60. Liu, Q.; Lu, X.; He, Z.; Zhang, C.; Chen, W.S. Deep convolutional neural networks for thermal infrared object tracking. Knowl.-Based Syst. 2017, 134, 189–198. [Google Scholar] [CrossRef]
  61. Liu, Q.; He, Z.; Li, X.; Zheng, Y. PTB-TIR: A thermal infrared pedestrian tracking benchmark. IEEE Trans. Multimed. 2019, 22, 666–675. [Google Scholar] [CrossRef]
  62. Liu, Q.; Li, X.; He, Z.; Li, C.; Li, J.; Zhou, Z.; Yuan, D.; Li, J.; Yang, K.; Fan, N.; et al. LSOTB-TIR: A large-scale high-diversity thermal infrared object tracking benchmark. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3847–3856. [Google Scholar]
  63. Li, X.; Liu, Q.; Fan, N.; He, Z.; Wang, H. Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking. Knowl.-Based Syst. 2019, 166, 71–81. [Google Scholar] [CrossRef]
  64. Liu, Q.; Li, X.; He, Z.; Fan, N.; Yuan, D.; Wang, H. Learning Deep Multi-Level Similarity for Thermal Infrared Object Tracking. IEEE Trans. Multimed. 2019, 23, 2114–2126. [Google Scholar] [CrossRef]
  65. Liu, Q.; Yuan, D.; Fan, N.; Gao, P.; Li, X.; He, Z. Learning Dual-Level Deep Representation for Thermal Infrared Tracking. IEEE Trans. Multimed. 2023, 25, 1269–1281. [Google Scholar] [CrossRef]
  66. Yuan, D.; Shu, X.; Liu, Q.; He, Z. Aligned Spatial-Temporal Memory Network for Thermal Infrared Target Tracking. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 1224–1228. [Google Scholar] [CrossRef]
  67. Huang, B.; Dou, Z.; Chen, J.; Li, J.; Shen, N.; Wang, Y.; Xu, T. Searching Region-Free and Template-Free Siamese Network for Tracking Drones in TIR Videos. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5000315. [Google Scholar] [CrossRef]
  68. Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Ye, Q.; Jiao, J.; Han, Z.; et al. Anti-UAV: A large-scale benchmark for vision-based UAV tracking. IEEE Trans. Multimed. 2021, 25, 486–500. [Google Scholar] [CrossRef]
  69. Huang, B.; Chen, J.; Xu, T.; Wang, Y.; Jiang, S.; Wang, Y.; Wang, L.; Li, J. SiamSTA: Spatio-temporal attention based Siamese tracker for tracking UAVs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 12–17 October 2021; pp. 1204–1212. [Google Scholar]
  70. Shi, X.; Zhang, Y.; Shi, Z.; Zhang, Y. Gasiam: Graph attention based siamese tracker for infrared anti-uav. In Proceedings of the 2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), Changchun, China, 20–22 May 2022; pp. 986–993. [Google Scholar]
  71. Hu, H.; Wang, X.; Zhang, Y.; Chen, Q.; Guan, Q. A comprehensive survey on contrastive learning. Neurocomputing 2024, 610, 128645. [Google Scholar] [CrossRef]
  72. Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar]
  73. Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
  74. Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
  75. Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 6638–6646. [Google Scholar]
  76. Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 146–164. [Google Scholar]
  77. Mayer, C.; Danelljan, M.; Paudel, D.P.; Van Gool, L. Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 13444–13454. [Google Scholar]
  78. Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 13608–13618. [Google Scholar]
  79. Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
  80. Chen, S.; Qiu, C.; Huang, Y.; Zhang, Z. Robust Probabilistic Discriminative Model Prediction Tracker via Improved Model Update Strategy. 2021. Available online: https://www.semanticscholar.org/paper/Robust-Probabilistic-Discriminative-Model-Tracker-Chen-Qiu/2c1e7fc5edb772ab51583ca5c5a66a87b4060986 (accessed on 20 December 2024). [CrossRef]
  81. Mayer, C.; Danelljan, M.; Bhat, G.; Paul, M.; Paudel, D.P.; Yu, F.; Van Gool, L. Transforming model prediction for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 8731–8740. [Google Scholar]
  82. Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4660–4669. [Google Scholar]
  83. Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. TCTrack: Temporal contexts for aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 14798–14808. [Google Scholar]
  84. Cai, Y.; Liu, J.; Tang, J.; Wu, G. Robust object modeling for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 9589–9600. [Google Scholar]
  85. Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; Torr, P.H. End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 2805–2813. [Google Scholar]
  86. Wang, N.; Song, Y.; Ma, C.; Zhou, W.; Liu, W.; Li, H. Unsupervised deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1308–1317. [Google Scholar]
  87. Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P.H. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1328–1338. [Google Scholar]
  88. Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and PATTERN Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4293–4302. [Google Scholar]
  89. Zhang, L.; Gonzalez-Garcia, A.; Van De Weijer, J.; Danelljan, M.; Khan, F.S. Synthetic data generation for end-to-end thermal infrared tracking. IEEE Trans. Image Process. 2018, 28, 1837–1850. [Google Scholar] [CrossRef]
Figure 1. The main architecture of CAMTracker. The whole tracker mainly contains four parts, including a pair of backbones, an instance-guide region proposal network (IG-RPN), a contrastive-augmented matching module (CAM) and an adaptive dynamic memory module (ADM).
Figure 1. The main architecture of CAMTracker. The whole tracker mainly contains four parts, including a pair of backbones, an instance-guide region proposal network (IG-RPN), a contrastive-augmented matching module (CAM) and an adaptive dynamic memory module (ADM).
Remotesensing 16 04775 g001
Figure 2. The structure of the IG-RPN, which contains a correlation encoder and an RPN head to generate possible proposals.
Figure 2. The structure of the IG-RPN, which contains a correlation encoder and an RPN head to generate possible proposals.
Remotesensing 16 04775 g002
Figure 3. The illustration of CAM. The module is mainly composed of a regular branch and an embedding branch. In the figure, GAP, MLP and Norm means global average pooling, multi-layer perceptron and normalization, respectively.
Figure 3. The illustration of CAM. The module is mainly composed of a regular branch and an embedding branch. In the figure, GAP, MLP and Norm means global average pooling, multi-layer perceptron and normalization, respectively.
Remotesensing 16 04775 g003
Figure 4. The demonstration of ADM.
Figure 4. The demonstration of ADM.
Remotesensing 16 04775 g004
Figure 5. The overall precision plot (a) and success plot (b) of CAMTracker and other compared trackers on the test set of Anti-UAV410.
Figure 5. The overall precision plot (a) and success plot (b) of CAMTracker and other compared trackers on the test set of Anti-UAV410.
Remotesensing 16 04775 g005
Figure 6. Attribute-based comparisons of CAMTracker and other trackers on AntiUAV-410. The attributes include fast motion (FM), occlusion (OC), out-of-view (OV), scale varaition (SV), thermal crossover (TC) and dynamic background clutter (DBC). Among the subplots, (af) are precision plots, and (gl) are success plots. The numbers in legend indicate the precision scores or success AUC scores of the corresponding trackers.
Figure 6. Attribute-based comparisons of CAMTracker and other trackers on AntiUAV-410. The attributes include fast motion (FM), occlusion (OC), out-of-view (OV), scale varaition (SV), thermal crossover (TC) and dynamic background clutter (DBC). Among the subplots, (af) are precision plots, and (gl) are success plots. The numbers in legend indicate the precision scores or success AUC scores of the corresponding trackers.
Remotesensing 16 04775 g006
Figure 7. Target size-based comparisons on precision plots of CAMTracker and other trackers on AntiUAV-410. The sizes include normal size (a), medium size (b), small size (c) and tiny size (d).
Figure 7. Target size-based comparisons on precision plots of CAMTracker and other trackers on AntiUAV-410. The sizes include normal size (a), medium size (b), small size (c) and tiny size (d).
Remotesensing 16 04775 g007
Figure 8. Target size-based comparisons on success plots of CAMTracker and other trackers on AntiUAV-410. The sizes include normal size (a), medium size (b), small size (c) and tiny size (d).
Figure 8. Target size-based comparisons on success plots of CAMTracker and other trackers on AntiUAV-410. The sizes include normal size (a), medium size (b), small size (c) and tiny size (d).
Remotesensing 16 04775 g008
Figure 9. Qualitative evaluation on some challenging sequences.
Figure 9. Qualitative evaluation on some challenging sequences.
Remotesensing 16 04775 g009
Figure 10. The precision plot (a) and success plot (b) of CAMTracker and other compared trackers on the test set of LSOTB-TIR.
Figure 10. The precision plot (a) and success plot (b) of CAMTracker and other compared trackers on the test set of LSOTB-TIR.
Remotesensing 16 04775 g010
Figure 11. Qualitative evaluation for sequences containing closed distractors.
Figure 11. Qualitative evaluation for sequences containing closed distractors.
Remotesensing 16 04775 g011
Figure 12. Failure on some challenging sequences.
Figure 12. Failure on some challenging sequences.
Remotesensing 16 04775 g012
Table 1. Overall performance of CAMTracker and other trackers on the test set of Anti-UAV410.
Table 1. Overall performance of CAMTracker and other trackers on the test set of Anti-UAV410.
MethodsP (%) S r (%) S auc (%) SA (%)
DSST [27]44.4037.6029.0029.09
Staple [26]38.0130.1724.1624.25
        SRDCF [28]        54.0445.1534.9635.17
KCF [24]41.4334.8926.5726.64
ECO [75]62.5955.3542.6142.88
ATOM [82]70.1365.2151.0251.37
SiamBAN [48]67.3462.3147.0147.32
SiamCAR [47]64.7460.4546.6346.93
GlobalTrack [49]86.1680.7561.2561.62
SiamRCNN [50]82.7677.7060.2460.54
DiMP50 [79]75.9171.2656.2756.66
PrDiMP50 [80]75.0770.3354.3354.69
AiATrack [76]82.3577.5459.1359.56
KeepTrack [77]80.9773.0956.4756.80
Stark-ST101 [52]78.6274.0456.7657.15
ToMP50 [81]73.9669.6354.7155.09
ToMP101 [81]75.2170.2854.7255.10
TCTrack [83]60.5355.2141.4041.64
MixformerV2-B [78]80.6676.0259.2659.65
SwinTrack-Tiny [54]71.5167.7352.7953.15
SwinTrack-Base [54]76.4972.1155.3955.74
SeqTrack-B256 [56]74.0068.6352.5452.87
ROMTrack [84]67.2262.2247.9148.13
AQATrack [59]76.5771.7155.4655.81
CAMTracker (ours)88.5684.8766.6867.10
Table 2. Overall performance of CAMTracker and other trackers on the test set of Anti-UAV410.
Table 2. Overall performance of CAMTracker and other trackers on the test set of Anti-UAV410.
CAMADMContrastive LossP (%) S r (%) S auc (%) SA (%)
---86.1680.7561.2561.62
--87.63↑1.4783.80↑3.0565.33↑4.0865.73↑4.11
-common86.01↓0.1581.75↑1.0063.11↑1.8663.49↑1.87
-refined86.46↑0.3082.38↑1.6363.93↑2.6864.32↑2.70
common87.87↑1.7184.31↑3.5666.23↑4.9866.65↑5.03
refined88.56↑2.4084.87↑4.1266.68↑5.4367.10↑5.48
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Hu, Y.; Yang, J.; Zhou, G.; Liu, F.; Liu, Y. A Contrastive-Augmented Memory Network for Anti-UAV Tracking in TIR Videos. Remote Sens. 2024, 16, 4775. https://doi.org/10.3390/rs16244775

AMA Style

Wang Z, Hu Y, Yang J, Zhou G, Liu F, Liu Y. A Contrastive-Augmented Memory Network for Anti-UAV Tracking in TIR Videos. Remote Sensing. 2024; 16(24):4775. https://doi.org/10.3390/rs16244775

Chicago/Turabian Style

Wang, Ziming, Yuxin Hu, Jianwei Yang, Guangyao Zhou, Fangjian Liu, and Yuhan Liu. 2024. "A Contrastive-Augmented Memory Network for Anti-UAV Tracking in TIR Videos" Remote Sensing 16, no. 24: 4775. https://doi.org/10.3390/rs16244775

APA Style

Wang, Z., Hu, Y., Yang, J., Zhou, G., Liu, F., & Liu, Y. (2024). A Contrastive-Augmented Memory Network for Anti-UAV Tracking in TIR Videos. Remote Sensing, 16(24), 4775. https://doi.org/10.3390/rs16244775

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop