Next Article in Journal
Why the Mittag-Leffler Function Can Be Considered the Queen Function of the Fractional Calculus?
Previous Article in Journal
Laminar–Turbulent Intermittency in Annular Couette–Poiseuille Flow: Whether a Puff Splits or Not
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Aerial Video Trackers Review

College of Software, Xinjiang University, Urumqi 830000, China
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(12), 1358; https://doi.org/10.3390/e22121358
Submission received: 24 October 2020 / Revised: 26 November 2020 / Accepted: 27 November 2020 / Published: 30 November 2020
(This article belongs to the Section Entropy Reviews)

Abstract

:
Target tracking technology that is based on aerial videos is widely used in many fields; however, this technology has challenges, such as image jitter, target blur, high data dimensionality, and large changes in the target scale. In this paper, the research status of aerial video tracking and the characteristics, background complexity and tracking diversity of aerial video targets are summarized. Based on the findings, the key technologies that are related to tracking are elaborated according to the target type, number of targets and applicable scene system. The tracking algorithms are classified according to the type of target, and the target tracking algorithms that are based on deep learning are classified according to the network structure. Commonly used aerial photography datasets are described, and the accuracies of commonly used target tracking methods are evaluated in an aerial photography dataset, namely, UAV123, and a long-video dataset, namely, UAV20L. Potential problems are discussed, and possible future research directions and corresponding development trends in this field are analyzed and summarized.

1. Introduction

Visual target tracking is an important topic in the field of computer vision. Its purpose is to accurately locate, identify and track the target after obtaining continuous images through the collector. An overview of research progress and visualization achievements at home and abroad reveals that visual target-tracking technology has unique social application value in terms of convenience, high efficiency, safety, reliability, high cost performance and low energy consumption [1] in the fields of medical diagnosis, human-computer interaction, public safety [2], video surveillance and posture estimation [3].
However, there are some differences between aerial target tracking technology and standard ground target tracking technology. The differences of among aerial photography instruments, environments and target states, which lead to high information content, multiple heterogeneity and high dimensionality of aerial photography images or videos. Available image processing algorithms such as image denoising [4], image enhancement [5] and image mosaicking [6] can satisfy the real-time processing requirements of aerial image target recognition, but difficult problems and challenges remain in the realization of target tracking, including the following.

1.1. Target Specificity

Aerial photography instruments have light sensitivities that differ among targets and are limited by their own flight height. In aerial photography images, there are often targets that are visible to the naked eye but have a small pixel size and image objects that blur or resemble the actual background color texture [7,8]. This study conducts a classification based on the characteristics of aerial photography targets according to the following six types:
  • Dim small targets: Targets for which the imaging size is relatively small due to the shooting angle and shooting distance-namely, targets for which the imaging size is less than 0.12% of the total number of pixels [9].
  • Weakly fuzzy targets: Targets for which the image is blurred due to the exposure time or flight jitter.
  • Weak-contrast targets: In a recognition environment with low noise and a low signal-to-noise ratio (SNR), the recognition target and moving background are similar in terms of color features and texture features. Hence, the contrast between the recognition target and the background is low, and the texture feature is not readily identified, but there is no missing target category.
  • Occluded targets: Targets that are temporarily occluded by the complex environmental background or are hidden for a long time during aerial photography tracking.
  • Fast-moving targets: Targets that exhibit dodging, fleeing and fast movement, which include image debris that is caused by the shaking of the UAV fuselage, obstacle avoidance and the influence of wind speed.
  • Common targets: Targets with normal behavior and clear images.

1.2. Background Complexity

Aerial photography can be roughly divided into three types: urban architectural landscape (e.g., urban road, urban building, and large-scale event site) photography, suburban open area (plain, grassland, and open area in an urban suburb) photography and complex and harsh environment scene (desert, mountain, gully and natural disaster site) photography. Due to the diverse environment, the pixel values of aerial photography targets and backgrounds are relatively low, namely, the texture features, spatial features and color features of the background differ substantially, which causes strong interference with aerial photography targets, especially in the case of complex environmental changes, sudden unknown static or mobile threats to aerial photography equipment, and other challenges in aerial photography. This paper summarizes the methods for overcoming target occlusion that is caused by a high-resolution pixel ratio in aerial photography and high feature complexity dimension.

1.3. Tracking Diversity

Aerial image acquisition equipment results in a variety of data forms, which include ordinary red, green, blue (RGB) color images (visible light images), infrared thermal images (gray images), GPS navigation information and acquisition equipment number information. Therefore, by combining various data features, the identification and tracking of occluded targets and weak targets can be realized. By using a single-UAV working mode or multi-UAV collaborative tracking mode, the number of available target features (spatial three-dimensional and multiangle features) can be increased to increase the tracking accuracy and tracking success rate. However, problems such as collaborative path planning, data normalization and image edge calculation are encountered.
According to the characteristics of the aerial video shooting target, this study conducts classification comparison of target-tracking methods and identification of the characteristics of various methods and usage scenarios. The main contributions of this paper can be described as follows.
  • We conduct a comprehensive benchmark test of aerial video trackers based on handcrafted feature and deep learning.
  • We take the target scale and definition as the classification criteria and conduct a complete comparative analysis of the three tracking schemes.
  • We benchmark 20 trackers based on handcrafted feature, depth Feature, siamese network and attention mechanism.
  • We compare the performance of the tracker in various challenging environments, so that relevant researchers can better understand the research progress on aerial video tracking.
The remainder of this paper is organized as follows. In Section 1, we explain the definition of aerial video target tracking from three perspectives: the target type, the shooting background and the tracking method. In Section 2, we compare the relevant datasets that can be used for aerial target tracking. In Section 3, we relevant tracking methods are introduced from three aspects: ordinary targets, weak targets and moving targets. In Section 4, we investigate and compare the structures of neural network trackers. In Section 5, we show the evaluation results of different trackers under UAV123 and UAV20L standards through experimental comparison and discuss the comparison between different trackers and the potential problems of aerial target tracking. In Section 6, we discuss the future research direction of aerial target tracking.

2. Aerial Video Datasets

Due to differences in the sensors of aerial photography equipment, parameters may vary among datasets [10]. Any single-frame image in a dataset contains multiple targets, but the frequency of the targets is not stable, and the target position and attitude change with the shooting angle. Therefore, although various traditional aerial photography datasets can reflect the application requirements of the real world, their application degree is typically low.
Aerial photography data are typically acquired by low-altitude drones. The number of videos in Table 1 represents the number of videos in the dataset, shortest video frames represents the number of frames in the video sequence with the fewest frames in the dataset, longest video frames represents the number of frames in the video sequence with the most frames in the dataset, total video frames represents the sum of the numbers of frames of all the video sequences in the dataset, and average video frames is obtained by dividing the total number of frames in the dataset by the number of videos. OTB and VOT are common target datasets, which are suitable for short-term tracking. The LaSOT dataset, consisting of 3.52 million manually annotated images and 1400 videos, is focused on long-term tracking and is by far the largest target dataset with dense annotation. However, these datasets contain substantial amounts of nonaerial target information and are not suitable for aerial target tracking. UAV123, ALOV300++, and Temple Color 128 are excellent special aerial photography datasets with rich types. Among them, the objects, such as dancers, completely transparent glass, octopuses, birds and camouflaged soldiers, exhibit occlusion, complete occlusion and sudden movement of the target, which are more in line with practical scenarios. UAV123 has a wide variety of scenes, which include urban landscapes, roads, buildings, sites, beaches and ports. The targets include cars, trucks, ships, people, groups and air vehicles, and the activities include walking, cycling, water skiing, driving and swimming. The long-term complete and partial occlusions of the target, scale changes, light changes, view changes, background clutter, camera motion and other effects are labeled. UAV123 has recently become increasingly popular due to its practical applications, such as navigation, wildlife surveillance, and crowd surveillance.

3. Traditional Target Tracking Algorithm

The combination of UAV with infrared equipment can solve the tracking problem of weak targets and hidden targets [21]. However, due to the high data feature dimensions, it is not suitable for the tracking analysis of fast-moving targets and exhibits low real-time performance. Many challenges remain in the real-time tracking of aerial photography. In addition, target loss that is caused by target deformation and different scales is an urgent problem to be solved. This section summarizes the problem in terms of the target category in the problem definition. Weak targets are defined in Section 1.

3.1. Common Targets

The traditional template-matching target tracking strategy is to construct a tracer based on sparse representation. The best candidate box can be identified by using template matching, but the background and target cannot be distinguished well. Reference [22] proposes the adaptive structural local sparse appearance (ASLA) algorithm, which increases the tracking accuracy and reduces the influence of occlusion by aligning the pooling operation on sparse code. Next, augmented quantum space learning and sparse representation are adopted in the update module to address drift and partial occlusion.
Various target trackers realize satisfactory short-time tracking performance, whereas others realize satisfactory long-time tracking performance. In Reference [23], the MUlti-Store Tracker (MUSTer) algorithm combines these two types of trackers—For short-time tracking, a powerful integrated correlation filter (ICF) method is used for short-term storage. The use of key-point matching tracking and random sample consensus [24] estimation in integrated long-term modules enables the integration of long-term memory and provides additional information for output control.
To overcome the high data feature dimensions, Reference [25] utilized the principal component analysis and scale-invariant feature transform (PCA-SIFT) algorithm, which improved SIFT and introduced PCA to reduce the dimensionality of aerial target features. Due to the loss of information in dimensionality reduction, this method is suitable for processing only clear aerial video images of targets. To overcome background interference and background shade, Reference [26] uses the appearance of the target and the background environment to build a tracker from two angles. The tracker is robust to changes in the appearance of the target during tracking. First, background patch information and foreground patch information are obtained, and multiangle information is associated through camera calibration. An adaptive model update strategy based on response distribution and prior tracking results is used to reduce the possibility of model drift and enhance tracking stability. Reference [27] designed a robust tracker that is based on a key patch sparse representation and designed patches for the occlusion part. First, using patch sparsity, patches are obtained from known images, and scores are provided. Second, key patches are selected according to the position and occlusion scenario, and corresponding contribution factors are designed for sampling patches to emphasize the contributions of selected key patches. This method increases the accuracy of partial occluded target tracking.

3.2. Weak Targets

In weak target tracking, two main challenges are encountered. First, the distance between the aerial photography equipment and the tracking target is relatively large, and the target occupies a relatively low percentage of pixels on the imaging plane and is vulnerable to interference by various types of noise clutter, thereby resulting in a missing target or target loss. Second, environmental factors (complex background, wind speed, and equipment jitter) lead to target blur and target loss. In this paper, weak targets, weakly contrasted targets and weak blurred targets are discussed and analyzed.

3.2.1. Dim Small Targets

To reduce the omission rate of dim small targets and increase the tracking accuracy, the relative local contrast measure (RLCM) multiscale detection algorithm was used in Reference [28]. The algorithm calculates the multiscale RLCM for each pixel of the original infrared image to enhance the real target and suppress all types of interference (such as high brightness background, complex background edges and pixel-sized noise with high brightness). An adaptive threshold is used to extract the real target. Formulas (1)–(3) calculates the RLCM of the center pixel of the center cell at each location.
R L C M = min ( I m e a n 0 I m e a n i I m e a n 0 I m e a n 0 ) ,
I m e a n 0 = 1 K 1 j = 1 K 1 G 0 j ,
I m e a n i = 1 K 2 j = 1 K 2 G i j , i = 1 , 2 , 3 , . . . , 8 .
where I m e a n 0 I m e a n i can be understood as an enhancement factor for the central cell [that is, cell(0)] in the i t h direction, and I m e a n 0 and I m e a n i denote the average gray values of the K 1 or K 2 max pixels in cell(0) and cell(i), respectively. K 1 and K 2 are the numbers of maximal gray values that are considered, and  G 0 j and G i j are the j t h maximal gray values of cell(0) and cell(i), respectively.
In Reference [29], an online multitarget tracker was designed by using high confirmations (strong detections) and low confirmations (weak detections) in the framework of the probability hypothesis density particle filter, which performed well in terms of tracking accuracy, number of missing targets and speed. The calculation flowchart is presented in Figure 1.
Strong detections are used to propagate target tags and promote target initialization, whereas weak detections are used only to support label propagation. Early association (EA) is executed prior to the trust angle update phase to reduce the extensive computational cost that is incurred by the labeling process. The federated data Z k inherit the corresponding identity information and are only used to track the status. After the EA phase, weak target detections that are not connected are discarded, while unassociated strong detections Z k are retained for the initialization of new particles. Strong detection generates new particles, as expressed in Formula (4), where N ( · ) is a Gaussian distribution, x k i represents the relative weight of each new particle, and X k , λ i is the i t h particle. Strong detection generates new detection particles independently modeled from the estimated state according to the function N ( · ) and dynamically updated based on parameters such as the detection size and video frame rate using covariance matrix . Moreover, unassociated strong detections initialize a new particle, as expressed in Formula (5), where | · | is the specified set and Z k represents combined detections. k is a standard deviation matrix that changes with time. It defines the relationship between the target detection tracking box and the weight of the new particles. These values can be learned from the training set, and state evaluation is conducted, as expressed in Formula (6), where each state x k , λ X k is estimated as the average of all resampled particles sharing the same identity.
X k , λ i p k ( X k , λ i | Z k + ) = 1 Z k + z k + Z k + N ( X k , λ i ; Z k + , ) ,
X k , λ i p k ( X k , λ i | z k ) = 1 z k z k Z k N ( X k , λ i ; z k , k ) ,
X k , λ = 1 χ k , λ X k , λ i χ k , λ X k , λ i .
Reference [30] realizes the feature binding of the target’s grayscale and spatial relation via compressed perception, thereby constructing a gaussian target to overcome high similarity between the small target and the background noise. Reference [31] combines particle swarm optimization (PSO) and a particle filter to optimize the sampling process of the particle filter to overcome small target feature poverty. In addition, the algorithm introduces the local PSO reset method to overcome the particle collapse problem in the particle filter for multitarget detection and tracking.

3.2.2. Weak Blurred Targets

The infrared detection system is typically used to find and track weak blurred targets. Reference [32] applied the Wiener filter to the processing of the original infrared image. First, motion blur is processed, and noise interference is suppressed. The gradient method is then used to sharpen the processed image to enhance the target edge. This method can substantially reduce the motion blur, increase the image quality and enhance the performance of the detection system. Reference [33] constructed a nonlinear blurred core with multiple moving components. A blind deconvolution technique that used a piecewise linear model was introduced to estimate the unknown kernels. This method is combined with noise reduction technology that is based on wavelet multiframe decomposition and the peak signal-to-noise ratio (PSNR). This algorithm is highly effective in accurately identifying various blurred cores and provides important research strategies for image defuzzing. Reference [34] proposes a new motion blurred computing method for ray tracking. This method provides analysis data of the blurred visibility of each ray motion and considers the time dimension. The algorithm can use any standard ray tracing acceleration structure without modification. Reference [35] proposes a frame-by-frame intermittent tracking method that is driven by an actuator, which is used for the motion-free blurred video shooting of fast-moving objects. By controlling the frame and shutter timing of the camera to reduce the motion blur and by synchronizing the vibration with the free-vibration-type actuator, the motion blur can be reduced in free-view high-frame-rate video shooting.

3.2.3. Weak-Contrast Targets

For the recognition and tracking of weakly contrasted targets, most algorithms require prior information about targets; otherwise, they would be affected by heavy noise clutter [36]. Reference [37] proposed a new method based on image fusion and mathematical morphology. Based on the description of the manipulatable pyramid, the original image is fused, and the target tracking of the fused image is realized via the mathematical morphology method. Reference [38] conducted an in-depth analysis of the background characteristics, weak target characteristics, and motion characteristics and proposed a moving average method. Based on foreground extraction, the difference calculation of adjacent frames that are related to the continuity of a moving target is conducted to eliminate the interference points and reduce the false alarm rate. The pretracking detection method proposed in Reference [39] operates directly on the original sensor signal without the need for a separate explicit detection stage. The probability density function of the target state is generated from the original pixel level, the probability indicator of the target presence is calculated, and the Bayesian particle filter is used to complete the target tracking. Reference [40] proposed a feedback neural network for weakly contrasted target motion tracking against a natural cluttered background. To form a feedback loop, the model delays the output time and forwards the feedback signal to the previous neural layer.

3.3. Occluded Targets and Fast-Moving Targets

In the course of UAV dynamic tracking, especially if fast movement occurs [41] and relabeling is necessary after the target is lost for a short time [42], the typical method determines the target area continuously through the video sequence [43]. Scholars at home and abroad have also proposed the correlation filter tracking algorithm [44] and the circular structure of tracking by detection with kernels (CSK) algorithm [45]. The tracking efficiency is high, but the tracking performance for multiscale targets is poor, and it is difficult to resume tracking of a missing target. To overcome this problem, reference [46] improved the scale-adaptive multifeature fusion (SAMF) algorithm on the basis of kernelized correlation filters (KCF) [47]. A multifeature (grayscale, histograms of oriented gradients (HOG), and color names (CN)) fusion method was used to realize feature complementation, and a multiscale search strategy was used to realize scale-adaptive tracking to increase the tracking accuracy. However, because the algorithm must conduct seven types of scale detection calculations, the speed is much lower than that of KCF. Reference [48] combines filter and context-aware information [49] and uses an intermittent learning method to enhance the network context awareness to increase the modeling performance of the network for occluded objects. In Reference [49], the frame with the best tracking results was used as the key frame in the follow-up tracking, which optimizes the quality of the training set and reduces the computational cost, thereby overcoming the poor robustness of the filter method in complex scenes.
Reference [50] used vector field guidance for multitarget tracking in aerial videos. By improving the vector field guidance method of a single UAV and defining a variable confrontation tracking track, the cooperative confrontation tracking of the UAV group on a moving target group is used to solve the problem of the visual range of the UAV when tracking multiple ground targets, which is suitable for processing aerial video images of a fast-moving target. To solve the problem of visual control of target tracking in visible light aerial photography, Reference [51] adopted a ground target tracking control strategy based on vision to realize the real-time tracking of aerial photography targets. Aiming at solving the regional cooperative search problem of multi-UAVs, Reference [52] described the changes in the environment and target state with the search process based on the search information graph model and established a motion model for the dynamic analysis of UAVs to ensure the accuracy of model prediction, thereby realizing the accurate tracking of complex targets with motion trajectories. To address the abnormal filter response caused by background interference in aerial video, a clipping matrix and regularization term were introduced in Reference [53] to expand the search area and suppress the distortion. The spatially regularized correlation filter (SRDCF) algorithm, which was proposed in Reference [54], adds spatial penalty terms on the basis of discriminative correlation filters (DCF) to solve for the boundary utility and realize superior performance in large-scale movement and complex scenes. However, the need to review used multiframe information in the tracking process creates a computational cost problem. The spatial-temporal regularized correlation filters (STRCF), proposed in Reference [55], add spatial and temporal regular terms on the basis of the problems encountered with SRDCF, and tracking requires only the information of the previous frame to ensure time efficiency. Most available filter algorithms attempt to introduce a predefined regularization to improve the learning relationship of the target object, but they are difficult to adapt to special scenarios in practice. To overcome this problem, Reference [56] proposed an online adaptive learning spatiotemporal regularization method. By introducing spatial local change information into spatial regularization, DCF can focus on the trusted part of the target object. The algorithm realizes satisfactory tracking performance on four aviation datasets. Reference [57] evaluated the target state by establishing an unscented Kalman filter based on a multi-interaction model, which reduces the network’s evaluation error of the moving target but also increases the computational consumption.

4. Target Tracking Algorithm Based on a Deep Learning Network

With the development of computer vision, many visual target tracking frameworks have been proposed and applied to aerial video target tracking. This section briefly introduces a tracking algorithm based on depth Features, a tracking algorithm based on a Siamese network and a target tracking algorithm based on an attention mechanism.

4.1. Depth Features

A deep learning network that is represented by a convolutional neural network (CNN) can automatically learn all the effective features of the target from many training sets, which not only effectively overcomes the background noise but also realizes satisfactory tracking performance [58,59].
Reference [60] designed a lightweight CNN for learning the common attributes of a multidomain video to address scenarios such as target occlusion and target deformation in practical tracking. The network tracking structure uses online fine-tuning to improve the real-time performance of the tracking algorithm. Reference [61] added RoIAlign on this basis to accelerate feature extraction and classify internal targets through multitask loss, adding discriminative parameters to targets with similar semantics. The network structure is illustrated in Figure 2. First, the first three layers of convolution share the multiple-domain features learned by the network (e.g., the illumination change, motion blur, or robustness to size changes), and the adaptive RoIAlign extracts CNN features of each region of interest (RoI) to improve the feature quality and reduce the computational complexity. Layers FC4 and FC5 are mainly used to distinguish the background and the target, and the unique characteristics of each video domain are stored into the FC6 branch with softmax cross-entropy loss.
The online tracking process of the RT-MDNet algorithm is described in Algorithm 1.
Algorithm 1 Online tracking process of RT-MDNet algorithm
Input: Pretrained RT-MDNet convolution weights w { w i } , where w i is the weight value of a convolution layer, and the initial target state X l .
Output: Adjusted target status X . 1.8
1:
Random initialization of the last domain-specific layer weights w 6 .
2:
Use bounding box regression technique to train boundary box regression function bbox.
3:
for: do
4:
    If (image == 1)
5:
    Acquire a convolution feature of the first frame image α ( W ) .
6:
    else if
7:
    Acquire convolution features of the second frame and subsequent images α ( w γ ) .
8:
    Draw a positive sample S i + and a negative sample S i .
9:
    Use S i + and S i to update w w j : w = c o n v ( S i + , S i ) , j = 4 , 5 , 6 .
10:
    Set long-term update frame index T l i and short-term update frame index T S i .
11:
    Draw target candidate sample state X i .
12:
    Find the optimal state of the target position: x = a r g m a x f + ( x i ) , where f + ( x i ) is the score of the target of the network evaluation.
13:
    if f + ( x i ) 0.5 , then draw a positive sample S i + and a negative sample S i ,Long-term update frame index set T l = i = 1 n T l i , and short-term update frame index set T S = i = 1 n T S i .
14:
    if T l > t l , then T l = T l / { min v t l v } t l , where t l v is the rate of change of the appearance of the long-term target.
15:
    if T s > t s , then T s = T s / { min v t s v } t s , where t s v is the rate of change of the appearance of the short-term target.
16:
    Use bbox to adjust the optimal state of the target position: x i = b b o x ( x ) .
17:
    If (i%10 == 0)
18:
    then use S V t l + and S V t s to update w { w j } : w = c o n v ( S V t l + , S V t s + ) .
19:
    else if f + ( x i ) < 0.5
20:
    then use S V t s + and S V t s to update w { w j } : w = c o n v ( S V t s + , S V t s + ) .
21:
end for
Reference [62] proposed the EArly Stopping Tracker (EAST) to convert the adaptive tracking problem into a decision-making process. The network structure is illustrated in Figure 3. The network uses the offline reinforcement learning method to learn an agent for a single-frame image. Based on this agent, it decides to select a layer in a series of feature layers to realize target monitoring or to use the next layer to conduct the same processing. However, this method exhibits reduced accuracy with increasing speed.
The action selection process for the EAST network is described in Algorithm 2, where action_4 denotes four groups of actions, and action is an action(i) value.
Algorithm 2 Action selection process for the EAST network
Input: Feature map, action index: eigth_actionindex {}, the action value h l from the first four layers, action list: a c t i o n a c t i o n i ( i 1 , 2 . . . 8 ) .
Output: Current conv layer action value. 1.8
1:
Calculate the corresponding average value F l of the first N layers: F l = k = 1 l F k / l .
2:
Construct the current state of the feature map:( F l , h l ).
3:
Use vector merging to calculate the following feature sequence: f e a t u r e _ l i s t = F l + a c t i o n _ 4 .
4:
Conduct feature reorganization of feature_list: feature_map=fc(feature_list).
5:
Compare feature_map and eigth_actionindex, choose the action with the highest score: sam_action=sam(feature_map,eigth_actionindex).
6:
if sam_action=Stop then “EAST”(early stop) at the subsequent target location will not be conducted.
7:
else then output the value of sam_action.
The discriminative correlation filter [63] shows substantial advantages in visual target tracking. The combination of a filter tracking framework and a deep neural network effectively improves the performance of the tracking algorithm [64,65]. Reference [66] proposed the multiple experts using entropy minimization (MEEM) algorithm within a tracking-by-detection framework to overcome the model drift caused by tracking failure or misalignment of the training samples. Aiming at solving this problem, the efficient convolution operators for tracking (ECO) algorithm was proposed in Reference [67], and continuous convolution operators (C-COT) [68] were simplified by modifying the number of model update frames, thereby reducing the model size, increasing the speed and reducing the risk of model overfitting. Simultaneously, according to the tracking results of the training set, components are generated by using the Gaussian mixture model (GMM) to ensure the diversity of the training set. However, the deep features of the network are not sufficiently effective, and the large amount of data calculation reduces the tracking speed of the network. Based on ECO, Reference [69] divided and conquered its depth features and shallow features, which substantially increased the robustness and tracking accuracy of the network structure.
To increase the network robustness, the multicue correlation filter tracking algorithm (MCCT), proposed in Reference [70], analyzes the fusion results that are obtained from the decision layers of multiple trackers to ensure the reliability of the results. The superimposed selection of adaptive strategies successfully distinguishes unreliable samples (in which there are occlusions or deformed data) to further avoid the problem of insufficient training due to sample contamination. Reference [56] combined the output of the Conv3 layer of the VGG-M [71] network with HOG-CN to increase the robustness of the model.
To overcome the difficulty of matching the training depth feature with the actual target information, the target-aware deep tracking (TADT) method, proposed in Reference [72], uses the global average of the backpropagation gradient to complete feature screening, evaluates the importance of each filter through a regression function, and applies a weighted supplement to the deep feature.

4.2. Siamese Network

To overcome the high computational burden and low speed of the previous deep neural network method, a Siamese network for introducing similarity learning into the matching process of the target image and search image was proposed, which balanced the costs of the tracking speed and tracking accuracy and gradually has become the preferred solution to the tracking problem [73,74].
Simplification of the target tracking problem to learn a common similarity mapping problem is an effective solution. The Siamese instance search for tracking (SINT) algorithm, proposed in Reference [75], learns a matching function through a Siamese network. The target feature of the first frame is used as a template, the subsequent sampling feature is matched with it for calculation, and the target with the highest score is selected as the final target. The algorithm uses a region pooling layer to realize model acceleration and demonstrates the feasibility of combining a deep neural network with traditional methods. Reference [76] also calculated the similarity between each position of the template and the image to be tested through template matching and selected the target with the highest similarity as the final target. The discriminative subspace learning model (DSLM) network, proposed in Reference [77], solves the problems of target occlusion and background interference by learning the relationship between the target module and the characteristics of the search area. Reference [78] constructed an asymmetric Siamese network (CFNet) that not only ensures the tracking accuracy but also simplifies the network structure. In Reference [79], DCF was used to complete the filtering, a probability heat map of the calculated result that was mapped to the target position was used to complete the online learning and tracking, and end-to-end training was realized.
These trackers simplify the problem of target tracking to the problem of learning a generic similarity map by learning the correlation between the feature representation of the target module and the search area. They do not consider the complex and changeable target scale, appearance or pixels in the actual tracking process. In Reference [80], tracking was decomposed into two parallel and collaborative threads—fast discriminative scale space tracking (FDSST) was used for fast tracking, and a Siamese network was used for accurate verification, thereby realizing both high accuracy and high speed. The Siamese region proposal network (SiamRPN) algorithm, which is proposed in Reference [81], overcomes the limitation of spatial invariance of the Siamese network. It is composed of a Siamese subnetwork and a region proposal subnetwork. The network completes the offline end-to-end training via large-scale image analysis, constructs a one-shot detection task to avoid time-consuming multiscale tests and obtains accurate candidate regions. SiamRPN increases the model accuracy and reduces the model size. DaSiamRPN, proposed in Reference [82], enriches the types of training data in the dataset via data augmentation, reduces the impacts of difficult negative samples on the network training, and improves the network generalization and discrimination performances. The interference recognition module in the network overcomes the low recognition accuracy caused by the lack of a self-updating model.
The Siamese network is not a deep network due to the lack of translation-invariance. The SiamRPN++ algorithm, proposed in Reference [83] based on Reference [81], effectively solves this problem by modifying the sampling strategy. The network structure is illustrated in Figure 4. The method recombines the positioning features and deep semantic features obtained by ResNet and improves the feature expression performance according to the sequence of features from low to high, from small to large, and from thin to thick. The traditional image feature pyramid network (FPN) [84] is similar to it. For the loss of clipping invariance caused by padding, the model shifts the training sample labels to alleviate the centralization problem caused by the deep network.
The SiamRPN block of the SiamRPN++ algorithm is described in Algorithm 3.
Algorithm 3 SiamRPN block
Input: Feature map ( φ ( z ) , φ ( x ) ), where φ ( z ) is the feature vector of the template frame; φ ( x ) is the feature vector of the detection frame.
Output: Classification results and regression results of bbox. 1.8
1:
Use x as a convolution kernel on φ ( z ) to conduct the convolution operation to obtain the following anchor sequence: A W h 2 k c l s = [ a d j _ 1 ] c l s [ a d j _ 2 ] c l s .
2:
Use z as a convolution kernel on φ ( z ) to conduct the convolution operation to obtain the following anchor sequence: A W h 4 k r e g = [ a d j _ 3 ] r e g [ a d j _ 4 ] r e g .
3:
Calculate the positive sample sequence S + and the negative sample sequence S by intersection over union (IoU) processing of all anchor sequences and the target real frame.
4:
Calculate the regression offset dx, dy, dw, dh of A W h 4 k r e g and binary classification { 0 , 1 } label of A W h 2 k c l s .
5:
Reshape A w h 4 k r e g .
6:
Conduct bbox regression using the smooth L1 loss: smooth L 1 ( x , σ ) = 0.5 σ 2 x 2 , | x | 1 σ 2 | x | 1 2 σ 2 , | x | 1 σ 2 .
7:
Remove the anchor sequences with label=-1 from A W h 2 k c l s .
8:
The cross-entropy function is used to calculate the classification results of the step 7 results.
9:
Output the regression results of bbox for step 6 and classification results for step 8.
Siam R-CNN, proposed in Reference [85], is a redetection architecture based on the trajectory dynamic programming algorithm (TDPA). Based on the Siamese framework, the self-motion and mutual motion of all potential objects are modeled, and the detected information is summarized into tracklets to complete the detection. This method is suitable for long-term tracking and is sufficient for addressing tracking failure after the target has been blocked for a long time. The Siamese box adaptive network (SiamBAN), proposed in Reference [86], simplifies the tracking problem into a problem of parallel classification and regression and directly conducts classification and regression operations on targets in a unified fully convolutional network (FCN). This avoids the computational complexity of the Siamese network due to the introduction of RPN and increases the network flexibility and generalization performance. The unsupervised deep tracker (UDT), proposed in reference [87], applies unsupervised learning to target tracking, uses three consecutive frames to evaluate the prediction deviation to increase the accuracy of the tracker, and applies a sensitive loss function to allocate a weight to each sample to overcome the noise caused by the random initialization of the target box in the unsupervised training.

4.3. Attention Mechanism

Challenges remain in ensuring the real-time performance and application of the tracker, and the available partial tracking algorithms cannot distinguish between the target and the background, which renders it difficult to address the changes of the target shape and background in real time. The attention mechanism module within the deep learning network reinforces important features in the image, thereby helping address issues such as target tracking failures [88].
Reference [89] proposed the residual attentional Siamese network (RASNet) algorithm and reconstructed the filtering mode of the Siamese network based on a CNN, thereby effectively avoiding the overfitting problem. The algorithm separates representational learning from discriminant learning and enhances the discrimination performance and adaptability of the algorithm. Real-time tracking is realized. The network structure is illustrated in Figure 5, which contains three attention mechanisms. General Attention refers to the introduction of the attention mechanism to integrate the common features of targets and highlight the commonness of features. Residual attention considers differences in learning objectives. Channel attention adapts to various objectives and eliminates noise.
The attention fusion process of the RASNet algorithm is described in Algorithm 4.
Algorithm 4 Attention Fusion Process of the RASNet algorithm
Input: Feature map.
Output: Trace box q with the largest response value. 1.8
1:
The feature map is downsampled and upsampled by the residual attention mechanism to obtain the target semantic feature sequence: f e a t u r e _ R .
2:
The general attention mechanism is used to extract the information of multiframe feature maps, and the common feature sequence of the feature maps is obtained: f e a t u r e _ G .
3:
The dual attention feature is calculated: feature_D = feature_R+feature_G.
4:
Calculate channel weights: channel_score=Sigmoid(Channel Attention(feature map)).
5:
The fusion feature sequence is calculated: f e a t u r e _ l i s t = f e a t u r e _ D c h a n n e l _ s c o r e .
6:
The trace box q with the largest response value in the feature_list is identified via the weighted cross-correlation method: where α represents dual attention, β represents channel attention, Z represents a template image f p , q = i = 0 m 1 j = 0 n 1 c = 0 d 1 α i , j β c φ i , j , c ( Z ) φ p + i , q + j , c ( X ) + b and p is a real box in Z, X represents the search image and q is the trace box in X.
Reference [90] proposed spatial attention (SCSAtt), which ensured the model’s speed and increases its robustness. SCSAtt uses weight allocation to highlight the importance of the feature of the channel—namely, the channel attention module—and uses the spatial attention module to highlight the area with the most information on the feature diagram to determine the target location. The network structure is summarized in Figure 6.
The Channel-Spatial attention calculation process in the SCSAtt algorithm is described in Algorithm 5.
Algorithm 5 Channel-Spatial attention calculation process in the SCSAtt algorithm
Input: Feature map F M H W C .
Output: Channel-Spatial attention Λ ( ϕ ( z ) ) . 1.8
1:
Use global max-pooling to obtain the F M object feature: F max 1 1 C = f c 2 ( Re L U ( f c 1 ( G P o o l max ( F M H W C ) ) ) ) .
2:
Use global average-pooling to obtain the F M : F avg 1 1 C = f c 2 ( Re L U ( f c 1 ( G P o o l a v g ( F M H W C ) ) ) ) .
3:
Use elementwise summation to fuse two feature vectors: φ c ( · ) 1 1 C = σ ( F max 1 1 C F avg 1 1 C ) .
4:
Calculate channel attention feature map C A : C A = φ C ( F M ) F M .
5:
Calculate S max H W 1 for C A with global max-pooling: S max H W 1 = G P o o l m a x ( C A H W C ) .
6:
Calculate S avg H W 1 for C A with global average-pooling: S avg H W 1 = G P o o l a v g ( C A H W C ) .
7:
Calculate the spatial attention, ϑ 3 3 is 3*3 convolution layer: φ S ( · ) H W 1 = σ ( ϑ 1 3 3 ( concat [ S max H W 1 , S avg H W 1 ] ) .
8:
Use the channel attention feature map to determine the ultimate effect on the spatial attention feature map S A : S A H W C = φ S ( · ) H W 1 C A H W C .
9:
Calculate the final stacked channel-spatial attention: Λ ( ϕ z ) ) = C A S A .
Similar to SCSAtt, the feature integrated correlation filter network (FICFNet) algorithm, proposed in reference [91], is a two-branch parallel connection network structure that unifies the three processes of feature extraction, feature integration and DCF learning. The feature integration module of the network cascades the shallow feature and the deep feature and uses the channel attention mechanism to adaptively combine the channel weight into the integrated feature, and the obtained target timing information can solve the problems of target occlusion and target deformation.

5. Experiment

5.1. Datasets

5.1.1. Baseline Assessment

To accurately evaluate the model performance, experiments were conducted on aerial datasets UAV123 [11] and UAV20L [11]. UAV123 contains 123 fully annotated HD video sequences over 110K frames from the perspective of low-altitude aviation. Each video sequence has 12 attribute categories: Aspect Ratio Change(ARC), Background Clutter(BC), Camera Motion(CM), Fast Motion(FM), Full Occlusion(FOC), Illumination Variation(IV), Low Resolution(LR), Out-of-View(OV), Partial Occlusion(POC), Similar Object(SOB), Scale Variation(SC), and Viewpoint Change(VC). A video sequence may have a variety of attributes that are affected by the shooting conditions, and the frequency differs among the attributes. UAV20L is a subset of UAV123 and contains 20 long-video sequences. The UAV dataset has been tagged with the size and location information of the target in each video sequence and can be used for model initialization and model evaluation.

5.1.2. Evaluating Indicators

In this paper, two evaluation indicators, accuracy and success, are used to complete the quantitative analysis of the model. Accuracy refers to the percentage of the target center position error that is in the specified range, and the center position error is defined as the average Euclidean distance between the center position of the real box(x 0 g t ,y 0 g t ) and the center position of the tracking prediction box(x 0 t r ,y 0 t r ), as illustrated in Figure 7a. The proportion of the overlap scores (which is calculated from the intersection ratio) of the real box and the prediction box that exceed the threshold frames in the video timing sequence is the success degree, as presented in Figure 7b. The error of the center position is a widely used standard, which cannot be easily used to evaluate the performance of the tracker in the case of target loss. The accuracy curve is generated accordingly, and the corresponding value of 20 pixel points is adopted as the accuracy evaluation index [16]. When the center position error cannot be used to evaluate the target scale change, the performance of the tracker can be compensated by an evaluation index that is based on the area overlap ratio and is generated accordingly, as expressed in Formula (7).
S = | R tr R g t | | R tr R g t | ,
where R tr represents the real target boundary box, R g t represents the prediction box of the tracking results, and ∪ and ∩ represent the union and intersection, respectively, of the two areas. This article uses the one-pass evaluation (OPE) accuracy and success graph to complete the model evaluation by ranking the tracking algorithms using the area under the curve (AUC) from the success graph. The parameter standards follow the default UAV123 settings.
The algorithm codes are implemented in the server with an NVIDIA TITAN V GPU by MATLAB and PYTHON, and the configuration parameters of the experimental environment are shown in Table 2.The codes of the trackers we reproduced are obtained from the GitHub repository, and the URLs are shown in Table 3. The training models of all tracking algorithms adopt the original models without retraining.

5.2. Evaluaton in UAV123

5.2.1. Overall Evaluation

In this paper, a total of 20 tracking algorithms are compared. Figure 8 presents the results for the algorithms on UAV123, which is the aerial photography dataset. Table 4 shows the characteristics of the tracking algorithm. Among them, SiamRPN++, SiamR-NN, SianBAN, SCSAtt, DaSiamRPN, and UDT are trackers that are based on Siamese networks. RT-MDNet, ECO, C-COT, MCCT, TADT, and DeepSTRCF are depth-based trackers. STRCF, SRDCF, MEEM, MUSTER, DSST, ECO-HC, KCF and SAMF are trackers that are based on handcrafted features. The trackers that are based on Siamese networks realize the best performances on the two measurement standards, with accuracy and success rates of 0.840 and 0.788, respectively, thereby outperforming the other tracking algorithms. This is a major breakthrough in tracking in the field of deep learning.
In the UVA 123 dataset, according to a comparison of the Siamese network model structures, SiamRPN++ utilizes a deep network, namely, ResNet, to fully extract target features by recombining features of shallow and deep layers. The network structure is relatively complex, but the advantage lies in the combination of a Siamese network and a deep structure to complete feature extraction. Siam R-CNN uses a Siamese network to apply the Faster R-CNN to solve the tracking problem and uses dynamic programming to address occlusion and target disappearance, which is suitable for long-term tracking and severely occluded scenes. However, the network structure is the most complex, and the computational burden is large. SiamBAN uses the representational capability of a fully convolutional network to simplify the tracking problem into classification and regression, thereby avoiding the hyperparameter problem. The accuracy and success rates of the SCSAtt tracker are 0.776 and 0.69, respectively; hence, the attention mechanism is an effective mechanism that helps the network increase the tracking accuracy. Since the structure of the DaSiamRPN algorithm cannot utilize deep features, there are gaps in the accuracy and success rates compared with the methods based on deep features, which demonstrates the importance of deep features. The UDT algorithm is the first unsupervised tracking algorithm to be implemented in a Siamese network framework, and its accuracy is consistent with that of SRDCF.
The trackers that are based on deep characteristics are being gradually optimized. While the tracking speed of RT-MDNet far exceeds that of ECO, it realizes the same success rate and accuracy as ECO; hence, the multidomain combination method is effective. By introducing deep features on the basis of the STRCF algorithm, the result of the DeepSTRCF algorithm is improved substantially compared with that of the STRCF algorithm.
Which models perform best?
Compared with other tracking algorithms, SiamRPN++, SiamBAN, and SCSATT networks have the best tracking performance, which can not only meet various challenges but also meet the real-time requirements. This is because these algorithms do not update the network parameters during online tracking, thus avoiding the time consumption caused by a large amount of computation.
Which models are more robust?
The Siam R-CNN algorithm uses the TDPA mechanism to address the problem of tracking failure after serious occlusion and target loss in online tracking, thus improving the robustness of the model. The ECO algorithm uses GMM to ensure the diversity of training sets and reduce the risk of model overfitting. DeepSTRCF improves the robustness of the model by fusing CNN features, HOG and CN. The MCCT algorithm comprehensively considers the tracking results of multiple trackers to ensure the reliability of the tracking results, and filters unreliable samples through an adaptive strategy to improve the robustness of the model.
Which models are lightweight?
The Siam-BAN algorithm simplifies the tracking problem to parallel classification and regression and directly classifies targets in FCN, which reduces the computational complexity and ensures a simple network structure and strong flexibility. The RT-MDNet algorithm simplifies the tracking problem to target recognition and achieves a higher tracking effect by considering the interference of similar objects in the loss function. The TADT algorithm assumes that the tracking task needs only the information of specific channels related to the target, eliminates other redundant channels, reduces the feature information used in the tracking process, and speeds up the tracking speed.
Which models are suitable for long-term tracking?
The DaSiamRPN algorithm improves the generalization ability of the model by enhancing the diversity of training samples, and uses a local-to-global strategy to solve the problem of target loss during long-term tracking.

5.2.2. Attribute Evaluation

To fully evaluate the performance of the tracker in a variety of challenging scenarios, this article compares 12 different attributes in terms of accuracy and success on the UAV123 dataset. Table 5 and Table 6 presents the evaluation results of these attributes by all target tracking algorithms, and Figure 9 compares the methods that are based on deep learning. According to the experimental results, the trackers based on Siamese networks can effectively handle various challenging scenes; for the scenes in the categories of Aspect Ratio Change (ARC), Camera Motion (CM), Illumination Variation (IV) and Viewpoint Change (VC), the results are especially outstanding. Hence, the Siamese network structure performs satisfactorily in solving tracking problems such as target scale change, target rapid motion and target background similarity interference. In addition, compared with the “attention mechanism” approach, the “deep feature tracker” approach performs better in the categories of Background Clutter (BC), Full Occlusion (FOC), Low Resolution (LR) and Partial Occlusion (POC), thus, rich depth features can well overcome the problems of target occlusion and deformation.
The visualization results of each tracker on the aerial photography dataset UAV123 are shown in Figure 10. Among them, the first line is the tracking result of the video sequence bike, the second line is the tracking result of the video sequence building, the third line is the tracking result of the video sequence group, and the fourth line is the tracking result of the video sequence boat. We can see in Figure 10 that under the condition of a simple background, as in bike, the trackers show good tracking effects. However, when the background contains objects similar to the target, as in building and group, the background is seriously affected by interference, and some trackers encounter difficulty distinguishing the target from similar objects. We can also see that when the target is severely occluded or temporarily disappeared, as in group, the trackers fail to track. When the target size is small, as in boat, due to the small proportion of the target in the image, it is difficult to obtain features, and the tracking accuracy of some trackers is poor.
The speed comparison among all the trackers is shown in Figure 11 where the success rate vs. fps is plotted for the UAV123 dataset.Compared with other algorithms, SN-based trackers have higher frame rate, This is because the network parameters are not updated during online tracking. Among the CNN-based trackers, RTMDNet has the highest frame rate and outperforms the other CNN-based trackers. This is because RTMDNet adds an adaptive ROI layer between the convolution layer and the full connection layer. This method can greatly reduce the computational complexity of the tracking process and enable it to achieve higher frame rate in the tracking process. Among the CF-based trackers, ECO has the highest frame rate and outperforms. The factorized convolution operation makes the tracker more efficient, enabling it to achieve higher frame rate and better performance. ECO-HC using only hand-crafted features (HOG and Color Names), thus further reducing the computation of the model, thereby allowing it to achieve a higher fps than ECO. It is also seen that the KCF has a high fps, but it has the lowest success rate due to the tracker only extracts HOG features.

5.3. Evaluation in UAV20L

UAV20L is a representative aerial long-video dataset. This paper compares the performances of 10 representative long-video trackers. According to the evaluation report in Figure 12, Siamese network trackers still perform at a high level and far surpass other trackers that are based on depth characteristics. In addition, we analyzed the evaluation results of 12 independent attributes that were provided by UAV20L: Aspect Ratio Change (ARC), Background Clutter (BC), Camera Motion (CM), Fast Motion (FM), Full Occlusion (FOC), Illumination Variation (IV), Low Resolution (LR), Out-of-View (OV), Partial Occlusion (POC), Similar Object (SOB), Scale Variation (SC), and Viewpoint Change (VC). Table 7 and Table 8 present the evaluation results of these attributes by all target tracking algorithms and presents the scores of the 10 trackers on these attributes.
The Siamese network trackers perform better on Scale Variation (SV), Aspect Ratio Change (ARC), Fast Motion (FM), Partial Occlusion (POC), Out-of-View (OV), Viewpoint Change (VC), Camera Motion (CM), and Similar Object (SOB). For Background Clutter (BC), Full Occlusion (FOC), Illumination Variation (IV) and Low Resolution (LR). The deep neural network trackers show unique advantages. Among them, the MCCT algorithm uses an adaptive strategy to remove contaminated samples. It is effective in working with background interference and realizes a success rate nearly 20% higher than that of the Siamese network. The TADT algorithm uses a callback function to ensure that the deep convolutional network retains the positioning features of the target after convolution learning to cope with complete occlusion and low resolution. The Full Occlusion(FOC) success rate is 0.307, and the Low Resolution(LR) success rate is 0.432, which exceeds that of Siam R-CNN by 6%.

5.4. Comparison and Summary

For a single target, the available tracking algorithms are relatively mature when the motion trajectory and background are relatively simple, and better results can be obtained by using filters, deep learning and other methods. For the problem of multicamera collaborative tracking, methods of combining geographic information have been proposed, but they still cannot solve the problem of multi-man-machine collaborative tracking of multiple targets in complex scenarios. Table 9 summarizes and compares 35 aerial photography target tracking algorithms with better performance.
For aerial photography target tracking with various ranges, environments and targets, both the tracking speed and the recognition accuracy must be considered. Therefore, the methods discussed in this paper can be divided into two categories: those that realize increased accuracy and those that realize increased tracking speed. Target position information can be used to establish a motion model that has a fast tracking speed, but the accuracy of tracking is poor; when tracking is implemented by model matching, the tracking accuracy is high, but the processing speed is slower. Due to the successful application of the correlation filtering algorithm in the single target tracking field, the algorithm transforms the data processing from the real domain into the frequency domain, and the processing speed is substantially increased. Therefore, for a single target with a relatively simple motion trajectory and background, the available target tracking algorithms and technologies are relatively mature, and the method combining filtering and deep learning can yield superior results.
Compared with the traditional method of correlation filtering, target tracking based on deep learning realizes substantial improvements in terms of accuracy and detection speed, especially the network structure based on Siam. However, due to the strong dependence of deep learning on data and the insufficient amount of data in target tracking, the current framework cannot yield satisfactory results, and the explanatory performances of related methods of deep learning is insufficient. To summarize the available target tracking algorithms, we still must overcome the following challenges.
  • Changes in the target attitude. Multiple postures of the same moving target reduce the accuracy of target recognition, which is a common interference problem in target tracking. When the target attitude changes, its characteristics differ from those at the original attitude, and the target is easily lost, thereby resulting in tracking failure. An attention mechanism can help networks focus on important information regarding targets and reduce the probability of target loss during tracking. The utilization by deep learning network algorithms of an attention mechanism to ensure the accurate positioning of network targets is a promising research direction.
  • Long-term tracking. In a long-time tracking process, due to the height and speed limit of aerial photography, the tracking target scale in the images in the video change with increasing tracking time. Since the tracking box cannot utilize adaptive tracking, it contains redundant background feature information, thereby leading to parameter update error of the target model. In contrast, the accelerated flight causes the target scale to increase continuously. Since the tracking box cannot contain all characteristic information of the target, parameter update error also occurs. According to the experimental results of this paper, the Siamese network realizes satisfactory performance in long-term tracking but cannot conduct online real-time tracking. The construction of a suitable long-term target tracking model according to the characteristics of long-term tracking tasks and their connection points with short-term tracking that combines the depth characteristics and migration learning remains a substantial challenge.
  • Target tracking in a complex background environment. Against a complex background such as night, substantial changes in illumination intensity or too much occlusion, the target exhibits reflection, occlusion or transient disappearance during movement. If the moving target is similar to the background, tracking failure will occurs because the corresponding model of the target cannot be found. The main strategies for solving the occlusion problem are as follows: The depth characteristics of the target can be fully extracted to ensure that the network can handle the occlusion problem. During the offline training, occluded targets can be added into the training samples so that the network can fully learn coping strategies when a target is blocked and the trained offline network can be used to track the target. Multi-UAV collaborative tracking can utilize target information from multiple angles and effectively solve the problem of target tracking against a complex background.
  • Real-time tracking. Real-time tracking is always a difficult problem in the field of target tracking. The current tracking method based on deep learning has the advantage of learning from a large amount of data. However, in the target tracking process, only the annotation data of the first frame are completely accurate, and it is difficult to extract sufficient training data from the network. The network model of deep learning is complex and has many training parameters. If the network is adjusted online in the tracking stage to ensure the tracking performance, the network tracking speed is severely affected. Large-scale datasets obtained via aerial photography are gradually becoming available, which include rich target classes and involve various situations that are encountered in practical applications. Many tracking algorithms have continued to learn depth characteristics from these datasets via an end-to-end approach, which is expected to further enable target tracking algorithms to realize real-time tracking while ensuring satisfactory tracking speed.

6. Future Directions

6.1. Cooperative Tracking and Path Planning of Multiple Drones

As the sensing field of a single UAV is limited and the 3D feature information of the target and scene is lost, it is necessary to cooperatively utilize multiple UAVs. However, in multiple-UAV cooperative tracking, since the information surveillance camera is discrete, there is a lack of information for the rapid integration mechanism among multiple cameras, and multicamera coordination is necessary for efficient target tracking [92,93]. Thus, the problem of cooperative path planning [94] is also encountered. Although satisfactory planning and design results have been obtained, multiple challenges are faced, such as challenges regarding locally optimal solutions [95] and the iteration time [96].

6.2. Long-Term Tracking and Abnormal Discovery

With the frequent occurrence of abnormal events in public areas, technology for the detection of abnormal crowd behavior based on aerial video has become a research hotspot at home and abroad in recent years [97]. Long-time tracking and monitoring are required, which pose new challenges in aerial photography tracking. In terms of degree, abnormal events can be divided into two groups: abnormal group events and abnormal individual events [98]. These events must have occurred during the process of tracking the abnormal behavior detection alarm. The use of target behavior prediction and security situational awareness to realize real-time anomaly warning is the key problem to be solved in the future.

6.3. Visualization and Intelligent Analysis of Aerial Photography Data

UAVs rely on a variety of wireless network technologies to realize real-time video surveillance and air transfer of related images or videos to a mobile command platform or background system for intelligent identification and analysis and to provide a decision-making basis for manpower deployment, emergency response and technical support. However, due to the lack of corresponding technical support and solutions, information sharing among aerial video equipment to establish and improve the aerial video application integration platform is not convenient, which constrains the role of the intelligent monitoring system in public security. Based on intelligent analysis, with the deployment of the 5G network, the realization of real-time tracking and security situational awareness prediction via a visual approach is essential for the future application of the visualization platform.

Author Contributions

J.J. contributed to performing the experiments and writing the report. Y.Q. contributed to project administration and funding acquisition. Z.L. contributed to review, editing, and supervising. Z.Y. contributed to conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Science Foundation of China under Grant 61966035, by the International Cooperation Project of the Science and Technology Department of the Autonomous Region “Data-Driven Construction of Sino-Russian Cloud Computing Sharing Platform” (2020E01023), and by the National Science Foundation of China under Grant U1803261.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bonatti, R.; Ho, C.; Wang, W.; Choudhury, S.; Scherer, S.A. Towards a Robust Aerial Cinematography Platform: Localizing and Tracking Moving Targets in Unstructured Environments. arXiv 2019, arXiv:1904.02319. [Google Scholar]
  2. Zheng, Z.; Yao, H. A Method for UAV Tracking Target in Obstacle Environment. In Proceedings of the 2019 Chinese Automation Congress (CAC), Hangzhou, China, 22–24 November 2019; pp. 4639–4644. [Google Scholar]
  3. Zhang, S.; Zhao, X.; Zhou, B. Robust Vision-Based Control of a Rotorcraft UAV for Uncooperative Target Tracking. Sensors 2020, 20, 3474. [Google Scholar] [CrossRef] [PubMed]
  4. Wu, D.; Du, X.; Wang, K. An effective approach for underwater sonar image denoising based on sparse representation. In Proceedings of the 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC), Chongqing, China, 27–29 June 2018; pp. 389–393. [Google Scholar]
  5. Chen, Y.; Yu, M.; Jiang, G.; Peng, Z.; Chen, F. End-to-end single image enhancement based on a dual network cascade model. J. Vis. Commun. Image Represent. 2019, 61, 284–295. [Google Scholar] [CrossRef]
  6. Qiu, S.; Zhou, D.; Du, Y. The image stitching algorithm based on aggregated star groups. Signal Image Video Process. 2019, 13, 227–235. [Google Scholar] [CrossRef]
  7. Laguna, G.J.; Bhattacharya, S. Path planning with Incremental Roadmap Update for Visibility-based Target Tracking. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019; pp. 1159–1164. [Google Scholar]
  8. Yang, X.; Shi, J.; Zhou, Y.; Wang, C.; Hu, Y.; Zhang, X.; Wei, S. Ground Moving Target Tracking and Refocusing Using Shadow in Video-SAR. Remote Sens. 2020, 12, 3083. [Google Scholar] [CrossRef]
  9. Zhang, W.; Cong, M.; Wang, L. Algorithms for optical weak small targets detection and tracking: Review. In Proceedings of the International Conference on Neural Networks and Signal Processing, Nanjing, China, 14–17 December 2003; Volume 1, pp. 643–647. [Google Scholar] [CrossRef]
  10. De Oca, A.M.M.; Bahmanyar, R.; Nistor, N.; Datcu, M. Earth observation image semantic bias: A collaborative user annotation approach. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 2462–2477. [Google Scholar] [CrossRef] [Green Version]
  11. Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
  12. Smeulders, A.W.; Chu, D.M.; Cucchiara, R.; Calderara, S.; Dehghan, A.; Shah, M. Visual tracking: An experimental survey. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1442–1468. [Google Scholar]
  13. Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Cehovin Zajc, L.; Vojir, T.; Bhat, G.; Lukezic, A.; Eldesokey, A.; et al. The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–53. [Google Scholar]
  14. Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Cehovin Zajc, L.; Vojir, T.; Hager, G.; Lukezic, A.; Eldesokey, A.; et al. The visual object tracking vot2017 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 1949–1972. [Google Scholar]
  15. Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
  16. Wu, Y.; Lim, J.; Yang, M.H. Object Tracking Benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1834–1848. [Google Scholar] [CrossRef] [Green Version]
  17. Liang, P.; Blasch, E.; Ling, H. Encoding color information for visual tracking: Algorithms and benchmark. IEEE Trans. Image Process. 2015, 24, 5630–5644. [Google Scholar] [CrossRef]
  18. Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–21 June 2019; pp. 5374–5383. [Google Scholar]
  19. Kiani Galoogahi, H.; Fagg, A.; Huang, C.; Ramanan, D.; Lucey, S. Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1125–1134. [Google Scholar]
  20. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Nie, Q.; Cheng, H.; Liu, C.; Liu, X.; et al. Visdrone-det 2018: The vision meets drone object detection in image challenge results. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 437–468. [Google Scholar]
  21. Hu, Y.; Xiao, M.; Zhang, K.; Wang, X. Aerial infrared target tracking in complex background based on combined tracking and detecting. Math. Probl. Eng. 2019, 2019, 1–17. [Google Scholar] [CrossRef] [Green Version]
  22. Jia, X.; Lu, H.; Yang, M.H. Visual tracking via adaptive structural local sparse appearance model. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 18–20 June 2012; pp. 1822–1829. [Google Scholar]
  23. Hong, Z.; Chen, Z.; Wang, C.; Mei, X.; Prokhorov, D.; Tao, D. Multi-store tracker (muster): A cognitive psychology inspired approach to object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 749–758. [Google Scholar]
  24. Raguram, R.; Chum, O.; Pollefeys, M.; Matas, J.; Frahm, J.M. USAC: A universal framework for random sample consensus. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 2022–2038. [Google Scholar] [CrossRef]
  25. Ke, Y.; Sukthankar, R. PCA-SIFT: A more distinctive representation for local image descriptors. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 27 June–2 July 2004; Volume 2, p. II-II. [Google Scholar]
  26. Zhou, X.; Li, J.; Chen, S.; Cai, H.; Liu, H. Multiple perspective object tracking via context-aware correlation filter. IEEE Access 2018, 6, 43262–43273. [Google Scholar] [CrossRef]
  27. He, Z.; Yi, S.; Cheung, Y.M.; You, X.; Tang, Y.Y. Robust object tracking via key patch sparse representation. IEEE Trans. Cybern. 2016, 47, 354–364. [Google Scholar] [CrossRef] [PubMed]
  28. Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared small target detection utilizing the multiscale relative local contrast measure. IEEE Geosci. Remote Sens. Lett. 2018, 15, 612–616. [Google Scholar] [CrossRef]
  29. Sanchez-Matilla, R.; Poiesi, F.; Cavallaro, A. Online multi-target tracking with strong and weak detections. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 84–99. [Google Scholar]
  30. Wang, C.; Song, F.; Qin, S. Infrared small target tracking by discriminative classification based on Gaussian mixture model in compressive sensing domain. In International Conference on Optical and Photonics Engineering (icOPEN 2016); International Society for Optics and Photonics: Bellingham, WA, USA, 2017; Volume 10250, p. 102502L. [Google Scholar]
  31. Liu, M.; Huang, Z.; Fan, Z.; Zhang, S.; He, Y. Infrared dim target detection and tracking based on particle filter. In Proceedings of the 2017 36th Chinese Control Conference (CCC), Dalian, China, 26–28 July 2017; pp. 5372–5378. [Google Scholar]
  32. Li, S.J.; Fan, X.; Zhu, B.; Cheng, Z.D.; State Key Laboratory of Pulsed Power Laser Technology, Electronic Engineering Institute. A method for small infrared targets detection based on the technology of motion blur recovery. Acta Photonica Sin. 2017, 37, 06100011–06100017. [Google Scholar]
  33. Raj, N.N.; Vijay, A.S. Adaptive blind deconvolution and denoising of motion blurred images. In 2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT); IEEE: Piscataway, NJ, USA, 2016; pp. 1171–1175. [Google Scholar]
  34. Shkurko, K.; Yuksel, C.; Kopta, D.; Mallett, I.; Brunvand, E. Time Interval Ray Tracing for Motion Blur. IEEE Trans. Vis. Comput. Graph. 2017, 24, 3225–3238. [Google Scholar] [CrossRef]
  35. Inoue, M.; Gu, Q.; Jiang, M.; Takaki, T.; Ishii, I.; Tajima, K. Motion-blur-free high-speed video shooting using a resonant mirror. Sensors 2017, 17, 2483. [Google Scholar] [CrossRef] [Green Version]
  36. Bi, Y.; Bai, X.; Jin, T.; Guo, S. Multiple feature analysis for infrared small target detection. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1333–1337. [Google Scholar] [CrossRef]
  37. Qiang, Z.; Du, X.; Sun, L. Remote sensing image fusion for dim target detection. In Proceedings of the 2011 International Conference on Advanced Mechatronic Systems, Zhengzhou, China, 11–13 August 2011; pp. 379–383. [Google Scholar]
  38. Wu, D.; Zhang, L.; Lin, L. Based on the moving average and target motion information for detection of weak small target. In Proceedings of the 2018 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), Xiamen, China, 25–26 January 2018; pp. 641–644. [Google Scholar]
  39. Rollason, M.; Salmond, D. Particle filter for track-before-detect of a target with unknown amplitude viewed against a structured scene. IET Radar Sonar Navig. 2018, 12, 603–609. [Google Scholar] [CrossRef]
  40. Wang, H.; Peng, J.; Yue, S. A feedback neural network for small target motion detection in cluttered backgrounds. In International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2018; pp. 728–737. [Google Scholar]
  41. Martin, D.; Gustav, F.; Fahad Shahbaz, K.; Michael, F. Accurate scale estimation for robust visual tracking. In Proceedings of the British Machine Vision Conference, Nottingham, UK, 1–5 September 2014. [Google Scholar]
  42. Cheng, H.; Lin, L.; Zheng, Z.; Guan, Y.; Liu, Z. An autonomous vision-based target tracking system for rotorcraft unmanned aerial vehicles. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 1732–1738. [Google Scholar]
  43. Li, F.; Yao, Y.; Li, P.; Zhang, D.; Zuo, W.; Yang, M.H. Integrating boundary and center correlation filters for visual tracking with aspect ratio variation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 2001–2009. [Google Scholar]
  44. Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar]
  45. Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. Exploiting the circulant structure of tracking-by-detection with kernels. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2012; pp. 702–715. [Google Scholar]
  46. Li, Y.; Zhu, J. A scale adaptive kernel correlation filter tracker with feature integration. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 254–265. [Google Scholar]
  47. Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [Green Version]
  48. Li, Y.; Fu, C.; Huang, Z.; Zhang, Y.; Pan, J. Intermittent Contextual Learning for Keyfilter-Aware UAV Object Tracking Using Deep Convolutional Feature. IEEE Trans. Multimed. 2020. [Google Scholar] [CrossRef]
  49. Li, Y.; Fu, C.; Huang, Z.; Zhang, Y.; Pan, J. Keyfilter-aware real-time uav object tracking. arXiv 2020, arXiv:2003.05218. [Google Scholar]
  50. Oh, H.; Kim, S.; Shin, H.S.; Tsourdos, A. Coordinated standoff tracking of moving target groups using multiple UAVs. IEEE Trans. Aerosp. Electron. Syst. 2015, 51, 1501–1514. [Google Scholar]
  51. Greatwood, C.; Bose, L.; Richardson, T.; Mayol-Cuevas, W.; Chen, J.; Carey, S.J.; Dudek, P. Tracking control of a UAV with a parallel visual processor. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 4248–4254. [Google Scholar]
  52. Song, R.; Long, T.; Wang, Z.; Cao, Y.; Xu, G. Multi-UAV Cooperative Target Tracking Method using sparse a search and Standoff tracking algorithms. In Proceedings of the 2018 IEEE CSAA Guidance, Navigation and Control Conference (CGNCC), Xiamen, China, 10–12 August 2018; pp. 1–6. [Google Scholar]
  53. Huang, Z.; Fu, C.; Li, Y.; Lin, F.; Lu, P. Learning aberrance repressed correlation filters for real-time uav tracking. In Proceedings of the IEEE International Conference on Computer Vision, Long Beach, CA, USA, 16–20 June 2019; pp. 2891–2900. [Google Scholar]
  54. Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 4310–4318. [Google Scholar]
  55. Li, F.; Tian, C.; Zuo, W.; Zhang, L.; Yang, M.H. Learning spatial-temporal regularized correlation filters for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4904–4913. [Google Scholar]
  56. Li, Y.; Fu, C.; Ding, F.; Huang, Z.; Lu, G. AutoTrack: Towards High-Performance Visual Tracking for UAV with Automatic Spatio-Temporal Regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–18 June 2020; pp. 11923–11932. [Google Scholar]
  57. Che, F.; Niu, Y.; Li, J.; Wu, L. Cooperative Standoff Tracking of Moving Targets Using Modified Lyapunov Vector Field Guidance. Appl. Sci. 2020, 10, 3709. [Google Scholar] [CrossRef]
  58. Wang, L.; Ouyang, W.; Wang, X.; Lu, H. Stct: Sequentially training convolutional networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1373–1381. [Google Scholar]
  59. Yun, S.; Choi, J.; Yoo, Y.; Yun, K.; Young Choi, J. Action-decision networks for visual tracking with deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy, 22–29 October 2017; pp. 2711–2720. [Google Scholar]
  60. Zhang, X.; Zhang, X.; Du, X.; Zhou, X.; Yin, J. Learning Multi-Domain Convolutional Network for RGB-T Visual Tracking. In Proceedings of the 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, 13–15 October 2018; pp. 1–6. [Google Scholar]
  61. Jung, I.; Son, J.; Baek, M.; Han, B. Real-time mdnet. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 83–98. [Google Scholar]
  62. Huang, C.; Lucey, S.; Ramanan, D. Learning policies for adaptive tracking with deep feature cascades. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 105–114. [Google Scholar]
  63. Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Convolutional features for correlation filter based visual tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 13–16 December 2015; pp. 58–66. [Google Scholar]
  64. Qi, Y.; Zhang, S.; Qin, L.; Yao, H.; Huang, Q.; Lim, J.; Yang, M.H. Hedged deep tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4303–4311. [Google Scholar]
  65. Xia, H.; Zhang, Y.; Yang, M.; Zhao, Y. Visual tracking via deep feature fusion and correlation filters. Sensors 2020, 20, 3370. [Google Scholar] [CrossRef]
  66. Jianming, Z.; Shugao, M.; Sclaroff, S. MEEM: Robust Tracking via Multiple Experts Using Entropy Minimization. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 188–203. [Google Scholar]
  67. Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
  68. Danelljan, M.; Robinson, A.; Khan, F.S.; Felsberg, M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 472–488. [Google Scholar]
  69. Bhat, G.; Johnander, J.; Danelljan, M.; Shahbaz Khan, F.; Felsberg, M. Unveiling the power of deep tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 483–498. [Google Scholar]
  70. Wang, N.; Zhou, W.; Tian, Q.; Hong, R.; Wang, M.; Li, H. Multi-cue correlation filters for robust visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4844–4853. [Google Scholar]
  71. Ke, H.; Chen, D.; Li, X.; Tang, Y.; Shah, T.; Ranjan, R. Towards brain big data classification: Epileptic EEG identification with a lightweight VGGNet on global MIC. IEEE Access 2018, 6, 14722–14733. [Google Scholar] [CrossRef]
  72. Li, X.; Ma, C.; Wu, B.; He, Z.; Yang, M.H. Target-aware deep tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1369–1378. [Google Scholar]
  73. Lukezic, A.; Matas, J.; Kristan, M. D3S-A Discriminative Single Shot Segmentation Tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–18 June 2020; pp. 7133–7142. [Google Scholar]
  74. Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines; AAAI: Menlo Park, CA, USA, 2020; pp. 12549–12556. [Google Scholar]
  75. Tao, R.; Gavves, E.; Smeulders, A.W. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1420–1429. [Google Scholar]
  76. Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
  77. Tang, W.; Yu, P.; Wu, Y. Deeply learned compositional models for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 190–206. [Google Scholar]
  78. Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; Torr, P.H. End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy, 22–29 October 2017; pp. 2805–2813. [Google Scholar]
  79. Wang, Q.; Gao, J.; Xing, J.; Zhang, M.; Hu, W. Dcfnet: Discriminant correlation filters network for visual tracking. arXiv 2017, arXiv:1704.04057. [Google Scholar]
  80. Fan, H.; Ling, H. Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5486–5494. [Google Scholar]
  81. Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8971–8980. [Google Scholar]
  82. Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
  83. Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2019; pp. 4282–4291. [Google Scholar]
  84. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  85. Voigtlaender, P.; Luiten, J.; Torr, P.H.; Leibe, B. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–18 June 2020; pp. 6578–6588. [Google Scholar]
  86. Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese Box Adaptive Network for Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–18 June 2020; pp. 6668–6677. [Google Scholar]
  87. Wang, N.; Song, Y.; Ma, C.; Zhou, W.; Liu, W.; Li, H. Unsupervised deep tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1308–1317. [Google Scholar]
  88. Zhao, L.; Ishag Mahmoud, M.A.; Ren, H.; Zhu, M. A Visual Tracker Offering More Solutions. Sensors 2020, 20, 5374. [Google Scholar] [CrossRef]
  89. Wang, Q.; Teng, Z.; Xing, J.; Gao, J.; Hu, W.; Maybank, S. Learning attentions: Residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4854–4863. [Google Scholar]
  90. Rahman, M.M.; Fiaz, M.; Jung, S.K. Efficient Visual Tracking with Stacked Channel-Spatial Attention Learning. IEEE Access 2020. [Google Scholar] [CrossRef]
  91. Li, D.; Wen, G.; Kuai, Y.; Porikli, F. End-to-end feature integration for correlation filter tracking with channel attention. IEEE Signal Process. Lett. 2018, 25, 1815–1819. [Google Scholar] [CrossRef]
  92. Ru, C.J.; Qi, X.M.; Guan, X.N. Distributed cooperative search control method of multiple UAVs for moving target. Int. J. Aerosp. Eng. 2015, 2015. [Google Scholar] [CrossRef] [Green Version]
  93. Nikodem, M.; Słabicki, M.; Surmacz, T.; Mrówka, P.; Dołęga, C. Multi-Camera Vehicle Tracking Using Edge Computing and Low-Power Communication. Sensors 2020, 20, 3334. [Google Scholar] [CrossRef] [PubMed]
  94. Zhong, Y.; Yao, P.; Sun, Y.; Yang, J. Method of multi-UAVs cooperative search for Markov moving targets. In Proceedings of the 2017 29th Chinese Control And Decision Conference (CCDC), Chongqing, China, 28–30 November 2017; pp. 6783–6789. [Google Scholar]
  95. Ramirez-Atencia, C.; Bello-Orgaz, G.; R-Moreno, M.D.; Camacho, D. Solving complex multi-UAV mission planning problems using multi-objective genetic algorithms. Soft Comput. 2017, 21, 4883–4900. [Google Scholar] [CrossRef]
  96. Oh, H.; Kim, S.; Tsourdos, A. Road-map–assisted standoff tracking of moving ground vehicle using nonlinear model predictive control. IEEE Trans. Aerosp. Electron. Syst. 2015, 51, 975–986. [Google Scholar]
  97. Da Costa, J.R.; Nedjah, N.; de Macedo Mourelle, L.; da Costa, D.R. Crowd abnormal detection using artificial bacteria colony and Kohonen’s neural network. In Proceedings of the 2017 IEEE Latin American Conference on Computational Intelligence (LA-CCI), Arequipa, Peru, 8–10 November 2017; pp. 1–6. [Google Scholar]
  98. Cong, Y.; Yuan, J.; Liu, J. Abnormal event detection in crowded scenes using sparse representation. Pattern Recognit. 2013, 46, 1851–1864. [Google Scholar] [CrossRef]
Figure 1. Probability hypothesis density particle filter framework calculation process. At time k, strong detection Z k + and weak detection Z k are associated with the predicted state that is calculated from the predicted particles. After the early association, two detection subsets are used for tracking. Detection Z k inherits the identity of the corresponding trajectory and is used to track the state, and Z k are unassociated strong detections and are used to initialize new states. After updating and resampling of the perspective, particle x k is used to estimate state X k .
Figure 1. Probability hypothesis density particle filter framework calculation process. At time k, strong detection Z k + and weak detection Z k are associated with the predicted state that is calculated from the predicted particles. After the early association, two detection subsets are used for tracking. Detection Z k inherits the identity of the corresponding trajectory and is used to track the state, and Z k are unassociated strong detections and are used to initialize new states. After updating and resampling of the perspective, particle x k is used to estimate state X k .
Entropy 22 01358 g001
Figure 2. RT-MDNet structure. The model consists of K branches with a shared layer and a domain-specific layer. Green and red represent positive samples and negative samples, respectively, in each domain.
Figure 2. RT-MDNet structure. The model consists of K branches with a shared layer and a domain-specific layer. Green and red represent positive samples and negative samples, respectively, in each domain.
Entropy 22 01358 g002
Figure 3. EArly Stopping Tracker (EAST) network structure. Judgment of the optimal feature layer by action.
Figure 3. EArly Stopping Tracker (EAST) network structure. Judgment of the optimal feature layer by action.
Entropy 22 01358 g003
Figure 4. SiamRPN++ network structure. In the case of a specified target template and search area, the output intensive prediction is obtained by fusing the outputs of SiamRPN blocks. The middle siamrpn block is displayed on the right, which is divided into two parts: a classification branch and a boundary box regression branch.
Figure 4. SiamRPN++ network structure. In the case of a specified target template and search area, the output intensive prediction is obtained by fusing the outputs of SiamRPN blocks. The middle siamrpn block is displayed on the right, which is divided into two parts: a classification branch and a boundary box regression branch.
Entropy 22 01358 g004
Figure 5. RASNet structure. The RASNet consists of a shared feature extractor, attention mechanisms (general attention, residual attention, and channel attention), and a weighted cross-correlation layer (WXCorr).
Figure 5. RASNet structure. The RASNet consists of a shared feature extractor, attention mechanisms (general attention, residual attention, and channel attention), and a weighted cross-correlation layer (WXCorr).
Entropy 22 01358 g005
Figure 6. Spatial attention (SCSAtt) structure. The channel attention and spatial attention are combined to learn on “what” and “where” to concentrate or suppress target information, thereby effectively locating target information.
Figure 6. Spatial attention (SCSAtt) structure. The channel attention and spatial attention are combined to learn on “what” and “where” to concentrate or suppress target information, thereby effectively locating target information.
Entropy 22 01358 g006
Figure 7. Evaluating indicators.
Figure 7. Evaluating indicators.
Entropy 22 01358 g007
Figure 8. Overall accuracy and success rate of the trackers in the UAV123 benchmark test. The abscissa is the threshold, and the ordinate is the precision value.
Figure 8. Overall accuracy and success rate of the trackers in the UAV123 benchmark test. The abscissa is the threshold, and the ordinate is the precision value.
Entropy 22 01358 g008
Figure 9. Result comparison of deep learning trackers.The abscissa is the attribute, and the ordinate is the precision value.
Figure 9. Result comparison of deep learning trackers.The abscissa is the attribute, and the ordinate is the precision value.
Entropy 22 01358 g009
Figure 10. Visualization of tracking results in different test sequences. The test sequence included bicycles, boats, buildings and people.
Figure 10. Visualization of tracking results in different test sequences. The test sequence included bicycles, boats, buildings and people.
Entropy 22 01358 g010
Figure 11. The success rate and frame rate of trackers on the UAV123 dataset.
Figure 11. The success rate and frame rate of trackers on the UAV123 dataset.
Entropy 22 01358 g011
Figure 12. Overall accuracy and success rates of the trackers in the UAV20L benchmark test.
Figure 12. Overall accuracy and success rates of the trackers in the UAV20L benchmark test.
Entropy 22 01358 g012
Table 1. Common aerial video datasets.
Table 1. Common aerial video datasets.
DatasetsNumber of VideosShortest Video FramesAverage Video FramesLongest Video FramesTotal Video Frames
UAV123 [11]1201099159085112,578
UAV20L [11]2017172934552758,670
ALOV300++ [12]314194835975151,657
VOT-2014 [13]25164409121010,000
VOT-2017 [14]6041356150021,000
OTB2013 [15]5171578387229,491
OTB2015 [16]10071590387259,040
Temple Color 128 [17]12971429387255,346
LaSOT [18]14001000250611,3973.52 M
NFS [19]100169383020,665383,000
VisDone 2018 [20]288-10,209-261,908
Table 2. Configuration parameters of experimental environment.
Table 2. Configuration parameters of experimental environment.
Parameter NameVersion or Value
Operating systemWindows 10
CPUIntel Xeon 3.60 GHz
GPUNVIDIA TITAN V/12 G
CUDACUDA10.1
RAM32 GB
Table 3. The URLs of the implemented tracking algorithm code. P represents the implementation in python and M represents the implementation in matlab.
Table 4. Tracker characteristics.
Table 4. Tracker characteristics.
TrackerBase NetworkFeatureOnline-LearningReal-Time
SiamRPN++SiamRPNCNNNY
SiamBANSiamFCCNNNY
Siam R-CNNSiamFCCNNYN
DaSiamRPNSiamRPNCNNYY
SCSAttSiamFCCNNNY
UDTSiamFCCNNNY
RTMDNetMDNetCNNYY
ECOC-COTCNN, HOG, CNYN
ECO-HCC-COTHOG, CNYN
C-COTC-COTCNNNN
MCCTDCFCNNYN
TADTTADTCNNNY
DeepSTRCFSTRCFCNN, HOG, CNYN
MEEMMEEMCNNYN
STRCFSRDCFHOG, CN, GrayYN
SRDCFSRDCFHOG, CNYN
SAMFKCFHOG, CN, GrayNN
MUSTERMUSTERHOG, CNNN
DSSTCFHOG, CN, GrayNN
KCFCFHOGNN
Table 5. The precision results of various trackers under the UAV123 dataset attribute. The best-performing tracker is displayed in red, and the second-best performer is in yellow.
Table 5. The precision results of various trackers under the UAV123 dataset attribute. The best-performing tracker is displayed in red, and the second-best performer is in yellow.
TrackerARCBCCMFMFOCIVLROVPOCSOBSVVC
Siam R-CNN0.8540.7140.8890.8220.7760.8090.7060.8390.8090.8120.8280.875
SiamBAN0.7960.6450.8480.8050.6710.7660.7190.7890.7650.7770.8130.824
SiamRPN++0.8180.6550.8630.7740.6610.8150.6900.8160.7710.8000.8200.876
DaSiamRPN0.7560.6680.7860.7370.6330.7100.6630.6930.7010.7470.7540.753
SCSAtt0.7220.5410.7750.6900.5620.6780.6260.7210.6950.780.7490.747
ECO0.6540.6240.7210.6520.5760.7100.6830.5900.6690.7470.7070.680
RTMDNet0.7200.6890.7670.6410.5790.7230.6890.6590.7000.7540.7350.702
MCCT0.6830.6160.7200.6140.5730.7040.6210.6590.6830.7410.7000.681
TADT0.6670.6690.7230.6170.6090.6690.6640.6260.6940.7280.6920.655
DeepSTRCF0.6440.5940.6960.5860.5200.6640.5970.6180.6300.7170.6670.640
UDT0.6180.5160.6540.6000.4740.5990.5850.5800.5780.6680.6390.599
SRDCF0.5870.5260.6270.5240.5010.6000.5790.5760.6080.6780.6390.593
STRCF0.5860.5630.6580.55540.4880.5380.5890.5700.5870.6480.6430.581
ECO-HC0.6530.6080.7120.5870.5690.6530.6310.5990.6530.6980.6900.640
C-COT0.5860.5020.6580.5540.4870.5360.5840.3880.5870.6480.6430.581
MEEM0.5630.5160.5950.4180.4600.5090.5800.4760.5260.6290.5910.680
SAMF0.4970.5300.5580.4020.4580.5240.5390.4690.5060.6110.5410.518
MUSTER0.5160.5810.5700.4060.4630.4890.5270.2960.4950.6290.5520.537
DSST0.4820.5000.5200.3670.4060.5240.4750.2560.5050.6040.5380.502
KCF0.4240.4540.4830.3000.3740.4180.4360.3860.4510.5780.4710.436
Table 6. The successful results of various trackers under the UAV123 dataset attribute. The best-performing tracker is displayed in red, and the second-best performer is in yellow.
Table 6. The successful results of various trackers under the UAV123 dataset attribute. The best-performing tracker is displayed in red, and the second-best performer is in yellow.
TrackerARCBCCMFMFOCIVLROVPOCSOBSVVC
SiamR-CNN0.7950.6480.8390.7530.6380.7650.6140.7720.7380.7490.7780.842
SiamRPN++0.7510.5640.8040.7060.5090.7560.5700.7280.6920.7210.7610.832
SiamBAN0.7240.5490.7830.7230.5100.6990.5900.7070.6780.6950.7460.772
DaSiamRPN0.6800.5740.7380.6600.4640.6530.5240.6310.6250.6590.6920.709
SCSAtt0.5970.4450.6910.5640.3790.5920.5920.6000.5880.6730.6550.645
ECO0.4970.4790.5990.4630.3580.5340.4700.5060.5480.6290.5880.530
RTMDNet0.5240.4630.6080.4540.3260.5740.4640.5530.5960.6170.6220.536
MCCT0.5210.5120.6180.4640.3600.5930.4110.5430.5530.6150.5780.546
TADT0.5010.5250.6130.4560.3960.5440.4790.4990.5640.6100.5820.513
DeepSTRCF0.5030.4440.6050.4270.3180.5290.3980.5130.5120.6010.5600.519
UDT0.4990.4220.5690.4800.3080.4990.4990.5000.4820.5630.5480.481
SRDCF0.4310.4010.5450.3660.3010.4570.3590.4650.4680.5320.5100.441
STRCF0.4180.4250.5120.3590.2890.3850.3880.4700.4690.5500.5160.426
ECO-HC0.4910.4590.5980.4140.3680.5110.4040.5200.5250.5850.5610.476
C-COT0.5840.3820.5390.3570.2890.3810.3820.4710.4620.5470.5100.421
MEEM0.3620.3890.4260.2420.2580.3600.3040.3290.3800.5160.4050.357
SAMF0.3620.4080.4500.2830.2490.3620.2690.3490.3920.5000.4300.354
MUSTER0.5160.4390.4320.2430.2420.3540.2960.2970.3470.4710.4050.385
DSST0.4820.3890.3460.2000.2260.3310.2560.2930.3420.4010.3220.299
KCF0.4220.3410.3470.1870.2100.2960.2100.2570.3210.3790.3070.283
Table 7. The precision results of various trackers under the UAV20L dataset attribute. The best-performing tracker is displayed in red, and the second-best performer in yellow.
Table 7. The precision results of various trackers under the UAV20L dataset attribute. The best-performing tracker is displayed in red, and the second-best performer in yellow.
TrackerARCBCCMFMFOCIVLROVPOCSOBSVVC
Siam R-CNN0.5220.1910.5970.6420.3490.4390.5210.6410.5780.6830.5970.561
DaSiamRPN0.5170.1910.5950.6410.3460.4360.5200.6370.5720.6670.5840.558
SiamRPN0.5140.1900.5960.6420.3510.4370.5180.6410.5740.6780.5810.549
MCCT0.5160.3820.540.5340.4180.5630.4750.5750.5730.6180.5860.495
ECO0.4890.3820.5670.4930.4090.5510.4860.5460.5540.5590.5670.507
TADT0.5210.3830.5880.6140.4440.5180.5500.5340.5770.5870.5880.505
PTAV0.4890.3820.5670.4930.4090.5510.4860.5460.5540.5590.5670.507
DeepSTRCF0.4880.3810.5660.5080.4290.5230.5120.5490.5560.5630.5660.503
UDT0.4460.3780.4960.4920.4270.4370.4450.4780.4870.5210.4890.402
SRDCF0.3890.2520.4820.3270.3310.4110.4290.4950.4910.5220.4810.414
SAMF0.3820.3300.4430.3080.3510.4160.4190.3840.4450.4570.4430.363
Table 8. The successful results of various trackers under the UAV20L dataset attribute. The best-performing tracker is displayed in red, and the second-best performer in yellow.
Table 8. The successful results of various trackers under the UAV20L dataset attribute. The best-performing tracker is displayed in red, and the second-best performer in yellow.
TrackerARCBCCMFMFOCIVLROVPOCSOBSVVC
Siam R-CNN0.4900.1370.5690.5440.2410.4310.4320.6230.5490.6910.6910.57
DaSiamRPN0.4890.1310.5640.5410.2250.4300.4240.6050.5430.6870.6910.552
SiamRPN0.4830.1360.5570.5370.2380.4270.4160.6180.5330.6820.6780.561
MCCT0.4030.3270.4630.3470.2850.4280.3370.4480.4560.5630.5630.497
ECO0.420.2880.5060.3210.2670.4980.3410.5010.4950.5650.5650.51
TADT0.4640.3210.5370.4450.3070.5040.4320.4480.5250.5910.5910.563
PTAV0.420.2880.5060.3210.2670.4980.3410.5010.4950.5650.5650.51
DeepSTRCF0.4740.2970.5560.3970.2860.5310.4080.5520.5450.610.610.556
UDT0.40.3190.4560.4040.3090.430.3490.4330.4410.5140.5140.43
SRDCF0.3050.2030.3840.2070.2140.3270.240.4070.3830.4630.4630.39
SAMF0.2810.2680.3490.1430.220.370.2750.3070.3560.3710.3710.349
Table 9. Comparison of aerial video tracking methods.
Table 9. Comparison of aerial video tracking methods.
CategoryMethodApplicable TargetApplicable ScenarioNumber of Targets
Manual featuresASLA [22]Common objectivesSevere target occlusionSingle target
MUSTer [23]Common objectivesShort/long-time trackingSingle target
Characteristics of the cascade [62]Common objectivesHover aerial shotSingle target
Moving average method [38]Weak small targetsSmaller targetSingle target
Grayscale features, spatial features [35]Weak/background similar targetsComplex background/small targetSingle target
Filter trackingBayesian trackers [39]Blurred objectivesCommon scenarioMany objectives
Wiener filtering [32]Blurred objectivesBlurred targetSingle target
Vector field characteristics [50]Fast/multitargetFast-moving speed/wide field of visionMany objectives
Feedback ESTMD [40]Moving small targetComplicated backgroundSingle target
ARCF [53]Moving targetSevere occlusion/background interferenceSingle target
DSST [41]Moving targetCommon scenarioSingle target
KCF [47]Moving targetCommon scenarioSingle target
SRDCF [54]Moving targetLarge range of motion/complex scenesSingle target
STRCF [55]Moving targetCommon scenarioSingle target
AutoTrack [56]Moving targetCommon scenarioSingle target
Scale estimateSAMF [46]Moving targetScale changeSingle target
Depth featuresRT-MDNet [61]Moving targetComplicated backgroundSingle target
MEEM [66]Multiscale targetGeneral backgroundSingle target
C-COT [68]Common objectivesGeneral backgroundSingle target
ECO [67]Common objectivesGeneral backgroundSingle target
ECO+ [69]Common objectivesBackground complex/multiscaleSingle target
MCCT [70]Common objectivesTarget occlusion/complex backgroundSingle target
TADT [72]Target deformationBackground interference/common scenarioSingle target
DeepSTRCF [55]Similar objectivesCommon scenarioSingle target
Siamese networkSiamFC [76]Target deformationGeneral backgroundSingle target
PTAV [80]Common objectivesCommon scenarioSingle target
SiamRPN [81]Weak small targetsCommon scenarioSingle target
Da SiamRPN [82]Moving targetLong trackSingle target
SiamRPN++ [83]Moving targetVarious scenariosSingle target
Siam R-CNN [85]Multiscale targetSevere occlusion/common scenarioSingle target
SiamBAN [86]Common objectivesVarious scenariosSingle target
UDT [87]Multiscale targetSevere occlusionSingle target
Attention mechanismRASNet [89]Common objectivesGeneral backgroundSingle target
SCSAtt [90]Common objectivesTarget scales vary substantiallySingle target
FICFNet [91]Moving targetSevere deformation/occlusion of the targetSingle target
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Jia, J.; Lai, Z.; Qian, Y.; Yao, Z. Aerial Video Trackers Review. Entropy 2020, 22, 1358. https://doi.org/10.3390/e22121358

AMA Style

Jia J, Lai Z, Qian Y, Yao Z. Aerial Video Trackers Review. Entropy. 2020; 22(12):1358. https://doi.org/10.3390/e22121358

Chicago/Turabian Style

Jia, Jinlu, Zhenyi Lai, Yurong Qian, and Ziqiang Yao. 2020. "Aerial Video Trackers Review" Entropy 22, no. 12: 1358. https://doi.org/10.3390/e22121358

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop