1. Introduction
Unmanned aerial vehicles (UAVs) have played an important role in aviation remote sensing because of their flexibility and safety. With the development of artificial intelligence in recent years, UAVs have been widely used in many computer vision areas, such as intelligent surveillance [
1], disaster rescue [
2], smart agriculture [
3], city security [
4], and so on. Among these applications, object tracking is one of the most fundamental and important algorithms.
Object tracking can be divided into single-object tracking (SOT) and multiobject tracking (MOT). SOT aims to estimate the location of a certain object in a video sequence. More studies have proposed many visual tracking methods, such as discriminative trackers [
5,
6,
7,
8,
9] and Siamese-based trackers [
10,
11,
12,
13,
14]. All of these effective models mainly focus on object tracking in surveillance scenarios. Different from generic object tracking, unmanned aerial vehicle (UAV) object tracking follows objects from an aerial perspective. According to research on UAV object-tracking data, we find that there are two main differences between UAV object tracking and generic object tracking.
The first challenge is the small size of objects in UAV scenes. The relative distances between the UAV and objects are larger than those of other sensors and objects. Most of the objects in video sequences captured by UAV occupy less than of the total number of pixels. The information we can obtain from these small objects is limited, especially the appearance features to represent the objects. Furthermore, appearance features of small objects are easily influenced by backgrounds. Most existing methods for object tracking cannot track objects successfully by using limited features in complex UAV scenes.
The second challenge usually occurs during the drone flight. To capture objects continuously, a camera mounted on a UAV will move with the tracked object frequently. According to our observation, trackers usually lose the object after the UAV camera moves suddenly, which we call the target-forgetting problem.
In fact, the smallness of objects and camera motion are two common attributes in the aviation remote sensing field. For example, urban traffic surveillance has the characteristics of high spatial complexity [
15]. In order to obtain more information about traffic scenes, the altitudes of UAV views are usually high, which leads to small sizes of objects. In addition, it is necessary to adjust shooting angles of the onboard cameras; otherwise, the UAV will lose the tracked objects that move very quickly.
Many research works have shown their effectiveness for UAV object tracking [
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26], and some of them have been concerned about the problems mentioned above. To represent small objects, Wang [
18] employed locally adaptive regression kernel (LARK) features to encode the edge information of the objects. However, the edges of objects are sometimes fuzzy when the background is complex or the object is small and blurred. Li [
17] proposed a geometric transformation based on background feature points for camera motion. However, this method did not utilize the deep features of CNN-based trackers. The problems caused by small objects and camera motion are summarized as difficult attributes in the UAV database proposed by Du [
27].
Through the experiment results and observation, we find that temporal information in sequences can enhance features of small objects and overcome the target-forgetting problem. Therefore, in order to cope with target-missing problems caused by insufficient representation of small objects and the simultaneous sudden movement of cameras, we design a memory-aware attention mechanism to leverage temporal memory in video clips. There are already many existing trackers [
28,
29,
30] that have utilized temporal information to boost tracking performance. However, these trackers extracted temporal information from high-level features of images and ignored temporal information contained in low-level features; however, low-level features are useful for object tracking because they contain location and boundary information [
9]. Thus, the proposed memory-aware attention mechanism can utilize temporal information contained in low-level features to improve the feature representation ability of trackers, and also encourage trackers to learn the pattern of camera motion in UAV object tracking. We embed the memory-aware attention mechanism into a discriminative tracker discriminative model prediction (DiMP) [
7], resulting in a new framework, named the temporal memory-guided DiMP (TMDiMP).
In addition, there is a lack of benchmark datasets devoted to visual tracking, especially containing small objects and camera motion, which are ubiquitous in real-world scenes. Many features, small objects, and camera motion are the usual reasons for object-tracking failure. Thus, to solve the problem caused by small objects and camera motion, we build a UAV object-tracking benchmark named VIPUOTB. We also define the average proportion of the target size to an image (APTS) and the average moving distance between adjacent frames during camera motion (AMDAF) to measure the normalized object size and camera motion speed. In VIPUOTB, the APTS is the smallest compared with all generic object tracking datasets [
31,
32,
33] and other UAV object-tracking datasets [
17,
27,
34,
35,
36]. The AMDAF is also the largest compared with other UAV object-tracking datasets. The VIPUOTB is used to verify that our TMDiMP can cope with the tracking problems caused by small objects and camera motion.
In particular, the research scope of our study is aerial remote-sensing image processing based on advanced artificial intelligence technology. Specifically, we focus on designing a powerful UAV object-tracking method to mitigate the issues caused by small objects and camera motion. During our study, we also generate a new UAV object-tracking dataset that contains smaller objects and faster camera motion speed compared with other datasets. The main contributions of our work are summarized as follows.
We present a specially designed framework with end-to-end training capabilities, called TMDiMP, which embeds a novel memory-aware attention mechanism. Temporal memory is utilized to generate discriminative features of small objects and overcome the object-forgetting problem of camera motion; thereby, the tracker can obtain more accurate results in complex UAV scenes.
We build a UAV object-tracking dataset collected in urban scenes, named VIPUOTB, which contains various object categories and different attributes. All video sequences in our dataset are labeled manually by several experts to avoid subjective factors. Compared with other existing UAV datasets, VIPUOTB is different in terms of object size, camera motion speed, location distribution, etc.
The quantitative and perceptive experimental results illustrate that our proposed TMDiMP achieves competitive performance compared with state-of-the-art methods on our VIPUOTB dataset and three public datasets, UAVDT, UAV123, and VisDrone.
The structure of this paper is as follows: other works related to ours are reviewed in
Section 2. A description of the proposed TMDiMP is given in
Section 3. The generated dataset VIPUOTB is introduced in
Section 4. The experimental results obtained by our method are presented in
Section 5. A discussion and conclusions from this work are presented in
Section 6 and
Section 7.
3. Proposed Framework
In this section, we first introduce the memory-aware attention mechanism. Next, the proposed TMDiMP framework is demonstrated, and the details of the network are outlined.
3.1. The Memory-Aware Attention Mechanism
The attention mechanism is one of the key components, which is widely used in different fields, such as feature representation [
49] and network architecture [
50], especially in Transformer, proposed by Vaswani [
51] for natural language processing. An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors [
51].
The self-attention mechanism proposed in Transformer can automatically focus on the interesting region of an image, which contains more useful information for the constructed task. Inspired by self-attention, we propose a memory-aware attention mechanism to generate discriminative feature maps of small objects and overcome the problem of object forgetting in trackers.
The low-level feature of targets contains the location and boundary information, which is beneficial for object tracking [
9]. In addition, extracting low-level features will not result in too much computation in the online tracking stage. Therefore, we first extract the low-level features
and
of a current frame
and a previous frame
.
Then,
and
are transformed into two feature spaces
h and
g to obtain key
K and query
Q, respectively,
A batchwise matrix multiplication and a
layer are then applied, resulting in a corresponding attention map
, which contains the temporal memory
where
indicates the attention of all points on key
K for each point on query
Q.
In classic self-attention [
51], the corresponding attention map
is then multiplied by the value
V and added back to
V for direct feature enhancement. The
V is usually obtained by transforming
by using another simple convolutional layer. The memory feature
, which is output of the attention mechanism, can be calculated by Equation (
3),
where
, and
is a convolutional layer.
As described above, to obtain location and boundary information, we use the low-level features to compute the corresponding attention map
in Equation (
2). While acting as our baseline, discriminative trackers usually utilize the semantic information to distinguish the target object from the background, and low-level features contain less semantic information, which is contained in high-level features. Therefore, we use the high-level features as the value
V; thereby, the memory-aware attention mechanism can merge
and high-level features in attention processing.
Due to the different sizes of low-level features and high-level features, we need to downsample
to the same size as
V. Then, an enhancing layer is employed to merge
and
V. Thus, the Equation (
3) is modified as Equation (
4) to calculate
. We have
where
).
,
and
denote the enhancing layer, downsampling layer, and backbone, respectively.
3.2. Overall View of TMDiMP
As shown in
Figure 1, our proposed TMDiMP comprises three components, a backbone, a memory-aware attention module, and two prediction branches. The workflow of our method is represented in Algorithm 1.
Algorithm 1: The online tracking process of our TMDiMP. |
|
In detail, the proposed TMDiMP takes a pair of adjacent frames as the backbone input. Here, we adopt ResNet50 pretrained on ImageNet as the backbone because ResNet50 can balance accuracy and speed during the tracking process.
First, temporal information crossing adjacent frames is calculated to obtain the temporal attention map. Specifically, the current frame and the previous frame are first processed by the in ResNet50 to obtain the low-level features and . Then, and are fed into the carefully designed memory-aware attention module. A and two convolutional layers, and , are implemented to generate key K and query Q, respectively. To avoid loss of location and boundary information, the transform operation does not change the size of low-level features. Then, the temporal attention map is calculated according to K and Q.
Secondly, the memory features are generated by enhancing the high-level features of
with
. As described above, the
are first merged with high-level features to obtain semantic information. In TMDiMP, we extract features
of the current frame
by
and
in ResNet50 as the values
and
, respectively. Then,
and
are utilized for downsampling
to the same size as
and
. The downsampled attention maps are denoted by
and
. We generate two values
and
because the employed bounding box estimation branch can cope with multiscale features to obtain a more accurate target size. Then, we merge
and
with
and
by an adaptive method to obtain memory features. We have
where
denotes the concatenate operation.
Finally, the tracking results are predicted by the two prediction branches according to memory features. The bounding box estimation branch takes
and
to estimate the width and height of the target. The target classification branch takes
to predict the score map of the current frame, which contains the location center of the target. The tracking results can be determined by the size and location center of the target. For more details of the two branches, refer to DiMP [
7].
The detailed architecture of our TMDiMP is shown in
Table 2. The memory features are more discriminative and robust than the original image features so that the predicted branches can utilize the memory features to estimate more accurate results compared with the baseline, which is demonstrated by the conducted experiments.
4. The VIPUOTB Dataset
In this section, we describe our data collection and annotation, present various statistics compared with other public datasets and showcase different aspects of our dataset.
4.1. Data Collection and Annotation
A challenging UAV tracking dataset called VIPUOTB, captured by UAV cameras mounted on a DJI drone, is collected in urban scenes, where the road conditions are complex and the number of targets is large, especially small targets and similar targets. These targets are also denser and closer to each other in urban scenes. More than three domain experts who are students working in this field for more than one year annotated over 16,000 individual frames of 50 video sequences by using LableImg software.
Figure 2 shows some frames of video sequences with annotated small objects. Due to the small size of the objects, we used object areas instead of whole images in subjective comparison. An example of obtaining an object area is illustrated in
Figure 3.
We find that the ground truth will be affected by the subjective factors of different experts. Taking
Figure 4 as an example, the red bounding box and green bounding box are given by two experts. However, the red box labels the whole person whereas the green box labels the main part of the body. To ensure consistency, we ask domain experts
to annotate each video clip and the same expert to annotate consecutive frames to avoid subjective factors on small objects. The final annotated ground truth is obtained by the agreement
B,
where
,
, and
.
c,
w, and
h are the location center, width, and height of the object, respectively.
To ensure annotation quality, we randomly checked the annotation results on 300 samples extracted from different video clips of two Ph.D. students whose research area is object tracking and quickly revised them. We repeat the above error correction work three times.
Statistics on VIPUOTB and two famous UAV datasets are summarized in
Table 3. Four important objective criteria, types of attributes, categories of objects, average proportion of the target size to an image (APTS) and average moving distance between adjacent frames during camera motion (AMDAF), are used to measure the diversity of the datasets. The details about APTS and AMDAF are given in
Section 4.2 and
Section 4.3.
4.2. Normalized Object Size
We compare the APTS among these three datasets because the small size of objects is one of the important factors we focus on. APTS can be calculated according to Equation (
7),
where
N is the number of total frames in a video sequence,
and
denote the width and height of an object in the
nth frame, and
and
denote the width and height of the
nth frame. We find that the average sizes of the objects in UAVDT and UAV123 are approximately the same, which are more than ten times larger than those in our database.
4.3. Fast Camera Motion
To compare the camera motion speed on different datasets, we manually select all frames when the camera motion occurred. The camera motion speed can be reflected by the average moving distance between adjacent frames. The AMDAF can be calculated according to Equation (
8),
where
K is the total number of video clips containing camera motion.
is the frame numbers of the
kth clip.
, and
are the abscissa and ordinate of the target center in the
mth frame. The AMDAF of VIPUOTB is the largest among the three datasets, which are approximately three times those of UAVDT and UAV123.
4.4. Attributes
A summary of 14 tracking attributes presented in our proposed VIPUOTB dataset is shown in
Table 4. We define two new attributes according to APTS and AMDAF, which are NSO and FCM. In addition, during the observation from the real road environment, we find that there are two specific attributes, FS and MS, which will seriously affect the tracking performance. The distribution of these attributes over our dataset is shown in
Figure 5 and
Figure 5a shows the number of sequences with different attributes.
Figure 5b shows the number of sequences with different numbers of attributes. From
Figure 5a, we can observe that some attributes occur more frequently, such as SO and CM. In particular, more than
of videos contain small-object and camera motion cases, which we mainly focus on in this paper, and among them,
of the sequences have no less than four challenge factors. We can summarize from
Figure 5b that
of the sequences have no less than five challenge factors over the whole dataset.
Seven attributes, SO, CM, BC, ARC, IV, SV, and LO are common to the three datasets VIUOTB, UAVDT, and UAV123. Except for these common attributes, our VIPUOTB contains NSO, FCM, FM, NI, FS, and MS, which are not exciting in UAVDT. Compared with UAV123, the difference is that the attributes of out-of-view (OV) and viewpoint change (VC) are not considered in our VIPUOTB, both of which have recently been regarded as hot research fields. In the future, we will add more video clips with OV and CV attributes to our datasets and improve the performance of our proposed method on these two attributes while maintaining the performance of other attributes.
4.5. Categories
There are five object categories—pedestrian, car, bus, UAV, and bicycle—in our VIPUOTB dataset, all of which frequently appear in the city scene. The UAVDT dataset consists of cars, trunks, and buses. The bird, building, and wakeboard in UAV123 are not considered in our VIPUOTB, which are not as important and common as other categories for object tracking in city scenes.
4.6. Object Location
Figure 6 presents the heatmaps of object location by superimposing all the binary maps consisting of the object ground truth and the background. The lighter the color is in a heatmap, the more frequently the object appears.
Figure 6a–c are location maps of all the objects on UAV123, UAVDT, and VIPUOTB, respectively. We can observe from
Figure 6 that the objects in UAVDT and UAV123 are mainly located in the center of an image. This centralized distribution will cause the problem that deep neural networks learn strong center bias as a priori knowledge in the training stage [
13,
52]. In contrast to the location distribution of UAVDT and UAV123, the objects in VIPUOTB are evenly distributed in different locations. These phenomena show that our dataset has higher diversity in location than the two other datasets.
5. Experiments
In this section, we perform an extensive evaluation of our proposed TMDiMP through several experiments. First, we discuss the important parameter settings of the network. Secondly, attribute-based evaluation is given to verify the effectiveness of TMDiMP in different situations. Thirdly, the experimental performance of our proposed method is described on four UAV tracking benchmarks by comparing it to state-of-the-art methods, including MDNet [
9], STRCF [
39], ECO-HC and ECO [
40], VITAL [
41], ATOM [
42], KYS [
28], DiMP [
7], PrDiMP [
8], AUTO [
22], TrDiMP [
29], and STM [
30]. Fourth, we prove the effectiveness of the memory-aware attention module utilized in our method. All the methods were implemented by PyTorch installed on a PC with i7-11700k and RTX 3090.
The tracking performance was measured by the precision, normalized precision and success of one-pass evaluation (OPE) [
53]. The precision is computed by using the center location error (CLE) between the estimated location and the ground truth. Because the precision metric is sensitive to target size and image resolution, the normalized precision metric is proposed in [
32] to address this problem. Different trackers are ranked with precision and normalized precision metrics by using a threshold (20 pixels) and an area under the curve (AUC) between 0 and 0.5, respectively. The success is computed by using the intersection over union (IoU) of the estimated bounding box and ground truth. The tracking algorithms are ranked with a success metric by using an AUC between 0 and 1. The complete code and our dataset will be released upon publication.
5.1. Parameter Settings
5.1.1. The Level of Features
As we know, the low-level features contain a greater amount of location and boundary information, and feature extraction in the low-level stage requires fewer computational resources. Therefore, we use features obtained from the
layer to calculate the attention map of
Q and
K. We conduct an experiment to prove which features extracted from
and
of ResNet50 are more representative. The comparison results are recorded in
Table 5. The tracking performance estimated by using attention maps calculated according to features from
is better than
on four metrics, including precision, normalized precision, success, and FPS. In addition, the tested FPS values are different for two reasons. First, the data preprocessing time is different because of the different sizes of images in the four datasets. Secondly, if targets are missed by trackers, discriminative online learning trackers will redetect targets, which increases the time consumption of online tracking.
5.1.2. The Number of Frames Applied in Memory-Aware Attention Module
The number of frames taken by the memory-aware attention module needs to be determined manually in our method. A single frame can only provide insufficient information. In contrast, more valueless information will introduce additional noise. We use the memory features accumulated by different numbers of frames to estimate tracking results, which are recorded in
Table 6. Note that different numbers of frames do not affect the time consumption of the algorithm because only low-level features are extracted to calculate the attention map. We can see that the values of precision, normalized precision and success decrease as the number of frames used increases. Two consecutive frames can retain the most effective temporal memory. Therefore, we choose adjacent frames as the input of our memory-aware attention module.
5.2. Attribute-Based Evaluation
To further explore the effectiveness of the TMDiMP tracker on different situations, we also estimate them on all attributes in
Figure 7. We find that almost all trackers cannot achieve the same performance on NSO and FCM as SO and CM, which indicates that the NSO and FCM are more challenging than SO and CM. As shown in
Figure 7a–d, our TMDiMP shows the best performance on attributes of SO and NSO, and the second-best performance on CM and FCM, which we mainly focus on in this paper. The success values of DiMP are 3.9%, 1.3%, 2.7%, and 0.5% lower than those of TMDiMP on these four challenging attributes.
In general, our method can obtain competitive results on most attributes. However, the performance is unsatisfactory when TMDiMP addresses the attributes of night and object blur. Through analysis, we find that video clips with object blur are captured mostly at night. Unfortunately, our method cannot extract discriminative features in a dark environment. We will pay more attention to these two attributes in the future.
5.3. State-of-the-Art Comparisons
The performance of our proposed model is evaluated on both objective and subjective evaluations.
5.3.1. VIPUOTB
The overall performance for all tracking methods on VIPUOTB is reported by the precision, normalized precision, and success plots of OPE, as shown in
Figure 8. TMDiMP outperforms all state-of-the-art methods on all precision, normalized precision, and success metrics. Our method improves over the baseline DiMP on the precision, normalized precision, and success of OPE by 4.8%, 2.3% and 2.9%, respectively.
5.3.2. UAVDT
Figure 9 illustrates the precision, normalized precision, and success plots among all competitors on UAVDT, and the performance score for each tracker is given in the legend of the figure. The proposed TMDiMP method performs favorably, with a precision value of 83.3%, normalized precision value of 72.8%, and success value of 62.9%. Compared with the baseline DiMP, our method improves the precision and success of OPE by 4.6%, 5.5%, and 5.0%, respectively. Although TrDiMP has the best precision performance of
, STM has the best normalized precision performance of 76.7% and the best success performance of
among all the methods, which is
,
and
higher than those of our proposed model. Through the observation and analysis of the results, we find that our tracker usually fails to track objects that are occluded for a long time. Similar objects are usually selected by TMDiMP under this situation, which is illustrated in the following
Section 6.
5.3.3. UAV123
We also compare all competitors on UAV123 and illustrate the precision, normalized precision, and success plots in
Figure 10. The highest precision value of 87.6%, normalized precision value of 83.0%, and success value of 66.8% are achieved by PrDiMP. Our model ranks second in all the precision, normalized precision and success metrics, which have approximately the same performance as baseline DiMP. Similar to UAVDT, the failures are caused by long-term occluded targets. We also give the false example in
Section 6.
5.3.4. VisDrone
The performance for each tracker on the VisDrone dataset is exhibited in
Figure 11. Our tracker TMDiMP achieves the best performance, with a precision score of 85.0%, normalized precision score of 79.5%, and success score of 64.3%. It surpasses ECO and STM by 1.8%, 0.7%, and 1.0% in the precision, normalized precision, and success plots, respectively. In addition, our tracker has a relative gain of 4.2% precision, 3.7% normalized precision, and 3.0% success over DiMP.
5.3.5. Average CLE and IoU
Table 7 reports the average CLE and IoU of all compared trackers on four benchmarks. This shows that only TMDiMP can achieve at least the top three results on all datasets.
5.3.6. Subjective Comparison
To provide a more intuitive exhibition, the subject assessment results obtained from the top seven trackers (TMDiMP, TrDiMP, STM, PrDiMP, DiMP, ATOM, and ECO) on a challenging sequence in our VIPOUTB database are illustrated in
Figure 12.
This sequence exhibits three challenging factors: a tiny bicycle, camera motion, and severe occlusion by trees. Because the proportion of target pixels to the total pixels of an image is too small, trackers are easily disturbed by the cluttered background. We can observe from the 40th frame that the pedestrian begins to cycle through the trees. The PrDiMP misses this bicycle in the 56th frame after the target passes through trees. At the 193rd frame, the camera mounted on the UAV moves suddenly to capture the target, and all other trackers tend to drift away from the bicycle temporarily. Our TMDiMP finds the object again when the camera is stable, which can be found in the 203rd frame. This experiment illustrates that the proposed memory-aware attention mechanism can encourage our tracker to learn the pattern of camera motion.
We also compare the subjective results of the top seven trackers (TMDiMP, TrDiMP, STM, VITAL, KYS, ATOM, and ECO) on a sequence from the public dataset VisDrone, which is shown in
Figure 13. The tracked target is a moving pedestrian. At the beginning of the sequence, the pedestrian is walking through a flag. All trackers basically lose the target when the target is occluded. Only our TMDiMP retracks the pedestrian.
5.3.7. Score Map Visualization
As introduced in
Section 3, our TMDiMP and baseline DiMP distinguish the target from the background by the target classification scores, which can determine the center location of the target.
Figure 14 visualizes the tracking results and their corresponding score maps generated by DiMP and our TMDiMP. The yellow and red rectangles denote ground-truth bounding boxes and tracking results, respectively. The first column of
Figure 14 gives the tracking results from the 76th to the 78th frame in a sequence, from which we can see that the DiMP tracker loses the target in the 77th frame. The corresponding score map of the 77th frame in the second column in
Figure 14 exhibits two highlighted areas, and the lighter area is marked with a purple rectangle. In contrast, the fourth column shows the uniform highlighted area of three consecutive frames and our model can track the target object without being affected by the similar object on the surroundings, which indicates that the temporal memory can enhance features of small objects.
5.4. Ablation Study
In this paper, we present a novel memory-aware attention mechanism inspired by the classic self-attention [
51]. In order to verify the effectiveness of our proposed method, we use the classic self-attention module instead of the memory-aware attention module in
Figure 1 and test the modified framework on different datasets. In classic self-attention module, value
V is obtained by processing low-level features
of the current frame
with a sample convolutional layer. Then, the enhanced low-level features
are calculated by adding the product of value
V and temporal attention map
to
V. Finally, the concatenated residual blocks in backbone take
to generate high-level features
. Different from the classic attention module, our proposed framework first extracts high-level features
of the current frame
as values
V. Then, the well-designed memory-aware attention module is employed to calculate memory features
by enhancing values
V with the temporal attention map
. The tracking results predicted by using memory features
and high-level features
are recorded in
Table 8, respectively, which suggests that our TMDiMP can utilize high-level features to generate more discriminative features mentioned in
Section 3.
5.5. Implementation Details
The size of the adjacent frames is resized to 288 × 288. The baseline DiMP [
7] and other previous works PrDiMP [
8] and TrDiMP [
29] utilize the training splits of LaSOT [
31], TrackingNet [
32], GOT-10k [
33], and COCO [
54] for offline training. However, COCO only contains single images. We use ImageNetVid [
55] instead of COCO in this work. Our framework is trained for 50 epochs with 2600 iterations per epoch and 10 image pairs per batch. The Adam optimizer [
56] is employed with an initial learning rate of 0.01 and a decay factor of 0.2 every 15 epochs.
6. Discussion
In this section, we give failure cases of our TMDiMP in
Figure 15, which shows three challenging video sequences in UAVDT (first row), UAV123 (second row), and VIPUOTB (third row). The ground-truth and tracking results of our method are marked in blue rectangles and red rectangles. In the first row of
Figure 15, the target in UAVDT is occluded by a similar object at the beginning of the video, and TMDiMP tracks a similar object when two cars separate completely in the 224th frame. Similarly, our method tracks another person when the target person is occluded by a tent in UAV123. The third row of
Figure 15 shows one of the most challenging video sequences in our dataset, which contains serious problems of object blur and occlusion. We can see from the eighth frame that the target car’s appearance is blurred when it is passing through trees. Our TMDiMP misses the target in all the following frames of this video. In fact, all the state-of-the-art methods we tested fail to track the object in this situation.
We can conclude from these three failure cases that our tracker cannot handle long-term disappearances of targets in tracking. This problem may be addressed by improving the online training performance of the tracker. When targets disappear, the online training mechanism can guide trackers to continuously search for lost targets globally according to their history appearances. However, it is challenging to find representative history appearances because sometimes trackers give negative samples high confidence scores. In addition, the computational resources onboard UAV are limited, so the online training mechanism may affect the real-time performance of trackers. We will pay more attention to designing an efficient online training mechanism in our future work.
7. Conclusions
In this study, we focus on mitigating UAV object-tracking problems caused by small objects and camera motion by using advanced artificial intelligence technology. Thus, a novel tracker called TMDiMP is proposed. TMDiMP is a discriminative tracker with end-to-end training ability that utilizes a carefully designed memory-aware attention mechanism to generate more discriminative features of small objects and overcome the object-forgetting problem of camera motion. We also build a UAV object-tracking dataset, VIPUOTB, which is different from existing datasets in terms of object size, camera motion speed, location distribution, etc. Compared with other UAV object-tracking datasets, our VIPUOTB tracks the smallest objects and has the fastest camera motion speed. Various experiments, including parameter setting, attribute-based evaluation, objective comparison, subjective comparison, and ablation study are conducted to verify the effectiveness of our proposed method. Through the analysis of the experimental results, we conclude that our TMDiMP can achieve a better performance on our VIPUOTB dataset and three public datasets, UAV123, UAVDT, and VisDrone, compared to state-of-the-art methods.
The failure cases show that our tracker misses targets when they disappear for a long time. In the future, we will pay more attention to data with multiple challenging attributes, such as long-term object blur and occlusion. We will solve these problems by improving the online training performance of the tracker. In addition, we will expand our generated dataset constantly by adding more video sequences and attributes, such as out-of-view objects.