(1) Challenge analysis of the OTB dataset

This part shows the success rate plots on the OTB-50 dataset for multiple challenge scenarios, as it contains 50 videos with relatively high tracking complexity in the OTB-100 dataset, which include: scale variation (SV), low resolution (LR), occlusion (OC), distortion (DF), motion blur (MB), fast motion (FM), in-plane rotation (IR), out-of-plane rotation (IR), out-of-field (OV), background clutter (BC), and illumination variation (IV).

More details of the performance of the proposed algorithm are shown in Figure 8. Overall, the proposed algorithm performs well in all 11 challenges. For the attributes of motion blur, distortion, and low resolution, the proposed algorithm outperforms the tracker SiamFC, which is also based on Siamese networks. The SiamRPN algorithm combines Siamese networks and region proposal network and has good tracking precision and speed, but the algorithm proposed in this paper has better performance under the background clutter challenge, indicating that the algorithm in this paper can extract the key features of the target. For exceeding the visual field, the proposed algorithm performs much better than the other nine compared trackers, which is attributed to the proposed lightweight target-aware attention learning network model and the attention learning loss function. In addition, the proposed algorithm performs better than most neural network-based trackers under the background clutter challenge, which indicates that the proposed lightweight target-aware attention learning network and attention learning loss function can effectively modify the pre-trained depth features to remove redundant information while enhancing the feature channels that are more important to the target representation, and thus it can improve the feature representation of the target. Overall, the proposed algorithm in this paper achieves good performance under several challenging attributes of the OTB-50 dataset.

(2) Qualitative experimental analysis of the OTB dataset

To qualitatively evaluate the proposed method, Figure 8 shows some tracking results of the proposed algorithm and other tracker on eleven challenging video sequences. SiamRPN is a deep learning-based algorithm, CF2 is a correlation filtering-based algorithm, where the SiamRPN algorithm also introduces region suggestion networks into the tracking, and SiamFC is a Siamese network-based algorithm, similar to the proposed algorithm in this paper. the proposed algorithm is similar.

In these six video sequences, there are many different challenges, including deformation (Bird1, MotorRolling, Skiing), occlusion (Soccer, Tiger), out-of-field (Bird1, Soccer), and background clutter (Football1, MotorRolling). SiamFC and the proposed algorithm can re-find the target after its occlusion disappears, while other trackers are unable to locate the target again due to untrustworthy samples introduced during model updates. CF2 and CREST drift rapidly in scenes where the target is out of view, and SiamFC and CF2 are unable to adapt to the challenge of scale changes in Bird1 and MotorRolling sequences. As the tracking task progresses, CREST, CF2, and SiamFC all lose targets one-by-one as the tracking drifts. In contrast, the algorithm proposed in this paper can adapt well to these challenges due to the introduction of a lightweight target-aware attention learning network

and an attention learning loss function to learn the channel weight information of the target. As in these scenarios in Figure 9, the performance of the proposed algorithm is significantly better than other trackers.

**Figure 8.** Comparison of 11 attribute challenge results.

**Figure 9.** Visualization of tracking results for focused challenge scenarios.

#### *4.3. TC-128 Dataset Experiments*

In this paper, the proposed method is evaluated on the Temple-Color (TC-128) dataset containing 128 videos. The evaluation method follows the guidelines in the OTB dataset and uses the accuracy plots in the one-time evaluation method (OPE) to compare the different trackers.

(1) Quantitative evaluation on TC-128 dataset: The proposed algorithm is compared quantitatively with 10 other trackers, including ECO [33], CREST [28], HCFTstar [34], CF2 [27], CACF [35], KCF [23], DSST [25], LOT [36], and CSK [37].

As shown in Figure 10, the proposed algorithm is in the top two positions among all trackers in terms of accuracy and success rate. Compared with the CF2 algorithm based on deep learning, the proposed algorithm achieves a higher success rate of 5.0% on TC-128, probably because CF2 uses unprocessed pre-trained deep features, while the proposed algorithm learns the most effective target channel weights through the designed lightweight target-aware attention learning network, so that the features better represent the appearance of the target. Moreover, the success rate of the proposed algorithm on TC-128 is 1.2% higher than that of CREST which learns linear regression on a single-layer convolutional network. It can also be seen that the CREST algorithm, which uses only one layer of depth features for target modeling, outperforms the CF2 algorithm, which uses multiple layers of depth features, which illustrates the great advantage of linear regression modeling on the network. The tracking robustness of the proposed algorithm is greater than that of the tracker CACF, which introduces contextual information. It can also be seen from the figure that trackers that use manual features to model targets such as KCF have significantly lower performance than other trackers that use depth features. The ECO algorithm combines color features and depth features to represent the target, and is sensitive to the color features of the target, so the performance on the TC-128 dataset designed for color features is better than the algorithm proposed in this paper. (2) Challenge analysis of TC-128 dataset: In this section, the success rate of the tracker associated with the work in this paper is tested on the TC-128 dataset for 11 challenging videos, including scale variation (SV), low-resolution (LR), occlusion (OC), distortion (DF), motion blur

(MB), fast motion (FM), in-plane rotation (IR), out-of-plane rotation (IR), out-of-field (OV), background clutter (BC), illumination variation (IV).

**Figure 10.** Success and precision rates on the TC-128 dataset.

Figure 11 shows the results of the proposed algorithm and other state-of-the-art trackers under 11 attribute challenges, and it is clear that the proposed algorithm outperforms the other trackers in overall performance. Thanks to the channel weight learning effect of the lightweight target-aware attention learning network, the proposed algorithm outperforms other trackers in the case of background clutter, motion blur, and deformation. ECO outperforms the proposed algorithm in deformation challenge scenarios due to the use of multi-feature fusion, but the proposed algorithm outperforms other trackers in several challenge scenarios with background clutter, motion blur, and out-of-field. In these scenarios, the targets often experience severe appearance changes or complex background disturbances, so the compared tracker experience tracking failures, while these compared tracker use sample update models that may contain noise, which prevents the tracker from obtaining an accurate model of the target appearance and leads to tracking failures. In contrast to these trackers, the lightweight target-aware attention learning network is introduced in this work to improve the modeling capability of depth features, allowing the tracker to adapt to target tracking tasks in complex scenes.

## *4.4. UAV123 Dataset Experiment*

To further illustrate the performance of the proposed algorithm, the performance of the proposed algorithm is evaluated on the UAV (UAV123) dataset in this paper. Compared with typical visual object tracking datasets including OTB and TC-128, the UAV123 dataset provides low-altitude aerial video for target tracking. UAV123 is also one of the largest target tracking datasets, which contains 123 video sequences with over 110,000 images and an average sequence length of 915 frames. The UAV123 dataset has become increasingly popular due to real-life applications that are becoming increasingly popular, such as navigation, wildlife monitoring, crowd surveillance, etc. An algorithm that strikes a good balance between accuracy and real-time speed would be more practical for tracking these targets.

As shown in Figure 12, the proposed algorithm is tested on the UAV123 dataset in this paper to compare with 10 other trackers, including SRDCF [26], CREST [28], CF2 [27], SiamRPN [10], DSST [25], Struck [38], ECO [33], TADT [39], KCF [23], and CSK [37]. Thanks to the lightweight target-aware attention learning network introduced in the Siamese network framework, the proposed algorithm is higher than the TADT algorithm in terms of accuracy and success rate. Moreover, the success rate of the proposed algorithm on UAV123 is 5.8% higher than that of CREST which learns linear regression on a single convolutional layer. The performance of the CREST algorithm using only one layer of depth features outperforms that of CF2 and SRDCF using multiple layers of depth features. Trackers using manual features, such as DSST and KCF, have significantly lower performance than other trackers using depth features.

**Figure 11.** Comparison of 11 attribute challenge results.

**Figure 12.** Success and precision rates on the UAV-123 dataset.

#### *4.5. VOT2016 Dataset Experiment*

The VOT dataset is a very popular dataset in the field of target tracking, and it uses two metrics, accuracy and robustness, to evaluate the performance of the trackers, as well as the average overlap metric (EAO) to rank the tracker. In this paper, the proposed algorithm is compared with other trackers on the VOT2016 dataset for experiments, and the compared trackers include SiamRPN++ [40], SiamRPN [10], TADT [39], DeepSRDCF [41], MDNet [42], SRDCF [26], HCF [27], DAT [43], and KCF [23]. The results of these tracker are obtained from the official results, and Figure 11 show the results of all tracker' ranking results.

As can be seen from Figure 13, thanks to the proposed lightweight target-aware attention learning network and the weight learning approach of the attention learning loss function, the proposed algorithm ranks third among all the compared trackers and performs better than the TADT algorithm that uses the regression loss function and the scale loss function for feature layer filtering. The performance of the proposed algorithm is weaker than that of SiamRPN and SiamRPN++ tracker, which also shows that SiamRPN introduces a region suggestion network to provide an accurate suggested target area and a classification regression mechanism to determine the target location and obtain a more accurate target scale through regression calculation. SiamRPN++ algorithm, on the other hand, introduces a deeper neural network to extract target features based on the SiamRPN algorithm, so it performs far ahead of the other tracker, which also shows that deep neural networks are more powerful in feature representation.

Table 2 shows some more detailed information comparing all the tracker, including the average overlap (EAO), overlap (Overlap), and failure (Failures), and the top three metrics on individual results are marked in red, green, and blue, respectively. As can be seen from the table, the proposed algorithm performs well overall in all three metrics, which reflects the ability of the proposed attention learning loss function and lightweight goal-aware attention learning network to learn reliable target features. The last column of the table shows the failure rate of the algorithm tracking, and it can be seen that the proposed algorithm ranks fourth place, which is not very far from the second-place SiamRPN and the third-place TADT, and there is still room for improvement.

**Figure 13.** EAO score ranking of the compared trackers VOT2016 dataset.

**Table 2.** Overall performance on VOT2016 dataset, the top three trackers are marked with red, green and blue, respectively.


#### *4.6. LaSOT Dataset Experiment*

To further demonstrate the effectiveness of our method, the performance of the proposed algorithm is evaluated on the LaSOT dataset in this work. Compared with the above tracking dataset, LaSot dataset has a larger salce and more complex challenges for the tracker during the tracking process. LaSOT considers the connection between visual appearance and natural language, not only labeling the bounding box but also adding rich natural language descriptions. It contains 1400 video sequences with an average sequence length of 2500 frames and the test dataset contains 280 video sequences, with 4 videos per category.

As shown in Figure 14, our method achieved the third place in precision and success rate. Compared with the tracking algorithms based on the correlation filter, our method also obtains a good performance. However, the performance of our method is not competitive enough with the state-of-art tracking methods on the LaSOT dataset. The reason for this phenomenon is that our algorithm is not able to solve the challenge of target disappearance reproduction during long-term tracking.

**Figure 14.** Success and precision plots of OPE on LaSOT dataset.

#### *4.7. Discussions*

The Siamese network tracker based on pre-trained depth features has achieved good performance in recent years. The pretrained depth features are trained in advance on large-scale datasets, and therefore contain feature information of a large number of objects. However, for a tracking video, the object being tracked is always the same, so the pretrained features contain some redundant features. To remove redundant and interfering information from pre-trained features and learn more accurate target information, this work presents a novel tracking method with the proposed lightweight target-aware attention learning network. This lightweight target-aware attention learning network uses reliable information that the ground truth of the target is given in the first frame of each video to train the weights of the network online and obtains gradient value information by backpropagation to determine the effect of different feature channels in the target feature layer on the target, and remodel the channel of the template feature by weighting this contribution. Then the compact and effective deep feature is obtained, which can better distinguish the object from the background. The network is the single-convolutional layer network which is relatively easy to implement and compared to complex convolutional neural networks, there are fewer parameters in the network. It is worth improving that although our method can refine the target features, it does not have the ability to deal with target failure, so its performance is constrained by the target disappearance reproduction challenge in long-term tracking.

#### **5. Conclusions**

In this paper, a novel Siamese network-based target tracking method is proposed to address the problem that different feature channels often have different importance for the target representation, which enhances the feature tracking target by designing a lightweight target-aware attention learning network and using a redesigned attention learning loss learning function to learn the most effective feature channel weights for the target using the Adam optimization method representation. This lightweight target-aware attention learning network uses reliable information from the first frame of each video sequence to train the weights of the network online, and obtains gradient value information by back propagation to determine the contribution of different feature channels in the target feature layer to model the target, and re-models the target by weighting this contribution to the channels of the template features. The network is relatively easy to implement and the small number of parameters facilitates fast computation. Finally, the proposed algorithm is evaluated on OTB, TC-128, UAV123, VOT2016, and LaSOT datasets, and both quantitative and qualitative analyses show that the method achieves satisfactory performance, demonstrating the effectiveness of the proposed lightweight target-aware attention learning network and attention learning loss function in a Siamese network framework-based tracker.

**Author Contributions:** Conceptualization: Y.Z., J.Z., R.D. and F.L.; methodology: Y.Z., J.Z., R.D. and F.L.; software: J.Z., R.D., H.Z.; validation: J.Z., R.D., H.Z.; analysis: Y.Z., R.D. and F.L.; investigation: H.Z.; resources: Y.Z., F.L.; writing—original draft preparation: J.Z., R.D.; writing—review and editing: Y.Z., J.Z., R.D.; visualization: R.D., H.Z.; supervision: H.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Natural Science Foundation of China (61873246, 62072416, 6167241, 61702462), Program for Science & Technology Innovation Talents in Universities of Henan Province (21HASTIT028), Natural Science Foundation of Henan (202300410495), Zhongyuan Science and Technology Innovation Leadership Program (214200510026).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** No conflict of interest exits in the submission of this manuscript, and this manuscript is approved by all authors for publication.
