**2. Related Work**

There are a large number of researchers who have made many contributions in the field of visual tracking, and many excellent trackers have been proposed. In this section, we discuss some trackers that are similar to our work.

#### *2.1. Lightweight Network-Based Tracker*

Real-time target tracking is a very relevant research element. However, when the tracking speed increases, the tracking accuracy is bound to be affected. Therefore, many researchers have researched how to increase the tracking speed without affecting the tracking accuracy. Zhao et al. [6] use a pruned convolutional neural network to construct the tracker, which is trained by a mutual learning method to further improve the localization accuracy. Cheng et al. [7] propose a real-time semantic segmentation method based on extended convolution smoothing and lightweight up-sampling on the basis of a lightweight network, which can achieve high segmentation accuracy while maintaining high-speed real-time performance. Zhao et al. [8] design a lightweight memory network, which only needs reliable target frame information to fine-tune network parameters online, so as to enhance the memory ability of the target appearance. At the same time, it can maintain good discriminant performance without a complicated update strategy. Unlike them, this paper designs a lightweight network for online learning of the most salient features of the target and achieves redundant feature channel trimming by back-propagating the weights to determine the importance of the feature channels.

#### *2.2. Siamese Network-Based Tracker*

In recent years, the combination of Siamese networks and target tracking has led target tracking to enter a new stage. Bertinetto et al. [9] propose a new structure of fully convolutional Siamese networks. In the initial offline phase, deep convolutional networks are regarded as a more general similarity learning problem, and then the simple online estimation of the problem during tracking can achieve very competitive performance, and the frame rate at runtime far exceeds the requirements of real-time performance. Li et al. [10] developed a model consisting of a Siamese network and a region proposal network, which discards the traditional multi-scale testing and online tracking, divides the network into template branches and detection branches, and uses a large amount of data for offline training to achieve a good tracking result. Gao et al. [11] propose a Siamese Attentional Key-point Network for target tracking, by designing a new Siamese lightweight hourglass network and a novel cross-attentional module to obtain more accurate target features, and propose a key-points detection approach to accurately locate target location and scale regression.

#### **3. Proposed Method**

#### *3.1. Basic Siamese Network for Visual Tracking*

Siamese networks are originally applied to template matching problems and are later introduced into object tracking. It is composed of two networks with the same structure and the same weight. These two networks are used to extract the depth feature of the target and the depth feature of the search area, and finally the cross-correlation calculation is used to find the highest response value in the search area. The position of this point is the final target position. Moreover, the whole process can be expressed by the following formula:

$$f(z, \mathbf{x}) = \varrho(z) \* \varrho(\mathbf{x}) + b \cdot \mathbf{1} \tag{1}$$

where *z* represents the initial frame position, *x* represents the position of the search region, *b* · 1 denotes the deviation value, and \* represents the convolution operation.

As shown in Figure 2, the proposed tracker contains a pre-trained feature extraction network, a lightweight target-aware attention learning network, and a Siamese network matching module. The VGG feature extraction network is a very deep convolutional network for image classification and achieves the state-of-the-art performance on the ImageNet challenge dataset. It is trained offline in this paper, and the proposed lightweight target-aware attention learning network is trained online by using the given first frame target information, and then the cross-correlation operation of the Siamese network is used to locate target. The attention learning loss function used to train the lightweight targetaware attention learning network is redesigned on the basis of the MSE loss function, and

the Adam optimization method is used for training, and the feature channel is determined according to the gradient value information of back propagation. The importance weight is weighted to the original depth feature to represent the target, and finally the template matching method of the Siamese network is used to locate the target. The calculation process is shown in Formula (2):

$$f\_{\text{new}}(z, \mathbf{x}) = (\varphi(z) \ominus \mathbf{a}) \* \varphi(\mathbf{x}) + b \cdot \mathbf{1} \tag{2}$$

where *z* denotes the template image, *x* denotes the image of the search region, *b* · 1 denotes the deviation value of each, *α* is the channel attention weight vector of the feature channel, denotes the Hadamard product, ∗ denotes the convolution operation, and *fnew*(*z*, *x*) denotes the response score.

**Figure 2.** Overview of our network architecture for visual tracking.

#### *3.2. Attentional Learning Loss Function*

Most of the trackers based on correlation filtering use recurrent samples to train regression models, while Chen et al. [12] propose to use single-layer convolution to solve the linear regression problem and use the gradient-descent training method to solve the regression problem in target tracking, which this paper is inspired by. In the linear regression model of the work [12], the objective is to learn a linear function using the training samples *<sup>X</sup>* <sup>∈</sup> *<sup>R</sup>m*×*<sup>n</sup>* and the corresponding regression objective *<sup>Y</sup>* <sup>∈</sup> *<sup>R</sup>m*. Each element *xi* in each row of the model *X* represents a training sample with feature dimensionality and the corresponding regression target *xi* is the first element of the model *Y*. Then, the objective is to learn the coefficients *w* of a regression function by minimizing the objective function *<sup>f</sup>*(*x*) <sup>=</sup> *<sup>w</sup><sup>T</sup>* · *<sup>x</sup>* during the offline training process.

$$\underset{w}{\text{arg min}} \; \|X\*w-Y\|^2 + \lambda \|w\|^2 \tag{3}$$

In Equation (3), · is the Euclidean parametrization, and *λ* is the regularization parameter to prevent overfitting.

The gradient values generated during the training of neural networks can be a good indication of the channel saliency feature information for different target classes [13], and this paper attempts to introduce this idea into a Siamese network-based tracker used for training to generate a set of weights that can represent the contribution of different feature channels to modeling, to enhance the target modeling capability of pre-trained depth features. To this end, this paper redefines its input based on Equation (3), which can be expressed by minimizing the following function:

$$\underset{w}{\text{arg min}} \sum\_{i} \left( \left( Z\_i \cdot w\_i' \right) \* X\_i - Y\_i \right)^2 + \lambda' \sum\_{i} w\_i'^2 \tag{4}$$

where · is the dot product operation, ∗ denotes the convolution operation, *Z* is the template depth feature, *X* is the search area depth feature; they are obtained from the same frame, and *Z* is located at the center of the *X*, *λ* is regularization parameter, *w* is the regression weight vector obtained by the network training, the dimension is the same as *Z* and *X*.

The comparison results of the target response maps are shown in Figure 3. Figure 3a shows the weighted features of the feature channels after learning using the attention learning loss function, and Figure 3b shows the target-specific diagnosis extracted directly using the original features.

**Figure 3.** Comparison of the before and after learning characteristics of attentional learning loss.

Finally, the regression weights *w* are mapped by the sigmoid function to obtain the channel weights corresponding to the sample images.

$$\alpha\_i = \frac{1}{\cdot} \Big/ \left(1 + e^{-w\_i'}\right) \tag{5}$$

where *α<sup>i</sup>* denotes the *i*-th value in *α*, and *α* ∈ [0, 1], *w <sup>i</sup>* denotes the *i*-th value in *w* .

In summary, the loss function generates the gradient information by training the target information in the first frame. The gradient information is used to generate the weights of the different channels of the feature to the target information expression. The feature channel is determined according to the gradient value information of back propagation under the attentional learning loss function. The importance weight is weighted to the original depth feature to represent the target. Finally, the template matching method of the Siamese network is used to locate the target. However, the loss function is used under the assumption that the error between the model output and the groundtruth value obeys a Gaussian distribution. When this condition is not satisfied, the loss function is limited in its usefulness.

#### *3.3. Lightweight Target-Aware Attention Learning Network*

In a pre-trained deep model-based classification network, each feature channel contains a specific target feature pattern, and all feature channels together construct a feature space containing a priori information about different objects. The pre-trained network identifies object classes mainly through a subset of these feature channels, so the importance of each channel should not be calculated equally when used to track the target representation.

As shown in Figure 4, the lightweight target-aware attention learning network proposed in this paper is built on a single-layer convolutional network, which is used in the same way as a general neural network, and its kernel is set to match the size of the target template. However, to obtain better object appearance features, the lightweight target-aware attention learning network proposed in this paper only uses the given first frame object information for training and does not require complex offline training, while using the more advanced Adam Optimization method to obtain network parameters.

**Figure 4.** Lightweight target-aware attention learning network.

(1) Parameter learning process.

A search area of size *X* is intercepted around the given first frame target as an initial training sample, *w <sup>i</sup>* is a set of initial target feature channel weights with an initial value of 1. In the subsequent learning process, the gradient value information is calculated to update its value online according to the difference between the response values and labels of different channels. The larger the gradient value is, the smaller the contribution of the feature channel to the target model. Equation (4) is used to guide the online learning process, and the Adam optimization method is used to optimize the network by empirically setting the learning rate to, the momentum to 0.9, the weight decay to 1000, and the maximum number of iterations to 100. Compared with the traditional gradient descent (SGD) optimization method, the Adam optimization method is an improvement and extension of it, with high computational efficiency and small memory occupation. Moreover, the learning rate of the SGD optimization method is fixed, while the Adam optimization method can update the learning rate of the third training process adaptively based on the average of the first two training weights, which can improve the performance of the network on sparse gradient problems.

(2) Obvious characteristic of the lightweight target-aware attention learning network.

The network designed in this paper is implemented on a single-layer convolutional network, which learns the optimal representation of the target appearance by adjusting a certain number of feature channel weights through simple single-layer convolutional operations, using the proposed attention learning loss function to learn online, thus generating an optimal set of channel modeling parameters. This approach is computationally simple, does not require complex model computation strategies, does not take up too many valuable memory resources, and is easy to implement. Moreover, the number of parameters in the network is small, which facilitates fast computation and achieves real-time fast online tracking.
