2.2.1. Triple Siamese Network Structure

The traditional Siamese network always uses the initial image as the tracking target. However, the velocity spectrum varies in the lateral direction. With the deepening of the tracking process, the information contained in the initial sample is insufficient to track the subsequent targets. We solve this problem by adding an update of target features in the tracker. The velocity spectrum samples at the initial position provide the most basic characteristics of the target and play a leading role in the tracking process. With the lateral movement of the target position, tracking by fusing the previous position features is better than using only the initial samples. Based on this consideration, we proposed a triple structure Siamese network based on the traditional Siamese network. We added an update branch of the current target and used the current prediction result as the target to track the velocity spectrum of the next position. During the tracking process, the current target will be updated with the result of target tracking. The structure of the triple velocity spectrum tracking network is shown in Figure 3.

**Figure 3.** Structure of the triple velocity spectrum tracking network.

Different from the traditional Siamese network, the triple network consists of three branches: an initial branch *I* with an initial target as an input, a search branch *S* with a search area as an input, and an update branch *U* with a tracking result at the previous location as an input. The backbone models in the three branches share the same CNN architecture. Through the same network model, the responses of *I*, *S*, and *U* are *ϕ*(*I*), *ϕ*(*S*), and *ϕ*(*U*), which are embedded into the feature space of subsequent tasks. *I* is the

initial branch, and the specified initial velocity spectrum is used as the sample. *I* remains unchanged throughout the tracking process. *U* is the update branch, that is, the previous target tracking result. *U* will be updated with the tracking process. In order to embed the information of these branches, we use the feature map of the updated branch and the feature map of the search area to perform cross-correlation operations. Similar to the traditional Siamese network, the triple network uses a full convolution network for feature extraction, and each channel also generates a corresponding mapping response *R*. Since there are three branches, the mapping results of two channels are generated, which are:

$$R\_1 = \varphi(S) \* \varphi(I) \tag{2}$$

$$R\_2 = \varphi(S) \* \varphi(\mathcal{U})\tag{3}$$

where ∗ represents a cross-correlation operation. *R*<sup>1</sup> is the mapping result of the initial branch and the search branch, and *R*<sup>2</sup> is the mapping result of the update branch and the search branch. In order to use the two-mapping information, the mapping results are fused weighted *Rf* :

$$R\_f = a\_1 R\_1 + a\_2 R\_2 \tag{4}$$

where *a*<sup>1</sup> and *a*<sup>2</sup> are the weight coefficients, and *a*<sup>1</sup> + *a*<sup>2</sup> = 1.
