2.2.1. M-O SiamRPN Framework

As shown in Figure 2, the proposed M-O SiamRPN consists of three critical components: (1) Siamese ResNet backbone for feature extraction applied to the classification branch and regression branch; (2) Multiple-order information generation module based on selected first-order features; (3) Region proposal network (RPN) under multiple constraints:


$$Cor^{\rm cls} = T^{\rm cls} \odot S^{\rm cls} \tag{9}$$

$$Cor^{\text{reg}} = T^{\text{reg}} \otimes S^{\text{reg}} \tag{10}$$

where *Cor*cls and *Cor*reg denote the correlation calculation of the classification branch and regression branch, *T*cls and *T*reg represent the features that the template branch fed into RPN for classification and regression, *S*cls and *S*reg are search branch features fed into RPN for classification and regression, and ⊗ is correlation calculation. The RPN outputs the location and category scores of the targets through different anchor boxes sliding the features input into the two branches. In the above regression procedure, a weight adaptive joint multiple intersection over union loss function is proposed to optimize the framework, where the weight adaptive scale aims to balance the information between positive and negative samples, and multiple intersection over union is used to improve the accuracy of anchor boxes regression.

**Figure 2.** M-O SiamRPN with weight adaptive joint multiple intersection over union framework.

2.2.2. First-Order Feature Selection Criteria Based on Spatially Continuity

The essence of convolution is to extract the efficient information of samples with kernels, so the corresponding important feature can be selected through the reasonable judgment of convolution kernels. In practical image processing, the large contiguous region extracted by the convolution kernel represents the corresponding feature maps with large blocks of contiguous pixels, which means that the feature map retains more essential information. Attributed to the different emphases of the features extracted by the kernels with different parameters, the spatial information contained in the convolutional kernel weights is the key to texture and edge extraction.

Here, we propose the concept of spatial continuity of convolutional kernels to integrate the originally scattered spatial information in the kernels, using the spatial information extraction ability as measurement criteria. The spatial continuity coefficient is calculated quantitatively and used to rank the convolution kernels, selecting the feature map with richer spatial information, as shown in Figure 3. For convolutional layers close to the input, this means that the top-ranked feature maps extract more sufficient local information, as shown in Figure 4.

**Figure 3.** Schematic of convolutional kernel spatial information ranking.

**Figure 4.** Convolutional kernels and feature maps arranged by spatial continuity from largest to smallest (9 randomly selected): (**a**) Input sample; (**b**) convolutional kernels and the corresponding feature maps.

The quantitative calculation of spatial continuity includes connected component analysis and weight information content. It is mathematically described as the sum of the products of the connected areas of all connected domains and the Frobenius norm [33] of the corresponding connected blocks in the matrix, and can be written as:

$$\text{SC} = \sum\_{k=1}^{m} \mathbb{C}\_{k} \|\boldsymbol{W}\|\_{F^{\prime}} \ (\mathbf{x}, \boldsymbol{y}) \in A\_{k} \tag{11}$$

$$\|\mathcal{W}\|\_F = \sqrt{\text{Tr}(\mathcal{W}^\text{T}\mathcal{W})} = \sqrt{\sum \mathcal{W}\_{\text{x,y}}^2} \tag{12}$$

where *W* is the weight matrix corresponding to the convolution kernel, *m* is the total number of connected blocks in *W*, *Ak* is the *k*th connected block of *W*, and *Ck* is the area of the *Ak*.

The connected domain is an effective analysis tool in the fields of pattern recognition and image segmentation, which mainly consists of four-neighborhood and eightneighborhood domains, as shown in Figure 5. Here we choose the eight-neighborhood domain. To obtain the connected domain area *Ck*, the binarized weight matrix *M* firstly needs to be calculated:

$$M\_{x,y} = \begin{cases} 1, \left| \begin{matrix} W\_{x,y} \\ W\_{x,y} \end{matrix} \right| > \text{threshold value} \\ 0, \left| \begin{matrix} W\_{x,y} \end{matrix} \right| < \text{threshold value} \end{cases} \tag{13}$$

**Figure 5.** Diagram of 4-neighborhood and 8-neighborhood: (**a**) 4-neighborhood; (**b**) 4-diagonal neighborhood; (**c**) 8-neighborhood.

Here, the threshold value is the hyperparameter. After calculating the *M* matrix, all the eight-connected domains in the matrix are traversed, and the connected areas of all the eight-connected domains are calculated at the same time. The exploration directions and location during the search process are defined as:

$$disactions = \left[ \left[ 1, 1 \right], \left[ -1, 0 \right], \left[ -1, 1 \right], \left[ 0, -1 \right], \left[ 0, 1 \right], \left[ 1, -1 \right], \left[ 1, 0 \right], \left[ 1, 1 \right] \right] \tag{14}$$

$$\begin{cases} \mathbf{x}' = \mathbf{x} + \textit{diractions}[k][0] \\ \mathbf{y}' = \mathbf{y} + \textit{diractions}[k][0]' \end{cases} \\ \text{0 } \le k \le 8 \tag{15}$$

where *directions* denotes the array of searching directions, *x* and *y* are the horizontal and vertical coordinates of the current position, and *x* , *y* are next position coordinates to be explored. The visited elements larger than the threshold are added to the connected block, and then the elements linked to the connected block continue to be explored. The total number of elements larger than the threshold that has been explored is the area of the concatenated domain *Ck*.

Finally, each convolutional kernel is sorted according to the value of spatial continuity, and the corresponding features are selected according to the selection proportion *p*. In the M-O SiamRPN framework, the first block in ResNet50 is chosen as a candidate to ensure that the features retain sufficient local information, which is conducive to the representation of texture by second-order features. Meanwhile, to combine the simplicity of the framework, *p* is set to 30% after several experiments, i.e., the features corresponding to the top 30% of the SC ranked convolutional kernels are selected.

### 2.2.3. Multiple-Order Feature for Blurred Edges of Tiny Target

The backbone of SiamRPN is replaced with a ResNet50, which increases the network depth and improves the semantic representation of the features, but it is still essentially a first-order feature. For the description of blurred edges of tiny targets, the assistance of second-order features is required [34]. The second-order information based on normalized covariance is added to the residual block as an embedded injection layer, shown in Figure 6b. The first-order features applied to produce the second-order information are selected by the *SC* proposed in Section 2.2.2. This subsection focuses on the generation of second-order information with normalized covariance and the fusion of multi-order features.

**Figure 6.** Generation of multi-order features in residual blocks: (**a**) Traditional residual block; (**b**) proposed multi-information residual block.

After a pair of samples is fed into the Siamese frame, a set of first-order features, i.e., *f* 1-O ∈ *h* × *w* × *c*, is obtained after extraction and *SC* selection. *f* 1-O can be rewritten in vector form according to the same position of different channels as *f* 1-O = *X* = { *X*1, *X*1,... , *Xn*}, then its corresponding covariance matrix *f* 2-O is calculated as:

$$f^{2\cdot \mathcal{O}} = \frac{1}{n}(\sum\_{i=1}^{n} XX^{\mathcal{T}}) + \varepsilon I \tag{16}$$

$$XX^{\mathrm{T}} = \begin{pmatrix} \mathrm{cov}(X\_1, X\_1) & \mathrm{cov}(X\_1, X\_2) & \cdots & \mathrm{cov}(X\_1, X\_n) \\ \mathrm{cov}(X\_1, X\_1) & \mathrm{cov}(X\_2, X\_2) & \cdots & \mathrm{cov}(X\_2, X\_n) \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{cov}(X\_{\mathrm{n}}, X\_1) & \mathrm{cov}(X\_{\mathrm{n}}, X\_2) & \cdots & \mathrm{cov}(X\_{\mathrm{n}}, X\_{\mathrm{n}}) \end{pmatrix} \tag{17}$$
 
$$\mathrm{cov}(X, \mathcal{Y}) = E[(X - \overline{\mathcal{X}})(\mathcal{Y} - \overline{\mathcal{Y}})] \tag{18}$$

where *I* is the unit matrix, *ε* denotes a small positive number, cov represents the covariance calculation, and *E*(•) is the mean value. The second-order covariance with channel interactions is obtained by transposed outer product calculations, whose powerful mathematical statistical guarantees are verified in various classification applications. The vector used to calculate the covariance contains descriptions of different channels for the same location,as shown in Figure 7. Therefore, the obtained second-order information simultaneously yields the higher-order characteristics and the channel complementary [35]. Different from other second-order methods, feature maps with high spatial continuity are selected for conducting second-order calculations, which do not involve dimensional surges and do not require an additional dimensionality reduction step.

**Figure 7.** The generation of 2-O information based on covariance.

The resulting second-order *f* 2-O belongs to Riemann flow and cannot be directly fused, which needs to be decomposed into Euclidean space [36]. So, the *f* 2-O is normalized based on eigenvalue decomposition, as shown in Figure 7. Since the covariance matrix *f* 2-O is the symmetric and positive definite, it can be decomposed by the singular value decomposition (SVD) as:

$$f^{2\cdot 0} = \mathcal{U}\Lambda\mathcal{U}^{\Gamma} \tag{19}$$

where Λ = ( *λ*1, *λ*2, ... *λn*) is the diagonal matrix consisting of the eigenvalues *λn* and *U* = [*<sup>u</sup>*1, *u*1,... , *un*] denotes the corresponding eigenvector. It can be seen from Equation (19) that the power normalization processing of covariance matrix *f* 2-O can be expressed as the power operation of eigenvalue:

$$(f^{2\text{-O}})^{\omega} = \mathcal{U}\Lambda^{\omega}\mathcal{U}^{\text{T}} \tag{20}$$

When *ω =* 0.5, it is the square-root normalization of the matrix. Due to the slow computational speed of GPU for matrix decomposition, the iterative method to calculate the square root is employed to meet the demand for high real-time performance of localization. Let the *f* 2-O = *Z*2, for the equation:

$$F(Z) = Z^2 - f^{2 \cdot \mathcal{O}} = 0 \tag{21}$$

Denman-Beavers [37] formula is applied to the iterative solution, avoiding the drawbacks of the pathological case of SVD and poor support on the GPU. The mathematical definition of Denman-Beavers is:

$$K\_{k+1} = \frac{1}{2} (K\_k + O\_k^{-1}) \tag{22}$$

$$O\_{k+1} = \frac{1}{2} (O\_k + K\_k^{-1}) \tag{23}$$

where the initial *K*0 is set to *f* 2-O, *O*0 is the unit array *I*, and *Kk* and *Ok* are converged globally at (*f* 2-O)1/2 and (*f* 2-O) −1/2, respectively. Fifteen iterations have been experimentally demonstrated to be sufficient.

As an injection layer in the residual block, the second-order information based on normalized covariance is segmentally differentiable, which is able to be trained end-to-end by back-propagation. From the chain rule, the partial derivative of the loss function *L* with respect to *f* 2-O can be described as:

$$\frac{\partial L}{\partial f^{2\text{-}O}} = \mathcal{U}\{ (Q^{\text{T}} \odot (\mathcal{U}^{\text{T}} \frac{\partial L}{\partial \mathcal{U}})) + (\frac{\partial L}{\partial \Lambda})\_{\text{diag}} \} \mathcal{U}^{\text{T}} \tag{24}$$

where the *Q* is interleaved matrix consisting of elements *q* as:

$$q\_{ij} = \begin{cases} \ 1/(\lambda\_i - \lambda\_j), i \neq j \\ \ 0, i = j \end{cases} \tag{25}$$

From the combination of the above equations, the different partial derivatives are obtained as:

$$\frac{\partial \mathcal{U}}{\partial \mathcal{U}} = \{\frac{\partial \mathcal{Z}}{\partial \mathcal{Z}} + \left(\frac{\partial \mathcal{Z}}{\partial \mathcal{Z}}\right)^{\mathcal{L}}\} \mathcal{U} \Lambda \tag{26}$$

$$\frac{\partial L}{\partial \Lambda} = (\Lambda)' \mathcal{U}^{\mathsf{T}} \frac{\partial L}{\partial \mathcal{Z}} \mathcal{U} \tag{27}$$

$$(\Lambda)' = \text{diag}(\frac{1}{2\sqrt{\lambda\_1}}, \frac{1}{2\sqrt{\lambda\_2}}, \dots, \frac{1}{2\sqrt{\lambda\_n}}) \tag{28}$$

In a similar manner to the summation of features in the residual module, second-order features are fused by summing matrix elements after channel adjustment:

$$f^{\mathsf{M}\cdot\mathsf{O}}(\mathsf{x}) = f^{\mathsf{1}\cdot\mathsf{O}}(\mathsf{x}) \oplus \mathsf{x} \oplus f^{\mathsf{2}\cdot\mathsf{O}} \tag{29}$$

where *x* denotes the input of the current module, ⊕ is the sum of matrix elements, and element alignment by convolution is omitted here. With the nonlinearity appearing throughout the R<sup>2</sup> space, the output of the overall network is further increased by adding secondorder information to enhance the expression of textures and edges.

### 2.2.4. Weight Adaptive Joint Multiple Intersection over Union Loss Function

During the training process, only anchor frame coverage exceeding the threshold is determined as a positive sample, while all other cases are negative samples. This setting leads to an imbalanced proportion of positive and negative samples [38]. In an attempt to improve the overall classification performance, the classifier tends to focus less on the minority class and favor the majority class, resulting in the minority samples being difficult to be recognized [39]. In addition, the fast and accurate location regression of the anchor box is also crucial to improving network optimization.

The loss function *L* of the framework based on SiamRPN consists of two parts, one of which is classification loss *L*cls represented by cross-entropy loss and the other is loss function *<sup>L</sup>*reg used to optimize the target position. Therefore, we redesign the loss function for each of the two branches of classification and regression, and the weight adaptive joint multiple intersection over union loss function is proposed to cope with the issue of imbalanced sample information.

For classification loss *L*cls, the classical binary cross-entropy loss function is expressed as:

$$L\_{\rm CE}(y\_i, p\_i) = -(y\_i \cdot \log(p\_i) + (1 - y\_i) \cdot \log(1 - p\_i))\tag{30}$$

where *yi* is the label of the sample *xi* and *pi* is the output probability. The errors are accumulated equally for each sample, without considering the effect of the majority of negative sample information on the training process. According to the cost-sensitive principle, the penalty weight of minority category samples is additionally increased to enhance the focus on positive samples, improving the identification efficiency of positive samples. Here, a sample mean distribution coefficient *AG* is defined as:

$$AG = S\_{\text{all}}/n \tag{31}$$

where *S*all and *n* are the numbers of sample categories. The implication of AG is the number of samples contained in each category in the ideal case of a uniform sample. Then, a weighting scale *r* is defined:

$$
\sigma = A G / n\_{\odot} \tag{32}
$$

where the *nc* denotes the number of samples actually classified into category *c.* For negative samples, the number of samples within the category is larger than *AG*, so the scale *r* is less than 1, which serves to reduce the weights. On the contrary, in the case of positive samples, whose number is less than *AG*, the scale *r* is greater than 1 and the weight is increased. The scaling of the weights is automatically adapted to the imbalance of the sample categories. By adding the weight adaptive scale *r* to *Lce*, the new loss function of the classification branch is:

$$L\_{\rm cls} = -\frac{1}{N} \sum\_{i=1}^{N} r\_i [y\_i \log(p\_i) + (1 - y\_i) \log(1 - p\_i)] \tag{33}$$

Deriving the above Equation (34), the gradient is obtained as:

$$\mathbf{g} = r\_i[y\_i \cdot (p\_i - 1) + (1 - y\_i) \cdot p\_i] \tag{34}$$

It can be seen that the gradient values are scaled up when the samples are positive, so that the total gradient contribution of both categories to the model optimization process is balanced.

The second part of the loss function is *<sup>L</sup>*reg for predicting the position of the target. Smooth-*l*1 is used in SiamRPN for regression optimization, but the four important vertices of the anchor boxes are not included in the optimization metric, which is less robust to rotations, scale changes, and non-overlapping boundaries. The intersection over union (*IoU*) [25] focuses on the overlap between the prediction anchor box and the groundtruth, considering the whole process as a regression calculation, which is essentially the crossentropy of the overlap rate:

$$IoL = \frac{S\_1 \cap S\_2}{S\_1 \cup S\_2} \tag{35}$$

$$L\_{IoI} = -p\ln(IoI) - (1 - p)\ln(1 - IoI)\tag{36}$$

where *S*1 and *S*2 are prediction box and groundtruth, respectively, *p* is the corresponding probability. *IoU* indicates the overlapping relationship of two borders, which obviously ranges from 0 to 1. However, the case of non-overlapping is not captured by the *IoU*. In this work, multiple intersection over union (MIoU) is proposed to optimize the regression of the prediction anchor box, designing a more complete loss function, which contains generalized intersection over union (*GIoU*) and distance intersection over union (*DIoU*) [40,41].

The non-overlapping part of the two boxes is the concern of the *GIoU*, aiming to make the two boxes infinitely close from the perspective of minimizing the non-overlapping area,

as shown in Figure 8b. Firstly, it is necessary to find the smallest box *S*3 that completely wraps *S*1 and *S*2, and then the area not occupied by the two boxes in *S*3 is calculated as:

$$IGolI = IoI - \frac{|S\_3 - (S\_1 \cup S\_2)|}{S\_3} \tag{37}$$

$$L\_{Global} = 1 - IoI + \frac{|S\_3 - (S\_1 \cup S\_2)|}{S\_3} \tag{38}$$

**Figure 8.** Schematic diagram of different intersection over union: (**a**) *IoU*; (**b**) *GIoU*; (**c**) *DIoU*.

Obviously, when the prediction box *S*1 and the groundtruth *S*2 do not overlap, i.e., *IoU* = 0, the value of *GIoU* is not zero, ensuring that the gradient of the loss function can be propagated backwards. However, when *S*1 contains *S*2, *GIoU* degenerates to *IoU* and the exact regression calculation cannot be implemented through both *GIoU* and *IoU*. Therefore, *DIoU* is introduced and its visualization is shown in Figure 8c. Constraints on the distance between the centers of the two boxes *S*1 and *S*2 are incorporated in the regression process through *DIoU* to better characterize the overlap information, which is calculated as:

$$IDII = IoI - \frac{d(S\_1, S\_2)}{R(S3)}\tag{39}$$

$$L\_{DIoI} = 1 - IoI + \frac{d\left(S\_1, S\_2\right)}{R\left(S\_3\right)}\tag{40}$$

where *d*(*S*1, *S*2) denotes the distance between the centroids of the two boxes and *R* is the diagonal *S*3. Thus, the regression loss function *<sup>L</sup>*reg based on MIoU is a combination of *DIoU* of *GIoU*:

$$L\_{\rm reg} = \alpha (1 - IoI + \frac{|\mathcal{S}\_3 - (\mathcal{S}\_1 \cup \mathcal{S}\_2)|}{\mathcal{S}\_3}) + \beta (1 - IoI + \frac{d(\mathcal{S}\_1, \mathcal{S}\_2)}{R(\mathcal{S}\_3)}) \tag{41}$$

where *α* and *β* are hyperparameters. The loss function of the overall framework *L* can be expressed as:

$$Loss = L\_{\text{cls}} + L\_{\text{reg}} \tag{42}$$

$$Loss = \frac{1}{n} \sum\_{i=1}^{n} L(y\_i, p\_i) + a(1 - IoI + \frac{|S\_3 - (S\_1 \cup S\_2)|}{S\_3}) + \beta(1 - IoI + \frac{d(S\_1, S\_2)}{R(S\_3)}) \tag{43}$$
