*2.4. Semantic-Guided Anchor-Adaptive Learning (SGAAL)*

Anchors are the basis in modern object detection, which are usually a set of artificially designed boxes, used as the benchmark for classification and bounding box regression. However, previous video SAR moving target shadow detectors mostly adopt preset fixedshape or fixed-size or fixed-scale anchors. In other words, they have changeless scales and aspect ratios, potentially declining the feature learning benefits of moving target shadows. Moreover, the raw anchors are arranged in the feature map densely and uniformly and are not in line with the video SAR image characteristic, as in Figure 12a. This is because moving target shadows in video SAR images are distributed sparsely and unevenly; if the dense and uniform anchors are used, there will be many false alarms generated. Therefore, inspired by Wang et al. [45], we design a novel semantic-guided anchor-adaptive learning (SGAAL) tool to generate high-quality location-adaptive and shape-adaptive anchors in the RPN, as in Figure 12b. Here, we adopt the high-level deep semantic features to guide anchor generation which can ensure higher anchor quality [45].

**Figure 12.** Sketch map of different anchor distributions. (**a**) The raw distribution; (**b**) the improved distribution with SGAAL. Anchors are marked in blue boxes.

The aim of SGAAL is to adaptively obtain the anchor location and the corresponding shape, that is, the parameters (*<sup>x</sup>*, *y*, *w*, *h*) of anchors in the image *I*, where (*<sup>x</sup>*, *y*) denotes the spatial coordinate of the anchor center, *w* denotes the width of the anchor box, and *h* denotes the height of the anchor box. Therefore, SGAAL can be described by

$$p(\mathbf{x}, y, w, h | I) = p(\mathbf{x}, y | I) \cdot p(w, h | \mathbf{x}, y, I) \tag{9}$$

where *p*(*<sup>x</sup>*, *y*|*I*) denotes the prediction process of the anchor location for a given image *I*, and *p*(*<sup>w</sup>*, *h*|*<sup>x</sup>*, *y*, *I*) denotes the prediction process of the anchor shape for a given image *I* and the corresponding known location. In other words, SGAAL will first adaptively predict the location (*<sup>x</sup>*, *y*) of anchors, and then adaptively predict the shape ( *w*, *h*) of anchors. Figure 13 shows the detailed implementation process of SGAAL. In Figure 13, the input semantic feature is denoted by *Q*. Its height is denoted by *H*, its width is denoted by *W*, and its channel number is denoted by *C*.

**Figure 13.** Detailed implementation process of SGAAL.

From Figure 13, we use a 1 × 1 conv *WL* to predict the anchor location whose channel number is set to 1, which will encode the whole *H* × *W* location space. This 1 × 1 conv layer is followed by a *sigmod* activation function that is defined by 1/(1 + *e*<sup>−</sup>*<sup>x</sup>*) to represent the occurrence probability of the shadow location. Here, the location threshold is denoted by *εL*. That is, when the value of the location (*<sup>x</sup>*, *y*) is bigger than *εL*, then this location is assigned by a positive "1" label (i.e., the network should generate anchors at this location); otherwise, it is assigned by a negative "0" label (i.e., the network should not generate anchors at this location). As locations with shadows occupy a small portion of the whole feature map, we adopt the focal loss (FL) of RetinaNet [39] to train this anchor location prediction network so as to avoid falling into a large number of negative samples, i.e.,

$$\begin{array}{lcl}\operatorname{loss}\_{FL}(p\_t) = -a\_t(1-p\_t)^\gamma \log(p\_t) \\ p\_t = \begin{cases} \ p & \text{if } y = 1 \\ 1-p & \text{otherwise} \end{cases} \end{array} \tag{10}$$

where *y* denotes the ground-truth class. *y* = 1 means the positive, otherwise it is the negative. *p* denotes the predicted probability ranging from 0 to 1. *γ* denotes the focusing parameter, set to 2 empirically, and *αt* denotes the weighting factor, set to 0.25 empirically.

We use a 1 × 1 conv *WS* to predict the anchor shape whose channel number is set to 2 because we need to obtain the anchor width *w* and height *h*. The anchor shape prediction is across the whole *H* × *W* location space. However, the shape predictions whose corresponding location predictions are lower than the threshold *εL* are filtered. This threshold *εL* will be determined experimentally in Section 5.4. The bounded IOU loss [72] is used to train this anchor shape prediction network because it is more sensitive to box spatial locations, i.e.,

$$loss\_{BIOL} = -\log(1 - \frac{G \cap P}{G \cup P}) \tag{11}$$

where *G* denotes the ground-truth box and *P* denotes the prediction box.

As a result, the anchor location and shape are obtained combined with *WL* and *WS*. Note that Wang et al. [45] pointed out that the feature for a large anchor should encode the content over a large region, while those for small anchors should have smaller scopes accordingly, thus, following their practice, we also devise an anchor-guided feature adaptation component, which will transform the feature at each individual location *i* based on the underlying anchor shape, i.e.,

$$q\_i{}^{\prime} = A(q\_{i\prime}w\_{i\prime}h\_i) \tag{12}$$

where *qi* denotes the *i*-th location element of the raw feature map *Q*, and *qi'* denotes the *i*-th location element of the transformed feature map *Q'*, and *wi* and *hi* denote the width and

height of anchors at the *i*-th location. Moreover, *A* is a 3 × 3 deformable convolutional layer *WA* which is used to predict the offset field from the output of the anchor shape prediction branch, and then apply the learned offset to the original feature map to obtain the final feature map *Q'*.

Finally, based on the adaptively learned anchors, the high-quality proposals are generated, and then they are mapped to the transformed feature map *Q'* by ROIAlign to extract their corresponding feature regions for the subsequent classification and regression in Fast R-CNN, as in Figure 2. In this way, the obtained optimized anchors will be able to adaptively match shadow location and shape so as to enable better false-alarm suppression ability and ensure more attentive shadow feature learning.
