2.2.1. Boundary-Aware Feature Extraction (BAFF)

The traditional feature extraction is implemented across the entire 2D space without distinguishing direction, i.e., four boundary directions including the *x*-left, *x*-right, *y*-top, and *y*-down. As a result, important boundary-sensitive features are not extracted. Thus, BAFF is arranged to solve this problem so as to ensure the subsequent boundary localization accuracy.

Figure 12 shows the implementation of BAFE. BAFE contains two parallel branches, i.e., *x*-boundary feature extraction and *y*-boundary feature extraction. Here, we take the *x*-boundary feature extraction as an example to introduce details. The same can be reasoned for the *y*-boundary feature extraction. First, we use a convolutional block attention module (CBAM) [61] to better capture direction-specific information of the ROI region. Then, a 1 × 1 conv with a softmax activation is used to normalize the attention map which will be weighted to the raw feature maps by the matrix element-wise multiplication. Afterwards, we sum features along the *y*-direction and use a 1 × 3 asymmetric conv to achieve the features along *x*-direction *Fx*. The above can be described by

$$F\_{\rm x} = \sum\_{y} F\_{\rm CCIM-Block}(y\_{\prime}; \cdot) \* M\_{\rm x}(y\_{\prime}; \cdot) \tag{6}$$

where *Mx* denotes the attention map of the x-boundary. Finally, *Fx* is split into two subsets evenly, i.e., *Fx*−*right* and *Fx*−*lef t*, to represent the features of the right and left boundaries.

**Figure 12.** Implementation of the boundary-aware feature extraction (BAFE) used in BABP-Block.

Figure 13 shows the implementation of CBAM. Let its input be *F*GCIM−Block ∈ R*H*×*W*×*<sup>C</sup>* where H and W are the height and width of feature maps and *C* is the channel number, then the channel attention is responsible for generating a channel-dimension weight matrix *WCA* ∈ R1×1×*<sup>C</sup>* to measure the important levels of *C* channels; the space attention is responsible for generating a space-dimension weight matrix *WSA* ∈ R*H*×*W*×<sup>1</sup> to measure the important levels of space-elements across the entire *H* × *W* space. They both range from 0 to 1 by a sigmoid activation which can enrich nonlinearity of neural networks for better performance, suggested by [61]. The result of the channel attention is denoted by *FCA* = *F* × *WCA*. The result of the space attention is denoted by *FSA* = *FCA* × *WSA*. It should be noted that here, the space attention is executed following the channel attention. It is also feasible to change their order.

**Figure 13.** Implementation of the convolutional block attention module (CBAM) used in BAFE.

For the channel attention, a max-pooling (MaxPool) is used to capture its local response, and an average-pooling (AvgPool) is used to capture its global response. A multi-layer perceptron (MLP) is used to refine them for better fusion between the local and global responses. Finally, the results are normalized by a sigmoid function to obtain *WCA*. The above is described by

$$\mathcal{W}\_{\complement A} = \sigma\{\text{MLP}[\text{MaxPool}(F)] + \text{MLP}[\text{AvgPool}(F)]\}\tag{7}$$

where *σ* denotes the sigmoid activation defined by 1/(1 + *<sup>e</sup>*<sup>−</sup>*<sup>x</sup>*).

For the space attention, MaxPool and AvgPool are also used. Still, differently, they both operate on the channel dimension to achieve 2D feature maps. Their results are concatenated directly and convolved by a common conv layer, producing the 2D spatial

attention map. Finally, the results are normalized by a sigmoid activation to obtain *WSA*. The above is described by

$$\mathcal{W}\_{SA} = \sigma \{ f\_{\mathcal{T} \times \mathcal{T}}([\text{MaxPool}(F\_{CA}), \text{AvgPool}(F\_{CA})]) \} \tag{8}$$

where *f* 7×7 is a 7 × 7 conv recommended by their original report [61].

### 2.2.2. Boundary Bucketing Coarse Localization (BBCL)

After boundary-sensitive features are achieved by the previous BAFE stage, we follow the bucketing idea [62] to predict the box boundary, referred to as BBCL. The specific implementation scheme is consistent with Wang et al. [59]. This scheme divides the target space into multiple buckets, or called discrete grid cells [60]. This coarse boundary localization is completed by searching for the correct bucket, i.e., the one in which the boundary resides. Figure 14 shows the implementation of BBCL. The candidate regions are divided into 2*k* buckets on both *x*-direction and *y*-direction, with *k* buckets corresponding to each boundary. Here, *k* is equal to 14, because the feature map's size is 14 × 14. From Figure 14, we adopt a fully-connected (FC) layer to serve as a binary classifier to predict whether the boundary is located in or is the closest to the bucket on each side, based on the ship boundary-aware features *Fx*−*right*, *Fx*−*lef t*, *Fy*−*right*, and *Fy*−*lef t*. We achieve the boundary probabilities of four sides, denoted by *sx*−*right*, *sx*−*lef t*, *sy*−*right*, and *sy*−*lef t*. It should be noted that the boundary probabilities of four sides, i.e., *sx*−*right*, *sx*−*lef t*, *sy*−*right*, and *sy*−*lef t* will be utilized for the final boundary-guided classification rescoring, which will be introduced in the next sections. Afterwards, the maximum activation value is then projected into the raw feature maps to achieve the corresponding index value. Finally, the four boundary positions are obtained, i.e., *x*-right, *x*-left, *y*-right, and *y*-left. In this way, the coarse boundary of a ship is predicted successfully.

**Figure 14.** Implementation of the boundary bucketing coarse localization (BBCL) used in BABP-Block.

2.2.3. Boundary Regression Fine Localization (BRFL)

After the coarse boundary of a ship is obtained, we need to finely adjust the box close to the GT box in order to eliminate the boundary effects of buckets, as shown in Figure 15. This process is the same as the traditional bounding box regression scheme. Specifically, we adopt a 4-way FC layer to complete this task, i.e., the center point correction and width-height adjustment. Since this process is operated in the predicted boundary box,

the distance between the initial box and the GT box becomes smaller. Consequently, such a regression task will become more relaxing to deal with the cross-scale effect, because the positioning difficulty is shared by the previous boundary prediction stage.

**Figure 15.** Implementation of the boundary regression fine localization (BRFL) used in BABP-Block.

Up to this point, we have achieved bounding box regression results by the regression branch in Figure 11.

### 2.2.4. Boundary-Guided Classification Rescoring (BGCR)

BBCL offers the localization reliability of the predicted boundary box, that is, the boundary probabilities of four sides *sx*−*right*, *sx*−*lef t*, *sy*−*right*, and *sy*−*lef t* as previously introduced. The idea of rescoring is also shown in FCOS [63] where the final classification score is computed by using the predicted center-ness score and the raw classification score together. And it is a direct intuition that it should be conducive to maintaining the optimum box with both high classification confidence and accurate localization if fully leveraging them. Thus, we arrange a boundary-guided classification rescoring (BGCR) strategy to reach this aim, which is described by

$$\mathbf{s}' = \mathbf{a} \cdot \mathbf{s} + \beta \cdot \frac{1}{4} \left( \mathbf{s}\_{x-right} + \mathbf{s}\_{x-left} + \mathbf{s}\_{y-right} + \mathbf{s}\_{y-left} + \mathbf{s}\_{y-left} \right) \tag{9}$$

where *s* denotes the original confidence score of the classification network (i.e., two FC layers in Figure 11), *s* denotes the final confidence score, *α* denotes the weight coefficient of the original confidence score and *β* denotes the weight coefficient of the localization reliability. In our work, *α* and *β* are both set to 0.5 considering the trade-off between the spatial localization reliability and the classification reliability [64,65]. Here, in terms of the total localization reliability, we directly average the four sides' boundary probabilities, because they seem to be equally important. Finally, the resulting score *s* will be inputted to the non-maximum suppression (NMS) algorithm [48] to remove repeated detections.
