2.1.1. Content-Aware Feature Reassembly (CAFR)

The standard ROI pooling size of the box prediction is 7 × 7 while that of the mask prediction is 14 × 14 [47]. Therefore, to maintain feature consistency between box and mask, before the global context modeling, we propose CAFR to up-sample the raw box feature maps from 7 × 7 to 14 × 14, which can also offer better modeling benefits in a larger feature space. We observe that this practice can offer a notable accuracy gain although the speed is sacrificed (see Section 5.1). Note that we abandon the common nearest neighbor or bilinear interpolation to reach this goal because they merely consider sub-pixel neighborhood, failing to capture the rich global context semantic information required by the dense prediction task. We also do not use the deconvolution because it applies the same kernel across the entire space, without considering the underlying global context content, limited by a limited field of view.

Differently, our proposed CAFR can enable the instance-specific content-aware handling while considering global context information, resulting in adaptive up-sampling kernels. Such a content-aware paradigm is also suggested by Wang et al. [51]. Figure 4 shows the implementation of CAFR. CAFR contains two processes—(i) content-aware kernel prediction and (ii) feature reassembly operation.

**Figure 4.** Implementation of the content-aware feature reassembly (CAFR) in the GCIM-Block.

The former is used to encode contents so as to predict the up-sampling kernel *K*. The input is the ROI's pooled feature maps denoted by *F*ROI. To reduce the computational burden, we first adopt a 1 × 1 conv for channel compression where the compression is set to 0.5, i.e., from the raw 256 to the current 128, in consideration of the accuracy-speed trade-off. Then, a 3 × 3 conv is used to encode the entire content whose kernel number is 22 × *n*2. Here, 2 denotes the up-sampling ratio, and *n* denotes the interpolation neighborhood scope to be considered, which is set to 5 empirically. In order to achieve the up-sampling kernel *K* across the entire 14 × 14 feature space, the previous encoded content feature maps are shuffled in space, leading to the tensor with a 14 × 14 × *n*<sup>2</sup> dimension.

Finally, it is normalized via a softmax calculation function defined by *exi*/ ∑*j exj* , leading to the final up-sampling kernel *K*. The 1 × 1 × *n*<sup>2</sup> tensor alongside the depth direction represents the corresponding kernel for a single up-sampling operation from the raw location *l*(*<sup>i</sup>*, *j*) to the required location *<sup>l</sup>*(*i*, *j*). Briefly, the above is described by

$$K = \text{softmax}\{\text{shuffle}[\text{conv}\_{\mathcal{B}\times\mathcal{B}}(\text{conv}\_{1\times1}(F\_{\text{ROI}}))]\}\tag{1}$$

The latter is to implement the feature reassembly, i.e., a convolution operation between the *n* × *n* neighbors of the location *l* denoted by *<sup>N</sup>*(*l*, *n*) and the predicted kernel *Kl* ∈ *K* corresponding to the required location *l'*. The above is described by

$$F'\_{\text{ROI}} = \bigcup\_{\substack{l \in F\_{\text{BOM}}, \, l' \in F'\_{\text{ROM}}}} \mathcal{N}(l, n) \otimes \mathcal{K}\_{l'} \tag{2}$$

where *<sup>F</sup>*ROIdenotes the output feature maps of CAFR, and ⊗ denotes the convolution operator.

### 2.1.2. Multi Receptive-Field Feature Response (MRFFR)

Inspired by the idea of the multi resolution analysis (MRA) [52] widely used in the wavelet transform community, we propose MRFFR to analyze ships in resolution from fine to coarse, which can improve the richness of global context information, i.e., from singlescale context to multi-scale contexts. Specifically, we adopt multi dilated convolutions [53] with different dilated rates *r* to reach this aim as shown in Figure 5, where different scale or color boxes represent different context scopes. MRFFR can not only excite feature multiresolution responses but also capture multi-scope context information, conducive to better global context modeling.

**Figure 5.** Multi receptive-field feature response of SAR ships.

Figure 6 depicts the implementation of MRFFR. We adopt four 3 × 3 convs with different dilated rates to trigger different resolution responses. More might bring better accuracy but will reduce speed. Then, the achieved four results are concatenated directly. Finally, we propose a dimension reduction squeeze-and-excitation (DRSE) to balance the contributions of different scope contexts and to achieve the channel reduction convenient for the subsequent processing. DRSE can model channel correlation to suppress useless channels and highlight valuable ones while reducing channel dimension, which reduces the risk of the training oscillation due to excessive irrelevant contextual backgrounds. We observe that only moderate contexts can enable better box and mask prediction.

**Figure 6.** Implementation of the multi receptive-field feature response (MRFFR) in GCIM-Block. DRSE denotes the dimension reduction squeeze-and-excitation.

The above is described by

$$F\_{\rm MRFFR} = f\_{\rm DMSE} \left\{ \left[ f\_{3 \times 3}^2 \left( F\_{\rm ROI}' \right), f\_{3 \times 3}^3 \left( F\_{\rm ROI}' \right), f\_{3 \times 3}^4 \left( F\_{\rm RCI}' \right), f\_{3 \times 3}^5 \left( F\_{\rm RCI}' \right) \right] \right\} \tag{3}$$

where *<sup>F</sup>*ROI denotes the input, *F*MRFFR denotes the output, *f r*3×3 denotes a 3 × 3 conv with a dilated rate r, and *f*DRSE denotes the DRSE operation to reduce channels from 1024 to 256.

Figure 7 depicts the implementation of DRSE. The input is denoted by *X* and output is denoted by *Y*. In the collateral branch, a global average pooling is used to achieve global spatial information, a 1 × 1 conv and a sigmoid activation function are used to squeeze channels to highlight important ones. The squeeze ratio *p* is set to 4 (1024 → 256). In the main branch, the input channel number is reduced directly by a 1 × 1 conv and a ReLU activation. The broadcast element-wise multiplication is used for compressed channel weighting. In this way, DRSE models the channel correlation of input feature maps in a reduced dimension space. It uses the learned weights from the reduced dimension space to pay attention to the important features of the main branch. It avoids the potential information loss of the rude dimension reduction. In short, the above is described by

$$Y = \text{ReLU}(\text{conv}\_{1 \times 1}(X)) \odot \sigma(\text{conv}\_{1 \times 1}(GAP(X))) \tag{4}$$

where *σ* denotes the sigmoid function and denotes the broadcast element-wise multiplication.

**Figure 7.** Implementation of dimension reduction squeeze-and-excitation (DRSE) in the MRFFR.

### 2.1.3. Global Feature Self-Attention (GFSA)

GFSA follows the basic idea of the non-local neural networks [54] to achieve the global context feature self-attention. It can be described by

$$\mathbf{y}\_{i} = \frac{1}{\mathbb{Q}(\mathbf{x})} \sum\_{\forall j} f\left(\mathbf{x}\_{i}, \mathbf{x}\_{j}\right) \mathbf{g}\left(\mathbf{x}\_{j}\right) \tag{5}$$

where **x** denotes the input, *i* and *j* are the index position in the inputted feature maps across the whole *H* × *W* space. *f* is a pairwise function used to represent the spatial correlation between *i* and *j*. *g* is a unary function used to represent the inputted feature maps at position *j*. To a given *i*, *j* will enumerate the whole *H* × *W* space, resulting a sequence of spatial correlation between *i* and every position in the inputted feature maps. Through ∑∀*j f* × c, the *i*-position's output **y***i* is related with the entire space. This means that global long-range spatial dependencies are captured. Finally, *ζ*(**x**) is used to normalize the response.

We instantiate Equation (5) in Figure 8. Notably, Equation (5) is only to illustrate the process of calculating a single feature vector at the j-position (**y***i*) and the essence of achieving the global context feature self-attention. However, in the instantiation, the feature vectors at every position (**y**) are computed in parallel through matrix calculation in consideration of computational efficiency, and we need to use existing operators such as convolution and softmax to achieve the global context feature self-attention for simplicity. Specifically, in Figure 8, features at the *i*-position are denoted by *φ* using a 1 × 1 conv *<sup>W</sup>φ*. Features at the *j*-position are denoted by *θ* using a 1 × 1 conv *Wθ*. We model unary function *g* as a linear embedding which is instantiated through a 1 × 1 conv *Wg*, and embed features into *C*/4 channel space to reduce computational burdens. Moreover, pairwise function *f* is modeled as the Gaussian function *e<sup>x</sup>Ti xj* and normalization factor *ζ*(**x**) is modeled as ∑ ∀*j e<sup>x</sup>Ti xj* . Therefore, we can instantiate *f* and the normalization process together through a

softmax calculation function along the dimension *j*. Since *<sup>W</sup>φ* and *Wθ*. are learnable, the spatial correlation f is obtained from adaptive learning between *φ* and *θ*. Note that in the global self-attention process, the sizes of features need to be transposed or shift between three dimension and two dimension, which is implemented through the permute and

灤FRQY 灤FRQY 灤FRQY *+ : + & : & : & : + & +: +: & +: & :ϕ ϕ :θ θ :J J +: & +: +: VRIWPD[ +: & +: & +: &* 灤FRQY *: & +* IODWW HQ SHUPXW H *)*\*)6\$ *I : + &* **\***L :]* PDW UL[PXOW LSOLFDW LRQ *+ & )*05))5

flatten operations, respectively. The response at the *i*-position **y***i* is obtained by a matrix multiplication.

**Figure 8.** Implementation of the global feature self-attention (GFSA) in GCIM-Block.

Since we embed features into *C*/4 channel space to reduce computational burdens before the global self-attention process, we need to recover the channel of features after the attention process through a 1 × 1 conv *Wz* for the adding operation.

Finally, we achieve the final global feature self-attention output *F*GFSA that will be transmitted to the subsequent boundary-aware box prediction. Here, *F*GFSA denotes the final output of GCIM-Block *F*GCIM−Block.

### *2.2. Boundary-Aware Box Prediction Block (BABP-Block)*

The traditional box prediction is implemented via estimating the bounding box's center offset (<sup>Δ</sup>*x*, <sup>Δ</sup>*y*) and the corresponding width and height offset (<sup>Δ</sup>*w*, Δ*h*) with its ground truth (GT) to optimize network parameters, as shown in Figure 9a. Yet, this paradigm is not very suitable for SAR ships from the following two aspects.

**Figure 9.** Different box prediction forms. (**a**) The traditional bounding box's center offset (<sup>Δ</sup>*x*, <sup>Δ</sup>*y*) and the corresponding width and height offset (<sup>Δ</sup>*w*, Δ*h*) estimation. (**b**) The bounding box's boundary estimation of this paper, which contains two basic steps, i.e., boundary prediction and location fine regression. The red colored box denotes prediction box. The green colored box denotes ground truth box. The green colored dot denotes the center of ground truth box. The red colored arrow denotes the trend of bounding box regression.

On the one side; as shown in Figure 10; SAR ships often exhibit a huge scale-difference due to a huge resolution difference; e.g.; 1m resolution for TerraSAR-X [55] and 20m resolution for Sentinel-1 [56]. This situation is called the cross-scale effect [39], e.g., the extremely small ships in Figure 10c vs. the extremely larger ships in Figure 10d. For example, the smallest ship in SSDD has only 28 pixels while the largest one has 62,878 pixels [57], where the scale ratio reaches 62,878/28 = 2245. For commonly-used two-stage models, presetting a series of prior anchors is required for RPN. Yet, no matter how the prior anchors are set; it is still difficult to cover such a dataset with a large scale-difference. In the dataset; since the proportion of small ships is higher than large ships; the size of the prior anchor is always closer to the small ship; but there will be a long space distance from the large ship. This will lead to the adjustment for the large ship anchors, as it becomes rather difficult if adopting the traditional scheme shown in Figure 9a. This is because it is time-consuming to adjust a small anchor to a large GT box; resulting in a grea<sup>t</sup> burden to the network training. As a result; the positioning accuracy of large ships will become poor

**Figure 10.** Some cross-scale SAR ships. (**a**) Ship size distribution in SSDD. (**b**) Ship size distribution in HRSID. (**c**) Small ships. (**d**) Large ships.

On the other side, it is rather challenging to locate the center of an SAR ship. Generally, different parts of the ship's hull have different materials, resulting in differential radar electromagnetic scatterings (i.e., radar cross section, RCS [58]). This makes the pixel brightness distribution of the ship in one SAR image extremely uneven. In many cases, the strong scattering points of the ship are not in the geometric center of the hull, but in the bow or stern. This phenomenon may directly lead to the failure

As shown above, we abandon the traditional scheme in Figure 9a, and adopt the boundary learning scheme in Figure 9b to implement the box prediction. We design a boundary-aware box prediction block (BABP-Block) to reach this goal, inspired by the gird idea from Wang et al. [59] and Lu et al. [60]. From Figure 9b, BABP-Block consists

of two basic steps. (i) The first is to predict the coarse boundary of a ship marked by yellow dotted lines in the *x*-left, *x*-right, *y*-top and *y*-down (i.e., four yellow activate grids). (ii) The second is to adjust the box finely from the boundary box to the GT box. This stage is the same as the traditional scheme, but obviously it is much easier to adjust the resulting coarse boundary box to the GT box so as to achieve the final finer box. This is because the distance to be adjusted is greatly reduced. Such from coarse to fine prediction scheme divides the task into two stages where each stage is responsible for its own task, resulting in the dual-supervision of training, enabling better box prediction. Once the box prediction becomes more accurate, the mask prediction will become more accurate as well. BABP-Block offers four main design concepts, i.e., (1) boundary-aware feature extraction (BAFE), (2) boundary bucketing coarse localization (BBCL), (3) boundary regression fine localization (BRFL), and (4) boundary-guided classification rescoring (BGCR). Its workflow is depicted in Figure 11. The input is the feature maps of GCIM-Block's output *F*GCIM−Block.

**Figure 11.** Workflow of the boundary-aware box prediction block (BABP-Block). Here, BAFF denotes the boundary-aware feature extraction, BBCL denotes the boundary bucketing coarse localization, BRFL denotes the boundary regression refined localization and BGCR denotes the boundary-guided classification rescoring.
