1. Introduction
Synthetic aperture radar (SAR) is an outstanding microwave sensor. It can provide high-resolution observation images via measuring objects’ radar scattering characteristics, free from both light and weather [
1,
2,
3,
4,
5], which is extensively used in the measurement [
6,
7], transportation [
8], ocean [
9,
10], and remote sensing [
11,
12] communities. Ship surveillance is a research highlight at present, because it is conducive to disaster reliefs traffic control, and fishery monitoring [
13]. Compared with optical [
14], infrared [
15], and hyperspectral [
16] sensors, SAR is more suitable for ocean ship surveillance because of its stronger adaptability to marine environments with changeable climate. Consequently, ship surveillance using SAR is receiving more attention [
17,
18,
19,
20,
21,
22,
23,
24].
Traditional methods [
17,
25,
26,
27] generally rely on hand-crafted features via expert experience, which are laborious and time-consuming, limiting broader generalization. Now, convolutional neural networks (CNNs) are offering many elegant schemes with high-efficiency and high-accuracy superiority. For example, LeCun et al. [
28] proposed LeNet5 for handwritten character recognition. Krizhevsky et al. [
29] proposed AlexNet, which showed great performance in 2012 ImageNet Competition. Simonyan et al. [
30] deepened the layers of networks to extract more discriminative features and proposed VGG for image classification. Girshick [
31] used deep convolutional networks to build Fast R-CNN for object detection. Ren et al. [
32] proposed Faster R-CNN which achieved state-of-the-art object detection accuracy on PASCAL VOC datasets. Therefore, more efforts are made by an increasing number of scholars for CNN-based SAR ship detection [
19,
20,
21,
22,
23,
24]. For example, Cui et al. [
19] proposed a dense attention pyramid network to detect multi-scale SAR ships. Zhang et al. [
20] proposed a balance scene learning mechanism to improve the performance of complex inshore ships. Sun et al. [
21] applied the anchor-free method for SAR ship detection. Zhang et al. [
22] designed a depthwise separable convolution neural network for faster detection speed. Song et al. [
24] developed an automatic methodology to generate robust training data for ship detection. However, according to the investigation in [
33], most existing reports focused on detecting ships at the box level, i.e., SAR ship box detection. Regrettably, only a few reports detected ships at the box level and pixel level simultaneously, i.e., SAR ship instance pixel segmentation.
Some works [
34,
35,
36,
37] have studied SAR ship instance segmentation. Wei et al. [
34] released a HRSID dataset and offered some common research baselines, but they did not offer methodological contributions. Su et al. [
35] applied CNN-based models for remote sensing image instance segmentation, but the characteristics of SAR ships were not considered, which hinders further accuracy improvement. Gao et al. [
36] proposed an anchor-free model, but the model cannot handle complex scenes and cases [
38]. Zhao et al. [
37] proposed a synergistic attention for SAR ship instance segmentation, but their method still missed many small ships and inshore ones. These existing models mostly have limited box positioning ability, hindering the further accuracy improvements of segmentation.
Thus, we propose a global context boundary-aware network (GCBANet) to solve this problem for better SAR ship instance segmentation. We designed a global context information modeling block (GCIM-Block) to capture spatial long-range dependences of ship surroundings, resulting in larger receptive fields; thus, the background interferences can be mitigated. We also designed a boundary-aware box prediction block (BABP-Block) to estimate the ship box boundary, rather than the ship box center and width-height. This can enable better cross-scale prediction, because aligning each side of the box to the target boundary is much easier than moving the box as a whole while tuning the size, especially for cross-scale targets. Here, cross-scale means that targets exhibit a large pixel-scale difference [
39]. A large scale-difference is usually from the large resolution difference [
40]. SAR ships have the cross-scale characteristic, i.e., small ships are extremely small and large ones are extremely large [
39]. Such huge scale difference increases instance segmentation difficulty. BABP-Block tackles this problem.
We conducted ablation studies to confirm the effectiveness of GCIM-Block and BABP-Block. Combined with them, GCBANet surpasses the other nine competitive models significantly on the two public SSDD [
41] and HRSID [
34] datasets. Specifically, on SSDD, it achieves 2.8% higher box AP and 3.5% higher mask AP than the existing best model; on HRSID, they are 2.7% and 1.9%. The source code and the result are available online on our website [
42].
The main contributions of this article are as follows:
GCBANet is proposed for better SAR ship instance segmentation.
GCIM-Block and BABP-Block are proposed to ensure GCBANet’s good performance.
GCBANet significantly outperforms the other nine competitive models.
The rest of the materials of this article are arranged as follows.
Section 2 introduces the methodology of GCBANet.
Section 3 introduces the experiments. Results are shown in
Section 4. Ablation studies are described in
Section 5. Finally, a summary of this article is made in
Section 6.
2. Methodology
Figure 1 shows the architecture of the proposed GCBANet. GCBANet follows the state-of-the-art cascade structure [
43,
44] for high-quality SAR ship instance segmentation, which sets three stages to refine box (B1, B2, and B3) prediction and mask (M1, M2, and M3) prediction progressively. This paradigm was demonstrated by the optimal instance segmentation performance [
45].
The backbone network is used to extract SAR ship features. Without losing generality, the common ResNet-101 [
46] is selected as GCBANet’s backbone network. The region proposal network (RPN) [
32] is used to generate some initial region candidates, i.e., regions of interests (ROIs). ROIAlign [
47] is used to extract feature subsets of ROIs among the backbone network’s feature maps
F for the subsequent box-mask refined prediction. ROIAlign’s input parameters are determined by the previous box prediction, i.e., RPN→ROIAlign-1, B1→ROIAlign-2, B2→ROIAlign-3, and B3→ROIAlign-4. The resulting feature subset is denoted by
. The box prediction in the
i-stage is conducted by learning on
whose more refined location regression is then inputted into the next stage. The mask prediction in the
i-stage is implemented by learning on the achieved next stage feature subset
. The final results of the box prediction B3 and mask prediction M3 are post-processed by a non-maximum suppression (NMS) [
48] to delete duplicate detections.
We observe that the mask prediction mainly relies on the previous stage box prediction from the information flow direction (B1
M1, B2
M2, and B3
M3). Therefore, if one wants to further improve the segmentation performance of the mask prediction, then they should first improve the detection performance of the box prediction. In this way, the overall instance segmentation can be improved (the instance segmentation contains the box detection and the mask segmentation). This is also a direct scheme to boost the two-stage instance segmentation models’ performance [
49]. Thus, considering the task characteristics of SAR ships, we design two blocks, a GCIM-Block (marked by a green circle) and BABP-Block (marked by a magenta circle), to reach this goal. Their resulting benefits will be transmitted to the final box prediction B3 and mask prediction M3 for better performance.
Next, we will introduce the GCIM-Block and the BABP-Block in detail in the following two sub-sections.
2.1. Global Context Information Modeling Block (GCIM-Block)
Ships in SAR images have various surroundings, as in
Figure 2, e.g., river courses, islands, inshore facilities, harbors, and wakes. Moreover, because of the special imaging mechanisms of SAR, ships are also accompanied with cross-shape sidelobes, speckle noise, and granular pixel distribution [
50]. These various surroundings pose differential effects to ship instance segmentation. It is very necessary to take them into consideration for better background discrimination ability in box prediction. Therefore, we design a global context information modeling block (GCIM-Block) to model global background context information, which can capture the spatial long-range dependences of ships to decrease false alarms and missed detections. GCIM-Block offers three main design concepts, i.e., (1) content-aware feature reassembly (CAFR), (2) multi receptive-field feature response (MRFFR), and (3) global feature self-attention (GFSA). Its workflow is shown in
Figure 3. The input is
and the output is
.
2.1.1. Content-Aware Feature Reassembly (CAFR)
The standard ROI pooling size of the box prediction is
while that of the mask prediction is
[
47]. Therefore, to maintain feature consistency between box and mask, before the global context modeling, we propose CAFR to up-sample the raw box feature maps from
to
, which can also offer better modeling benefits in a larger feature space. We observe that this practice can offer a notable accuracy gain although the speed is sacrificed (see
Section 5.1). Note that we abandon the common nearest neighbor or bilinear interpolation to reach this goal because they merely consider sub-pixel neighborhood, failing to capture the rich global context semantic information required by the dense prediction task. We also do not use the deconvolution because it applies the same kernel across the entire space, without considering the underlying global context content, limited by a limited field of view.
Differently, our proposed CAFR can enable the instance-specific content-aware handling while considering global context information, resulting in adaptive up-sampling kernels. Such a content-aware paradigm is also suggested by Wang et al. [
51].
Figure 4 shows the implementation of CAFR. CAFR contains two processes—(i) content-aware kernel prediction and (ii) feature reassembly operation.
The former is used to encode contents so as to predict the up-sampling kernel K. The input is the ROI’s pooled feature maps denoted by . To reduce the computational burden, we first adopt a conv for channel compression where the compression is set to 0.5, i.e., from the raw 256 to the current 128, in consideration of the accuracy-speed trade-off. Then, a conv is used to encode the entire content whose kernel number is . Here, 2 denotes the up-sampling ratio, and n denotes the interpolation neighborhood scope to be considered, which is set to 5 empirically. In order to achieve the up-sampling kernel K across the entire feature space, the previous encoded content feature maps are shuffled in space, leading to the tensor with a dimension.
Finally, it is normalized via a softmax calculation function defined by
, leading to the final up-sampling kernel
K. The
tensor alongside the depth direction represents the corresponding kernel for a single up-sampling operation from the raw location
to the required location
. Briefly, the above is described by
The latter is to implement the feature reassembly, i.e., a convolution operation between the
neighbors of the location
l denoted by
and the predicted kernel
corresponding to the required location
l’. The above is described by
where
denotes the output feature maps of CAFR, and
denotes the convolution operator.
2.1.2. Multi Receptive-Field Feature Response (MRFFR)
Inspired by the idea of the multi resolution analysis (MRA) [
52] widely used in the wavelet transform community, we propose MRFFR to analyze ships in resolution from fine to coarse, which can improve the richness of global context information, i.e., from single-scale context to multi-scale contexts. Specifically, we adopt multi dilated convolutions [
53] with different dilated rates
r to reach this aim as shown in
Figure 5, where different scale or color boxes represent different context scopes. MRFFR can not only excite feature multi-resolution responses but also capture multi-scope context information, conducive to better global context modeling.
Figure 6 depicts the implementation of MRFFR. We adopt four
convs with different dilated rates to trigger different resolution responses. More might bring better accuracy but will reduce speed. Then, the achieved four results are concatenated directly. Finally, we propose a dimension reduction squeeze-and-excitation (DRSE) to balance the contributions of different scope contexts and to achieve the channel reduction convenient for the subsequent processing. DRSE can model channel correlation to suppress useless channels and highlight valuable ones while reducing channel dimension, which reduces the risk of the training oscillation due to excessive irrelevant contextual backgrounds. We observe that only moderate contexts can enable better box and mask prediction.
The above is described by
where
denotes the input,
denotes the output,
denotes a
conv with a dilated rate r, and
denotes the DRSE operation to reduce channels from 1024 to 256.
Figure 7 depicts the implementation of DRSE. The input is denoted by
X and output is denoted by
Y. In the collateral branch, a global average pooling is used to achieve global spatial information, a
conv and a sigmoid activation function are used to squeeze channels to highlight important ones. The squeeze ratio
is set to 4 (
). In the main branch, the input channel number is reduced directly by a
conv and a ReLU activation. The broadcast element-wise multiplication is used for compressed channel weighting. In this way, DRSE models the channel correlation of input feature maps in a reduced dimension space. It uses the learned weights from the reduced dimension space to pay attention to the important features of the main branch. It avoids the potential information loss of the rude dimension reduction. In short, the above is described by
where
denotes the sigmoid function and
denotes the broadcast element-wise multiplication.
2.1.3. Global Feature Self-Attention (GFSA)
GFSA follows the basic idea of the non-local neural networks [
54] to achieve the global context feature self-attention. It can be described by
where
x denotes the input,
i and
j are the index position in the inputted feature maps across the whole
H ×
W space.
f is a pairwise function used to represent the spatial correlation between
i and
j.
g is a unary function used to represent the inputted feature maps at position
j. To a given
i,
j will enumerate the whole
H ×
W space, resulting a sequence of spatial correlation between
i and every position in the inputted feature maps. Through
, the
i-position’s output
is related with the entire space. This means that global long-range spatial dependencies are captured. Finally,
is used to normalize the response.
We instantiate Equation (5) in
Figure 8. Notably, Equation (5) is only to illustrate the process of calculating a single feature vector at the j-position (
) and the essence of achieving the global context feature self-attention. However, in the instantiation, the feature vectors at every position (
) are computed in parallel through matrix calculation in consideration of computational efficiency, and we need to use existing operators such as convolution and softmax to achieve the global context feature self-attention for simplicity. Specifically, in
Figure 8, features at the
i-position are denoted by
using a
conv
. Features at the
j-position are denoted by
using a
conv
. We model unary function
g as a linear embedding which is instantiated through a
conv
, and embed features into
channel space to reduce computational burdens. Moreover, pairwise function
f is modeled as the Gaussian function
and normalization factor
is modeled as
. Therefore, we can instantiate
f and the normalization process together through a softmax calculation function along the dimension
j. Since
and
. are learnable, the spatial correlation f is obtained from adaptive learning between
and
. Note that in the global self-attention process, the sizes of features need to be transposed or shift between three dimension and two dimension, which is implemented through the permute and flatten operations, respectively. The response at the
i-position
is obtained by a matrix multiplication.
Since we embed features into channel space to reduce computational burdens before the global self-attention process, we need to recover the channel of features after the attention process through a conv for the adding operation.
Finally, we achieve the final global feature self-attention output that will be transmitted to the subsequent boundary-aware box prediction. Here, denotes the final output of GCIM-Block .
2.2. Boundary-Aware Box Prediction Block (BABP-Block)
The traditional box prediction is implemented via estimating the bounding box’s center offset
and the corresponding width and height offset
with its ground truth (GT) to optimize network parameters, as shown in
Figure 9a. Yet, this paradigm is not very suitable for SAR ships from the following two aspects.
On the one side; as shown in
Figure 10; SAR ships often exhibit a huge scale-difference due to a huge resolution difference; e.g.; 1m resolution for TerraSAR-X [
55] and 20m resolution for Sentinel-1 [
56]. This situation is called the cross-scale effect [
39], e.g., the extremely small ships in
Figure 10c vs. the extremely larger ships in
Figure 10d. For example, the smallest ship in SSDD has only 28 pixels while the largest one has 62,878 pixels [
57], where the scale ratio reaches 62,878/28 = 2245. For commonly-used two-stage models, presetting a series of prior anchors is required for RPN. Yet, no matter how the prior anchors are set; it is still difficult to cover such a dataset with a large scale-difference. In the dataset; since the proportion of small ships is higher than large ships; the size of the prior anchor is always closer to the small ship; but there will be a long space distance from the large ship. This will lead to the adjustment for the large ship anchors, as it becomes rather difficult if adopting the traditional scheme shown in
Figure 9a. This is because it is time-consuming to adjust a small anchor to a large GT box; resulting in a great burden to the network training. As a result; the positioning accuracy of large ships will become poor
On the other side, it is rather challenging to locate the center of an SAR ship. Generally, different parts of the ship’s hull have different materials, resulting in differential radar electromagnetic scatterings (i.e., radar cross section, RCS [
58]). This makes the pixel brightness distribution of the ship in one SAR image extremely uneven. In many cases, the strong scattering points of the ship are not in the geometric center of the hull, but in the bow or stern. This phenomenon may directly lead to the failure
As shown above, we abandon the traditional scheme in
Figure 9a, and adopt the boundary learning scheme in
Figure 9b to implement the box prediction. We design a boundary-aware box prediction block (BABP-Block) to reach this goal, inspired by the gird idea from Wang et al. [
59] and Lu et al. [
60]. From
Figure 9b, BABP-Block consists of two basic steps. (i) The first is to predict the coarse boundary of a ship marked by yellow dotted lines in the
x-left,
x-right,
y-top and
y-down (i.e., four yellow activate grids). (ii) The second is to adjust the box finely from the boundary box to the GT box. This stage is the same as the traditional scheme, but obviously it is much easier to adjust the resulting coarse boundary box to the GT box so as to achieve the final finer box. This is because the distance to be adjusted is greatly reduced. Such from coarse to fine prediction scheme divides the task into two stages where each stage is responsible for its own task, resulting in the dual-supervision of training, enabling better box prediction. Once the box prediction becomes more accurate, the mask prediction will become more accurate as well. BABP-Block offers four main design concepts, i.e., (1) boundary-aware feature extraction (BAFE), (2) boundary bucketing coarse localization (BBCL), (3) boundary regression fine localization (BRFL), and (4) boundary-guided classification rescoring (BGCR). Its workflow is depicted in
Figure 11. The input is the feature maps of GCIM-Block’s output
.
2.2.1. Boundary-Aware Feature Extraction (BAFF)
The traditional feature extraction is implemented across the entire 2D space without distinguishing direction, i.e., four boundary directions including the x-left, x-right, y-top, and y-down. As a result, important boundary-sensitive features are not extracted. Thus, BAFF is arranged to solve this problem so as to ensure the subsequent boundary localization accuracy.
Figure 12 shows the implementation of BAFE. BAFE contains two parallel branches, i.e.,
x-boundary feature extraction and
y-boundary feature extraction. Here, we take the
x-boundary feature extraction as an example to introduce details. The same can be reasoned for the
y-boundary feature extraction. First, we use a convolutional block attention module (CBAM) [
61] to better capture direction-specific information of the ROI region. Then, a
conv with a softmax activation is used to normalize the attention map which will be weighted to the raw feature maps by the matrix element-wise multiplication. Afterwards, we sum features along the
y-direction and use a
asymmetric conv to achieve the features along
x-direction
. The above can be described by
where
denotes the attention map of the x-boundary. Finally,
is split into two subsets evenly, i.e.,
and
, to represent the features of the right and left boundaries.
Figure 13 shows the implementation of CBAM. Let its input be
where H and W are the height and width of feature maps and
C is the channel number, then the channel attention is responsible for generating a channel-dimension weight matrix
to measure the important levels of
C channels; the space attention is responsible for generating a space-dimension weight matrix
to measure the important levels of space-elements across the entire
space. They both range from 0 to 1 by a sigmoid activation which can enrich nonlinearity of neural networks for better performance, suggested by [
61]. The result of the channel attention is denoted by
. The result of the space attention is denoted by
. It should be noted that here, the space attention is executed following the channel attention. It is also feasible to change their order.
For the channel attention, a max-pooling (MaxPool) is used to capture its local response, and an average-pooling (AvgPool) is used to capture its global response. A multi-layer perceptron (MLP) is used to refine them for better fusion between the local and global responses. Finally, the results are normalized by a sigmoid function to obtain
. The above is described by
where
denotes the sigmoid activation defined by
.
For the space attention, MaxPool and AvgPool are also used. Still, differently, they both operate on the channel dimension to achieve 2D feature maps. Their results are concatenated directly and convolved by a common conv layer, producing the 2D spatial attention map. Finally, the results are normalized by a sigmoid activation to obtain
. The above is described by
where
is a
conv recommended by their original report [
61].
2.2.2. Boundary Bucketing Coarse Localization (BBCL)
After boundary-sensitive features are achieved by the previous BAFE stage, we follow the bucketing idea [
62] to predict the box boundary, referred to as BBCL. The specific implementation scheme is consistent with Wang et al. [
59]. This scheme divides the target space into multiple buckets, or called discrete grid cells [
60]. This coarse boundary localization is completed by searching for the correct bucket, i.e., the one in which the boundary resides.
Figure 14 shows the implementation of BBCL. The candidate regions are divided into 2
k buckets on both
x-direction and
y-direction, with
k buckets corresponding to each boundary. Here,
k is equal to 14, because the feature map’s size is
. From
Figure 14, we adopt a fully-connected (FC) layer to serve as a binary classifier to predict whether the boundary is located in or is the closest to the bucket on each side, based on the ship boundary-aware features
,
,
, and
. We achieve the boundary probabilities of four sides, denoted by
,
,
, and
. It should be noted that the boundary probabilities of four sides, i.e.,
,
,
, and
will be utilized for the final boundary-guided classification rescoring, which will be introduced in the next sections. Afterwards, the maximum activation value is then projected into the raw feature maps to achieve the corresponding index value. Finally, the four boundary positions are obtained, i.e.,
x-right,
x-left,
y-right, and
y-left. In this way, the coarse boundary of a ship is predicted successfully.
2.2.3. Boundary Regression Fine Localization (BRFL)
After the coarse boundary of a ship is obtained, we need to finely adjust the box close to the GT box in order to eliminate the boundary effects of buckets, as shown in
Figure 15. This process is the same as the traditional bounding box regression scheme. Specifically, we adopt a 4-way FC layer to complete this task, i.e., the center point correction and width-height adjustment. Since this process is operated in the predicted boundary box, the distance between the initial box and the GT box becomes smaller. Consequently, such a regression task will become more relaxing to deal with the cross-scale effect, because the positioning difficulty is shared by the previous boundary prediction stage.
Up to this point, we have achieved bounding box regression results by the regression branch in
Figure 11.
2.2.4. Boundary-Guided Classification Rescoring (BGCR)
BBCL offers the localization reliability of the predicted boundary box, that is, the boundary probabilities of four sides
,
,
, and
as previously introduced. The idea of rescoring is also shown in FCOS [
63] where the final classification score is computed by using the predicted center-ness score and the raw classification score together. And it is a direct intuition that it should be conducive to maintaining the optimum box with both high classification confidence and accurate localization if fully leveraging them. Thus, we arrange a boundary-guided classification rescoring (BGCR) strategy to reach this aim, which is described by
where
s denotes the original confidence score of the classification network (i.e., two FC layers in
Figure 11),
denotes the final confidence score,
denotes the weight coefficient of the original confidence score and
denotes the weight coefficient of the localization reliability. In our work,
and
are both set to 0.5 considering the trade-off between the spatial localization reliability and the classification reliability [
64,
65]. Here, in terms of the total localization reliability, we directly average the four sides’ boundary probabilities, because they seem to be equally important. Finally, the resulting score
will be inputted to the non-maximum suppression (NMS) algorithm [
48] to remove repeated detections.