*Article* **Quad-FPN: A Novel Quad Feature Pyramid Network for SAR Ship Detection**

**Tianwen Zhang, Xiaoling Zhang \* and Xiao Ke**

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China; twzhang@std.uestc.edu.cn (T.Z.); xke@std.uestc.edu.cn (X.K.) **\*** Correspondence: xlzhang@uestc.edu.cn

**Abstract:** Ship detection from synthetic aperture radar (SAR) imagery is a fundamental and significant marine mission. It plays an important role in marine traffic control, marine fishery management, and marine rescue. Nevertheless, there are still some challenges hindering accuracy improvements of SAR ship detection, e.g., complex background interferences, multi-scale ship feature differences, and indistinctive small ship features. Therefore, to address these problems, a novel quad feature pyramid network (Quad-FPN) is proposed for SAR ship detection in this paper. Quad-FPN consists of four unique FPNs, i.e., a DEformable COnvolutional FPN (DE-CO-FPN), a Content-Aware Feature Reassembly FPN (CA-FR-FPN), a Path Aggregation Space Attention FPN (PA-SA-FPN), and a Balance Scale Global Attention FPN (BS-GA-FPN). To confirm the effectiveness of each FPN, extensive ablation studies are conducted. We conduct experiments on five open SAR ship detection datasets, i.e., SAR ship detection dataset (SSDD), Gaofen-SSDD, Sentinel-SSDD, SAR-Ship-Dataset, and high-resolution SAR images dataset (HRSID). Qualitative and quantitative experimental results jointly reveal Quad-FPN's optimal SAR ship detection performance compared with the other 12 competitive state-of-the-art convolutional neural network (CNN)-based SAR ship detectors. To confirm the excellent migration application capability of Quad-FPN, the actual ship detection in another two large-scene Sentinel-1 SAR images is conducted. Their satisfactory detection results indicate the practical application value of Quad-FPN in marine surveillance.

**Keywords:** synthetic aperture radar (SAR); ship detection; convolutional neural network (CNN); deep learning (DL); feature pyramid network (FPN); quad feature pyramid network (Quad-FPN)

#### **1. Introduction**

Synthetic aperture radar (SAR) is an advanced active microwave sensor for the highresolution remote sensing observation of the Earth [1]. Its all-day and all-weather working capacity makes it play an important role in marine surveillance [2]. As a fundamental marine mission, SAR ship detection is of great value in marine traffic control, fishery management, and emergent salvage at sea [3,4]. Thus, up to now, the topic of SAR ship detection has received continuous attention from an increasing number of scholars [5–15].

In earlier years, a standard solution is to design ship features by manual ways, e.g., constant false alarm rate (CFAR) [1], saliency [2], super-pixel [3], and transformation [4]. Yet, these traditional methods are always complex in algorithm, weak in migration, and cumbersome in manual design, leading to their limited migration applications. Moreover, they often use limited ship images for theoretical analysis to define ship features, but these features cannot reflect the characteristics of ships with various sizes under different backgrounds. This causes their poor multi-scale and multi-scene detection performance.

Fortunately, in recent years, with the rise of deep learning (DL) and convolutional neural networks (CNNs), current state-of-the-art DL-based/CNN-based SAR ship detectors have helped solve the above-mentioned problems, to some degree. Compared with traditional methods, CNN-based ones have significant advantages, i.e., simplicity, highefficiency, and high-accuracy, because they can enable computational models with multiple

**Citation:** Zhang, T.; Zhang, X.; Ke, X. Quad-FPN: A Novel Quad Feature Pyramid Network for SAR Ship Detection. *Remote Sens.* **2021**, *13*, 2771. https://doi.org/10.3390/rs13142771

Academic Editors: Anwaar Ulhaq and Douglas Pinto Sampaio Gomes

Received: 26 May 2021 Accepted: 2 July 2021 Published: 14 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

processing layers to learn data representations with multiple-level abstractions. This can effectively improve detection accuracy. Thus, nowadays, many scholars [5–15] in the SAR ship detection community are starting to pay much attention to CNN-based methods.

For instance, based on Fast R-CNN [16], Li et al. [9] proposed a binarized normed gradient-based method to extract SAR ship-like regions. Based on Faster R-CNN [17], Lin et al. [14] designed a squeeze and excitation rank mechanism to improve detection performance. Based on you only look once (YOLO) [18], Zhang et al. [10] integrated the multi-scale mechanism, concatenation mechanism, and anchor box mechanism for small ship detection. Based on RetinaNet [19], Yang et al. [11] tried to suppress ship detections' false alarms by loss weighting means. Based on single shot multi-box detector (SSD) [20], Wang et al. [7] proposed an optimized version to enhance small ship detection while improving detection speed. Based on Cascade R-CNN [21], Wei et al. [12] designed a robust SAR ship detector named HR-SDNet for multi-level ship feature extraction.

Since the feature pyramid network (FPN) was proposed by Lin et al. [22], it has been a standard solution for multi-scale SAR ship detection. For different resolutions, incident angles, satellites, etc., SAR ships possess various sizes. FPN can detect ships with different sizes at different resolution levels based on more reasonable semantic features from backbone networks. This enables better detection performance. Thus, it has received a wide range of attention, e.g., Wei et al. [12] optimized its structure to present a highresolution FPN for better multi-scale detection. Cui et al. [13] adopted a convolutional block attention module to improve its performance. Lin et al. [14] added a squeezeand-excitation module at the top of FPN to activate important features. Zhao et al. [15] designed an attention receptive pyramid network to detect ships with various sizes and complex backgrounds.

However, SAR ship detection is still a challenging issue due to complex background interferences (e.g., port facilities, sea clutters, and volatile sea states), multi-scale ship feature differences, and indistinctive small ship features. Thus, this paper proposes a novel quad feature pyramid network (Quad-FPN) for SAR ship detection. Figure 1 shows Quad-FPN's structure. From Figure 1, four FPNs constitute it, i.e., a DEformable COnvolutional FPN (DE-CO-FPN), a Content-Aware Feature Reassembly FPN (CA-FR-FPN), a Path Aggregation Space Attention FPN (PA-SA-FPN), and a Balance Scale Global Attention FPN (BS-GA-FPN). Their implementation shows a pipeline, meaning gradually enhancing detection performance. We conduct extensive ablation studies to confirm each FPN's effectiveness. Experimental results on five open SAR ship detection datasets (i.e., SSDD [5], Gaofen-SSDD [6], Sentinel-SSDD [6], SAR-Ship-Dataset [7], and HRSID [8]) reveal that Quad-FPN can offer the most superior detection accuracy compared with the other 12 competitive state-of-the-art CNN-based SAR ship detectors. Finally, we also perform the actual ship detection in another two large-scene SAR images from the Sentinel-1 satellite. The satisfactory detection results confirm the excellent migration application capability of Quad-FPN. The software is available online on our website [23].

**Figure 1.** Pipeline structure of Quad-FPN.

The main contributions of this paper are as follows:


The rest of this paper is arranged as follows. Section 2 introduces Quad-FPN. Section 3 introduces our experiments. Results are shown in Section 4. Ablation studies are presented in Section 5. Finally, a summary of this paper is made in Section 6.

#### **2. Quad-FPN**

Quad-FPN is the basis of classical Faster R-CNN [17] and FPN [22], which are both important solutions to handle mainstream detection tasks. Figure 2 shows Quad-FPN's overview. Four basic FPNs, i.e., DE-CO-FPN, CA-FR-FPN, PA-SA-FPN, and BS-GA-FPN, constitute its network architecture. Their implementation presents a pipeline that improves SAR ship detection performance progressively.

**Figure 2.** Network architecture of Quad-FPN. (**a**) DE-CO-FPN; (**b**) CA-FR-FPN; (**c**) PA-SA-FPN; and (**d**) BS-GA-FPN.

The overall design idea of Quad-FPN is as follows.


#### *2.1. DEformable COnvolutional FPN (DE-CO-FPN)*

The core idea of DE-CO-FPN is that we use the deformable convolution [24] to extract ship features. It contains more useful ship shape information, meanwhile alleviating complex background interferences. Previous work [5–15] mostly adopted the standard or dilated convolutions [25] to extract features. However, the two have limited geometric modeling ability due to their regular kernels. This means that their ability to extract the shape features of multi-scale ships is bound to become poor, causing poor multi-scale detection performance. For inshore ships, the standard and dilated convolutions cannot restrain interferences of port facilities; for ships side-by-side parking at ports, they also cannot eliminate interferences from the nearby ship hull. Thus, to solve this problem, the deformable convolution is used to establish DE-CO-FPN. Figure 3 shows their intuitive comparison. From Figure 3, it is obvious that the deformable convolution can extract ship shape features more effectively; it can suppress the interference of complex backgrounds, especially for more complex inshore scenes. Finally, ships are likely to be separated successfully from complex backgrounds. Thus, this deformable convolution process can be regarded as an extraction of salient objects in various scenes, which plays a role of spatial attention.

**Figure 3.** Different convolutions. (**a**) Standard convolution; (**b**) dilated convolution; and (**c**) deformable convolution.

In the deformable convolution, the standard convolution kernel is augmented with offsets Δ**p***<sup>n</sup>* that are adaptively learned in training to model targets' shape features, i.e.,

$$\mathbf{y}(\mathbf{p}\_0) = \sum\_{\mathbf{p}\_n \in \mathcal{R}} \mathbf{w}(\mathbf{p}\_n) \times \mathbf{x}(\mathbf{p}\_0 + \mathbf{p}\_n + \Delta \mathbf{p}\_n) \tag{1}$$

where **p**<sup>0</sup> denotes each location, denotes the convolution region, **w** denotes the weight parameters, **x** denotes the input, **y** denotes the output, and Δ**p***<sup>n</sup>* denotes the learned offsets at the *n*-th location. It should be noted that compared with standard convolutions, deformable ones' training is in fact time-consuming; it needs more GPU memory. This is because the learned offsets add extra network parameters, increasing networks' complexity. A reasonable fitting of these offsets must be time-consuming. Yet, in this paper, to obtain better accuracy of ships with various shapes, we have not studied this issue deeply for the time being. This problem will be considered with due attention in our future work.

In Equation (1), Δ**p***<sup>n</sup>* is typically fractional. Thus, we use the bilinear interpolation to ensure the smooth implementation of convolutions, i.e.,

$$\mathbf{x}(\mathbf{p}) = \sum\_{\mathbf{q}} G(\mathbf{q}, \mathbf{p}) \times \mathbf{x}(\mathbf{q}) \tag{2}$$

where **p** denotes the fraction location to be interpolated, **q** denotes all integral spatial locations in the feature map **x**, and *G*( ) denotes the bilinear interpolation kernel defined by

$$G(\mathbf{q}, \mathbf{p}) = \lg(q\_{\mathbf{x} \prime} p\_{\mathbf{x}}) \times \lg(q\_{y \prime} p\_{y}), \text{ where } \lg(a, b) = \max(0, 1 - |a - b|) \tag{3}$$

In experiments, we add another one convolution layer to learn the offsets Δ**p***n*. Then, the standard convolution combining Δ**p***<sup>n</sup>* is performed on the input feature maps. Finally, ship features with rich shape information (*A*1, *A*2, *A*3, *A*4, and *A*<sup>5</sup> in Figure 2a) will be transferred to subsequent FPNs for more operations.

#### *2.2. Content-Aware Feature Reassembly (CA-FR-FPN)*

The core idea of CA-FR-FPN is that we design a CA-FR-Module (marked by circle in Figure 2b) to enhance feature transmission benefits when performing the up-sampling multi-level feature fusion. Previous work [5–15] added a feature fusion branch from top to bottom to via feature up-sampling. This feature up-sampling is often completed by the nearest neighbor or bilinear interpolations, but the two means merely consider sub-pixel neighborhoods, which cannot effectively capture the rich semantic information required by dense detection tasks [26], especially for densely distributed small ships. That is, features of small ships are easily diluted because of their poor conspicuousness, leading to feature loss. Thus, to solve this problem, we propose a CA-FR-Module in the up-sampling feature fusion branch from top to bottom to achieve a feature reassembly. It can be aware of important contents in feature maps, and attach importance to key small ship features, thereby improving feature transmission benefits. Figure 2b shows the network architecture of CA-FR-FPN. From Figure 2b, for five-scale levels (*B*1, *B*2, *B*3, *B*4, and *B*5), four CA-FR-Modules are used for feature reassembly. In practice, CA-FR-Module will complete the task that is similar to the 2× up-sampling operation in essence. Figure 4 shows the implementation process of CA-FR-Module. From Figure 4, there are two basic steps in CA-FR-Module: (1) kernel prediction, and (2) content-aware feature reassembly.

**Figure 4.** Implementation process of CA-FR-Module in CA-FR-FPN. (**a**) Kernel prediction; (**b**) content-aware feature reassembly.

#### **Step 1: Kernel Prediction**

Figure 4a shows the implementation process of the kernel prediction. In Figure 4, the feature maps **F**'s dimension is *L* × *L* × *C*, where *L* denotes its size and *C* denotes its channel width. Overall, the process of the kernel prediction (denoted by *ψ*) is responsible for generating adaptive feature reassembly kernels **W***<sup>l</sup>* at the original location *l*, according to the *k* × *k* neighbors of feature maps **F***<sup>l</sup>* through a content-aware manner, i.e.,

$$\mathbf{W}\_{l} = \psi\left(\mathcal{N}(\mathbf{F}\_{l}, k)\right) \tag{4}$$

where *N*(·) means the neighbors and **W***<sup>l</sup>* denotes the reassembly kernel.

To enhance the content-aware benefits of the kernel prediction, we first design a convolution layer to amplify the inputted feature maps **F** by α times (from *C* to α·*C*). This convolution layer's kernel number is set to α·*C*, where α is an experimental hyperparameter that will be studied in Section 5.2.2. Then, we adopt another convolution layer to encode the content of input features so as to obtain reassembly kernels. Here, we set the kernel width as 2<sup>2</sup> × *<sup>k</sup>* × *<sup>k</sup>* where 2 is from the requirement of the 2× up-sampling operation. The purpose is to enlarge the size of feature maps to 2*L*. Moreover, *k* × *k* is from the *k* × *k* neighbors of feature maps **F***l*. Afterwards, the content encoded features are reshaped to a 2*L* × 2*L* × (*k* × *k*) dimension via the pixel shuffle means [27]. Finally, each reassembly kernel is normalized by a soft-max function spatially to reflect the weight of each sub-content.

In summary, the above operations can be described by:

$$\mathbf{W}\_{l} = soft-\max\left\{shufflc[f\_{enclo}(f\_{amplify}(\mathbf{F}\_{l}))]\right\} \tag{5}$$

where *famplify* denotes the feature amplification operation, *fencode* denotes the content encode operation, *shuffle* denotes the pixel shuffle means, *soft-max* denotes the soft-max function defined by *eXi*/ ∑*<sup>j</sup> e Xj* , and **W***<sup>l</sup>* denotes the generated reassembly kernel.

#### **Step 2: Content-Aware Feature Reassembly**

Figure 4b shows the implementation process of the content-aware feature reassembly. Overall, the process of the content-aware feature reassembly (denoted by *φ*) is responsible for generating the final up-sampling feature maps **F** *l* , i.e.,

$$\mathbf{F}'\_{l'} = \phi(N(\mathbf{F}\_l, k), \mathbf{W}\_l) \tag{6}$$

where *k* denotes the *k* × *k* neighbors and **W***<sup>l</sup>* denotes the reassembly kernel in Equation (4) that corresponds to the *l'* location of feature maps after up-sampling from the original *l* location. For each reassembly kernel **W***l*, this step will reassemble the features within a local region via the function *φ* in Equation (6). Similar to the standard convolution operation, *φ* can be implemented by a weighted sum. Thus, for a target location *l'* and the corresponding square region *N*(**F***l*, *k*) centered at *l* = (*i*, *j*), the reassembly output is described by

$$\mathbf{F}'\_{l'} = \sum\_{n \in \mathbb{N}} \sum\_{m \in \mathbb{R}} \mathbf{W}\_{l\_r(n,m)} \times \mathbf{F}\_{(i+n,j+m)} \tag{7}$$

where denotes the corresponding square region *N*(**F***l*, *k*). Moreover, *k* is set to 5 in our work that is an optimal value followed by [26].

With the reassembly kernel **W***l*, each pixel in the region of the original location *l* contributes to the up-sampled pixel *l* differently, based on the content of features rather than location distance. Semantic features from the pyramid top will be transferred into the bottom, bringing better transmission benefits. Finally, the pyramid top's features will be fused into the bottom to enhance the feature expression ability of small ships.

#### *2.3. Path Aggregation Space Attention FPN (PA-SA-FPN)*

The core idea of PA-SA-FPN is that we add an extra path aggregation branch with a space attention module (PA-SA-Module) (marked by circle in Figure 2c) from the pyramid bottom to the top. Previous work [5–15] often transmitted high-level strong semantic features to the bottom to improve the whole pyramid expressiveness. Yet, the low-level location information from the pyramid bottom was not considered to be transmitted to the top. This can lead to inaccurate positionings of large ship bounding boxes, so the detection performance of large ships is reduced. Thus, we add an extra path aggregation branch (bottom-to-top) to handle this problem. Moreover, to further improve path aggregation benefits, we design a PA-SA-Module to concentrate on important spatial information to avoid interferences of complex port facilities. Figure 2c shows PA-SA-FPN's architecture. From Figure 2c, the location information of the pyramid bottom is transmitted to the top (*C*<sup>1</sup> → *C*<sup>2</sup> → *C*<sup>3</sup> → *C*<sup>4</sup> → *C*5) by the feature down-sampling. In this way, the top semantic features will be enriched with more ship spatial information. This can improve feature expression ability of large ships. Moreover, before the down-sampling, the low-level feature maps are refined by a PA-SA-Module to improve path aggregation benefits [28].

Figure 5 shows the implementation process of PA-SA-Module. In Figure 5, the input feature maps are denoted by **Q** and the output ones are denoted by **Q***'*. First, a global average pooling (GAP) [29] is used to obtain the average response in space; a global max pooling (GMP) [29] is used to obtain the maximum response in space. Then, their implementation results are concatenated as the synthetic feature maps, denoted by **S**. Unlike the previous convolutional block attention module [28], we design a space encoder *fspace-encode* to encode the space information. It is used to represent the spatial correlation. This can improve spatial attention gains because features in the coding space are more concentrated. Then, the output of *fspace-encode* is activated by a *sigmod* function to represent each pixel's importance-level in the original space, i.e., an importance-level weight matrix **W***S*. Finally, an elementwise multiplication is conducted between the original feature maps **Q** and the importance-level weight matrix **W***<sup>S</sup>* to obtain the output **Q***-* .

**Figure 5.** Implementation process of PA-SA-Module in PA-SA-FPN.

In short, the above can be described by

$$\mathbf{Q}' = \mathbf{Q} \odot \mathbf{W}\_{\mathbf{S}} \tag{8}$$

where **Q** denotes the input feature maps, **Q** denotes the output feature maps, denotes the elementwise multiplication, and **W***<sup>S</sup>* denotes the importance-level weight matrix, i.e.,

$$\mathbf{W}\_S = \text{sigmoid}\left\{ f\_{\text{space}-encode}(\text{GAP}(\mathbf{Q})(c) \text{GMP}(\mathbf{Q})) \right\} \tag{9}$$

where *GAP* denotes the global average-pooling, *GMP* denotes the global max-pooling, *fspace-encode* denotes the space encoder, © denotes the concatenation operation, and *sigmod* is an activation function defined by 1/(1 + *e*−*x*).

Finally, the feature pyramid will be stronger when possessing both the top-to-bottom branch and bottom-to-top branch. Each level has rich spatial location information and abundant semantic information, which help improve large ships' detection performance.

#### *2.4. Balance Scale Global Attention FPN (BS-GA-FPN)*

The core idea of BS-GA-FPN is that we further refine features from each feature level in the pyramid, to address the feature level imbalance of different scale ships. SAR ships often present different characteristics at different levels in the pyramid, i.e., the existence of multi-scale ship feature differences. Due to the difference of resolutions, the difference of satellite shooting distances, and different slicing methods, there are many scales of ships in the existing SAR ship datasets. E.g., for SSDD, the smallest ship pixel size is 7 × 7 while the biggest one is 211 × 298. Such huge size gap results in large ship feature differences, which makes it very difficult to detect them. In the computer vision community, Pang et al. [30] found that such feature level imbalance may weaken the feature expression

capacity of FPN, but previous work [5–15] in the SAR ship detection community was not aware of this problem. Thus, to handle this problem, we design a BS-GA-Module to further process pyramid features to recover a balanced BS-GA-FPN. Implementation process of BS-GA-Module consists of four steps: (1) feature pyramid resizing, (2) balanced multi-scale feature fusion, (3) global attention (GA) refinement, and (4) feature pyramid recovery, as in Figure 6.

**Figure 6.** Implementation process of BS-GA-Module. (**a**) Feature pyramid resizing; (**b**) balanced multi-scale feature fusion; (**c**) GA refinement; and (**d**) feature pyramid recovery.

#### **Step 1: Feature Pyramid Resizing**

Figure 6a shows the graphical description of the feature pyramid resizing. In Figure 6a, in the PA-SA-FPN, features maps at different levels are denoted by *C*1, *C*2, *C*3, *C*4, and *C*5. To facilitate the fusion of balanced features to preserve their semantic hierarchy at the same time, we resize each detection scale (*C*1, *C*2, *C*3, *C*4, and *C*5) to a unified resolution, by a max-pooling or up-sampling. Here, *C*<sup>3</sup> is selected as this unified resolution level because it locates in the middle of the pyramid. It can maintain a trade-off between top semantic information and bottom spatial information. Finally, the above can be described by

$$H\_1 = \text{MaxPool}^{4 \times}(\mathbb{C}\_1), \ H\_2 = \text{MaxPool}^{2 \times}(\mathbb{C}\_2), \ H\_3 = \mathbb{C}\_3, \ H\_4 = \text{LfpSample}^{2 \times}(\mathbb{C}\_4), \ H\_5 = \text{Lfp sampling}^{4 \times}(\mathbb{C}\_5) \tag{10}$$

where *H*1, *H*2, *H*3, *H*4, and *H*<sup>5</sup> are the resized feature maps from the original ones, *UpSamplingn*<sup>×</sup> denotes the *n* times up-sampling, and *MaxPooln*<sup>×</sup> denotes the *n* times maxpooling.

#### **Step 2: Balanced Multi-Scale Feature Fusion**

Figure 6b shows the graphical description of the balanced multi-scale feature fusion. After obtaining feature maps with the same unified resolution, the balanced multi-scale feature fusion is executed by

$$\mathbf{I}(i,j) = \frac{1}{5} \sum\_{k=1}^{5} H\_k(i,j) \tag{11}$$

where *k* denotes the *k*-th detection level, (*i, j*) denotes the spatial location of feature maps, and **I** denotes the output integrated features. From Equation (11), the features from each scale (*H*1, *H*2, *H*3, *H*4, and *H*5) are uniformly fused as the output **I** (a mean operation). Here, the average operation fully reflects the balanced idea of SAR ship scale feature fusion.

Finally, the output **I** with condensed multi-scale information will contain balanced semantic features of various resolutions. In this way, big ship features and small ones can complement each other to facilitate the information flow.

#### **Step 3: GA Refinement**

To make features from different scales become more discriminative, we also propose a GA refinement mechanism to further refine balanced features in Equation (11). This can enhance their global response ability. That is, the network will pay more attention to important spatial global information (feature self-attention), as in Figure 6c.

The GA refinement can be described by

$$O\_i = \frac{1}{\xi(\mathbf{I})} \times \sum\_{\forall j} f\left(I\_{i\nu} I\_j\right) \times g\left(I\_j\right) \tag{12}$$

where *Ii* denotes the input at the *i*-th location, *Oi* denotes the output at the *i*-th location, *f*(·) is a function used to calculate the similarity between the location *Ii* and *Ij*, *g*(·) is a function to characterize the feature representation at the *j*-th location, and *ξ*(·) denotes a normalized coefficient (the input overall response). The *i*-th location information denotes the current location's response, and the *j*-th location information denotes the global response.

In Equation (12), *g*(·) can be regarded as a linear embedding,

$$\lg\left(I\_{\bar{\jmath}}\right) = \mathcal{W}\_{\mathcal{R}} I\_{\bar{\jmath}} \tag{13}$$

where *Wg* is a weight matrix to be learned, and we use a 1 × 1 convolutional layer to obtain this weight matrix during training.

Furthermore, one simple extension of the Gaussian function is to compute similarity *f*(·) in an embedding space,

$$f(I\_{i\prime}, I\_{\dot{j}}) = \varepsilon^{\theta(I\_{\dot{i}})^T \phi(I\_{\dot{j}})} \tag{14}$$

where *θ*(*Ii*) = *WθIi* and *φ*(*Ij*) = *WφIj* are two embeddings. *W<sup>θ</sup>* and *W<sup>φ</sup>* are the weight matrixes to be learned that are both achieved by other two 1 × 1 convolutional layers.

As above, the normalized coefficient *ξ*(·) is set to

$$\xi^x(\mathbf{I}) = \sum\_{\forall j} f(I\_i, I\_j) \tag{15}$$

Finally, the whole GA refinement is instantiated as:

$$O\_i = \left(\boldsymbol{\varepsilon}^{\boldsymbol{\theta}(I\_i)^T \boldsymbol{\Phi}(I\_j)} \times \mathcal{W}\_{\boldsymbol{\S}} I\_j\right) / \sum\_{\forall j} \boldsymbol{e}^{\boldsymbol{\theta}(I\_i)^T \boldsymbol{\Phi}(I\_j)} \tag{16}$$

where eθ(Ii) <sup>T</sup>*φ*(Ij) / ∑ eθ(Ii) <sup>T</sup>*φ*(Ij) can be achieved by a soft-max function.

∀j Figure 6c shows the graphical description of the above GA refinement. From Figure 6c, two 1 × 1 convolutional layers are used to compute *φ* and *θ*. Then, by the matrix multiplication *<sup>θ</sup>Tφ,* the similarity *<sup>f</sup>* is obtained. One 1 × 1 convolutional layer is used to characterize the representation of the features *g*. Finally, *f* with a soft-max function multiplies by *g* to obtain the feature self-attention output **O** = {*O<sup>i</sup>* **|** *i* in **I**}. Finally, the feature self-attention output **O** is further processed by one 1 × 1 convolutional layer (marked in a dotted box). The purpose is to make **O** match the dimension of the original input **I** to facilitate follow-up element-wise adding. This is similar to the residual/skip connections of ResNet. Consequently, the refined features **I** combining the feature self-attention information are achieved, which will be further processed in the subsequent steps, i.e.,

$$\mathbf{I}' = \mathcal{W}\_O \mathbf{O} + \mathbf{I} \tag{17}$$

where *WO* is also a weight matrix to be learned, and another 1 × 1 convolutional layer can be used to obtain it during training.

In essence, the GA refinement can directly capture long-range dependence of each location (global response) by calculating the interaction between two different arbitrary positions. It is equivalent to constructing a convolutional kernel with the same size as the feature map **I**, to maintain more useful ship information, making feature maps more discriminative. More detailed theories about this global attention can be found in [31].

**Step 4: Feature Pyramid Recovery**

Figure 6d shows the graphical description of the feature pyramid recovery. From Figure 6d, the refined features **I** are resized again through using the similar but reverse procedure of Equation (10) to recover a balanced feature pyramid, i.e.,

*D*<sup>1</sup> = *UpSampling*4×(**I** ), *D*<sup>2</sup> = *UpSampling*2×(**I** ), *D*<sup>3</sup> = **I** , *D*<sup>4</sup> = *MaxPool*2×(**I** ), *D*<sup>5</sup> = *MaxPool*4×(**I** ) (18)

> where *D*1, *D*2, *D*3, *D*4, and *D*<sup>5</sup> denote the recovered feature maps at different levels after ship scale balance operations. They reconstruct the final network architecture of BS-GA-FPN. Ultimately, *D*1, *D*2, *D*3, *D*4, and *D*<sup>5</sup> in BS-GA-FPN will possess more multi-scale balanced features that will be used to be responsible for the final ship detection.

#### **3. Experiments**

Our experiments are run on a personal computer with i9-9900K CPU and RTX2080Ti GPU based on Pytorch. Quad-FPN and the other 12 competitive SAR ship detectors are implemented under the MMDetection toolbox [32] to ensure the comparison fairness.

#### *3.1. Experimental Datasets*


#### *3.2. Experimental Details*

ResNet-50 with pretraining on ImageNet [33] serves as Quad-FPNs' backbone network. Images in SSDD, Gaofen-SSDD, Sentinel-SSDD, SAR-Ship-Dataset, and HRSID are resized as the 512 × 512, 160 × 160, 160 × 160, 256 × 256, and 800 × 800 image size for training. We train Quad-FPN for 12 epochs with a batch size of 2, due to the limited GPU memory. Stochastic gradient descent (SGD) [34] serves as the optimizer with a 0.1 learning rate, a 0.9 momentum, and a 0.0001 weight decay. Moreover, the learning rate is reduced by 10 times per epoch from 8-epoch to 11-epoch to ensure an adequate loss reduction. Followed by Wei et al. [12], a soft non-maximum suppression (Soft-NMS) [35] algorithm

is used to suppress duplicate detections with an intersection over union (IOU) threshold of 0.5.

#### *3.3. Loss Function*

Followed by Cui et al. [13], the cross entropy (CE) serves as the classification loss *Lcls*,

$$L\_{cls} = -\frac{1}{N} \sum\_{i=1}^{N} p\_i \log(p\_i^\*) + (1 - p\_i) \log(1 - p\_i^\*) \tag{19}$$

where *pi* denotes the predictive class probability, *pi* \* denotes the ground truth class label, and *N* denotes the prediction number. The smoothL1 serves as the regression loss *Lreg*,

$$L\_{\text{reg}} = \frac{1}{N} \sum\_{i=1}^{N} p\_i^\* \text{smooth}\_{\text{L1}}(t\_i - t\_i^\*),\\
\text{where } \text{smooth}\_{\text{L1}}(\mathbf{x}) = \begin{cases} 0.5\mathbf{x}^2 \, if \, |\mathbf{x}| < 1\\ |\mathbf{x}| - 0.5 \, otherwise \end{cases} \tag{20}$$

where *ti* denotes the predictive bounding box and *ti \** denotes the ground truth box.

#### *3.4. Evaluation Indices*

Evaluation indices from the PASCAL dataset [5] are adopted by this paper, including the recall (*r*), precision (*p*), and mean average precision (mAP) [36], i.e.,

$$\tau = \text{TP} / (\text{TP} + \text{FN}), \; p = \text{TP} / (\text{TP} + \text{FP}), \; \text{mAP} = \int\_0^1 p(r) \times dr \tag{21}$$

where *TP* denotes the number of true positives, *FN* denotes that of false negatives, *FP* denotes that of false positives, and *p*(*r*) denotes the precision-recall curve. In this paper, mAP measures the final detection accuracy because it considers both precision and recall.

Moreover, the frames per second (FPS) is used to measure the detection speed, which is defined by 1/*t*, where *t* refers to the time to detect an image, whose unit is the second (s).

#### **4. Results**

#### *4.1. Quantitative Results on Five Datasets*

Tables 1–5 show the quantitative comparison with the other 12 competitive state-ofthe-art CNN-based SAR ship detectors, on SSDD, Gaofen-SSDD, Sentinel-SSDD, SAR-Ship-Dataset, and HRSID. From Tables 1–5, one can clearly find that:


#### *4.2. Qualitative Results on Five Datasets*

Figures 7–11 show the qualitative results on SSDD, Gaofen-SSDD, Sentinel-SSDD, SAR-Ship-Dataset, and HRSID. Here, we only compare Quad-FPN with the second-best detector, due to limited pages.

**Figure 7.** SAR ship detection results on SSDD. (**a**) Ground truths; (**b**) detection results of the second-best DCN [24]; and (**c**) detection results of the first-best Quad-FPN. Missed detections are marked by red boxes; false alarms are marked by orange boxes.


*Remote Sens.* **2021** , *13*, 2771

 HR-SDNet [12]

 DAPN [13]

9

10

11

12

13

 SER Faster R-CNN [14]

 ARPN [15] **Quad-FPN (Ours)**

 91.04

 88.55

 85.70

 91.54 95.37

 90.62

 75.58 The best detector is bold and the second-best is underlined.

 89.73 **92.84**

 78.51 94.21

 76.90

 59.55

 76.73 **85.68**

 90.64 95.79

 92.38

 83.62

 92.23 **94.54**

 21.05 21.81

 87.40

 83.27

 77.71

 76.74

 71.93

 88.60

 91.45

 86.97

 20.81

 87.01

 86.09

 80.35

 75.77

 74.62

 91.53

 91.32

 89.88

 21.25

 91.59

 89.43

 82.62

 82.83

 78.18

 94.10

 94.79

 93.16

 7.93


*Remote Sens.* **2021** , *13*, 2771

 HR-SDNet [12]

 DAPN [13]

9

10

11

12

13

 SER Faster R-CNN [14]

 ARPN [15] **Quad-FPN (Ours)**

 93.29

 93.34

 93.58

 92.01 96.10

 88.11

 77.55 The best detector is bold and the second-best is underlined.

 91.35 **94.39**

 87.92 92.37

 72.77

 55.01

 81.14 **83.93**

 95.00 97.33

 94.79

 88.95

 96.59

 95.10

 21.52 22.96

 86.78

 92.18

 87.01

 69.37

 80.24

 95.74

 93.83

 95.11

 22.84

 87.28

 91.97

 86.94

 70.08

 80.34

 95.45

 94.21

 94.81

 21.53

 92.11

 92.29

 86.19

 80.80

 81.88

 95.63

 96.11

 95.14

 7.88


**Table 5.** **Figure 8.** SAR ship detection results on Gaofen-SSDD. (**a**) Ground truth; (**b**) detection results of the second-best Free-Anchor [41]; and (**c**) detection results of the first-best Quad-FPN. Missed detections are marked by red boxes; false alarms are marked by orange boxes.

**Figure 9.** SAR ship detection results on Sentinel-SSDD. (**a**) Ground truths; (**b**) detection results of the second-best Free-Anchor [41]; and (**c**) detection results of the first-best Quad-FPN. Missed detections are marked by red boxes; false alarms are marked by orange boxes.

**Figure 10.** SAR ship detection results on SAR-Ship-Dataset. (**a**) Ground truths; (**b**) detection results of the second-best Free-Anchor [41]; and (**c**) detection results of the first-best Quad-FPN. Missed detections are marked by red boxes; false alarms are marked in orange.

**Figure 11.** SAR ship detection results on HRSID. (**a**) Ground truths; (**b**) detection results of the second-best Guided Anchoring [40]; and (**c**) detection results of the first-best Quad-FPN. Missed detections are marked by red boxes; false alarms are marked by orange boxes.

Taking SSDD in Figure 7 as an example, we can draw the following conclusions:

1. Quad-FPN can successfully detect various SAR ships with different sizes under various backgrounds. This shows its excellent detection performance with excellent scale-adaptation and scene-adaptation. Compared with the second-best CNN-based ship detector DCN [24], Quad-FPN can improve the detection confidence scores. For example, in the first detection sample of Figure 7, Quad-FPN increases the confidence score from 0.96 to 1.0. This can show Quad-FPN's higher credibility.


Moreover, from the detection results of the second sample on Gaofen-SSDD in Figure 8, Quad-FPN can remove false alarms from ship-like man-made facilities, meanwhile successfully detecting the ship moored at port, even under the strong speckle noise interference, or rather low signal to noise ratio (SNR). This shows Quad-FPN has both keen judgment merits and robust anti-noise performance. Similarly, the detection results of the first three samples on SAR-Ship-Dataset in Figure 10 can also reveal its excellent anti-noise performance. Finally, from the detection results of the third sample on SAR-Ship-Dataset in Figure 10, a large ship parking at port is detected by Quad-FPN again. This is because PA-SA-FPN can transmit the low-level location information from the pyramid bottom to the pyramid top, which can bring more accurate positionings of large ship bounding boxes. Correspondingly, the feature learning benefits of large ships are enhanced, thereby avoiding their missed detections. Given the above, Quad-FPN offers state-of-the-art SAR ship detection performance.

#### *4.3. Large-Scene Application in Sentinel-1 SAR Images*

We conduct the actual ship detection in another two large-scene Sentinel-1 SAR images to confirm the good migration capability of Quad-FPN. Figure 12 shows the coverage areas of the two large-scene Sentinel-1 SAR images. The two areas are both the world's major shipping routes, so they are selected. Table 6 shows their descriptions. From Table 6, the VV polarization SAR images are selected given that ships generally exhibit higher backscattering values in VV polarization [42]. In addition, the interferometric wide-swath (IW) mode of Sentinel-1 is selected specifically because it is the main mode to acquire data in areas of maritime surveillance interest [42]. The ship ground truths are annotated by SAR experts using the automatic identification system (AIS) and Google Earth. This can provide a more reliable performance evaluation. These two SAR images are resized as 24,000 × 16,000 image size, respectively. Then, followed by [43], they are cut into 800 × 800 small sub-images directly for training and testing because of the limited GPU memory. Finally, they are inputted into Quad-FPN for the actual SAR ship detection. After that, the detection results of these sub-images are integrated to the original large-scene SAR image.

**Figure 12.** Coverage areas of two large-scene Sentinel-1 SAR images. (**a**) Singapore Strait; (**b**) Gulf of Cadiz.



Figure 13 shows the visualization SAR ship detection results of Quad-FPN on the two large-scene SAR images. From Figure 13, most ships can be detected by Quad-FPN successfully, which shows its good migration application capability in ocean surveillance.

**Figure 13.** *Cont.*

**Figure 13.** Detection results in two large-scene Sentinel-1 SAR images. (**a**) Image 1; (**b**) Image 2. Detections are marked by blue boxes.

#### 4.3.1. Quantitative Comparison with State-of-The-Art

Tables 7 and 8 show their quantitative comparison with the other 12 competitive CNN-based SAR ship detectors. To be clear, in Tables 7 and 8, the GPU time is selected to compare their speed (*tGPU*) because modern CNN-based detectors are always run on GPUs. From Tables 7 and 8, one can find that Quad-FPN achieves the best detection accuracy on the two large-scene SAR images, showing its good migration capability.

On the Image 1, Quad-FPN offers an accuracy of 83.96% mAP, superior to the secondbest PANET [37] (83.96% mAP > 80.51% mAP); on the Image 2, Quad-FPN offers an accuracy of 87.03% mAP, superior to the second-best PANET [37] (87.03% mAP > 84.33% mAP). To be honest, we find that Quad-FPN's detection speed is relatively modest in contrast to others; thus, further detection speed improvements can be performed in the future.

#### 4.3.2. Quantitative Comparison with CFAR

Finally, we perform an experiment to compare performance with a classical and common-used two-parameter CFAR detector. Following the standard implementation process from Deng et al. [44], we obtain the CFAR's detection results in the Sentinel-1 toolbox [45]. Tables 9 and 10 show their quantitative detection results.


 HR-SDNet [12]

 DAPN [13]

9

10

11

12

13

 SER Faster R-CNN [14]

 ARPN [15] **Quad-FPN (Ours)**

 351

 351

 351

 351 351

 264

 335

 323

 348

 403 The best detector is bold and the second-best is underlined.

 256

 292

 291

 298

 310

 8

 43

 32

 50

 93

 95

 59

 60

 62

 41

 88.32

 82.78

 82.91

 83.19

 72.93

 96.97

 87.16

 90.09

 85.63

 76.92

 72.51

 82.38

 81.94

 81.09 **87.03**

 106.88

 75.25

 70.56

 76.28 120.42


**Table 9.** Quantitative evaluation indices comparison with CFAR on Image 1.

**Table 10.** Quantitative evaluation indices comparison with CFAR on Image 2.


In Tables 9 and 10, the traditional CFAR usually does not use mAP from the DL community to measure accuracy, so F1 is used to represent accuracy, defined by:

$$\text{F1} = 2 \times \frac{p \times r}{p + r} \tag{22}$$

Moreover, in Tables 9 and 10, CFARs are usually run on CPUs, whereas modern DL-based methods are always run on GPUs; to ensure a reasonable comparison, the CPU time is selected for their speed comparison (*tCPU*). From Tables 9 and 10, Quad-FPN is greatly superior to CFAR in terms of the detection accuracy, i.e., 0.74 F1 of CFAR on Image 1 << 0.84 F1 of Quad-FPN on Image 1, and 0.69 F1 of CFAR on Image 2 << 0.84 F1 of Quad-FPN on Image 2. The detection speed of Quad-FPN is also greatly superior to CFAR, i.e., 223.15 s CPU time of Quad-FPN on Image 1 << 884.00 s CPU time of CFAR on Image 1, and 226.08 s CPU time of Quad-FPN on Image 2 << 735.00 s CPU time of CFAR on Image 2. Therefore, Quad-FPN might still meet the needs of practical applications.

#### **5. Ablation Study**

In this section, ablation studies are conducted to verify the effectiveness of each FPN. We also discuss the advantages of each innovation. Here, we take the SSDD dataset as an example to show the results, due to limited pages. Table 11 shows the effectiveness of the Quad-FPN pipeline (DE-CO-FPN→CA-FR-FPN→PA-SA-FPN→BS-GA-FPN). From Table 11, the detection accuracy is improved step by step from left to right in the Quad-FPN pipeline architecture (89.92% mAP→93.61% mAP→94.58% mAP→95.29% mAP). This can show each FPN's effectiveness from the perspective of the overall structure.


**Table 11.** Effectiveness of the Quad-FPN pipeline.

To be clear, the sequence of the four FPNs is better kept unchanged; otherwise, the final accuracy cannot reach the best level according to our experiments. Some detailed analysis can be found in Section 2 (i.e., the overall design idea of Quad-FPN).

#### *5.1. Ablation Study on DE-CO-FPN*

We make two experiments with respect to DE-CO-FPN. Experiment 1 in Section 5.1.1 is used to confirm the effectiveness of DE-CO-FPN, directly. Experiment 2 in Section 5.1.2 is used to confirm the advantage of the deformable convolution.

#### 5.1.1. Experiment 1: Effectiveness of DE-CO-FPN

Table 12 shows the ablation study results on DE-CO-FPM. In Table 12, "✘" denotes removing DE-CO-FPN (the other three FPNs are reserved) and "✓" denotes using DE-CO-FPN. From Table 12, DE-CO-FPN improves the accuracy by ~3% mAP, which shows its effectiveness. Combined with it, SAR ship features extracted by networks will contain useful shape information; moreover, they can alleviate complex background interferences.

**Table 12.** Effectiveness of DE-CO-FPN.


#### 5.1.2. Experiment 2: Different Types of Convolutions

Table 13 shows the ablation study results on different convolution types. In Table 13, "Standard" denotes the traditional regular convolution in Figure 3a, "Dilated" denotes the dilated convolution in Figure 3b, and "Deformable" denotes the deformable convolution in Figure 3c. From Table 13, the deformable convolution achieves the best detection accuracy because it can more effectively model various ships' shapes by its adaptive kernel offset learning. This adaptive kernel offset learning can extract the shape and edge features of ships accurately, to suppress the interference of complex backgrounds, especially for the complex inshore scenes. In this way, ships can be separated successfully from complex backgrounds. Thus, this deformable convolution process can be regarded as an extraction of salient objects in various scenes, which plays a role of spatial attention. Accordingly, the accuracy on the overall dataset is improved.

**Table 13.** Different types of convolutions.


#### *5.2. Ablation Study on CA-FR-FPN*

With respect to CA-FR-FPN, we will make two experiments. Experiment 1 in Section 5.2.1 is used to confirm the effectiveness of CA-FR-FPN, directly. Experiment 2 in Section 5.2.2 is used to determine the appropriate feature amplification factor *α* in CA-FR-Module.

#### 5.2.1. Experiment 1: Effectiveness of CA-FR-FPN

Table 14 shows the ablation study results on CA-FR-FPN. In Table 14, "✘" denotes removing CA-FR-FPN (i.e., not using the CA-FR-Module, but the other three FPNs are reserved.); "✓" denotes using the CA-FR-FPN. From Table 14, CA-FR-FPN improves the detection accuracy by ~1% mAP because it can be aware of more valuable information for feature up-sampling. Its adaptive content-aware kernel can improve the transmission benefits of information flow, to improve the detection performance. This is because it can effectively capture the rich semantic information required by dense detection tasks, especially for densely distributed small ships. This can avoid the feature loss because of small ship features' poor conspicuousness. Accordingly, the accuracy on the overall dataset is improved.


**Table 14.** Effectiveness of CA-FR-FPN.

5.2.2. Experiment 2: Different Feature Amplification Factors

Table 15 shows the ablation study results on feature amplification factor *α* in CA-FR-Module. In Table 15, "✘" denotes not amplifying features. From Table 15, when features are amplified no matter what the value of *α* is, the detection accuracy can obtain improvements, compared with not amplifying features. Therefore, the feature amplification can indeed enhance the content-aware benefits of the kernel prediction, no matter what the value of *α* is. This is because in the embedded feature amplification space, the amount of information of feature maps will be effectively increased, promoting the better correctness of the kernel prediction. Finally, in our Quad-FPN, to obtain a better detection accuracy (95.29% mAP), *α* is set to an optimal or saturated value 8.

**Table 15.** Different feature amplification factors.


#### *5.3. Ablation Study on PA-SA-FPN*

We make three experiments with respect to PA-SA-FPN. Experiment 1 in Section 5.3.1 is used to confirm the effectiveness of PA-SA-FPN, directly. Experiment 2 in Section 5.3.2 is used to confirm the effectiveness of PA-SA-Module. Experiment 3 in Section 5.3.3 is used to confirm the advantage of PA-SA-Module.

#### 5.3.1. Experiment 1: Effectiveness of PA-SA-FPN

Table 16 shows the ablation study results on PA-SA-FPN. In Table 16, "✘" denotes removing PA-SA-FPN (the other three FPNs are reserved); "✓" denotes using PA-SA-FPN. From Table 16, PA-SA-FPN improves the detection accuracy by ~1.5% mAP because the low-level spatial location information in the pyramid bottom has been transmitted to the top in PA-SA-FPN. In this way, the positionings of large ship bounding boxes will become more accurate. Accordingly, the accuracy on the overall dataset is improved.



#### 5.3.2. Experiment 2: Effectiveness of PA-SA-Module

Table 17 shows the ablation study results on PA-SA-Module. From Table 17, PA-SA-Module can effectively enhance the detection accuracy by ~1% mAP because it can enable more pivotal spatial information in the pyramid bottom be effectively transmitted to the

top. This can improve path aggregation benefits. In this way, the features of large ships might become richer and more discriminative. Accordingly, the accuracy on the overall dataset is improved.

**Table 17.** Effectiveness of PA-SA-Module.


#### 5.3.3. Experiment 3: Different Attention Types

Table 18 shows the ablation study results on different attention types. In Table 18, "SE" denotes the squeeze-and-excitation mechanism [36] and "CBAM" denotes the convolutional block attention module [28]. From Table 18, PA-SA-Module is superior to others because it can cause key spatial global information to be transmitted more efficiently, which means that it is more suitable for PA-SA-FPN. Moreover, different from the previous CBAM, our designed space encoder *fspace-encode* can encode the space information. It is can represent the spatial correlation more effectively. This can improve spatial attention gains because the features in the coding space are more concentrated.

**Table 18.** Different attention types.


#### *5.4. Ablation Study on BS-GA-FPN*

We conduct three experiments with respect to BS-GA-FPN. Experiment 1 in Section 5.4.1 is used to confirm the effectiveness of BS-GA-FPN, directly. Experiment 2 in Section 5.4.2 is used to confirm the effectiveness of GA. Experiment 3 in Section 5.4.3 is used to confirm the advantage of GA.

#### 5.4.1. Experiment 1: Effectiveness of BS-GA-FPN

Table 19 shows the ablation study results on BS-GA-FPN. In Table 19, "✘" denotes removing BS-GA-FPN (the other three FPNs are reserved); "✓" denotes using BS-GA-FPN. From Table 19, BS-GA-FPN can play an important role in ensuring higher detection accuracy because it can improve the accuracy by ~1% mAP. In this way, ship multi-scale features can be effectively balanced, which can achieve a stronger feature expression capacity of the final FPN. Accordingly, the accuracy on the overall dataset is improved.

**Table 19.** Effectiveness of BS-GA-FPN.


#### 5.4.2. Experiment 2: Effectiveness of GA

Table 20 shows the ablation study results on GA. From Table 20, GA can improve the detection accuracy because when various ship multi-scale features are refined by it, they can become more discriminative. This feature self-attention might amplify important global information and suppress tiresome interferences, which can enhance the feature expressiveness of FPN. Essentially, GA is able to directly capture long-range dependence of each location (global response) through calculating the interaction between two different

arbitrary positions. The whole GA refinement is essentially equivalent to construct a convolutional kernel with the same size as the feature map, to maintain more useful ship information. Accordingly, the accuracy on the overall dataset is improved.

**Table 20.** Effectiveness of GA.


#### 5.4.3. Experiment 3: Different Refinement Types

Table 21 shows the ablation study results of different refinement types. In Table 21, we compare three refinement types, including a convolutional layer, an SE [36], and a CBAM [28]. From Table 21, GA offers the best detection accuracy because it can directly capture long-range dependence of each location (global response) to maintain more useful ship information that makes feature maps more discriminative. Different from the traditional convolution refinement types, its receptive field is wider, i.e., the whole input feature map's size, resulting in a better spatial correlation learning. Accordingly, the accuracy on the overall dataset is improved.

**Table 21.** Different refinement types.


#### **6. Conclusions**

Aiming at some challenges in SAR ship detection, e.g., complex background interferences, multi-scale ship feature differences, and indistinctive small ship features, a novel Quad-FPN is proposed for SAR ship detection in this paper. Quad-FPN consists of four unique FPNs that can guarantee its excellent detection performance, i.e., DE-CO-FPN, CA-FR-FPN, PA-SA-FPN, and BS-GA-FPN. In DE-CO-FPN, we adopt the deformable convolution to extract SAR ship features that will contain more useful ship shape information, meanwhile alleviating complex background interferences. In CA-FR-FPN, we design a CA-FR-Module to enhance feature transmission benefits when performing the up-sampling multi-level feature fusion. In PA-SA-FPN, we add an extra path aggregation branch with a space attention module from the pyramid bottom to the top. In BS-GA-FPN, we further refine features from each feature level in the pyramid to address feature level imbalance of different scale ships. We perform extensive ablation studies to confirm the effectiveness of each FPN. Experimental results on five open datasets jointly reveal that Quad-FPN can offer the most superior SAR ship detection performance compared with the other 12 competitive state-of-the-art CNN-based SAR ship detectors. Moreover, the satisfactory detection results in two large-scene Sentinel-1 SAR images showing Quad-FPN's excellent migration capability in ocean surveillance. Quad-FPN is an excellent two-stage SAR ship detector. Four FPNs' internal implementations are different from previous work. They are well-designed improvements to ensure the state-of-the-art detection performance, without bells and whistles. They can exactly enable Quad-FPN's excellent ship scale-adaptability and detection scene-adaptability.

Our future work is as follows:


**Author Contributions:** Conceptualization, T.Z.; methodology, T.Z.; software, T.Z.; validation, T.Z.; formal analysis, T.Z.; investigation, T.Z.; resources, T.Z.; data curation, T.Z.; writing—original draft preparation, T.Z.; writing—review and editing, X.Z. and X.K.; visualization, T.Z.; supervision, X.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Natural Science Foundation of China (61571099).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** No new data were created or analyzed in this study. Data sharing is not applicable to this article.

**Acknowledgments:** The authors would like to thank the editors and the four anonymous reviewers for their valuable comments that can greatly improve our manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

