Ship Detection in Synthetic Aperture Radar Images Based on BiLevel Spatial Attention and Deep Poly Kernel Network

Tian, Siyuan; Jin, Guodong; Gao, Jing; Tan, Lining; Xue, Yuanliang; Li, Yang; Liu, Yantong

doi:10.3390/jmse12081379

Open AccessArticle

Ship Detection in Synthetic Aperture Radar Images Based on BiLevel Spatial Attention and Deep Poly Kernel Network

by

Siyuan Tian

¹

,

Guodong Jin

^1,*,

Jing Gao

¹,

Lining Tan

¹,

Yuanliang Xue

¹,

Yang Li

¹ and

Yantong Liu

^2,*

¹

Xi’an Research Institute of Hi-Tech, Xi’an 710025, China

²

Department of Computer and Information Engineering, Kunsan National University, Gunsan 54150, Republic of Korea

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(8), 1379; https://doi.org/10.3390/jmse12081379

Submission received: 3 July 2024 / Revised: 31 July 2024 / Accepted: 9 August 2024 / Published: 12 August 2024

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Synthetic aperture radar (SAR) is a technique widely used in the field of ship detection. However, due to the high ship density, fore-ground-background imbalance, and varying target sizes, achieving lightweight and high-precision multiscale ship object detection remains a significant challenge. In response to these challenges, this research presents YOLO-MSD, a multiscale SAR ship detection method. Firstly, we propose a Deep Poly Kernel Backbone Network (DPK-Net) that utilizes the Optimized Convolution (OC) Module to reduce data redundancy and the Poly Kernel (PK) Module to improve the feature extraction capability and scale adaptability. Secondly, we design a BiLevel Spatial Attention Module (BSAM), which consists of the BiLevel Routing Attention (BRA) and the Spatial Attention Module. The BRA is first utilized to capture global information. Then, the Spatial Attention Module is used to improve the network’s ability to localize the target and capture high-quality detailed information. Finally, we adopt a Powerful-IoU (P-IoU) loss function, which can adjust to the ship size adaptively, effectively guiding the anchor box to achieve faster and more accurate detection. Using HRSID and SSDD as experimental datasets, mAP of 90.2% and 98.8% are achieved, respectively, outperforming the baseline by 5.9% and 6.2% with a model size of 12.3 M. Furthermore, the network exhibits excellent performance across various ship scales.

Keywords:

SAR; ship detection; multiscale; deep learning; YOLO

1. Introduction

Radar pairs are able to differentiate between different sources of radiation by means of signals, enabling the identification and classification of objects [1]. SAR is a unique active microwave imaging radar technology [2] that can overcome the limitations of bad weather, such as clouds, fog, rain, and snow and provide high-resolution radar images for maritime targets. In addition, the resolution of SAR images does not change with the observation distance, which can achieve long-distance detection and support long-term, continuous, dynamic, and real-time monitoring of a vast sea area. Therefore, Synthetic Aperture Radar (SAR) image target detection has been widely researched and applied in many fields, such as agriculture, forestry, water resource management, geology, and military research. It is also worth noting that the field of ship detection technology in SAR has received considerable attention from academics both within and outside the country.

The advantages of deep learning methods are manifested in the advantages of self-learning, self-improvement, and weight sharing. One-stage algorithms such as SSD [3] and YOLO series [4,5,6,7] and two-stage detection networks such as Faster R-CNN [8] and Cascade R-CNN [9] have been widely used for object detection. In the field of ship detection, deep learning methods are increasingly influential.

The main challenge in using deep learning models for ship detection is to deal with targets at different scales. Various approaches have been proposed to achieve accurate multiscale ship detection. The most common approaches include the attention module (ABP [10], AMMRF [11], and A-BFPN [12]) or multilayer feature fusion (DFF-YOLOv5 [13], LMSDYOLO [14], MAM [15], and Quad-FPN [16]). All of these detection methods perform feature fusion based on existing features. However, the existing features do not adequately take into account the various sizes of the object, and the detection accuracy decreases in multiscale ship target detection. In addition, the semantic details of the ships extracted by these approaches are flooded, which is not favorable for ship detection. In particular, ship targets may become blurred under the influence of clutter, and degradation of the signal-to-noise ratio makes it more challenging to distinguish ship targets. Therefore, it is necessary to capture as much information about the ship as possible. We need to make the best use of this information to reduce the impact of clutter on the detection process. Many methods have high computational complexity and consume a large amount of memory without considering the reduction in redundant parameters.

In order to effectively solve the above problems, the YOLO-MSD method is proposed in this paper. Firstly, we design the Deep Poly Kernel Network (DPK-Net), which consists of the Optimized Convolution (OC) Module and Poly Kernel (PK) Module. PConv is introduced in the OC module, which makes the network lighter and ensures more efficient extraction of spatial features. The PK module employs multiple parallel depthwise separable convolutions to capture contextual information at different scales, which enhances the depth and breadth of feature extraction and improves the detection accuracy and scale adaptability for targets of different sizes. Secondly, we propose a BiLevel Spatial Attention Module (BSAM), which combines the BiLevel Routing Attention (BRA) and the Spatial Attention Module. The BRA receives global information initially, followed by the Spatial Attention Module, which enhances the network’s capacity to localize the target and acquire high-quality detailed information. Finally, we adopt the Powerful-IoU (P-IoU) loss function. This function can combine a target-size adaptive penalty factor and a gradient-adjusting function based on the anchor box quality to guide the anchor box to faster regression. Through this series of designs, YOLO-MSD demonstrates excellent performance and broad applicability in ship detection tasks.

The main contributions of our work can be summarized as follows:

We construct DPK-Net in the backbone network, which consists of the OC module and PK module. The OC module aims to reduce data redundancy and optimize the efficiency of information processing. The PK module extracts dense ship features from different receptive fields. These features are adaptively fused along the channel dimensions to collect contextual information more efficiently.
We design the BSAM attention mechanism to obtain global information through sparsity while preserving ship detail information and achieving faster regression through P-IoU.
Many experiments have been conducted on the SSDD and HRSID datasets, with excellent experimental results proving the effectiveness of the proposed models.

2. Related Work

2.1. MultiScale Ship Detection

In SAR ship detection, ship targets exhibit multiscale diversity, and attention modules can be an effective solution.

Specifically, Fu et al. [10] designed an attention-guided balanced pyramid (ABP) structure in the FBR-Net network in order to improve the attention of small vessels. Tang et al. [11] proposed a multiscale attention mechanism for sensory field convolutional blocks (AMMRF), which can effectively use the positional information of the feature maps to distinguish between the vessels and the background. Li et al. [12] more fully utilized semantic features and multilayer complementary features to build an attention-guided balanced feature pyramid network (A-BFPN).

Multilayer feature fusion is also a commonly used method. Li et al. [13] obtained context fusion information by cascading and juxtaposing a number of pyramid modules containing different combinations of convolutional layers. Guo et al. [14] designed a depth-adaptive spatial feature fusion module for the multiscale problem in rotating object detection. Suo et al. [15] integrated low-level spatial data with high-level semantic data to address the challenge of detecting ships with significant size variations. Zhang et al. [16] developed four distinct feature pyramid modules and arranged them sequentially to create a Quad-FPN, enhancing the model’s ability to detect features across multiple scales.

Although the above neural networks play a particular effect in multiscale ship detection applications, they still have specific problems. They usually require a large amount of computational resources, especially in the training phase, which requires a large amount of data and computational power, resulting in high cost and energy consumption. In addition, it is difficult for them to consider various scale targets in the feature extraction process, losing detailed information, and there is still room for improvement in the detection effect.

2.2. Attention Mechanism

In the field of computer vision, attention mechanisms are designed to improve the efficiency of image feature extraction by assigning weights to the spatial and channel information in neural networks. This process generates image weight coefficients that amplify targets and diminish backgrounds, thereby aiding subsequent imaging tasks. Attention mechanisms can be classified into several types: hard attention, soft attention, self-attention, global attention, local attention, and multihead attention [17]. Notable attention mechanism algorithms include Squeeze-and-Excitation Networks (SENet), Selective Kernel Networks (SKNet), Convolutional Block Attention Module (CBAM), Criss-Cross Attention Network (CCNet), Object Context Network (OCNet), and Dual Attention Network (DANet) [18,19,20,21,22]. The Transformer model, which employs an encoder-decoder architecture, has also attracted significant interest in recent years [23].

Attention mechanisms hold significant promise for ship detection applications. For instance, Chen et al. [24] enhanced the feature extraction capabilities of backbone networks by utilizing attention mechanisms and multilevel features. Zhu et al. [25] introduced a hierarchical attention-based SAR ship detection method, which integrates global and local attention modules and applies a hierarchical attention strategy at both the image and target levels. Yasir et al. [26] incorporated a convolutional block attention module into the feature fusion module of the YOLO-tiny framework, assigning different weights to each feature map image to highlight effective features. Shan et al. [27] proposed the SimAM attention mechanism to enhance spatial features in images, improving both the accuracy of ship detection and the computational efficiency of the network. Zhou et al. [28] developed a sub-flap sensing mechanism to mitigate the impact of strong scattering points and enhance ship information, thereby improving the model’s ability to recognize ship targets by identifying and suppressing sub-flap noise in images.

Although attention mechanisms can enhance the performance of ship detection, they come with certain limitations. These mechanisms add to the model’s complexity, leading to higher computational costs, making the training process more challenging, and hindering the real-time performance of detection tasks.

2.3. Loss Function

The loss function allows for a thorough evaluation of the deviation between the model’s predicted values and actual outcomes. As a pivotal element in model training, the loss function significantly influences both the model’s performance and its practical applicability. Efficient and precise ship detection can be achieved by using a well-crafted loss function. For instance, Zhu et al. [29] implemented CloU loss to speed up convergence and enhance overall performance. Yang et al. [30] developed a novel, straightforward, and efficient

E^{1 / 2} I o U

loss that balances the impact of both high-quality and low-quality samples on the loss, thereby making it more effective for SAR image ship detection using unsupervised domain adaptation. Additionally, Hu et al. [31] incorporated the Normalized Wasserstein Distance (NWD) into the loss function to improve the regression for small ships and enhance the model’s capability for multiscale detection.

Nonetheless, the aforementioned approaches encounter challenges in balancing the loss across targets of varying scales and exhibit slow convergence in multiscale ship detection. To address these issues, our method integrates a penalty factor, where the target box size acts as the denominator, and utilizes a P-IoU loss function tailored to the quality of the anchor box. This strategy enhances the scale adaptation and robustness.

3. Materials and Methods

3.1. The Overview of YOLO-MSD

Our proposed YOLO-MSD framework builds upon YOLOv7-tiny [7] as its foundational structure. The architecture comprises three main components: the backbone network, responsible for feature extraction; the feature fusion network, which reprocesses and refines the features obtained from the backbone; and the classification prediction network, which conducts the final detection predictions. In this process, the predicted outcomes are combined with the ground truth data and fed into the loss function for further computation and optimization. Ultimately, non-maximum suppression is employed to discard redundant detection boxes, ensuring the precise localization of ship targets.

The overall network architecture is depicted in Figure 1. Initially, a deep polykernel backbone network is established. This backbone network serves as the foundation for the subsequent modules and processes. Specifically, the introduction of PConv within the backbone facilitates the design of an OC module aimed at minimizing unnecessary computations and memory access. Additionally, a PK module is incorporated at the terminus of the backbone network to extract multiscale target features and capture the local context.

Subsequently, further processing and dimensionality adjustment of the feature map is achieved by fine fusion of ELAN-Tiny, SPP-Tiny, MP and CBL modules at the neck. The BSAM attention mechanism is integrated at the neck to thoroughly capture global image information alongside ship details.

The Head section focuses on feature classification and regression, which not only enhances the feature representation capability through the Conv, BN and LeakyReLU combination, ELAN-Tiny module and Maxpool operation but also provides effective support for final decision-making through dimensionality reduction (e.g., halving the feature map size after Maxpool). In particular, the SP module combined with the Cat operation enables feature map splicing and fusion, which significantly increases the number of channels in the feature map and further enriches feature information. Furthermore, a novel P-IoU loss function is implemented in the final regression phase to achieve faster convergence and enhanced accuracy.

3.2. Deep Poly Kernel Network (DPK-Net)

In this section, we will introduce DPK-Net in detail. The specific structure is shown in Figure 2. Within the DPK-Net architecture, the input image undergoes initial sampling via two CBL modules, reducing its size to one-fourth of the original dimensions. Subsequently, the network integrates three optimized convolutional modules, three MP modules, one ELAN-Tiny module, and one deep poly kernel module on top of the baseline model YOLOv7-tiny. This arrangement produces three distinct feature maps, each on a different scale.

3.2.1. Optimized Convolution Module (OC)

Figure 3a illustrates the OC module’s architecture, which features two distinct branches. The left branch enhances the network’s receptive field by passing through the CBL module. Meanwhile, the right branch sequentially processes through the CBL and PBL modules, enabling comprehensive feature extraction with a lightweight design and minimizing the parameter count, which is typically increased by standard convolutional module stacking. Ultimately, the output feature maps from both branches are concatenated, followed by dimensionality reduction to decrease the channel count for the final output. Incorporating PConv into this module enhances network efficiency and spatial feature extraction while maintaining a lightweight structure. As depicted in Figure 3c, PConv optimizes this process by selectively applying filters to some input channels, leaving the others unchanged [32]. For continuous or periodic memory access, the initial or final consecutive

c_{p}

channel is computed to represent the entire feature map.

To generalize, we assume that the input and output feature maps possess an equal number of channels. Consequently, the FLOPs of PConv are only:

h \times w \times k^{2} \times c_{p}^{2}

(1)

Due to

r = \frac{c_{p}}{c} = \frac{1}{4}

, the FLOPs of PConv are only 1/16 of those of a regular convolution. In addition, PConv has a smaller memory access.

h \times w \times 2 c_{p} + k^{2} \times c_{p}^{2} \approx h \times w \times 2 c_{p}

(2)

That is, the memory access of PConv is 1/4 that of the regular convolution.

Hence, the construction of the OC module is designed to minimize superfluous computations and optimize memory access efficiency.

3.2.2. Poly Kernel Module (PK)

Depthwise separable convolution is composed of two stages, as illustrated in Figure 4. First, depthwise convolution is applied to the input features. In this step, each convolution kernel is linked to a specific channel, meaning that each channel undergoes a convolution operation using only its corresponding kernel. The output feature map retains the same number of channels and convolution kernels as the input feature map. Subsequently, the feature maps are processed using pointwise convolution. This involves weighting and combining the feature maps from the previous step along the channel dimension, resulting in new feature maps in which the number of convolution kernels matches the number of output channels [33].

As shown in Figure 5, the PK module first utilizes a small kernel convolution to obtain local information, followed by a set of parallel depthwise separable convolutions to capture contextual information across multiple scales [34].

The PK module in the DPK block can be mathematically represented as follows,

L = {Conv}_{k_{s} \times k_{s}} (X)

(3)

Z^{(m)} = {DWConv}_{k^{(m)} \times k^{(m)}} (L), m = 1, \dots, 4 .

(4)

where

X

is the initial ship localized feature. Here

L \in ℝ^{C \times H \times W}

is the local features extracted by

k_{s} \times k_{s}

convolution and

Z^{(m)} \in ℝ^{C \times H \times W}

is the contextual features extracted by the m-th

k^{(m)} \times k^{(m)}

depthwise separable convolution (DWConv). In our experiments, we set

k_{s} = 3

,

k^{(m)} = (m + 1) \times 2 + 1

.

The interrelationships between the various channels were then characterized by fusing the local and contextual features through a convolution of size 1 × 1:

P = {conv}_{1 \times 1} (L + \sum_{m = 1}^{4} Z^{(m)}),

(5)

where

P \in ℝ^{C \times H \times W}

denotes the output features. The 1 × 1 convolution acts as a channel fusion technique, enabling the integration of features with varying receptive field sizes.

Our PK module enables the extraction of features from the backbone that encompass various scales and convolution depths, facilitating the effective detection of large, medium, and small ship targets simultaneously. Additionally, it allows the capture of extensive contextual information while preserving the integrity of the local texture features. This helps to extract ship features, especially those of small vessels, from background clutter and improves the reliability of ship detection performance.

3.3. BiLevel Spatial Attention Module (BSAM)

The attention mechanism enables an improved network to retain features with critical information by calculating the similarity or correlation between features and assigning appropriate weights to different features. At the same time, the design of the attention mechanism can also improve the leakage and misdetection caused by the ship’s targets being too dense or obscuring each other. In this paper, the BSAM attention is designed.

Figure 6 depicts the detailed implementation of the BSAM attention. Commencing with an intermediate feature map, our module derives the attention map along two distinct dimensions sequentially. Subsequently, this attention map undergoes multiplication with the input feature map to accomplish adaptive feature refinement.

Given an intermediate feature map

X \in ℝ^{H \times W \times C}

as input, the BiLevel Routing Attention module is utilized to generate an attention map

M_{B} \in ℝ^{H \times W \times C}

, and the Spatial Attention Module is utilized to generate a 2D spatial attention map

M_{s} \in ℝ^{1 \times H \times W}

and the overall attention process can be summarized as,

\begin{matrix} F = M_{B} (X) \otimes X, \\ F^{'} = M_{s} (F) \otimes F, \end{matrix}

(6)

3.3.1. BiLevel Routing Attention (BRA)

The BRA modifies attention weights according to the features of the input image. This enables the network to apply varying levels of focus to different locations or attributes, enhancing the detection of ship targets at multiple scales. Importantly, this adjustment does not overburden the model computationally [35].

BiFormer, a derivative of the Transformer [23] model, incorporates dynamic sparse attention, enhancing computational flexibility and feature discernment via BRA. Initially, it eliminates most non-essential key-value pairs at the coarse region level, preserving only a small subset of routing areas. Subsequently, it implements detailed token-to-token attention within the selected regions. As shown in Figure 7, the two-layer routing attention mechanism firstly divides the input feature map

X \in ℝ^{H \times W \times C}

into

S \times S

non-overlapping regions, such that

X

is transformed into

X^{r} \in ℝ^{S^{2} \times \frac{H W}{S^{2}} \times C}

and each region contains

\frac{H W}{S^{2}}

feature vector. We can use Equation (7) to derive

Q, K, V \in ℝ^{S^{2} \times \frac{H W}{S^{2}} \times C}

to get

Q

,

K

and

V

.

Q = X^{r} W^{q}, K = X^{r} W^{k}, V = X^{r} W^{v},

(7)

where

W^{q}, W^{k}, W^{v} \in ℝ^{C \times C}

are the projection weights of

Q

,

K

and

V

, respectively. Then, calculate the mean value of

Q

and

K

to get

Q^{r}, K^{r}

, respectively. Equation (8) is employed to compute the adjacency matrix

A^{r}

, which assesses the semantic similarity across various regions.

A^{r} = Q^{r} {(K^{r})}^{T} .

(8)

The matrix

A^{r}

is filtered using Equation (9), and only the first k connections are kept for each region to prune the association graph to obtain the index matrix

I^{r}

,

I^{r} = topkIndex (A^{r}) .

(9)

where the i-th row of

I^{r}

contains the k indexes of the most relevant regions to the ith region. Filtering and collecting

K

and

V

by

I^{r}

using Equation (10) yields

K^{g}

and

V^{g}

.

K^{g} = gather (K, I^{r}), V^{g} = gather (V, I^{r})

(10)

Ultimately, the routing index matrix facilitates the application of fine-grained token-to-token attention from one region to another on

Q

,

K^{g}

, and

V^{g}

.

M_{B} (X) = Attention (Q, K^{g}, V^{g}) + LCE (V)

(11)

where

LCE

is a deeply separable convolution with a convolution kernel size of 5 and a step size of 1.

The computation of the BRA consists of three parts: linear projection, region-to-region routing, and token-to-token attention. The total amount of computations is therefore:

\begin{array}{l} FLOPs & = {FLOPs}_{p r o j} + {FLOPs}_{r o u t i n g} + {FLOPs}_{a t t n} \\ = 3 H W C^{2} + 2 (S^{2})^{2} C + 2 H W k \frac{H W}{S^{2}} C \\ = 3 H W C^{2} + C (2 S^{4} + \frac{k {(H W)}^{2}}{S^{2}} + \frac{k {(H W)}^{2}}{S^{2}}) \\ \geq 3 H W C^{2} + 3 C (2 S^{4} \cdot \frac{k {(H W)}^{2}}{S^{2}} \cdot \frac{k {(H W)}^{2}}{S^{2}})^{\frac{1}{3}} \\ = 3 H W C^{2} + 3 C k^{\frac{2}{3}} (2 H W)^{\frac{4}{3}} \end{array}

(12)

where

O ({(H W)}^{2})

is the complexity of the ordinary attention,

C

is the token embedding dimension (i.e., number of channels of the feature map), and

k

is the number of regions to attend (“

k

” in “top-

k

”). Here, the inequality between the arithmetic and geometric means has been applied. The equality in Equation (12) holds if and only if

2 S^{4} = \frac{k {(H W)}^{2}}{S^{2}}

. Therefore:

S = {(\frac{k}{2} {(H W)}^{2})}^{\frac{1}{6}}

(13)

In other words, BRA achieves

O ({(H W)}^{\frac{4}{3}})

complexity if we scale the region partition factor S with respect to the input resolution according to Equation (13).

Compared with the traditional Transformer self-attention structure, BRA is less computationally intensive and significantly reduces the memory pressure. At the same time, while keeping the model lightweight, the mechanism ensures that the model can maximize the retention of fine-grained contextual feature information. It also reduces the impact of noise interference and mitigates the limitations imposed by clutter in the SAR images. This enables the capture of remote dependencies more effectively.

3.3.2. Spatial Attention Module

To concentrate on global spatial information, spatial attention must be computed. Initially, average-pooling followed by max-pooling is conducted along the channel axis to consolidate the feature maps’ channel information [36], generating two 2D maps:

F_{a v g}^{s} \in ℝ^{1 \times H \times W}

and

F_{m a x}^{s} \in ℝ^{1 \times H \times W}

. Fusion on the channel axis allows for better handling of critical spatial information. Then, we concatenate them to form an information descriptor. Finally, a spatial attention map

M_{s} (F) \in R^{H \times W}

is generated using convolutional layers and sigmoid operations to highlight key pixels and suppress clutter from interfering with ship features. In short, spatial attention is computed as,

\begin{matrix} M_{s} (F) = σ (f^{3 \times 3} ([A v g P o o l (F); M a x P o o l (F)])) \\ = σ (f^{3 \times 3} ([F_{a v g}^{s}; F_{m a x}^{s}])), \end{matrix}

(14)

where

σ

represents the sigmoid function and

f^{3 \times 3}

signifies the convolution process utilizing a 3 × 3 filter size.

3.4. Powerful-IoU Loss Function (P-IoU)

Numerous metrics employed by sophisticated detectors rely on the IoU, which is a crucial evaluation metric for current loss functions. Simply put, IoU quantifies the overlap between the detection box and the target box.

I o U (B_{a}, B_{b}) = \frac{| B_{a} \cap B_{b} |}{| B_{a} \cup B_{b} |}

(15)

where

B_{a}

and

B_{b}

denote the prediction and actual frames, respectively. The loss function is defined as

L_{IoU} = 1 - IoU (B_{a}, B_{b})

(16)

As shown in Figure 8a, the D-IoU and C-IoU [37] are defined as follows:

L_{DIoU} = 1 - IoU (B_{a}, B_{b}) + \frac{ρ^{2} (d, d^{gt})}{c^{2}}

(17)

where

d

and

d^{g t}

denote the centers of mass of the predicted and real frames, respectively,

ρ (\cdot)

is the Euclidean distance, and

c

is the length of the diagonal of the smallest frame that contains both frames.

L_{CIoU} = 1 - IoU (B_{a}, B_{b}) + \frac{ρ^{2} (d, d^{gt})}{c^{2}} + α υ

(18)

where

α

serves as the loss trade-off parameter and

υ

evaluates the similarity between aspect ratios.

\begin{matrix} α = \frac{υ}{(1 - IoU (B_{a}, B_{b})) + υ} \\ υ = \frac{4}{π^{2}} {(\arctan \frac{w^{gt}}{h^{gt}} - \arctan \frac{w}{h})}^{2} \end{matrix}

(19)

where

w

and

h

are the width and height of the bounding box, and

w^{g t}

and

h^{g t}

are the width and height of the ground truth (GT) box, respectively.

The penalization factors used are flawed, leading to an increase in the anchor box size and slower convergence during regression. Specifically, these factors are inadequate because they do not precisely differentiate between the anchor box and target box, fail to properly consider the target size, and may underperform in certain scenarios. Employing factors based on the anchor box size and smallest enclosing box of the target as the denominator in the penalty term is inappropriate. This causes the anchor box region to expand during regression, negatively impacting efficiency. Thus, the IoU-based regression loss function requires a more suitable penalty term to enhance the performance.

To address these shortcomings, we utilize P-IoU, integrating a penalty factor that incorporates the target box size in the denominator and considers the quality of the adapted anchor boxes [38]. This method ensures that the anchor box regresses more efficiently along a direct path, resulting in quicker convergence and improved accuracy. Here, the penalty factor P, adjusted to the target size, is defined as follows,

P = (\frac{d w_{1}}{w_{g t}} + \frac{d w_{2}}{w_{g t}} + \frac{d h_{1}}{h_{g t}} + \frac{d h_{2}}{h_{g t}}) / 4 .

(20)

where

d w_{1}, d w_{2}, d h_{1}, d h_{2}

is the absolute distance between the corresponding edge of the prediction frame and the target frame,

w_{g t}

,

h_{g t}

are the width and height of the target frame, as shown in Figure 8b.

Using

P

as a penalty factor in the loss function avoids expanding the anchor box. This occurs because the denominator

P

relies solely on the target box size, remaining unaffected by the anchor box size or the smallest enclosing box of the target. Unlike penalty factors in other loss functions,

P

remains unchanged by anchor box enlargement. Furthermore,

P

only reaches zero if the anchor frame fully overlaps with the target frame. Additionally,

P

adapts to the target size. Consequently, we employ a penalty function that is adjusted according to the quality of the anchor box.

f (x) = 1 - e^{- x^{2}},

(21)

P I o U = I o U - f (P), - 1 \leq P I o U \leq 1,

(22)

L_{P I o U} = 1 - P I o U = L_{I o U} + f (P), 0 \leq L_{P I o U} \leq 2 .

(23)

The PIoU loss function guides the anchoring box to faster regression along the effective path and, thus, faster convergence. In particular, the combination of target-size adaptive tuning and loss adjustment for the importance of ship targets is fine-tuned to optimize the requirements specific to ship detection. This method effectively addresses the challenge of identifying multiscale ship targets, particularly in maritime environments where target sizes vary significantly. It enhances the detection accuracy and adaptability to complex conditions, thereby increasing the precision and robustness of ship detection.

4. Experiment and Results

4.1. Experimental Platform

The detailed implementation of the proposed YOLO-MSD method is outlined as follows: This method is developed using Python, and the experiments are conducted within a deep learning framework built on Pytorch 1.11.0. YOLOv7-tiny is chosen as the baseline model for this study. The hardware setup includes an Intel(R) Xeon(R) Silver 4210R CPU running at 2.40 GHz, an NVIDIA RTX A6000 GPU, and 512 GB of RAM.

4.2. Datasets

4.2.1. HRSID

The HRSID dataset [39] comprises data obtained from satellite sensors, such as TerraSAR-X, Sentinel-1B, and TanDEM-X. From the original 136 large-scale SAR satellite images, 5604 images of size 640 × 640 pixels were derived. Among the 16,951 annotated ships, the distribution is 54.5% small, 43.5% medium, and 2% large vessels. The HRSID dataset includes SAR images with resolutions of 0.5 m, 1 m, and 3 m, all formatted into horizontal bounding boxes (HBB) of 800 × 800 pixels. It features a variety of maritime scenes ranging from simple to complex. Throughout the model training phase, the dataset was divided into training, validation, and testing sets at an 8:1:1 ratio.

4.2.2. SSDD

The SSDD dataset comprises 1160 SAR images, with dimensions ranging from 190 to 526 pixels in height and 214 to 668 pixels in width, encompassing 2456 targets [40]. On average, there are 2.12 ships per image. This dataset predominantly sources its data from the RadarSat-2, Sentinel-1, and TerraSAR-X sensors with resolutions between 1 m and 15 m. The target areas are cropped to approximately 500 × 500 pixels. The ship target positions were manually annotated using the PASCAL VOC format. The dataset primarily contains small targets that exhibit diverse features near coasts, in open seas, and across various scales, making it suitable for evaluating the robustness of models. During the training phase of the model, the dataset was segmented into training, validation, and testing subsets, adhering to an 8:1:1 split ratio.

4.3. Model Evaluation

To assess the detection performance of YOLO-MSD in comparison with other methods, we evaluate the detection results using Precision (P), Recall (R), mean average recall (mAP), parameters, and FLOPs. To determine the accuracy of the prediction frames, we calculate the Intersection over Union (IoU) between the predicted frames and the ground truth [40]. IoU represents the ratio of the intersection area to the union area of the predicted frame to the ground truth, as described by Equation (24).

I o U = \frac{B_{p} \cap B_{g t}}{B_{p} \cup B_{g t}}

(24)

where

B_{P}

represents the prediction box and

B_{g t}

represents the actual ground truth. A higher Intersection over Union (IoU) value indicates more accurate detection results.

For ship detection, three outcomes are possible: true positives (TP), false positives (FP), and false negatives (FN). True positives refer to the number of accurately detected ships, false positives refer to the number of erroneously detected ships, and false negatives refer to the number of missed ships. Precision is defined as the proportion of correctly detected ships out of all detected ships, while Recall is the proportion of correctly detected ships out of the total number of actual ships. Equations (25) and (26) are used to compute the detection accuracy and completeness rates.

P r e c i s i o n = \frac{T P}{T P + F P}

(25)

R e c a l l = \frac{T P}{T P + F N}

(26)

Assessing the detection model based solely on Precision (P) or Recall (R) can be inadequate. Consequently, the F1 score is employed to integrate both P and R for a more holistic evaluation of the model. The formula for the F1 score is provided in Equation (27).

F 1 = \frac{2 \times (P r e c i s i o n \times R e c a l l)}{P r e c i s i o n + R e c a l l}

(27)

Average Precision (AP) provides a more comprehensive evaluation of various detection methods. By plotting Recall on the horizontal axis and Precision on the vertical axis, AP represents the area under the Precision-Recall (P-R) curve. The calculation formula for AP is presented in Equation (28).

A P = \int_{0}^{1} P (R) d R

(28)

Using the pixel count within the ship’s predicted bounding box, ships are classified as small, medium, or large based on COCO index definitions. Subsequently, their detection accuracies are computed. Table 1 provides several definitions for the COCO index.

To enhance the evaluation of the model performance, we introduced additional metrics such as frames per second (FPS), model parameters, and floating-point operations per second (FLOPs). The FPS is defined as:

F P S = 1 / T

(29)

where T denotes the detection time for a single image. FPS indicates the average frame rate of the validation datasets. The parameters of the convolutional layer can be obtained by Equation (30).

P a r a m s = k_{H} \times k_{W} \times C_{i n} / g \times C_{o u t}

(30)

where

k_{H}

and

k_{W}

denote the convolution kernel’s dimensions,

C_{i n}

represents the number of input feature map channels,

C_{o u t}

represents the number of output feature map channels, and

g

is the number of group convolutions,. The total model parameters are obtained by summing the parameters of all layers. The FLOPs can be obtained by Equation (31).

F L O P s = (2 \times k_{H} \times k_{W} \times C_{i n} / g - 1) \times C_{o u t} \times H_{o u t} \times W_{o u t}

(31)

where

C_{o u t} \times H_{o u t} \times W_{o u t}

is the total number of units included in the output feature map.

4.4. Experimental Results

4.4.1. Comparison with Existing Methods

To further validate the proposed modeling algorithms, we conducted comparative experiments on the HRSID and SSDD datasets under identical conditions. In the experiments, we selected eight classical object detection algorithms for comparison, namely, the two-stage Faster R-CNN [8] and Cascade R-CNN, the one-stage YOLOv5_n, YOLOv7 [7], YOLOv7-tiny, SSD [3], and EfficientDet [41], along with anchorless detection algorithms RetinaNet [42] and CenterNet [43]. Additionally, ten other SAR ship detection methods were evaluated: CSD-YOLO [44], Pow-FAN [45], BL-Net [46], FEPS-Net [47], PPA-Net [48], CMFT [49], MLSDNet [50], FBR-Net [10], MANet [15] and Quad-FPN [16]. The detailed experimental results are presented in Table 2.

The experimental results on HRSID and SSDD indicate that YOLO-MSD, while not achieving the highest accuracy and completeness among all algorithms, excels in mean Average Precision (mAP). Specifically, YOLO-MSD enhances the mAP by 5.9% on the HRSID dataset and by 6.2% on the SSDD dataset compared to the baseline YOLOv7-tiny.

These findings indicate the effectiveness of the proposed method for ship SAR image detection. Faster R-CNN [8] demonstrates the poorest performance, primarily due to its two-stage detection process, which first generates pre-selected frames that are often compromised by the scattering noise prevalent in SAR images. Among single-stage detection algorithms, SSD and EfficientDet exhibit high accuracy but fall short in Recall, achieving less than 50% on the HRSID dataset. The YOLO series, in contrast, achieves higher mAP. In anchorless algorithms, the absence of pre-generated anchor boxes leads to only one anchor box prediction per position. This limitation may result in undetected overlapping or blurred regions, thus reducing the Recall and mAP of RetinaNet and CenterNet. YOLO-MSD achieves the highest mAP on both SSDD and HRSID datasets among the evaluated methods, with an average accuracy of 90.2% on the HRSID dataset. However, the algorithm’s performance on HRSID is inferior to that on SSDD, likely due to the complexity of the HRSID dataset. This complexity arises from scenarios like closely aligned ships in harbors, the simultaneous entry of multiple small targets, and significant cross-scale differences among ships, which challenge the network’s capabilities. Despite these challenges, YOLO-MSD remains superior to the other algorithms.

To assess the model’s complexity and detection speed, we utilized parameters, FLOPs, and FPS metrics. As illustrated in Table 2, YOLO-MSD achieves 45 FPS, with 12.3 parameters and 33.8 FLOPs. These results indicate that YOLO-MSD is capable of real-time detection. The introduction of an attention mechanism and additional parameters enhances its accuracy, albeit with an increased complexity.

In comparison to all other evaluated algorithms, the proposed YOLO-MSD ship detection model excels in detection performance, demonstrating substantial overall performance. These findings validate the feasibility and effectiveness of the proposed improvements in this study.

4.4.2. Ablation Experiment

For our ablation experiments, we utilized the publicly available SSDD and HRSID datasets as benchmarks to assess the effectiveness of model enhancements across different datasets. The baseline algorithm was YOLOv7-tiny, enabling us to examine the effects of various module modifications. To validate the module efficacy and evaluate the YOLO-MSD model’s detection performance, we employed P, R, mAP50, mAP50-95, and different scale accuracy metrics (

A P_{S}

,

A P_{M}

, and

A P_{L}

). Consistent parameter settings were maintained throughout all experiments: 300 epochs, a batch size of 16, and an initial learning rate of 0.01.

The results shown in Table 3 indicate that Experiment 1 and Experiment 9 serve as the baseline models, and their mAP50 are 92.6% and 84.3%, respectively. The introduction of both DPK-Net, BSAM, and P-IoU on the baseline model greatly improves the precision, Recall, mAP50, and mAP50-95 of the algorithm, and the mAP on the HRSID dataset and SSDD dataset, respectively, reach 90.2% and 98.8%, which are improved by 5.9% and 6.2% on the baseline model, respectively.

The results shown in Table 4 indicate that Experiments 1–8 use different combinations of modules. Experiment 1 builds the baseline model of the HRSID dataset, and it can be seen that the mAP50 of the baseline model is 84.3%. Experiment 2 builds the OC module in the backbone network, where the number of parameters and computation is reduced due to the introduction of PConv. However, all the metrics are improved over the baseline model. In Experiment 3, after adding the PK module to the backbone network only, all the indexes are significantly improved compared with the baseline model, and the highest value of 62.9% is reached at

A P_{L}

, which indicates that the PK module is more effective for large object detection. In Experiment 4, adding only the BSAM attention mechanism only

A P_{S}

slightly decreases, and all other metrics are improved over the baseline model. In Experiment 5, only the P-IoU loss function is added, only

A P_{M}

slightly decreases, and all other indicators are improved over the baseline model. In Experiment 6, based on Experiment 2, the PK module is added to form the DPK-Net. All the indexes are improved, but compared with Experiment 3, there is a slight decrease in

A P_{L}

. All the other indexes are improved, which indicates that the combination of the PK module and the OC module is good for detecting large targets, and it can have a balanced effect on detecting targets at different scales. Experiment 7 adds the P-IoU loss function based on Experiment 4, and all the indexes are improved. All the indexes are improved compared with Experiment 5, especially the effect on

A P_{S}

is more significant, which indicates that the combination of BSAM and P-IoU is more effective for detecting small targets. All blocks were used in Experiment 8, and the indicators achieved a well-balanced effect. The mAP increased from 84.3% to 90.2% compared to the baseline model, indicating a substantial improvement in the detection accuracy of ship targets across various scales, but

A P_{M}

and

A P_{L}

decrease by 0.1% and 0.7%, respectively, compared with Experiment 6. In addition, the multiscale performance analysis of SSDD-based ships is shown in the Supplementary Material, Table S1.

The PR curve illustrates the relationship between precision and recall during the model training. Typically, a curve that extends closer to the upper right indicates a superior model performance. Figure 9 shows a comparison of the PR curves from the ablation experiments using the HRSID and SSDD datasets. The figure demonstrates that, in comparison to the baseline YOLOv7-tiny model, the YOLO-MSD model achieves the best performance, evidenced by the largest area under the curve.

Figure 10 illustrates the training loss curves for YOLOv7-tiny and YOLO-MSD on both the HRSID and SSDD datasets. Initially, both models show a similar reduction in training loss. However, as training progresses, YOLO-MSD demonstrates a more rapid decline in training loss compared to YOLOv7-tiny after ten training sessions. In conclusion, the YOLO-MSD model introduced in this study effectively reduces loss and accelerates model convergence.

Figure 11 and Figure 12 display the ship detection results for the HRSID and SSDD datasets. YOLOv7-tiny exhibits notable shortcomings in detecting ships of varying sizes and fails to effectively identify ships of different dimensions simultaneously. It also misclassifies certain objects as ships and erroneously detects multiple adjacent ships as a single entity. In contrast, the enhanced model presented in this research significantly outperforms the baseline model, effectively overcoming these limitations.

To further demonstrate the module’s semantic perceptual capabilities, we compare the results of image heat maps across various scales and scenarios. These heat maps illustrate the model’s activity levels in different regions or input data features, reflecting its perception of diverse semantic categories. The heat maps reveal that the improved model delineates object boundaries and texture features more precisely than the baseline model. Figure 13 presents the heat maps of the images at different scales and scenes on the SSDD and HRSID datasets.

As shown in Figure 13a, the improved model outperforms the baseline model in detecting small ships, providing a clearer depiction of the ship boundaries. This enhancement is attributed to the model’s superior feature extraction capabilities, particularly in capturing ship contours and fine details. Figure 13b shows that both models perform well in medium ship detection, but the improved model and our proposed method produce clearer heat maps. In Figure 13c, the improved model outperforms the baseline model in detecting large ships. The inclusion of the BSAM and P-IoU loss function significantly improves edge feature extraction for large ships with trailing shadows and image blurring, demonstrating the model’s ability to integrate multiple features and enhance its recognition of larger ships.

Figure 13d illustrates that both the baseline and improved models perform well in detecting offshore ships and effectively extracting ship features. Figure 13e,f shows that the improved model surpasses the baseline in detecting inshore ships and densely packed ships. Our proposed method effectively captures image information across multiple scales and identifies ship features of various sizes and shapes, thus enhancing the model’s comprehension of the image.

In summary, the YOLO-MSD model demonstrates significant improvements over the baseline model.

5. Discussion

While YOLO-MSD demonstrates strong detection performance in most scenarios, it still encounters issues with missed detections and inaccuracies. For instance, in situations with numerous overlapping bounding boxes, such as those depicted in Figure 12, where ships are densely docked and their bounding boxes overlap, the model struggles. The indistinct boundaries between features and the presence of redundant elements in the extracted features hinder the detection of centrally located ships, demonstrating the model’s limitations in feature recognition and refinement. Furthermore, the SAR ship dataset is characterized by a scarcity of positive samples, a complex background with numerous negative samples, and a data-dependent deep detection model. This results in suboptimal data utilization during detection and inadequate performance in near-shore and complex environments. Therefore, to enhance the practical utility of YOLO-MSD, it is essential to improve the balance between positive and negative sample effectiveness and bolster robustness and generalization capabilities in diverse scenarios.

6. Conclusions

This study introduces a novel multiscale ship detection algorithm for SAR images, leveraging the YOLO-MSD framework. By utilizing an enhanced DPK-Net as the backbone network and incorporating the BSAM attention mechanism alongside the P-IoU loss function, the proposed algorithm significantly enhances detection performance. Experimental results validate the superior capabilities of the YOLO-MSD model in SAR ship detection tasks. Specifically, when compared to the baseline YOLOv7-tiny algorithm, the proposed method shows a precision improvement of 3.8% and 7.9%, a recall improvement of 4.1% and 2.6%, and a mAP50 improvement of 5.9% and 6.2% on the HRSID and SSDD datasets, respectively.

There are several potential areas for improvement in this research. Future studies within the current framework could focus on these key areas. First, to enhance the model’s generalization ability in specific scenarios, it may be necessary to augment scenario-specific datasets, apply advanced data enhancement techniques, or implement domain adaptation methods. Second, integrating multisource data is crucial for boosting the accuracy and robustness of ship detection. Combining SAR images with data from other sensors, such as optical and radar, can provide more comprehensive target information, particularly under adverse weather conditions. By thoroughly investigating these directions, we can significantly improve the performance and practicality of ship detection technology, thereby providing robust support for related research and practical applications.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jmse12081379/s1, Table S1: Multi-scale performance analysis of SSDD-based ships.

Author Contributions

Conceptualization, S.T. and G.J.; methodology, S.T.; software, S.T. and Y.X.; validation, J.G.; formal analysis, Y.L. (Yang Li); investigation, Y.X.; resources, G.J.; data curation, Y.L. (Yantong Liu); writing—original draft preparation, S.T. and Y.X.; writing—review and editing, S.T.; visualization, S.T.; supervision, L.T. and Y.L. (Yang Li); project administration, G.J.; funding acquisition, G.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 61673017, 61403398, 62001115).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. These data are not publicly available due to the need for future work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dudczyk, J.; Rybak, Ł. Application of Data Particle Geometrical Divide Algorithms in the Process of Radar Signal Recognition. Sensors 2023, 23, 8183. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep Learning for SAR Ship Detection: Past, Present and Future. Remote Sens. 2022, 14, 2712. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOV4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOV6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOV7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Fu, J.; Sun, X.; Wang, Z.; Fu, K. An Anchor-Free Method Based on Feature Balancing and Refinement Network for Multi-scale Ship Detection in SAR Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 1331–1344. [Google Scholar] [CrossRef]
Tang, H.; Gao, S.; Li, S.; Wang, P.; Liu, J.; Wang, S.; Qian, J. A Lightweight SAR Image Ship Detection Method Based on Improved Convolution and YOLOv7. Remote Sens. 2024, 16, 486. [Google Scholar] [CrossRef]
Li, X.; Li, D.; Liu, H.; Wan, J.; Chen, Z.; Liu, Q. A-BFPN: An attention-guided balanced feature pyramid network for SAR ship detection. Remote Sens. 2022, 14, 3829. [Google Scholar] [CrossRef]
Li, Y.G.; Zhu, W.G.; Li, C.X.; Zeng, C.Z. SAR image near-shore ship object detection method in complex background. Int. J. Remote Sens. 2023, 44, 924–952. [Google Scholar] [CrossRef]
Guo, Y.; Chen, S.; Zhan, R.; Wang, W.; Zhang, J. LMSD-YOLO: A lightweight YOLO algorithm for multi-scale SAR ship detection. Remote Sens. 2022, 14, 4801. [Google Scholar] [CrossRef]
Suo, Z.; Zhao, Y.; Hu, Y. An Effective Multi-Layer Attention Network for SAR Ship Detection. J. Mar. Sci. Eng. 2023, 11, 906. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Ke, X. Quad-FPN: A Novel Quad Feature Pyramid Network for SAR Ship Detection. Remote Sens. 2021, 13, 2771. [Google Scholar] [CrossRef]
Cordonnier, J.B.; Loukas, A.; Jaggi, M. Multi-head attention: Collaborate instead of concatenate. arXiv 2020, arXiv:2006.16362. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar] [CrossRef]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Yuan, Y.; Wang, J. OCNet: Object Context Network for Scene Parsing. arXiv 2018, arXiv:1809.00916. [Google Scholar]
Lin, X.; Guo, Y.; Wang, J. Global Correlation Network: End-to-End Joint Multi-Object Detection and Tracking. arXiv 2021, arXiv:2103.12511. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar] [CrossRef]
Chen, S.; Zhan, R.; Zhang, J. Regional attention-based single shot detector for SAR ship detection. J. Eng. 2019, 21, 7381–7384. [Google Scholar] [CrossRef]
Zhu, C.; Zhao, D.; Liu, Z.; Mao, Y. Hierarchical Attention for Ship Detection in SAR Images. In Proceedings of the IGARSS 2020–2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020. [Google Scholar] [CrossRef]
Yasir, M.; Shanwei, L.; Mingming, X.; Hui, S.; Hossain, S.; Colak, A.T.I.; Wang, D.; Jianhua, W.; Dang, K.B. Multi-scale ship object detection using SAR images based on improved Yolov5. Front. Mar. Sci. 2023, 9, 1086140. [Google Scholar] [CrossRef]
Shan, H.; Fu, X.; Lv, Z.; Zhang, Y. SAR ship detection algorithm based on deep dense sim attention mechanism network. IEEE Sens. J. 2023, 23, 16032–16041. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, H.; Ma, F.; Pan, Z.; Zhang, F. A sidelobe-aware small ship detection network for synthetic aperture radar imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Zhu, H.; Xie, Y.; Huang, H.; Jing, C.; Rong, Y.; Wang, C. DB-YOLO: A duplicate bilateral YOLO network for multi-scale ship detection in SAR images. Sensors 2021, 21, 8146. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Chen, J.; Sun, L.; Zhou, Z.; Huang, Z.; Wu, B. Unsupervised Domain-Adaptive SAR Ship Detection Based on Cross-Domain Feature Interaction and Data Contribution Balance. Remote Sens. 2024, 16, 420. [Google Scholar] [CrossRef]
Hu, B.; Miao, H. An Improved Deep Netural Network for Small Ship Detection in SAR Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 2596–2609. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2023. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. arXiv 2024, arXiv:2403.06258. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2023. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar] [CrossRef]
Liu, C.; Wang, K.; Li, Q.; Zhao, F.; Zhao, K.; Ma, H. Powerful-IoU: More straightforward and faster bounding box regression loss with a nonmonotonic focusing mechanism. Neural Netw. 2024, 170, 276–284. [Google Scholar] [CrossRef] [PubMed]
Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR ship detection dataset (SSDD): Official release and comprehensive data analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 318–327. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Chen, Z.; Liu, C.; Filaretov, V.F.; Yukhimets, D.A. Multi-Scale Ship Detection Algorithm Based on YOLOv7 for Complex Scene SAR Images. Remote Sens. 2023, 15, 2071. [Google Scholar] [CrossRef]
Xiao, M.; He, Z.; Li, X.; Lou, A. Power Transformations and Feature Alignment Guided Network for SAR Ship Detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Liu, C.; Shi, J.; Wei, S.; Ahmad, I.; Zhan, X.; Zhou, Y.; Pan, D.; Li, J.; et al. Balance learning for ship detection from synthetic aperture radar remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 182, 190–207. [Google Scholar] [CrossRef]
Bai, L.; Yao, C.; Ye, Z.; Xue, D.; Lin, X.; Hui, M. Feature Enhancement Pyramid and Shallow Feature Reconstruction Network for SAR Ship Detection. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2023, 16, 1042–1056. [Google Scholar] [CrossRef]
Tang, G.; Zhao, H.; Claramunt, C.; Zhu, W.; Wang, S.; Wang, Y.; Ding, Y. PPA-Net: Pyramid Pooling Attention Network for Multi-Scale Ship Detection in SAR Images. Remote Sens. 2023, 15, 2855. [Google Scholar] [CrossRef]
He, J.; Su, N.; Xu, C.; Liao, Y.; Yan, Y.; Zhao, C.; Hou, W.; Feng, S. A Cross-Modality Feature Transfer Method for Target Detection in Sar Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Chang, H.; Fu, X.; Dong, J.; Liu, J.; Zhou, Z. MLSDNet: Multiclass Lightweight SAR Detection Network Based on Adaptive Scale Distribution Attention. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]

Figure 1. The overall network structure of YOLO-MSD. The DPK-Net, including the OC module and PK module, is firstly constructed as the backbone network, then the BSAM is introduced at the neck, and finally, the P-IoU is introduced at the regression stage.

Figure 2. The structure of the DPK-Net. The backbone network extracts the critical characteristics of the input image, followed by the output of three different scale feature maps.

Figure 3. Detailed design of the OC module. (a) shows the structure of the OC Module. (b,c) show a detailed comparison between regular convolution and PConv.

Figure 4. Depthwise separable convolution.

Figure 5. The PK Module starts with a small kernel convolution for local data and then uses parallel DWConv for the multiscale context.

Figure 6. The structure of the BSAM attention mechanism.

Figure 7. The overall structure of BRA.

Figure 8. IoU-based losses. The loss functions described in (a) incorporate dimensional information, specifically using the diagonal length of the smallest enclosing bounding box (represented by the gray dashed box) for both the anchor and target boxes as the denominator in the loss calculation. Conversely, the P-IoU loss function outlined in (b) simplifies this approach by utilizing only the edge length of the target box as the denominator of its loss factor.

Figure 9. PR curves for ablation experiments: (a) is based on HRSID, and (b) is based on SSDD.

Figure 10. The loss curves of the proposed YOLO-MSD and the original YOLOv7-tiny model: (a) is based on HRSID, and (b) is based on SSDD.

Figure 11. Ship detection results for the HRSID.

Figure 12. Ship detection results for the SSDD.

Figure 13. Heat map of images at different scales and in different scenes. (a) small ship. (b) medium ship. (c) large ship. (d) offshore ships. (e) inshore ships. (f) dense inshore ships.

Table 1. The definition of some COCO indicators.

Metric	Meaning
$AP$	AP for IoU = 0.50:0.05:0.95
$AP 50$	AP for IoU = 0.50
$A P_{S}$	AP for small targets (area < 32²)
$A P_{M}$	AP for medium targets (32² < area < 96²)
$A P_{L}$	AP for large targets (area > 96²)
$F P S$	Frames per second

Table 2. Experimental results of comparative experiments.

Method	HRSID			SSDD			FPS	Params (M)	FLOPs (G)
Method	P	R	mAP	P	R	mAP	FPS	Params (M)	FLOPs (G)
Faster-RCNN	0.378	0.560	0.454	0.502	0.944	0.851	13	41.3	251.4
Cascade R-CNN	0.739	0.634	0.651	0.908	0.941	0.905	25	68.93	119.0
SSD	0.928	0.438	0.681	0.936	0.552	0.899	92	23.7	30.4
EfficientDet	0.969	0.331	0.484	0.959	0.533	0.713	29	3.8	2.3
YOLOv5_n	0.890	0.717	0.776	0.925	0.833	0.897	95	1.9	4.5
YOLOv7	0.847	0.724	0.819	0.928	0.782	0.902	56	37.1	105.1
YOLOv7-tiny	0.864	0.747	0.843	0.889	0.876	0.926	96	6.008	13.0
RetinaNet	0.980	0.395	0.534	0.976	0.623	0.698	34	36.3	10.1
CenterNet	0.948	0.696	0.788	0.948	0.604	0.785	48	32.6	6.7
CSD-YOLO	0.932	0.804	0.861	0.959	0.959	0.986	-	-	-
Pow-FAN	0.885	0.837	0.897	0.946	0.965	0.963	31	136	-
BL-Net	0.915	0.897	0.886	0.912	0.961	0.952	5	47.8	417.8
FEPS-Net	-	-	0.897	-	-	0.960	32	37.31	-
PPA-Net	0.903	0.882	0.893	0.952	0.912	0.952	-	-	-
CMFT	0.813	0.911	0.896	0.924	0.981	0.973	-	-	-
MLSDNet	-	-	0.897	-	-	0.974	7	5.68	18.4
FBR-Net	-	-	0.896	0.928	0.940	0.941	25	32.5	141.3
MANet	0.871	0.782	0.863	0.953	0.949	0.957	-	-	-
Quad-FPN	0.880	0.873	0.861	0.895	0.958	0.953	11	-	-
Ours	0.902	0.788	0.902	0.968	0.902	0.988	45	12.3	33.8

Table 3. Comparison of results for different improvement points added to the model.

Experiment	OC Module	PK Module	BSAM	P-IoU	Dataset	P	R	mAP 50	mAP 50-95
1	-	-	-	-	SSDD	0.889	0.876	0.926	0.578
2	√	-	-	-		0.934	0.880	0.948	0.606
3	-	√	-	-		0.908	0.894	0.945	0.575
4	-	-	√	-		0.895	0.868	0.931	0.552
5	-	-	-	√		0.941	0.909	0.966	0.651
6	√	√	-	-		0.935	0.921	0.967	0.658
7	-	-	√	√		0.927	0.951	0.970	0.657
8	√	√	√	√		0.968	0.902	0.988	0.661
9	-	-	-	-	HRSID	0.864	0.747	0.843	0.555
10	√	-	-	-		0.885	0.823	0.852	0.586
11	-	√	-	-		0.865	0.754	0.878	0.555
12	-	-	√	-		0.874	0.772	0.856	0.560
13	-	-	-	√		0.871	0.766	0.854	0.559
14	√	√	-	-		0.895	0.785	0.886	0.568
15	-	-	√	√		0.884	0.788	0.877	0.583
16	√	√	√	√		0.902	0.788	0.902	0.592

“√” indicates that the current module was used, and “-” indicates that it was not used.

Table 4. Multiscale performance analysis of HRSID-based ships.

Experiment	OC Module	PK Module	BSAM	P-IoU	mAP	$A P_{S}$	$A P_{M}$	$A P_{L}$	Params (M)	FLOPs (G)
1	-	-	-	-	0.843	0.462	0.748	0.215	6.008	13.0
2	√	-	-	-	0.852	0.487	0.758	0.279	6.007	13.0
3	-	√	-	-	0.878	0.496	0.784	0.629	11.233	18.8
4	-	-	√	-	0.856	0.460	0.753	0.259	8.599	30.6
5	-	-	-	√	0.854	0.464	0.746	0.280	6.008	13.0
6	√	√	-	-	0.886	0.518	0.785	0.433	9.689	16.0
7	-	-	√	√	0.877	0.494	0.758	0.285	10.905	31.1
8	√	√	√	√	0.902	0.534	0.784	0.426	12.281	33.6

“√” indicates that the current module was used, and “-” indicates that it was not used.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, S.; Jin, G.; Gao, J.; Tan, L.; Xue, Y.; Li, Y.; Liu, Y. Ship Detection in Synthetic Aperture Radar Images Based on BiLevel Spatial Attention and Deep Poly Kernel Network. J. Mar. Sci. Eng. 2024, 12, 1379. https://doi.org/10.3390/jmse12081379

AMA Style

Tian S, Jin G, Gao J, Tan L, Xue Y, Li Y, Liu Y. Ship Detection in Synthetic Aperture Radar Images Based on BiLevel Spatial Attention and Deep Poly Kernel Network. Journal of Marine Science and Engineering. 2024; 12(8):1379. https://doi.org/10.3390/jmse12081379

Chicago/Turabian Style

Tian, Siyuan, Guodong Jin, Jing Gao, Lining Tan, Yuanliang Xue, Yang Li, and Yantong Liu. 2024. "Ship Detection in Synthetic Aperture Radar Images Based on BiLevel Spatial Attention and Deep Poly Kernel Network" Journal of Marine Science and Engineering 12, no. 8: 1379. https://doi.org/10.3390/jmse12081379

APA Style

Tian, S., Jin, G., Gao, J., Tan, L., Xue, Y., Li, Y., & Liu, Y. (2024). Ship Detection in Synthetic Aperture Radar Images Based on BiLevel Spatial Attention and Deep Poly Kernel Network. Journal of Marine Science and Engineering, 12(8), 1379. https://doi.org/10.3390/jmse12081379

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ship Detection in Synthetic Aperture Radar Images Based on BiLevel Spatial Attention and Deep Poly Kernel Network

Abstract

1. Introduction

2. Related Work

2.1. MultiScale Ship Detection

2.2. Attention Mechanism

2.3. Loss Function

3. Materials and Methods

3.1. The Overview of YOLO-MSD

3.2. Deep Poly Kernel Network (DPK-Net)

3.2.1. Optimized Convolution Module (OC)

3.2.2. Poly Kernel Module (PK)

3.3. BiLevel Spatial Attention Module (BSAM)

3.3.1. BiLevel Routing Attention (BRA)

3.3.2. Spatial Attention Module

3.4. Powerful-IoU Loss Function (P-IoU)

4. Experiment and Results

4.1. Experimental Platform

4.2. Datasets

4.2.1. HRSID

4.2.2. SSDD

4.3. Model Evaluation

4.4. Experimental Results

4.4.1. Comparison with Existing Methods

4.4.2. Ablation Experiment

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI