1. Introduction
The Synthetic Aperture Radar (SAR) system boasts unparalleled capabilities for all-weather surveillance, offering distinct advantages when compared to traditional optical remote sensing satellites for terrestrial observation [
1,
2]. Variations in the brightness of SAR imagery are intricately linked to the backscatter coefficient of the target area, mirroring critical physical parameters such as surface texture, soil moisture, and the complex dielectric constant. The application of SAR technology has shown exceptional promise in various fields including marine environmental monitoring, geological surveys, agricultural development, emergency management, and naval reconnaissance [
3,
4,
5,
6].
Ship detection is pivotal for ensuring maritime safety and providing essential data support. However, the utility of optical imagery is significantly hampered by its inherent limitations, chiefly, its inability to facilitate real-time monitoring of vessels during nocturnal hours. Moreover, optical imagery often fails in conditions of sea fog, dense cloud coverage, rainfall, and various extreme or adverse weather scenarios, leading to a complete loss of the detection targets in the imagery [
7]. In contrast, SAR boasts remarkable penetration capabilities and provides consistent monitoring across all weather conditions, be it day or night, under cloud cover, or amidst rainfall. The inherent benefits of SAR technology effectively offset the shortcomings associated with optical satellites, thereby delivering a robust, continuous, and reliable data source for the enhancement of real-time ship monitoring processes.
In the past, there have been a large number of excellent solutions in the field of machine learning and machine vision, which have been used for ship identification from the perspective of data processing and image processing, such as Support Vector Machines (SVMs) [
8], threshold segmentation [
9], statistical feature methods [
3], and various manually crafted feature-based methods [
10,
11]. Nevertheless, these approaches exhibit limited performance in scenarios characterized by complex environmental backgrounds, ship deformations, and prevalent noise [
12]. Furthermore, while these traditional methods might show exceptional capabilities in certain specific scenarios, they suffer from a lack of versatility and adaptability across diverse situations [
13]. Deep learning has proven adept at navigating the intricacies of object detection tasks, showcasing remarkable adaptability in complex scenarios [
14,
15]. Its proficiency in executing multiple classification detections, alongside unparalleled accuracy and data processing capabilities, has propelled its application in the analysis of Synthetic Aperture Radar (SAR) imagery. Currently, these techniques have yielded substantial improvements in the domain of ship detection using SAR images [
16]. Predominantly, general object detection algorithms are classified into two broad categories: one-stage and two-stage detection methods. One-stage detection algorithms, such as SSD [
17] and YOLO [
18], approach detection as a comprehensive end-to-end regression challenge. Though they compromise on the accuracy of detecting smaller objects, these algorithms exhibit a significant advantage in real-time detection capabilities. On the other hand, classical two-stage detection algorithms, such as R-CNN [
19], Faster R-CNN [
20], Mask R-CNN [
21], and SPP-Net [
22], enhance the model’s sensitivity towards smaller objects. However, they introduce computational complexity that can hinder overall detection efficiency.
Detecting ships on water surfaces demands algorithms that deliver both high-speed and robust performance. Given the typical image characteristics obtained through satellite remote sensing, the target ship usually occupies only a minor portion of the overall pixel count, resulting in its identification as a small target. To this end, a vast array of detection algorithms specifically tailored for ship identification has been introduced in recent research. For instance, Kang et al. [
23] utilized a blend of downsampled shallow layers along with upsampled deeper layers to create region proposals. Li et al. [
24] applied a technique involving high-frequency sub-band channel fusion to enhance ship features in SAR imagery. Meanwhile, Yu et al. [
25] enhanced the YOLOv5 model with a bidirectional feature pyramid network to improve the ability to detect ships of various scales. In another advancement, Chen et al. [
26] introduced the SAS-FPN module, merging porous spatial pyramid pooling with shuffled attention strategies to minimize the loss of ship features. Furthermore, Li et al. [
27] achieved rotation-invariant properties of the SAR ship hull in the frequency domain by applying a Fourier transform to the multi-scale features derived from the Feature Pyramid Network (FPN). Additionally, Chen et al. [
28] formulated a novel loss function integrated with GIoU to diminish the network’s sensitivity to scale variations, thereby focusing on enhancing the algorithm’s performance in detecting ship targets within SAR imagery characterized by complex backgrounds. Despite the availability of advanced deep learning methods for ship detection, SAR imagery, known for its unique radar imaging features, presents considerable interpretation challenges. Factors such as self-scattering mechanisms, terrain interference, and atmospheric noise contribute to issues like salt-and-pepper noise, blurred edges, and shape distortion during the imaging process, posing significant hurdles in accurately locating and identifying ship shapes [
29].
Detecting ships using SAR imagery presents numerous challenges: (1) In SAR imagery, the majority of ships are depicted by only a small cluster of pixels, thereby imposing high requirements on the detection algorithm’s capability to recognize small targets. (2) The existence of salt-and-pepper noise and radar interference within SAR images leads to the blurring of boundaries, ghosting effects, and misplacement of the identified ships. This significantly undermines the reliability of target identification and localization. (3) The visual interpretation of SAR imagery is significantly dependent on the expertise of specialists in the field. A notable challenge in this domain is that considerable inconsistencies in the annotation quality of datasets complicate the ship detection situation.
In order to tackle the previously mentioned challenges, this paper introduces a Multi-Location Cross-Attention Ship Detection Network (LCAS-DetNet), which is based on the YOLO architecture [
18]. This architecture comprises a feature extraction component, an intermediate neck layer, and a detection head. Furthermore, to mitigate the effects of noise interference in detection tasks, we present the Multi-Location Cross-Attention (MLCA) framework. The MLCA strategy integrates both global and local perspectives to pinpoint ship locations with high precision. It employs a dual-branch structure that acts as a filter during feature extraction, identifying relevant targets amid perceptual features, thus eliminating non-informative noise and random noise points. To enhance the accuracy in determining the spatial location of ships, our approach maintains the spatial coordinates of the target across multiple scales leveraging a residual structure for location data. Notably, drawing inspiration from Wise-IoUv3 [
30], this research integrates a focusing mechanism into the loss calculation process. This is achieved by adopting the degree of deviation as a weight factor, which helps in minimizing the influence of dataset quality discrepancies on the gradient updates. The main contributions are as follows:
We propose a Multi-Location Cross-Attention (MLCA) to enhance the network’s focus on key regions, suppress salt-and-pepper noise interference, and highlight ship features.
An multi-scale cascaded feature extractor with spatial coordinates is proposed. It is used for extracting ship features while retaining the multi-scale spatial coordinates of the ships.
We integrate the focusing mechanism Wise-IoUv3 [
30] loss into the bounding-box regression loss calculation. This helps minimize the influence of dataset quality differences on gradient updates.
We propose a Multi-Location Cross-Attention Ship Detection Network (LCAS-DetNet), a multi-scale cascaded object detection algorithm. LCAS-DetNet was tested on two ship detection datasets (HRSID [
31] and SSDD [
32]), and the results demonstrated state-of-the-art performance.
2. Methods
This study proposes LCAS-DetNet for the detection of surface ships based on SAR images. LCAS-DetNet provides a small target detection solution with advantages in denoising and real-time performance for ship detection tasks. As shown in
Figure 1, it follows the YOLO structure, consisting of multiple location feature extractor, a feature fusion neck layer, and a detection head. To more accurately pinpoint the locations of small target ships, this study introduces an attention scheme called Multi-Location Cross-Attention (MLCA) into the feature extractor. MLCA implicitly re-encodes the spatial locations of targets in the feature extractor, enhancing the network’s spatial perception and object location capabilities. At the same time, it acts as a filter by calculating correlations to filter out salt-and-pepper noise in SAR data.
As a Multi-Location Cross Attention mechanism, LCAS-DetNet demonstrates significant potential in enhancing the precision of ship localization. This improvement is primarily facilitated by the integration of implicit global position encoding, achieved through the attention mechanism, alongside local position encoding, enabled by convolutional operations. This dual encoding framework effectively delineates the spatial relationships between the ship and its surrounding environment, as well as the interactions among ships at varying distances. Consequently, this design offers more possibilities for multi-target monitoring.
2.1. Multi-Location Cross-Attention
The success of ViT [
33] has introduced the Transformer into visual computing, prompting increased discussion among researchers regarding the significance of attention mechanisms in computer vision. Inspired by [
34,
35], this study proposes a Multi-Location Cross-Attention (MLCA) to further explore the utility of attention mechanisms in SAR images and ship detection tasks.
As shown in
Figure 2, MLCA consists of local and global branches, to obtain features at different scales and spatial coordinate location. This form of attention is referred to as cross-attention [
36]. Within each branch, additional branches are embedded to acquire attention from both channel and spatial perspectives, named Multi-Location Attention (MLA). On one hand, this enables the stabilization of ship features against interference caused by noise from multiple dimensions. On the other hand, the spatial position of ships can be implicitly encoded through a correlation calculation.
To maintain the plug-and-play characteristic of MLCA, the channel dimensions of the input and output, as well as the tensor shape, remain consistent after undergoing complex transformations. Assuming that the input , the output . After x undergoes normalization–convolution–activation, the channel dimension is increased , providing homologous raw materials for the dual-branch structure.
As shown in
Figure 3, after
is sliced along the channel dimension, it is separately input into the global and local branches. Different scales of patches are obtained through serialization, assuming the patch window is
. Window
q plays different roles in the two branches. In the global branch,
q represents the number of patch windows, while in the local branch,
q represents the size of a patch window.
q is a hyperparameter that can take different values depending on the task and dataset. In the global branch, the serialized tensor
. In the local branch, the serialized tensor
. MLCA obtains the spatial positions and receptive fields of different scales within the same layer. Such cross-attention introduces richer features and spatial information into the network.
As illustrated in
Figure 3a, the MLA module contains channels and spatial branches, taking the global branch as an example.
In the channel branch, the input
is divided into two parts in the channel dimension, resulting in two different sequences of shape
, named
and
. Among them,
represents the input reserve, while
serves as an index to search for features with high similarity. As shown in Equation (
1),
and
measure spatial distance through dot product in the channel dimension, serving as the evaluation criterion for feature correlation. During backpropagation, it guides the weight update of
. We believe that such a design helps resist interference caused by water surface fluctuations and abnormal weather conditions at sea. This interference can cause water bodies that should be surface scattering to be easily confused with volume scattering from ships.
In the spatial branch, the emphasis is on pixel-level correlation on the spatial plane, that is, the
dimension. As shown in
Figure 3, when
is input into the spatial branch, it is divided into two along the
h dimension, resulting in
and
. As shown in Equation (
2), a pixel-level dot product interaction is performed between the slices. To ensure the integrity of the attention of the feature maps, we perform a linear transformation with learnable weights on
and
, resulting in
and
. A dot product is then performed between
and
, as well as
and
, to guide the backpropagation and update the learnable weight coefficients. It is worth noting that in order to ensure multi-perspective learning of attention, the local branch performs the slice interaction along the
w dimension. We believe that a pixel-level multi-perspective dot product can better cope with the challenges brought by salt-and-pepper noise in SAR imaging. Such high-granularity attention calculation can accurately focus on the pixel values of the target object from the features, thereby ignoring small salt-and-pepper noise.
We sum up the obtained
and
to obtain the Multi-Location Attention, as shown in Equation (
3), adding the global and local attention linearly to obtain the final MLCA.
2.2. Multi-Scale Cascade Feature Extractor
The entire multi-scale cascade feature extractor consists of stacked MLCA blocks, as shown in
Figure 4a. After passing through each MLCA block, the feature map of shape
becomes
. Through the cascading design, our feature extractor makes it possible for information to be passed and exchanged between different layers, combining features from different layers to produce richer and more complex feature representations, which enhances the network’s representational and learning capabilities. In the SAR image ship detection task of this study, we obtain 8×, 16×, and 32× downsampled features from the multi-scale cascade feature extractor and pass them to the neck layer for feature fusion.
MLCA module: It is the core of LCAS-DetNet, designed to handle the challenges posed by small targets and noise interference in the network. The MLCA module is a dual-branch structure, consisting of a local branch and a global branch, forming cross-attention, as shown in
Figure 2. The feature of the input MLCA module is passed through layer normalization–convolution–GELU (NCG) to provide raw material for the two branches. NCG is a reusable module that ensures the learnability of the attention mechanism in feature maps using convolution, normalization, and GELU activation function, continuously updating the relevance weights through backpropagation. Ensuring the nonlinear transformation of features while computing the relevance of different targets. Upon entering the branches, the feature map is serialized and undergo NCG to increase the dimensionality, serving as the input for the MLA block. After undergoing the transformation shown in
Figure 4, the feature map generates attention maps for both global and local branches. The attention maps from both global and local branches are concatenated and then linearly integrated. Following the residual connection with the input, an output with attention is obtained. The network extracts multi-scale features and obtains multi-scale attention maps by vertically stacking MLCA blocks, enriching the network’s receptive field.
Downsampling module: It is an important part of the feature extractor and is designed to decrease the feature map size while retaining important information. It consists of four parts: Depthwise Separable Convolution (DWConv) [
37], batch normalization [
38], GELU activation function [
39], and MaxPooling [
40], as illustrated in
Figure 4b. DWConv [
37] is an advanced convolutional operation that integrates depthwise convolution with pointwise convolution; it significantly reduces the number of parameters while simultaneously enabling the extraction of both spatial and channel features. Batch normalization [
38] adjusts the input values of the activation function to lie within a more responsive region, thereby increasing the gradient and mitigating the issue of gradient disappearance. The GELU [
39] is a nonlinear activation function based on the Gaussian error, endowing the network with superior expressive and abstract capabilities. MaxPooling [
40] reduces the spatial size of the feature map by half.
Patch embedding module: It is a residual structure used for multi-scale localization of target ships. This module consists of an adaptive convolution, whose kernel size is the same as the patch size. It encodes the implicit local positions of the target ships in units of patches, as illustrated in
Figure 4c. As the network deepens, the positional information from the upper layer is transmitted to the next layer through residual connections. Furthermore, in the network, the feature scale decreases layer by layer while the patch size remains unchanged. The receptive field of the patch embedding also gradually expands, allowing for the acquisition of position information from multiple scales and perspectives, leading to better ship localization, which is as shown in
Figure 5.
2.3. Loss Function
Within LCAS-DetNet, the loss function comprises two components: a category classification loss () and a bounding-box regression loss ().
The category classification loss is depicted in Equation (
4), which is used to determine whether the predicted category can correspond to the real label category and output the confidence score of each category.
where
and
represent the true distribution and predicted distribution, respectively.
The bounding-box regression loss is depicted in Equation (
5), comprising both the DFL loss and Wise-IoUv3 loss [
30]. We integrate the focusing mechanism into the bounding-box regression loss calculation by adopting the degree of deviation as a weight factor. This design helps minimize the influence of dataset quality differences on gradient updates.
where
l is the true distribution, while
and
are the nearest integer to the left and right of
l.
is the difference between
and the predicted distribution. We use
and
to represent the size of the minimum enclosing box.
x,
y and
,
are the horizontal and vertical coordinates of the predict box and real box. The outlier degree
is an index used to assess the quality of anchor boxes.
and
are two definable parameters used to construct the non-monotone focusing coefficient
.
2.4. Network Design
The feature extractor part of our network consists of a cascade of MLCA blocks, illustrated in the purple region of
Figure 1. We input a SAR image with dimensions
into the network and acquire multi-scale features with dimensions (
,
,
), (
,
,
), and (
,
,
) after passing through the third, fourth, and fifth layer MLCA block, respectively, where c is set to 32.
Drawing inspiration from YOLO framework [
18], we incorporate its straightforward and efficient feature fusion and detection methodology, showcased in the green and yellow regions of
Figure 1. The task of the neck part is to fuse multi-scale features from the feature extractor. Through a series of operations of the conv–batch normalization–SiLU (CBS) module, C2f module, and upsampling module, the network has information from different semantic spaces. In the head part, multiple prediction bounding boxes at different scales are derived by performing regression and classification on feature maps across three distinct scales, then an NMS [
41] algorithm is employed to remove redundant detections exhibiting substantial overlap, yielding the final output in the form of
, where
is the label index value corresponding to the target, and
is the confidence score of the target.
x,
y and
h,
w represent the position information and scale information of the marking box, respectively.