LRTransDet: A Real-Time SAR Ship-Detection Network with Lightweight ViT and Multi-Scale Feature Fusion

Feng, Kunyu; Lun, Li; Wang, Xiaofeng; Cui, Xiaoxin

doi:10.3390/rs15225309

Open AccessArticle

LRTransDet: A Real-Time SAR Ship-Detection Network with Lightweight ViT and Multi-Scale Feature Fusion

¹

School of Software and Microeletronics, Peking University, Beijing 102600, China

²

School of Integrated Circuits, Peking University, Beijing 100871, China

³

Beijing Aerospace Automatic Control Institute, Beijing 100039, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(22), 5309; https://doi.org/10.3390/rs15225309

Submission received: 13 October 2023 / Revised: 2 November 2023 / Accepted: 6 November 2023 / Published: 9 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, significant strides have been made in the field of synthetic aperture radar (SAR) ship detection through the application of deep learning techniques. These advanced methods have substantially improved the accuracy of ship detection. Nonetheless, SAR images present distinct challenges, including complex backgrounds, small ship targets, and noise interference, thereby rendering the detectors particularly demanding. In this paper, we introduce LRTransDet, a real-time SAR ship detector. LRTransDet leverages a lightweight vision transformer (ViT) and a multi-scale feature fusion neck to address these challenges effectively. First, our model implements a lightweight backbone that combines convolutional neural networks (CNNs) and transformers, thus enabling it to simultaneously capture both local and global features from input SAR images. Moreover, we boost the model’s efficiency by incorporating the faster weighted feature fusion (Faster-WF2) module and coordinate attention (CA) mechanism within the feature fusion neck. These components optimize computational resources while maintaining the model’s performance. To overcome the challenge of detecting small ship targets in SAR images, we refine the original loss function and use the normalized Wasserstein distance (NWD) metric and the intersection over union (IoU) scheme. This combination improves the detector’s ability to efficiently detect small targets. To prove the performance of our proposed model, we conducted experiments on four challenging datasets (the SSDD, the SAR-Ship Dataset, the HRSID, and the LS-SSDD-v1.0). The results demonstrate that our model surpasses both general object detectors and state-of-the-art SAR ship detectors in terms of detection accuracy (97.8% on the SSDD and 93.9% on the HRSID) and speed (74.6 FPS on the SSDD and 75.8 FPS on the HRSID), all while demanding 3.07 M parameters. Additionally, we conducted a series of ablation experiments to illustrate the impact of the EfficientViT, the Faster-WF2 module, the CA mechanism, and the NWD metric on multi-scale feature fusion and detection performance.

Keywords:

synthetic aperture radar (SAR); ship detection; vision transformer (ViT); faster weighted feature fusion (Faster-WF2); coordinate attention (CA); real-time

1. Introduction

Synthetic aperture radar (SAR) is a kind of microwave remote sensor for vast, extensive, and long-term monitoring due to its certain penetration capability and powerful anti-interference ability [1,2]. Ship detection in SAR images plays a significant role in military and civil scenarios, such as military reconnaissance, maritime traffic control, and sea rescue [3,4]. Traditional ship-detection methods usually distinguish ships by modeling the statistical distribution of a cluttered background, which is typically represented by the constant false alarm rate (CFAR) method [5]. The use of CFAR has led to great achievements in simple background detection; however, it does lose detection accuracy and generalization in complex background scenarios, such as in small ship detection or bad sea conditions.

Recently, deep learning methods have emerged as the mainstream in many applications for their high accuracy and strong flexibility, such as human detection [6], urban road transport detection [7], 3D point cloud [8,9] and trajectory prediction [10]. In ship-related research, ship trajectory prediction algorithms can provide early warnings to avoid collisions [11,12] and ship-detection algorithms can effectively improve marine transport management. Especially, convolutional neural networks (CNNs) play an important role in ship detection for their high-feature proposal capability and compatibility with parallel computing in GPUs. Detection algorithms are usually divided into two categories: two-stage detection and one-stage detection. Two-stage detection algorithms, represented by Faster R-CNN [13], Mask R-CNN [14], etc., use a region proposal network (RPN) at the first stage to generate numerous regions of interest (RoI) with different scales and shapes. At the second stage, these regions are processed and refined to locate and classify objects. Generally, two-stage algorithms can achieve superior detection accuracy but have huge computational costs and model sizes, thus rendering them challenging to train and deploy on edge devices. To address this problem, one-stage algorithms that are utilized without generating regions have been proposed, such as You Only Look Once (YOLO) [15], RetinaNet [16], Fully Convolutional One-Stage Object Detection (FCOS) [17], and YOLOX [18]. For instance, the YOLO series preset anchor boxes in feature maps and perform regression to refine anchor boxes, while YOLOX believes objects locates in the center of anchor boxes and predicts bounding boxes for detection. One-stage algorithms have attracted more attention because they can reduce both computational costs and model size with minimal accuracy loss. In optical ship detection, YOLO-based detectors have demonstrated that the detection paradigm of “backbone-neck-head” in YOLO has a powerful ability in feature extraction [19,20]. The YOLO series also show its great generality and flexibility in other scenarios, like ship depth estimation [21] and ship instance segmentation [22]. In SAR ship detection, where imagery is generally simpler than optical imagery, researchers have been wrestling with redundant computations in YOLO to make detectors lighter and faster [23,24].

In real-world object detection, there exist objects with variant scales. While large objects are easier to detect due to their distinct low-level features like edge information, the detection accuracy of small objects is insufficient because of their lack of high-level features [25]. In general object detectors, multi-scale feature fusion modules are commonly used to improve the detection accuracy of small objects. For example, AugFPN [26] implements consistent supervision and residual feature augmentation to narrow the semantic gaps and reduce the information loss of the highest pyramid level in feature pyramid networks (FPNs). DenserNet [27] aggregates feature maps at different semantic levels to produce more keypoint features. Through weakly supervised triplet ranking loss, DenserNet succeeds in large-scale localization and image retrieval with efficient computation. These feature fusion modules make full use of the complementarity between different levels of features, as well as the correlation between multi-scale features [28]. In SAR imagery, small ships are also widely present, and they are difficult to locate and classify due to their missing color information and vague edges. Therefore, proposing an efficient multi-scale feature fusion method that is suitable for real-time SAR ship detection is necessary.

In the past few years, transformers [29] have been increasingly adopted in various natural language processing (NLP) models. Transformers’ strong generalization ability and impressive performance make them capable of being gradually deployed in diverse scenarios. In computer vision tasks, vision transformer (ViT) [30] is the first classification model incorporating transformer blocks, thereby achieving perfect performance. However, ViT models usually have substantial computational costs, thus limiting their feasibility for deployment on resource-constrained devices. To alleviate this problem, there has been a growing focus on developing lightweight ViT models. In particular, EfficientViT [31] has emerged as a solution that attains higher accuracy, faster inference speeds, and lower computational complexity. Although certain researchers are exploring the integration of ViT into real-time SAR ship detectors, there are few works using lightweight ViT.

This paper implements a real-time SAR ship detector LRTransDet based on lightweight ViT, Faster-WF2 modules, and the coordinate attention (CA) mechanism. We also propose a corresponding solution to small ship detection in SAR images. Our model achieves superior performance on four challenging multi-scale SAR ship datasets. The contribution points of this work are as follows:

This paper constructs a novel real-time SAR Ship-Detection Network with lightweight ViT, a faster weighted feature fusion neck, and an optimized loss function, which is named LRTransDet. Compared with the SOTA detector, our model obtains higher detection accuracy with lower time and space complexity while also having a real-time inference speed;
In terms of feature extraction, we innovatively embed the latest lightweight ViT detection network into the backbone. Our backbone combines the globality of transformers and the locality of CNN, all while reducing computational complexity and still ensuring high-quality multi-scale feature extraction;
We reconstruct the detector’s neck based on Faster-WF2 modules and the CA mechanism, thereby enhancing and boosting the feature fusion across multi-scale feature maps with complex backgrounds. We also conduct ablation experiments to prove the effectiveness of these components in the neck;
For the situations where numerous small targets exist in SAR images, we propose a loss function that combines the Complete Intersection over Union (CIoU) measurement scheme and the Wasserstein distance. This optimization approach empowers the detector to excel in capturing small ship targets;
To evidence the performance of our proposed model, we conduct tests on four real-world challenging SAR datasets: the SSDD, the SAR-Ship Dataset, the HRSID, and the LS-SSDDv1.0. Compared with general object detectors and state-of-the-art SAR ship detectors, our model demonstrates superior performance by achieving a mean average precision (mAP) of 97.8%, 95.1%, 93.9%, and 76.2% on the four datasets (all while demanding lower computational resources with GFLOPs measuring at 3.85, 0.96, 9.4, and 9.4 on the four datasets), respectively.

In the rest of the article, we detail some of the related works in Section 2, introduce the architecture and module details of our proposed LRTransDet in Section 3, analyze the experimental results for our detector in Section 4, show the limitations of our model and future work in Section 5, and draw our conclusions in Section 6.

2. Related Work

2.1. SAR Ship-Detection Models

Deep learning-based methods have delivered great achievements in SAR ship detection. Nevertheless, there are still challenges in detecting small and vague ships because of information loss in large scenes and complex backgrounds. In SAR ship detection, researchers often combine multi-scale feature fusion with other effective components to detect small ships precisely, such as the attention mechanism. FBUA-Net [32] uses global context-guided feature balance pyramids and a united attention module to balance the semantic information of different level features, and to suppress the scattering noise around ships. MFTF-Net [33] proposes a local enhancement and transformer module, the four-scale residual feature fusion network, and fCBAM attention to enrich information and reduce interference for small ships. ATSD [34] implements spatial insertion attention and weighted cascade feature fusion to concentrate on localization accuracy and extract multi-scale ship features. By employing various forms of multi-scale fusion and attention mechanisms, the semantic features of small ships are enhanced, and it is easier to distinguish small ships from noise and backgrounds.

Although the aforementioned algorithms improve detection accuracy effectively, they usually suffer from large computation and storage resources, which hinders the application in real-time ship detection [35]. Consequently, researchers have been grappling with model compression as a means through which to reduce inference times in SAR ship detection. LPEDet [36] replaces the backbone of YOLOX with a lightweight NLCNet, which consists of multiple depthwise separable convolution modules. Moreover, the model designs a position-enhanced attention to compensate for accuracy loss with little computational cost. LssDet [23] constructs a lightweight path aggregation feature pyramid network to reduce the redundancy of the network. Lite-YOLOv5 [24] designs a lightweight cross-stage partial module for reducing the amount of calculation, and performs network pruning to obtain a compact detector. Researchers have successfully and effectively decreased the complexity of models, rendering them capable of real-time detection and hardware deployment. These algorithms have demonstrated that they can maintain, but yet hardly improve, the detection accuracy while compressing models, and this is likely because the accuracy and the complexity of models appear to be a trade-off. Therefore, determining how to improve the accuracy when decreasing the degree of computational cost and memory consumption remains a problem—one that will be addressed in this paper.

2.2. Transformers

Transformers were first applied in the machine translation tasks of NLP [29]. A typical transformer structure is illustrated in Figure 1, whereby it contains an encoder and a decoder. Stacking multiple such transformer structures can form a larger network. Specifically, the transformer module includes input, positional encoding, multi-head attention (MHA), a feed forward network (FFN), etc. Within the encoder, attention weights are assigned to inputs, and they generate encoded hidden outputs. These encoded representations then serve as the foundation for the decoder module, which leverages this information to produce an output sequence.

The attention mechanism is the main module of a transformer. In the calculation of the self-attention mechanism, three different vectors named Q, K, and V need to be used, which respectively represent the meaning of the input sequence in the corresponding subspace. For most transformers, softmax attention is a commonly used attention mechanism, which can be calculated as follows:

A t t e n t i o n = s o f t m a x (\frac{Q \cdot K^{T}}{\sqrt{D_{k}}}) \cdot V

(1)

There are multiple attention heads in a multi-head attention mechanism, and each head is assigned by different Q, K, and V. Therefore, the input sequence can learn more subspace representations. This setting enhances the learning and generalization capabilities of the model. The FFN module is deployed after MHA, and it consists of two linear transformation layers and an activation layer for information integration.

Due to the great success of models that are based on transformers in NLP, many researchers have tried to implement them in computer vision tasks. Because of their capability in capturing a wide range of information from the input sequence, transformers can be used as a substitute for CNNs for the feature extraction of input images [37]. In [30], the researchers built an image classification network, which only contained transformer structures, called ViT. The model first resizes the input image into a set of 2D patches, and it then inputs each patch as a token into the transformer structure to calculate the attention; finally, it uses a multilayer perceptron (MLP) head to output the classification results. The experimental results show that ViT has the same performance as the best CNN baseline but takes up fewer resources. In addition, the study advises that, in most cases, ViT needs to be pretrained on larger datasets, and it then needs to be fine-tuned for different downstream tasks.

In order to leverage the advantages of transformers and CNNs, there are also certain works that have combined the backbone of CNNs with transformers to achieve the effect of extracting local features while capturing the long-range features from inputs. For example, Conformer [38] enhances the learning effect of the model by implementing a hybrid network structure, which cascades CNN-based units to extract local features and transformer-based structures to obtain global features. The model consists of two branches, a CNN and a transformer, which are designed as ResNet and ViT, respectively. In addition, the model uses a feature coupling unit (FCU) as a bridge between the two branches to integrate the feature information from the CNN and transformer.

In object detection tasks, DEtection TRansformer (DETR) [39] uses the structure of a transformer’s encoder–decoder, and implements an end-to-end object detection model through the collective prediction loss. DETR uses CNNs as the backbone for feature extraction and dimensionality reduction for input images, and it then applies transformer structures to learn the global information of the feature maps. After this, they pass output embedding into the FFN to calculate the bounding box and classification results. However, DETR requires a great deal more time in training than other detectors, and it has a low performance when detecting small targets. A series of subsequent work [40,41,42], however, have mitigated this problem.

Benefiting from the success of transformers in computer vision, there have recently been more works that have focused on breaking through the computing bottleneck of ViT, and have implemented lighter ViT structures. They have been aimed at reducing the resource consumption of models and on facilitating deployment on edge devices. For instance, S. Metha et al. proposed MobileViT in 2022 [43], which combines CNN and transformers, and uses the “Unfold-Transformer-Fold” mechanism to reduce the parameters of their model. H. Tao el al. designed a LighterViT without convolution units [44]. Focused on the limited capacity of the traditional self-attention mechanism in thinner networks, Y. Chenglin et al. proposed the Light Vision Transformer (LVT) [45] with two enhanced self-attention mechanisms, CSA and RASA, to improve the low-level and high-level features in their model.

At present, there are certain works that have attempted to employ transformers in SAR ship detection. Xia et al. proposed CRTransSar, which also represented the first time that someone had applied transformers in SAR ship-detection models [46]. CRTransSar is based on a Swin Transformer structure, and it is able to deliver a good performance. Later, in order to extract more information from small target ships, Shi et al. designed a detector [47] that changes the original attention mechanism in the Swin Transformer to the deformable attention mechanism. Furthermore, inspired by the multi-head self-attention in transformers, Yu et al. designed a multiple hybrid attention mechanism ship detector model named MHASD [48], which enhances both the accuracy and speed of the proposed model. Due to the applications of ViT, the performance of their models has been impressively improved. However, the softmax function in ViT has a huge computational complexity, thereby limiting its deployment on hardware platforms.

3. Methodology

This section mainly introduces the details of our proposed LRTransDet. The overall architecture of the model will be introduced in Section 3.1. Lightweight ViT, Faster-WF2, coordinate attention, and our optimized loss function in the model will be introduced in Section 3.2, Section 3.3 and Section 3.4. Finally, the metric for evaluating the performance of our model will be introduced in Section 3.5.

3.1. Model Architecture

Our proposed model is a faster and superior SAR ship-detection model named LRTransDet. Its overall architecture is shown in Figure 2, and its features are as follows: LRTransDet is a one-stage SAR ship-detection model. Its design refers to YOLOv5, and it includes a backbone based on lightweight ViT, a faster weighted feature fusing neck, and a detection head. In addition, considering the problem of there being too many small ships to detect in certain SAR images, we optimized the loss function to improve the detection ability of the model.

YOLOv5, the fifth version of the YOLO series, has been used in many object detection tasks [49]. There are three parts of YOLOv5: the backbone, the neck, and the head. The backbone extracts features from input images and has three basic units: the CBS (which contains the convolution, batch norm and SiLU) module, the cross stage partial (CSP) module, and the spatial pyramid pooling-fast (SPPF) module. In the neck, YOLOv5 uses a Feature Pyramid Network (FPN) and a Path Aggregation Network (PANet) for the top-down and bottom-up fusion on feature maps. The major units in the neck are also the CBS and CSP modules. However, different from the CSP unit in the backbone, the CSP in the neck does not contain a residual structure. The output in the three different scales of the neck, which correspond to small targets, medium targets, and large targets in the feature maps, are sent to the head for further detection and positioning.

Compared with some of the latest YOLO series models, such as YOLOv7 [50], despite the fact that they demonstrate better performance on the general COCO dataset, YOLOv5 is still a more stable and mature choice for industrial applications, including remote sensing. Secondly, the later models come with larger parameters when compared with their equivalent in the YOLOv5 model. As a result, our model uses YOLOv5-small (YOLOv5s) as the baseline to optimize the real-time SAR ship-detection task. First of all, we innovatively embedded the latest lightweight ViT feature extraction framework in the backbone, which ensures the model can extract rich information from the multi-scale input images but can still maintain low computational costs.

Secondly, we propose a faster weighted feature fusion neck that is based on weighted feature fusion. We designed a set of faster and better feature fusion modules with an attention mechanism, which can improve the performance of feature fusion on the basis by reducing the calculation amount of the original neck in YOLOv5s.

Finally, we optimized the loss function for the small ships in the SAR images. In our loss function, we used the measurement scheme based on combining the Wasserstein distance with the original CIoU measurement scheme. The aforementioned details of our model will be explained in the following parts.

3.2. Embedding the Lighter ViT into the Backbone

As mentioned in Section 2, the transformers achieved great success in computer vision. Based on the multi-head self-attention mechanism, transformers can calculate the attention weights of the input sequence in different subspaces so as to obtain more information. To a certain extent, transformers can reduce the degree of calculation while still maintaining the performance of the model. However, there is still a computing bottleneck in the ViT. For example, in original transformer structures, the computation of attention is a softmax attention, which requires a large amount of time to complete its computations; in addition, it has a large space complexity and takes up a great deal of computing resources. There are also certain studies that have shown that not all attention heads have significant contribution in terms of results [51].

In one-object detectors, the task of the backbone is to extract more effective information from the image and to then pass it to subsequent parts of the model in order to complete further classification and positioning. As we all know, in SAR ship datasets, there are many small ships, and the image size in certain datasets is also large. As such, it is easy to lose information when using a simple deep convolutional network for feature extraction. On the other hand, transformers have the capacity to extract a wide range of features. But the classical transformer has limitations on speed and in extracting local features. In order to ensure both the quality and speed in feature extraction, we embedded EfficientViT into the backbone of the LRTransDet.

As shown in Figure 3a, the core part of our backbone is composed of six stages, and it contains EfficientViT blocks as its core part stem. In addition, stage 0 and stage 1 only contain the CBH (convolution, batch norm, and hardswish) module, the depthwise separable convolution (DSConv), and the mobile inverted bottleneck convolution (MBConv) from MobileNet-V2 [52]. DSConv is composed of the depth-wise convolution (DWConv) and point-wise convolution (PWConv). The former operation is calculated once in each input channel of the feature maps, and the latter uses a

1 \times 1

kernel to perform a convolution in the depth direction of each feature map. As such, by comparing with ordinary convolution, DSConv can decrease the complexity with the same depth of the network. This structure has been widely used in the MobileNet series. Since the transformation of the number of feature map channels in the MBConv structure is opposite to that in the ordinary ResNet structure, the residual structure in MBConv is called the inverted residual structure. The MBConv in our backbone first uses a

1 \times 1

Conv to expand the number of channels of the feature map; then, it uses a

3 \times 3

DWConv and a

1 \times 1

PWConv to extract features from the feature map. Finally, it uses the residual structure to enhance information fusion. In a MBConv, the gradient propagation can be enhanced and the memory consumption during inference can be reduced. Stage 3 and stage 4 include EfficientViT modules, which are the core part of the backbone. Different from the softmax attention mechanism, which is mentioned in Section 2, the linear attention calculation of the similarity function in EfficientViT block is shown as follows:

{C o n t e x t}_{i} = \sum_{j = 1}^{N} \frac{S i m (Q_{i}, K_{j})}{\begin{matrix} \sum_{j = 1}^{N} S i m (Q_{i}, K_{j}) \end{matrix}} V_{j} = \sum_{j = 1}^{N} \frac{ϕ (Q_{i}) ϕ {(K_{j})}^{T}}{\begin{matrix} \sum_{j = 1}^{N} ϕ (Q_{i}) ϕ {(K_{j})}^{T} \end{matrix}} V_{j}

(2)

In this equation, we assume that the input is x,

Q = x W_{Q}

,

K = x W_{K}

and

V = x W_{V}

, where N is the number of samples in x, and

W_{Q}, W_{K}, W_{V}

are the three learnable matrices.

S i m (Q_{i}, K_{j}) = ϕ (Q_{i}) ϕ {(K_{j})}^{T}

is the linear similarity function, where

ϕ (\cdot)

is the kernel function. In our work, we chose

ϕ (\cdot) = R e L U (\cdot)

. The reason for this is because we used this function so that we can employ the associative law in matrix multiplication, which can reduce the calculation complexity and improve the calculation speed.

The detail of the EfficientViT module is shown in Figure 4. We used the lightweight multi-scale attention (Lightweight MSA) mechanism and the main block in the EfficientViT module to generate the attention results. Specifically, we used the

1 \times 1

convolution to obtain the

Q / K / V

tokens, in which the image blocks were similar to the processing units in NLP tasks. Considering that the ReLU-based attention mechanism has certain performance limitations, the aggregation process was set for generating multi-scale tokens in the lightweight MSA mechanism. After obtaining these tokens, the ReLU-based attention mechanism was still used to calculate the attention weights for each

Q / K / V

. Then, the output was concatenated and sent to a

1 \times 1

convolution for the next feature fusion.

This design combines the advantages of CNNs and transformers, which can extract the local features of SAR images by downsampling the convolution operations step by step. With the use of transformers, global features are not lost. Next, we sent the outputs from S2, S3, and SPPF to the neck for feature fusion and detection.

3.3. A Faster Weighted Feature Fusion Neck

The main difficulty of SAR ship detection is that input images have complex backgrounds and various ship scales. Meanwhile, real-time implementation also requires speed and quality from the detector. Therefore, determining how to extract valid information from multi-scale feature maps at a high speed remains a key problem to be solved in neck part design. The FPN-PANet-based neck in YOLOv5 has a large number of calculations and low speed. And the simple channel-wise feature map connection renders it easy to lose information in the multi-level feature pyramid, thus affecting the accuracy of the SAR ship detector. In order to solve this problem, we propose a faster weighted feature fusion neck that is based on the Faster-WF2 module, whose overall architecture is shown in Figure 5. It mainly includes several main modules of the Faster-WF2 and CA mechanism, which can ensure the performance of feature fusion while reducing the calculation of the neck.

Recently, in many lightweight convolution networks, DSConv has often been used to reduce computation. Having said that, since DSConv greatly affects the accuracy of the network and needs a high amount of memory space to access the temporary variables, this convolution operation is difficult to deploy in every network. Inspired by the latest model FasterNet [53], we utilized two convolution methods, partial convolution (PConv) and PWConv, in the neck, and optimized the existing CSP to reduce calculation complexity.

Due to there being little differences between channels in SAR ship images, we considered the fact that there are some redundant operations during the process of certain general CNN structures. In fact, in convolution layers, some weights produce a small contribution to outputs. Therefore, we attempted to implement the FasterBlock method with a partial convolution in the detector to reduce the calculation in the SAR ship detection, specifically with little performance loss or even improvements.

The details of FasterBlock and PConv are shown in the Figure 6 below. In FasterBlock, we used PConv to reduce the computational redundancy and memory usage. By setting the segmentation coefficient as

x = C_{p} / C

, the input feature map

F = C \times W \times H

was divided into two parts, whose shape were

f_{1} = C_{p} \times W \times H

and

f_{2} = (C - C_{p}) \times W \times H

. We only conducted the standard convolution calculation on

f_{1}

, and then directly concatenated

f_{2}

to the convolution results. Next, to make full use of the information of each channel, we added two PWConv layers after PConv to process the feature maps. The overall structure of the convolution in the FasterBlock looked like a “T-shape” convolution, which could direct more attention to the center of each feature map. In fact, there were more meaningful weights at the center than at the edges in a convolution operation. This “T-shape” convolution could also improve the efficiency of the convolution layers by reducing the FLOPs [53].

Furthermore, instead of simple channel concatenation in YOLOv5, we used the weighted feature fusion operation from Bi-FPN [54] to set a group of learnable parameters for each of the feature maps that were of different scales. Through this operation, we could distinguish the importance of the different feature maps in the training process so that our model could retain more important information of the small ship targets. The calculation of the weighted fusion in Figure 5 is shown in Equation (3) as follows:

O = C o n v (S i L U (\frac{α f_{1} + β f_{2}}{α + β + ϵ}))

(3)

where

α

and

β

are the two learnable parameters, and

f_{1}

and

f_{2}

are the corresponding feature maps to be fused.

For the purpose of compensating for the possible information loss caused by reducing the parameters and FLOPs, we used the coordinate attention mechanism [55] to enhance the model’s attention to the position information during the feature fusion process. Compared with other commonly used attention mechanisms, the CA mechanism has a smaller scale of calculation, and it enables the network to have the ability to focus on a wider range of targets, which is helpful for ship detection in some large-scale SAR images. Its calculation process is shown in Figure 7 below.

Specifically, the CA module contains two parts, which are named coordinate information embedding and coordinate attention generation. These are used to encode each channel of the feature map and to generate attention toward the different coordinates of vertical and horizontal direction. The process of coordinate embedding is shown in Formula (4) as follows:

\begin{matrix} z_{c}^{h} (h) & = \frac{1}{W} \sum_{0 ⩽ i ⩽ W} x_{c} (h, i) \\ z_{c}^{w} (w) & = \frac{1}{H} \sum_{0 ⩽ j ⩽ H} x_{c} (j, w) \end{matrix}

(4)

In Equation (4), the input image

x_{c}

with the size of

W \times H \times C

is split into two parts,

z_{c}^{h}

and

z_{c}^{w}

, with the size of

1 \times H \times C

and

W \times 1 \times C

after average pooling. Then, these two parts of the feature maps were concatenated to generate temporary feature maps, which were used to compute the attention weights in the height and width direction. Finally, the output Y of the CA module could be described as follows:

\begin{matrix} A t t e n t i o n W e i g h t s^{h} & = σ (F_{1 \times 1}^{h} (δ (F_{1} ([z^{h}, z^{w}])))) \\ A t t e n t i o n W e i g h t s^{w} & = σ (F_{1 \times 1}^{w} (δ (F_{1} ([z^{h}, z^{w}])))) \end{matrix}

(5)

Y_{c} (i, j) = x_{c} (i, j) \times A t t e n t i o n W e i g h t s_{c}^{h} (i) \times A t t e n t i o n W e i g h t s_{c}^{w} (j)

(6)

where

σ (\cdot)

is the sigmoid function,

δ (\cdot)

is the swish function, and the

F_{\cdot}

is the convolutional operation.

Finally, we combined the Faster-WF2 module and CA to allow our neck to have a better feature fusion quality with a faster inference. We assumed the input image size was

640 \times 640

, and the details structure of our network is shown in Table 1. Each Faster-WF2 module includes one or more convolution operations, i.e., FasterBlock and/or the weighted fusion unit. The number of them used can be dynamically adjusted according to the complexity and accuracy requirements of the task. In our experiment, the number of these components was set to 1.

3.4. Loss Function

Intersection over Union (IoU) calculates the degree of coincidence between the predicted bounding box of the detector and the ground truth bounding box of the target, which is an important metric through which to measure the accuracy of detector positioning in object detection tasks. At present, the threshold of IoU is usually used to determine whether the current anchor matches the target in YOLOv5s, which is effective for most object detection tasks. However, in certain SAR images, there are many small ships that are represented by only a few pixels, which makes the IoU-based indicators particularly sensitive to the movement of the small targets. Although simply reducing the threshold of IoU can allow them to match more anchors, it will also greatly affect the overall training quality. To solve this problem, we use the normalized Wasserstein distance (NWD)-based method of measuring IoU [56] to optimize the loss function. First, we reconstructed all the bounding boxes from the head into a two-dimensional Gaussian distribution. Second, we used NWD to calculate the similarity between these bounding boxes, and then we decided which bounding box would be assigned to each small ship target.

Specifically, the calculation method of the Wasserstein distance between the

p (x)

and

q (x)

distributions was as follows:

W (p, q) = inf_{γ \in Γ (p, q)} (\int \int γ (x, y) d (x, y) d x d y)

(7)

In this formula,

γ \in Γ (p, q)

is the joint distribution of

p (x)

and

q (x)

, and

d (x, y)

can be any distance. Compared with the other commonly used calculation method such as KL divergence distance or JS divergence distance, Wasserstein distance has the advantage of measuring the distance between two completely non-overlapping distributions. In this extreme case, JS divergence is constant, and KL divergence is meaningless.

For those small ship targets in SAR images, there may be certain background pixels in the boundary of their bounding boxes. In order to distinguish between the ship and the background, we used a two-dimensional Gaussian distribution to assign weights for the pixels in the bounding box:

N (x | μ, Σ) = \frac{e x p (- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ))}{2 π | Σ |^{\frac{1}{2}}}

(8)

where the

x, μ,

and

Σ

denote the coordinate (

x, y

), the mean vector, and the co-variance matrix of the Gaussian distribution. With respect to the two bounding boxes

{B o x}_{A} = (x_{a}, y_{a}, w_{a}, h_{a})

and

{B o x}_{B} = (x_{b}, y_{b}, w_{b}, h_{b})

, we converted them into a 2D Gaussian distribution of

{G B o x}_{A} = N_{A} (μ_{A}, Σ_{A})

and

{G B o x}_{B} = N_{B} (μ_{B}, Σ_{B})

. The Wasserstein distance between them can be described as follows:

W_{2}^{2} ({G B o x}_{A}, {G B o x}_{B}) = ∥ μ_{A} - μ_{B} ∥_{2}^{2} + {∥ Σ_{A}^{\frac{1}{2}} - Σ_{B}^{\frac{1}{2}} ∥}_{F}^{2}

(9)

where

{∥ \cdot ∥}_{F}

means the Frobenius norm.

Finally, due to the value of the IoU being between [0, 1], we needed to transform

W_{2}^{2}

to its normalized exponential form:

N W D ({G B o x}_{A}, {G B o x}_{B}) = e x p (- \frac{\sqrt{W_{2}^{2} (N_{A}, N_{B})}}{C})

(10)

where

C

is a constant determined by the different datasets.

The loss function for our proposed method is shown in Formula (11), where N is the number of heads. It is divided into three parts:

L_{b o x}

,

L_{o b j}

, and

L_{c l s}

, which measure the loss of the detector on the bounding box regression, object confidence, and classification, respectively.

L o s s = \sum_{i}^{N} (λ_{b o x} L_{b o x} + λ_{o b j} L_{o b j} + λ_{c l s} L_{c l s})

(11)

In YOLOv5, the calculation of

L_{b o x}

is based CIoU as follows:

L_{b o x} = \sum_{i}^{A} L_{C I o u} (P r e d_{b o x}, G t_{b o x})

(12)

where

A

is the number of anchors set to the target,

P r e d_{b o x}

is the output by the detector, and

G t_{b o x}

is the ground truth. Furthermore,

L_{C I o U}

can be described as follows:

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (P r e d_{b o x}, G t_{b o x})}{c^{2}} + α υ

(13)

where

\begin{matrix} υ & = \frac{4}{π^{2}} {(arctan \frac{w_{G t}}{h_{G t}} - arctan \frac{w_{P r e d}}{h_{P r e d}})}^{2} \\ α & = \frac{υ}{(1 - I o U) + υ} \end{matrix}

(14)

In Equation (14),

α

is the weight function and v is used to measure the similarity between the aspect ratio of the ground truth and the prediction box.

w_{G t}

and

h_{G t}

are the width and height of the ground truth, respectively, and

w_{P r e d}

and

h_{P r e d}

are the width and height of the prediction box, respectively.

CIoU has the ability to improve the convergence speed and stability during network training. In order to leverage the advantages of CIoU and the NWD, we designed a ratio

η

to balance the NWD metric and the CIoU scheme. Our bounding box regression loss can be calculated as follows:

\begin{matrix} L_{b o x} & = η L_{N W D} + (1 - η) L_{C I o U} \\ = η (1 - N W D (P r e d_{b o x}, G t_{b o x})) + (1 - η) L_{C I o U} \end{matrix}

(15)

Overall, in our optimized loss function

L_{b o x}

, we used the normalized Wasserstein distance to improve our model’s performance on those small targets detection. First, considering that most small objects are not strict rectangles, we reconstruct the bounding boxes as a two-dimensional Gaussian distribution. This distribution has the ability to distinguish the importance of the pixels in the center and those at the edge, which describes the differences between foreground and background better. Secondly, as IoU calculates the Jaccard similarity coefficient of two limited sample sets [57], we use the Wasserstein distance to measure the distance between these two-dimensional Gaussian distributions. The reason we use Wasserstein distance is that it has the ability to calculate the distance between two completely non-overlapping distribution. This method also has the advantages for detecting small objects in its scale invariance and smoothness to location deviation [56]. Finally, in order for this indicator to be calculated in the loss function of our model, Wasserstein distance needs to be normalized to [0, 1].

In addition, for

L_{o b j}

and

L_{c l s}

, we used BCEWithLogitsLoss in YOLOv5 [49]:

\begin{matrix} L_{o b j} & = B C E (p_{0}, p_{I o U}; w_{o b j}) \\ L_{c l s} & = B C E (c_{p r e d}, p_{G t}; w_{c l s}) \end{matrix}

(16)

3.5. Evaluating Metrics

In object detection tasks, mAP is often used as one of the most important indicators through which to evaluate the performance of a model. To calculate the mAP, we first need to determine the precision and recall. The definitions of precision (P) and recall (R) were understood as follows:

P = \frac{T P}{T P + F P}

(17)

R = \frac{T P}{T P + F N}

(18)

In the formula, true positive (

T P

) is the number of instances whose labels are predicted by the model and corresponding ground truth as both being positive. False positive (

F P

) is the number of instances whose labels are predicted by the model as being positive but where the corresponding ground truth is negative. False negative (

F N

) is the number of instances whose labels are predicted by the model as being negative but where the corresponding ground truth is positive. A detection result is identified as

T P

only when its IoU with the ground truth is greater than a given IoU threshold (0.5 is the threshold most commonly used).

By setting different thresholds for IoU, we can draw a PR curve. The mAP can be obtained by calculating the area surrounded by the PR curve and the axis:

m A P = \int_{0}^{1} P (R) d R

(19)

The

F 1

-score is also a comprehensive indicator through which to analyze the performance of the model, which is defined as follows:

F 1 = \frac{2 \times P \times R}{P + R}

(20)

In addition to these indicators, we also used floating point operations (FLOPs) and parameters to evaluate the time complexity and space complexity of our model. The time complexity of the convolution operation can be calculated by the following:

F L O P s \sim O (M^{2} \cdot K^{2} \cdot C h a n n e l^{i n p u t} \cdot C h a n n e l^{o u t p u t})

(21)

where M is the size of the output feature map calculated by the convolution operation, K is the kernel size of the convolution operation, and

C h a n n e l^{i n p u t}

and

C h a n n e l^{o u t p u t}

are the number of input channels and output channels, respectively. Therefore, the overall time complexity of the model is as follows:

F L O P s_{N e t} \sim O (\sum_{i = 1}^{D} M_{i}^{2} \cdot K_{i}^{2} \cdot C h a n n e l_{i}^{i n p u t} \cdot C h a n n e l_{i}^{o u t p u t})

(22)

In Equation (22),

D

is the number of all convolutional layers in the model. The space complexity of the model can be measured by the following parameters:

P a r a m s \sim O (\sum_{i = 1}^{D} K_{i}^{2} \cdot C h a n n e l_{i}^{i n p u t} \cdot C h a n n e l_{i}^{o u t p u t})

(23)

Furthermore, we used the frames per second (FPS) of the model to judge the real-time performance of the detector. Specifically, the FPS was tested with a batch size of 1 and 32 in our experiments.

4. Experiment

In this section, we verify the performance of our proposed model. First, we introduce the four datasets and the environment settings we used in the experiments. Second, we compare the detection result of our model with some popular general object detection models and the latest state-of-the-art (SOTA) SAR ship-detection models to show the advantages of our proposed method. Finally, we designed a series of ablation experiments to show the function and contribution of our proposed lightweight ViT-based backbone, Faster-WF2 module, CA mechanism, and optimized loss function.

4.1. Datasets

We used four sizable, real-world datasets from public repositories: the SAR Ship-Detection Dataset (SSDD) [58], the SAR-Ship Dataset [59], the High-Resolution SAR Image Dataset (HRSID) [60], and the Large-Scale SAR Ship-Detection Dataset-v1.0 (LS-SSDD-v1.0) [61]. These datasets were used to verify the effectiveness and generalization of our proposed method. The distribution of ships in the four datasets is shown in Figure 8. In addition, the statistics of these real-world public datasets are shown in Table 2.

The SSDD dataset was released in 2017; it contains 1160 images of complex scenes, and 2456 ships of different sizes. The images in the SSDD have sizes that vary from

217 \times 214

to

526 \times 646

, and resolutions that vary from 1 m to 15 m. In our experiments, we resized all of the images to

512 \times 512

, and divided the dataset into a training set and a test set with the proportion of 8:2.

The SAR-Ship Dataset was released in 2019 for complex backgrounds; it contains 43,819 images and includes 59,535 ships. The images in the SAR-Ship Dataset have sizes of

256 \times 256

pixels, and resolutions from

1.7 \times 4.3

m to

25 \times 25

m. We also used the SAR-Ship Dataset for the ablation experiment. The whole dataset was divided into a training set and a test set with the proportion of 8:2.

The HRSID dataset was released in 2020. It contains 5604 images with

800 \times 800

pixels, and 16,951 ships. The resolution of the images are 0.5 m, 1 m, and 3 m. Moreover, 98% of ships are small and medium ships, which is appropriate for high resolution detection tasks. We divided the dataset into a training set with 65% of the images and a test set with 35% of the images.

The LS-SSDD-v1.0 Dataset was released in 2020 for large-scale scene tasks, and it contains 15 images with 24,000 × 16,000 pixels, 6003 small ships, and 12 medium ships. The images were cut into 9000 sub-images with

800 \times 800

pixels, which were divided into a training set with 6000 images and a test set with 3000 images in our experiment. The resolution of the images in LS-SSDD-v1.0 was

5 \times 20

m.

4.2. Experiment Settings and Hyperparameters

Our experimental environment configuration is shown in Table 3. All of the experiments used the same environment.

In our experiment, we compared the performance of our detector with both the general object detection models and the state-of-the-art SAR ship detectors. We set the initial learning rate as 0.01, the momentum as 0.937, and the weight decay as 0.0005; in addition, we used an SGD optimizer during the training stage. For general object detection models other than YOLOv5s, we used the MMdetection framework provided by OpenMMLab to implement them [62]. MMdetection is a toolbox that supports various mainstream object detection frameworks. Finally, for SSDD, we checked the image size in this dataset, and the average image size on SSDD was

481 \times 331

; as such, we set the input size to

512 \times 512

, which would not cause a large loss of image information. For the SAR-Ship Dataset, we set the inputs at its standard image resolution of

256 \times 256

, whereas HRSID and LS-SSDD-v1.0 were used in their standard image resolution of

800 \times 800

to prove that LRTransDet has the ability to process multiple scale inputs.

4.3. Results from the SSDD

The SSDD is the most commonly used dataset in SAR ship-detection tasks thus far [63]. We compared LRTransDet with general object detectors such as Faster-RCNN, RetinaNet, FCOS, YOLOX, and our baseline YOLOv5s. The results of this is shown in Table 4. It can be seen that, although the FCOS and YOLOX obtained higher indicators in precision and recall (97.0% and 95.6%), our model had advantages in

F 1

-score, mAP, FLOPs, parameters, and FPS (95.0%, 97.8%, 3.07M, 3.85G, 74.6, and 964, respectively). The results from the SSDD showed that our model is lighter and faster than the general detectors, and the results were also obtained in real-time.

As shown in Table 5, compared with the latest SOTA SAR ship detectors, our model also obtained a superiority in mAP, in parameters, and in FLOPs. The experiment results show that our model exceeded ATSD and MHASD by 1% in mAP (whose FLOPs and parameters were the lowest, respectively, in this table). Compared with LPEDet, which has the best performance in this table, our model not only has a slight advantage in mAP, but it also achieves lower FLOPs and parameters. The experiments on the SSDD show that our model can obtain a higher accuracy and speed with lower memory usage.

The precision–recall curve (PR-Curve) for our proposed model is shown in Figure 9. Additionally, the visualization results from the SSDD are shown in Figure 10. It shows that our proposed model has the ability to detect multiple small ships in the SSDD that are offshore, and large ships that are inshore particularly well. In addition, it also has a high confidence for these targets.

4.4. Results from the SAR-Ship Dataset

There was a great deal of noise and variations in the complex backgrounds represented in the images of the SAR-Ship Dataset; as such, we chose this dataset to verify the noise resistance and robustness of LRTransDet. The specific results are shown in Table 6 and Table 7. The PR-Curve is shown in the Figure 11, and it evidently shows that the PR-Curve of our proposed model is smooth. Compared with the general detectors, it demonstrated that our model is still able to keep a high precision, recall, and

F 1

-score in the detection tasks in complex background scenarios. In addition, LRTransDet showed clear superiority over all of the general benchmarks, i.e., the mAP, parameters, FLOPs, and FPS. The comprehensive results showed that our model has a higher detection accuracy (95.1%), faster detection speed (76.4 FPS with a batch of 1, and 1619 FPS with a batch of 32), and lower conputational complexity (0.96 GFLOPs and 3.07 M params.).

Table 7 shows the results of LRTransDet and other SOTA models on the SAR-Ship Dataset. It is not easy to collect all of their indicators from all the SOTA detectors at the same time because some of the methods do not provide accurate parameters and FLOPs. Compared with others of the latest detectors, our model has the best mAP, parameters, and FLOPs, which thus verifies that LRTransDet has a better and more comprehensive ability than SOTA models. The results also prove that our model can deal with complex scenes and the impact of noise well, and that it has the power to obtain more information by lower computation and memory costs.

The visualization results on the SAR-Ship Dataset are shown in Figure 12. It can be clearly observed that our model can handle noisy images and keep a high detection confidence.

4.5. Results on Large-Scale Datasets

When certificating our model, it was found that it can also handle the large-scale SAR ship scenes. As such, we conducted relevant experimental verification on two representative large-scale SAR ship datasets—HRSID and LS-SSDD-v1.0. The PR-Curve of LRTransDet on these two large-scale datasets is shown in Figure 13. In Table 8 and Table 9, we compare LRTransDet with some of the popular general object detectors. Our model achieved large benefits on these two datasets, reaching 93.9% and 76.2% mAP on HRSID and LS-SSDD-v1.0, respectively, and these values exceeded the best general detector by 1.2% and 1.1%. In addition, compared with these general object detectors, our model can maintain high precision, recall, and an

F 1

-score at the same time, such that LRTransDet has a better and more stable performance. In terms of computational complexity, our model had lower time duration and space complexity (9.4 GFLOPs and 3.07 M params.) than the general detector. For FPS, our model has a faster speed during inference (75.8 FPS with a batch of 1, and 578 FPS with a batch of 32); as such, our model functions better in real-time.

In addition, we also compared our model with the latest SOTA SAR ship detectors on the HRSID and LS-SSDD-v1.0, and the results are shown in Table 10 and Table 11, respectively. It can be found that LRTransDet has a better ability on large-scale image detection tasks than the latest SAR ship detectors. This is particularly the case on the HRSID as LRTransDet obtains clear advantages in the mAP and in parameters. It has a 3.6% higher mAP than the best SOTA detector, FBUA-Net, as shown in Table 10. In addition, it has fewer 2 M parameters than LPEDet (3.07 M with 5.68 M). Although ATSD has 2 fewer GFLOPs, our model achieved 5.8% more mAP (93.9% with 88.1%) than it. Meanwhile, the FPS in the two large-scale datasets proves that our model is suitable for real-time scenarios.

In the LS-SSDD-v1.0, compared with the lightest SOTA model Lite-YOLOv5 (as shown in Table 11), our detector had more than 3% greater mAP while using more 2 M parameters and 5 GFLOPs. Considering the three indicators comprehensively, these results prove that our proposed model has the capacity to keep operating with a relatively lower requirement of calculation time while also delivering a better accuracy.

The visualization results on the two large-scale datasets are shown in Figure 14 and Figure 15. Figure 14 shows the detection results and ground truth in the HRSID. Compared with the SSDD, there are many complex scenes and small targets in the inshore images of the HRSID, and the results show that our model can capture these ships well. In addition, Figure 15 shows the results and corresponding ground truth in the LS-SSDD-v1.0. We can see that the targets are smaller than those in other datasets, such that ships in this dataset are more difficult to detect.

4.6. Ablation Experiment

To prove the role of our proposed lightweight ViT-based backbone, we conduct a series of ablation experiments on the Faster-WF2 module, CA mechanism, and an optimized loss function designed for small ships in our model. In Table 12, we compare the experimental results when using two different backbones on the SAR-Ship Dataset. In our designed ablation, we compare our EfficientViT backbone with CSPDarkNet53, thereby revealing that our chosen backbone significantly reduces both GFLOPs (2.35 G compared to 0.96 G) and parameters (6.40 M compared to 3.07 M) by more than two-fold. Although our mAP lags behind the CSPDarkNet53 by 0.4%, our backbone still produces a comprehensive performance. Moreover, fewer calculations and a higher inference speed also make our model suitable for deployment on hardware platforms. This comparative experimental result also proves that the MBConv operation in our backbone has obviously reduced computational complexity. Meanwhile, the power of transformers to capture global information also makes up for the performance loss caused by reducing computational complexity to a certain extent.

Second, we compare the results by using different numbers of Faster-WF2 modules in Table 13. Through adding a greater number of Faster-WF2 modules, we can see that the mAP and calculation costs were both optimized. Compared with the model without Faster-WF2 modules, the detector using four Faster-WF2 modules had a 0.8% higher precision, a 0.4% higher recall, a 0.7% higher

F 1

-score, and a 0.6% higher mAP, reducing 0.55 M parameters and 0.13 GFLOPs. These results show that our Faster-WF2 module can improve the performance of the detector by using the FasterBlock to remove the computation redundancy in SAR images and setting the weighted fusion unit to aggregate more information from the different scales of the feature maps.

To further verify the effectiveness of CA mechanism in our neck structure, as shown in Table 14, we compare the performance of our detector in terms of whether the CA mechanism is used or not. It was found that, after applying the coordinate attention mechanism, the mAP improved by 1.2% (from 93.9% to 95.1%), but the parameters and FLOPs only saw a small increase. As a result, this ablation result shows that the CA mechanism is able to capture the location information in the feature maps well, and its ability also has a certain impact on feature fusion in the neck.

Finally, we evaluate the effectiveness of the NWD optimized loss function in Table 15. The experimental results show that our optimization for the loss function is effective, and that the model with an NWD metric improves the mAP by 1.6% when compared with the traditional CIoU-based loss function. Considering that the bounding boxes in the SAR-Ship Dataset are small [59], this result proves that the NWD metric effectively captures the information of small ships in SAR images. Moreover, the backgrounds in this dataset are more complex, our ablation results also verify that the NWD optimized loss function enables the model to better handle complex scenes.

5. Discussion and Future Works

LRTransDet consists of a lightweight ViT-based backbone and a faster weighted feature fusion neck. The lightweight ViT structure can capture more SAR image information while demanding minimal computational resources. The Faster-WF2 module in our neck reduces redundant calculations in SAR images, thereby enhancing the processing speed of our model during the multi-scale feature fusion. The loss function combined with the NWD metric optimizes the performance of our model in terms of ship detection. The experimental results on the four challenging SAR datasets have demonstrated the advantages of our proposed detector in this paper.

However, there are still some limitations. As shown in Figure 16a,d, when the image is an inshore image with a complex background and includes many ships, our detector may experience occasional missed detections. For instance, in the lower left corner of Figure 16a, a ship marked by the red rectangle eluded detection. As shown in Figure 16d, only five ships were detected in the region marked by the red rectangle, and the detector missed most of the ships in the corresponding ground truth. Moreover, in Figure 13f, when the ground truth bounding boxes overlapped, our model may deliver a false detection for detecting multiple overlapping ships as one ship. (The error example case is shown in the part marked by the yellow rectangle in Figure 16e). Finally, in Figure 16g, LRTransDet had difficulty in distinguishing objects that closely resembled ships. These bottlenecks that affect the model’s performance are also the directions we need to improve next.

In the future, we will continue to mitigate the problem of our detector and improve the performance of our model. Our focus will be on enhancing the overall performance of our model while also exploring the integration of quantization and compression technologies, as well as streamlining the deployment of our detector on hardware platforms. In addition, we will further explore and optimize the detection algorithm on datasets with more complex information.

6. Conclusions

In this paper, we introduced LRTransDet, an efficient, multi-scale, and noise-resistant SAR ship-detection model. To extract rich features from SAR images, we integrated EfficientViT into the backbone, which has the power to combine the ability of CNNs to extract local features, and uses transformers to extract global features. In addition, to reduce the redundant calculation of the neck and improve the fusion efficiency of the feature pyramid, we designed a faster weighted feature fusion neck that is based on the Faster-WF2 module and coordinate attention mechanism. Meanwhile, when aiming at detecting small ships, we combined the NWD metric with the CIoU metric, optimized the loss function, and ensured a strong performance in small ship detection. According to the experimental results on the SSDD, the SAR-Ship Dataset, the HRSID, and the LS-SSDD-v1.0, our detector achieved a superior performance when compared against certain general object detection models and some of the latest SOTA ship-detection models. Through a series of ablation experiments, we proved that our proposed backbone, neck, and optimized loss function can improve the performance of the network in detection tasks without consuming too many computing and storage resources.

In summary, LRTransDet delivers a superior performance while maintaining lower resource consumption and higher processing speeds. It was demonstrated that it is suitable for ship-detection tasks on edge devices, which has great significance to the application of lightweight SAR ship-detection models in the industry. In the future, we will try to deploy our model on a hardware platform, and continue to design a detector that is more compatible with SAR ship detection.

Author Contributions

Conceptualization, K.F.; methodology, K.F. and L.L; software, K.F.; validation, L.L, X.W. and X.C.; data curation, L.L.; writing—original draft preparation, K.F.; writing—review and editing, L.L. and X.C; visualization, K.F. and L.L; supervision, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Fu, J.; Sun, X.; Wang, Z.; Fu, K. An Anchor-Free Method Based on Feature Balancing and Refinement Network for Multiscale Ship Detection in SAR Images. IEEE Trans. Geosci. Remote. Sens. 2021, 59, 1331–1344. [Google Scholar] [CrossRef]
Lv, J.; Chen, J.; Huang, Z.; Wan, H.; Zhou, C.; Wang, D.; Wu, B.; Sun, L. An Anchor-Free Detection Algorithm for SAR Ship Targets with Deep Saliency Representation. Remote Sens. 2023, 15, 103. [Google Scholar] [CrossRef]
Hong, Z.; Yang, T.; Tong, X.; Zhang, Y.; Jiang, S.; Zhou, R.; Han, Y.; Wang, J.; Yang, S.; Liu, S. Multi-Scale Ship Detection From SAR and Optical Imagery Via A More Accurate YOLOv3. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6083–6101. [Google Scholar] [CrossRef]
Yang, G.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Wang, J.; Zhang, X. Algorithm/Hardware Codesign for Real-Time On-Satellite CNN-Based Ship Detection in SAR Imagery. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Leng, X.; Ji, K.; Yang, K.; Zou, H. A Bilateral CFAR Algorithm for Ship Detection in SAR Images. IEEE Geosci. Remote. Sens. Lett. 2015, 12, 1536–1540. [Google Scholar] [CrossRef]
Zhou, X.; Zhang, L. SA-FPN: An effective feature pyramid network for crowded human detection. Appl. Intell. 2022, 52, 12556–12568. [Google Scholar] [CrossRef]
Chen, J.; Wang, Q.; Peng, W.; Xu, H.; Li, X.; Xu, W. Disparity-Based Multiscale Fusion Network for Transportation Detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18855–18863. [Google Scholar] [CrossRef]
Chen, A.; Zhang, K.; Zhang, R.; Wang, Z.; Lu, Y.; Guo, Y.; Zhang, S. PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 5291–5301. [Google Scholar]
Zong, C.; Wan, Z. Container ship cell guide accuracy check technology based on improved 3D point cloud instance segmentation. Brodogr. Teor. Praksa Brodogr. Pomor. Teh. 2022, 73, 23–35. [Google Scholar] [CrossRef]
Xu, Y.; Bazarjani, A.; Chi, H.g.; Choi, C.; Fu, Y. Uncovering the Missing Pattern: Unified Framework Towards Trajectory Imputation and Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 9632–9643. [Google Scholar]
Qian, L.; Zheng, Y.; Li, L.; Ma, Y.; Zhou, C.; Zhang, D. A New Method of Inland Water Ship Trajectory Prediction Based on Long Short-Term Memory Network Optimized by Genetic Algorithm. Appl. Sci. 2022, 12, 4073. [Google Scholar] [CrossRef]
Zheng, Y.; Lv, X.; Qian, L.; Liu, X. An Optimal BP Neural Network Track Prediction Method Based on a GA&ACO Hybrid Algorithm. J. Mar. Sci. Eng. 2022, 10, 1399. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Tian, Y.; Wang, X.; Zhu, S.; Xu, F.; Liu, J. LMSD-Net: A Lightweight and High-Performance Ship Detection Network for Optical Remote Sens. Images. Remote Sens. 2023, 15, 4358. [Google Scholar] [CrossRef]
Zheng, Y.; Zhang, Y.; Qian, L.; Zhang, X.; Diao, S.; Liu, X.; Cao, J.; Huang, H. A lightweight ship target detection model based on improved YOLOv5s algorithm. PLoS ONE 2023, 18, 1–23. [Google Scholar] [CrossRef]
Zheng, Y.; Liu, P.; Qian, L.; Qin, S.; Liu, X.; Ma, Y.; Cheng, G. Recognition and Depth Estimation of Ships Based on Binocular Stereo Vision. J. Mar. Sci. Eng. 2022, 10, 1153. [Google Scholar] [CrossRef]
Yasir, M.; Zhan, L.; Liu, S.; Wan, J.; Hossain, M.S.; Isiacik Colak, A.T.; Liu, M.; Islam, Q.U.; Raza Mehdi, S.; Yang, Q. Instance segmentation ship detection based on improved Yolov7 using complex background SAR images. Front. Mar. Sci. 2023, 10, 1113669. [Google Scholar] [CrossRef]
Yan, G.; Chen, Z.; Wang, Y.; Cai, Y.; Shuai, S. LssDet: A Lightweight Deep Learning Detector for SAR Ship Detection in High-Resolution SAR Images. Remote Sens. 2022, 14, 5148. [Google Scholar] [CrossRef]
Xu, X.; Zhang, X.; Zhang, T. Lite-YOLOv5: A Lightweight Deep Learning Detector for On-Board Ship Detection in Large-Scene Sentinel-1 SAR Images. Remote Sens. 2022, 14, 1018. [Google Scholar] [CrossRef]
Zeng, N.; Wu, P.; Wang, Z.; Li, H.; Liu, W.; Liu, X. A Small-Sized Object Detection Oriented Multi-Scale Feature Fusion Approach With Application to Defect Detection. IEEE Trans. Instrum. Meas. 2022, 71, 1–14. [Google Scholar] [CrossRef]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. AugFPN: Improving Multi-Scale Feature Learning for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12592–12601. [Google Scholar]
Liu, D.; Cui, Y.; Yan, L.; Mousas, C.; Yang, B.; Chen, Y. DenserNet: Weakly Supervised Visual Localization Using Multi-Scale Feature Aggregation. Proc. AAAI Conf. Artif. Intell. 2021, 35, 6101–6109. [Google Scholar] [CrossRef]
Huang, L.; Chen, C.; Yun, J.; Sun, Y.; Tian, J.; Hao, Z.; Yu, H.; Ma, H. Multi-Scale Feature Fusion Convolutional Neural Network for Indoor Small Target Detection. Front. Neurorobotics 2022, 16, 881021. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. Adv. Neural Inf. Process. Syst. (NeurIPS) 2017, 30, 600–610. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual-Only, 3–7 May 2021. [Google Scholar]
Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Lightweight Multi-Scale Attention for On-Device Semantic Segmentation. arXiv 2023, arXiv:2205.14756v3. [Google Scholar] [CrossRef]
Bai, L.; Yao, C.; Ye, Z.; Xue, D.; Lin, X.; Hui, M. A Novel Anchor-Free Detector Using Global Context-Guide Feature Balance Pyramid and United Attention for SAR Ship Detection. IEEE Geosci. Remote. Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Zha, M.; Qian, W.; Yang, W.; Xu, Y. Multifeature Transformation and Fusion-Based Ship Detection With Small Targets and Complex Backgrounds. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Yao, C.; Xie, P.; Zhang, L.; Fang, Y. ATSD: Anchor-Free Two-Stage Ship Detection Based on Feature Enhancement in SAR Images. Remote Sens. 2022, 14, 6058. [Google Scholar] [CrossRef]
Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep Learning for SAR Ship Detection: Past, Present and Future. Remote Sens. 2022, 14, 2712. [Google Scholar] [CrossRef]
Feng, Y.; Chen, J.; Huang, Z.; Wan, H.; Xia, R.; Wu, B.; Sun, L.; Xing, M. A Lightweight Position-Enhanced Anchor-Free Algorithm for SAR Ship Detection. Remote Sens. 2022, 14, 1908. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Peng, Z.; Huang, W.; Gu, S.; Xie, L.; Wang, Y.; Jiao, J.; Ye, Q. Conformer: Local Features Coupling Global Representations for Visual Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 367–376. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable {DETR}: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Virtual-Only, 3–7 May 2021. [Google Scholar]
Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries. In Proceedings of the 5th Conference on Robot Learning, London, UK, 8–11 November 2022; pp. 180–191. [Google Scholar]
Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity. In Proceedings of the International Conference on Learning Representations, Virtual-Only, 25–29 April 2022. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the International Conference on Learning Representations, Virtual-Only, 25–29 April 2022. [Google Scholar]
Huang, T.; Huang, L.; You, S.; Wang, F.; Qian, C.; Xu, C. LightViT: Towards Light-Weight Convolution-Free Vision Transformers. arXiv 2022, arXiv:2207.05557. [Google Scholar]
Yang, C.; Wang, Y.; Zhang, J.; Zhang, H.; Wei, Z.; Lin, Z.; Yuille, A. Lite Vision Transformer with Enhanced Self-Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11998–12008. [Google Scholar]
Xia, R.; Chen, J.; Huang, Z.; Wan, H.; Wu, B.; Sun, L.; Yao, B.; Xiang, H.; Xing, M. CRTransSar: A Visual Transformer Based on Contextual Joint Representation Learning for SAR Ship Detection. Remote Sens. 2022, 14, 1488. [Google Scholar] [CrossRef]
Shi, H.; Chai, B.; Wang, Y.; Chen, L. A Local-Sparse-Information-Aggregation Transformer with Explicit Contour Guidance for SAR Ship Detection. Remote Sens. 2022, 14, 5247. [Google Scholar] [CrossRef]
Yu, N.; Ren, H.; Deng, T.; Fan, X. A Lightweight Radar Ship Detection Framework with Hybrid Attentions. Remote Sens. 2023, 15, 2743. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; Michael, K.; TaoXie.; Fang, J. ultralytics/yolov5: V7.0—YOLOv5 SOTA Realtime Instance Segmentation. Available online: https://zenodo.org/records/7347926 (accessed on 8 November 2023).
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Michel, P.; Levy, O.; Neubig, G. Are Sixteen Heads Really Better than One? Adv. Neural Inf. Process. Syst. (NeurIPS) 2019, 32. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Glasgow, UK, 23–28 August 2020; pp. 10781–10790. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Montreal, QC, Canada, 10–17 October 2021; pp. 13713–13722. [Google Scholar]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A Normalized Gaussian Wasserstein Distance for Tiny Object Detection. arXiv 2022, arXiv:2110.13389. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seoul, Korea, 27 October 2019–2 November 2019; pp. 658–666. [Google Scholar]
Li, J.; Qu, C.; Shao, J. Ship detection in SAR images based on an improved faster R-CNN. In Proceedings of the 2017 SAR in Big Data Era: Models, Methods and Applications (BIGSARDATA), Beijing, China, 13–14 November 2017; pp. 1–6. [Google Scholar] [CrossRef]
Wang, Y.; Wang, C.; Zhang, H.; Dong, Y.; Wei, S. A SAR Dataset of Ship Detection for Deep Learning under Complex Backgrounds. Remote Sens. 2019, 11, 765. [Google Scholar] [CrossRef]
Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A High-Resolution SAR Images Dataset for Ship Detection and Instance Segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Ke, X.; Zhan, X.; Shi, J.; Wei, S.; Pan, D.; Li, J.; Su, H.; Zhou, Y.; et al. LS-SSDD-v1.0: A Deep Learning Dataset Dedicated to Small Ship Detection from Large-Scale Sentinel-1 SAR Images. Remote Sens. 2020, 12, 2997. [Google Scholar] [CrossRef]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Li, J.; Chen, J.; Cheng, P.; Yu, Z.; Yu, L.; Chi, C. A Survey on Deep-Learning-Based Real-Time SAR Ship Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3218–3247. [Google Scholar] [CrossRef]
Bai, L.; Yao, C.; Ye, Z.; Xue, D.; Lin, X.; Hui, M. Feature Enhancement Pyramid and Shallow Feature Reconstruction Network for SAR Ship Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1042–1056. [Google Scholar] [CrossRef]
Zhou, K.; Zhang, M.; Wang, H.; Tan, J. Ship Detection in SAR Images Based on Multi-Scale Feature Extraction and Adaptive Feature Fusion. Remote Sens. 2022, 14, 755. [Google Scholar] [CrossRef]
Li, L.; Jiang, L.; Zhang, J.; Wang, S.; Chen, F. A Complete YOLO-Based Ship Detection Method for Thermal Infrared Remote Sens. Images under Complex Backgrounds. Remote Sens. 2022, 14, 1534. [Google Scholar] [CrossRef]
Yang, S.; An, W.; Li, S.; Wei, G.; Zou, B. An Improved FCOS Method for Ship Detection in SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8910–8927. [Google Scholar] [CrossRef]
Li, K.; Zhang, M.; Xu, M.; Tang, R.; Wang, L.; Wang, H. Ship Detection in SAR Images Based on Feature Enhancement Swin Transformer and Adjacent Feature Fusion. Remote Sens. 2022, 14, 3186. [Google Scholar] [CrossRef]
Yu, J.; Wu, T.; Zhou, S.; Pan, H.; Zhang, X.; Zhang, W. An SAR Ship Object Detection Algorithm Based on Feature Information Efficient Representation Network. Remote Sens. 2022, 14, 3489. [Google Scholar] [CrossRef]
Gao, S.; Liu, J.M.; Miao, Y.H.; He, Z.J. A High-Effective Implementation of Ship Detector for SAR Images. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Shao, Z.; Zhang, X.; Wei, S.; Shi, J.; Ke, X.; Xu, X.; Zhan, X.; Zhang, T.; Zeng, T. Scale in Scale for SAR Ship Instance Segmentation. Remote Sens. 2023, 15, 629. [Google Scholar] [CrossRef]
Yu, N.; Ren, H.; Deng, T.; Fan, X. HA-SARSD: An Effective SAR Ship detector via the Hybrid Attention Residual Module. In Proceedings of the 2023 IEEE Radar Conference (RadarConf23), San Antonio, TX, USA, 1–5 May 2023; pp. 1–6. [Google Scholar] [CrossRef]
Wang, S.; Gao, S.; Zhou, L.; Liu, R.; Zhang, H.; Liu, J.; Jia, Y.; Qian, J. YOLO-SD: Small Ship Detection in SAR Images by Multi-Scale Convolution and Feature Transformer Module. Remote Sens. 2022, 14, 5268. [Google Scholar] [CrossRef]
Zhu, M.; Hu, G.; Zhou, H.; Wang, S.; Feng, Z.; Yue, S. A Ship Detection Method via Redesigned FCOS in Large-Scale SAR Images. Remote Sens. 2022, 14, 1153. [Google Scholar] [CrossRef]

Figure 1. Architecture of a transformer. The multi-head attention mechanism is the main block in the transformer. In this figure, we assume the length of the input sequence is two.

Figure 2. The overall structure of our proposed LRTransDet.

Figure 3. The overall structure of our lightweight-based backbone. (a) The detail framework of our backbone. (b) The detail of the depthwise separable convolution (DSConv). (c) The calculation of the mobile inverted bottleneck convolution (MBConv).

Figure 4. The dataflow of the EfficientViT module. Each EfficientViT module consists of a lightweight MSA and a MBConv for capturing global features and local features.

Figure 5. Architecture of the faster weighted Feature Fusion Neck. The main blocks for our proposed neck are the Faster-WF2 module and CA mechanism.

Figure 6. Details of FasterBlock in the neck.

Figure 7. The structure of CA.

Figure 8. Description of the size of the bounding boxes for the four datasets. (a) SSDD. (b) The SAR-Ship Dataset. (c) HRSID. (d) LS-SSDD-v1.0.

Figure 9. The precision–recall curve of LRTransDet on the SSDD [58].

Figure 10. The visualization results of our designed LRTransDet on SSDD [58]. (a) Detection results. (b) Corresponding ground truth.

Figure 11. The precision–recall curve of LRTransDet on the SAR-Ship Dataset [59].

Figure 12. The visualization results of our designed LRTransDet on the SAR-Ship Dataset [59]. (a) Detection results. (b) Corresponding ground truth.

Figure 13. The precision–recall curves of LRTransDet on two large-scale datasets. (a) The HRSID [60]. (b) The LS-SSDD-v-1.0 [61].

Figure 14. The visualization results of our designed LRTransDet on the HRSID [60]. (a) Detection results. (b) Corresponding ground truth.

Figure 15. The visualization results of our designed LRTransDet on the LS-SSDD-V1.0 [61]. (a) Detection results. (b) Corresponding ground truth.

Figure 16. The visualization of the performance limitations on detection. (a–d) Two sets of missed detection examples and the corresponding ground truth. (e–h) Two sets of false detection examples and the corresponding ground truth.

Table 1. The details of LRTransDet.

Input	Layer Name	Kernel Size	Stride	Output
$640 \times 640 \times 3$	CBH	3	2	$320 \times 320 \times 8$
$320 \times 320 \times 8$	DSConv	3	1	$320 \times 320 \times 8$
$320 \times 320 \times 8$	MBConv	3	2	$160 \times 160 \times 16$
$160 \times 160 \times 16$	MBConv	3	1	$160 \times 160 \times 16$
$160 \times 160 \times 16$	MBConv	3	2	$80 \times 80 \times 32$
$80 \times 80 \times 32$	MBConv	3	1	$80 \times 80 \times 32$
$80 \times 80 \times 32$	MBConv	3	2	$40 \times 40 \times 64$
$40 \times 40 \times 64$	LightWeight MSA	1/3/5	1	$40 \times 40 \times 64$
$40 \times 40 \times 64$	MBConv	3	1	$40 \times 40 \times 64$
$40 \times 40 \times 64$	LightWeight MSA	1/3/5	1	$40 \times 40 \times 64$
$40 \times 40 \times 64$	MBConv	3	1	$40 \times 40 \times 64$
$40 \times 40 \times 64$	MBConv	3	2	$20 \times 20 \times 128$
$20 \times 20 \times 128$	LightWeight MSA	1/3/5	1	$20 \times 20 \times 128$
$20 \times 20 \times 128$	MBConv	3	1	$20 \times 20 \times 128$
$20 \times 20 \times 128$	LightWeight MSA	1/3/5	-1	$20 \times 20 \times 128$
$20 \times 20 \times 128$	MBConv	3	1	$20 \times 20 \times 128$
$20 \times 20 \times 128$	SPPF	5	1	$20 \times 20 \times 512$
$20 \times 20 \times 512$	Conv	1	1	$20 \times 20 \times 256$
$20 \times 20 \times 256$	Upsample	–	–	$40 \times 40 \times 256$
$40 \times 40 \times 256$	Faster-WF2	1/3	1	$40 \times 40 \times 256$
$40 \times 40 \times 256$	CA	1	1	$40 \times 40 \times 256$
$40 \times 40 \times 256$	Conv	1	1	$40 \times 40 \times 128$
$40 \times 40 \times 128$	Upsample	–	–	$80 \times 80 \times 128$
$80 \times 80 \times 128$	Faster-WF2	1/3	1	$80 \times 80 \times 128$
$80 \times 80 \times 128$	CA	1	1	$80 \times 80 \times 128$
$80 \times 80 \times 128$	Conv	3	2	$40 \times 40 \times 128$
$40 \times 40 \times 128$	Faster-WF2	1/3	1	$40 \times 40 \times 256$
$40 \times 40 \times 256$	CA	1	1	$40 \times 40 \times 256$
$40 \times 40 \times 256$	Conv	3	2	$20 \times 20 \times 256$
$20 \times 20 \times 256$	Faster-WF2	1/3	1	$20 \times 20 \times 512$

The kernel size of LightWeight MSA was 1/3/5, which means there are three types of convolution in the block. The details of them can be found in Figure 4. The kernel size of Faster-WF2 was 1/3, which means there are two types of convolution in the block. The details of them can be found in Figure 5 and Figure 6.

Table 2. Statistics of the four datasets.

Details	SSDD	SAR-Ship Dataset	HRSID	LS-SSDD-v1.0
Satellite	RadarSat-2, TerraSAR-X, Sentinel-1	Gaofen-3, Sentinel-1	Sentinel-1B, TerraSAR-X, TanDEM	Sentinel-1
Polarization	HH, HV, VV, VH	Single, Dual, Full	HH, HV, VV	VV, VH
Resolution (m)	1–15	$1.7 \times 4.3$ – $25 \times 25$	0.5, 1, 3	$5 \times 20$
Image number	1160	43,819	5604	9000
Train/test ratio	928/232	35,055/8764	3643/1961	6000/3000
Ship number	2456	59,535	16,951	6015
Image size (pixel²)	$217 \times 214$ – $526 \times 646$	$256 \times 256$	$800 \times 800$	$800 \times 800$
Average image size (pixel²)	$481 \times 331$	$256 \times 256$	$800 \times 800$	$800 \times 800$

Table 3. Experimental environment configuration.

Item	Description
Python version	3.7.11
Pytorch version	1.12.1
CPU type	Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
GPU type	NVIDIA Corporation GA100 [GRID A100 PCIe 40GB] (rev a1)
Linux version	Red Hat Enterprise Linux Server v7.0
CUDA and CUDNN version	CUDA 11.4 + CUDNN 8.8.0

Table 4. Results of the comparison with the general object detection models on the SSDD [58]. We calculated the FPS with a batch of 1 and of 32.

Methods	Precision (%)	Recall (%)	F1-Score (%)	mAP (%)	Params. (M)	FLOPs (G)	FPS (b@1)	FPS (b@32)
Faster-RCNN [13]	87.3	89.2	88.2	91.3	41.3	51.6	44.6	785
RetinaNet [16]	81.2	67.4	73.7	79.2	36.3	40.1	51.8	796
FCOS [17]	80.3	95.6	87.3	89.4	32.1	38.6	51.5	775
YOLOX [18]	97.0	63.2	76.5	94.7	8.94	8.52	65.3	833
YOLOv5s [49]	96.1	94.0	95.0	97.5	7.03	10.21	73.5	863
LRTransDet	96.8	93.2	95.0	97.8	3.07	3.85	74.6	964

The best results are in bold.

Table 5. Comparison results with state-of-the-art SAR ship-detection models on the SSDD [58].

Methods	mAP (%)	Parameters (M)	FLOPs (G)
ATSD [34]	96.8	61.5	7.25
LSIA-CGSD [47]	94.7	33.5	236.8
MHASD [48]	96.8	5.5	13.7
LPEDet [36]	97.4	5.68	18.38
FBUA-Net [32]	96.2	36.54	71.11
CRTransSar [46]	97.0	96	–
FEPS-Net [64]	96.0	37.31	–
LRTransDet	97.8	3.07	3.85

The best results are in bold.

Table 6. Results of the comparison with the general object detection models on the SAR-Ship Dataset [59]. We calculated the FPS with a batch of 1 and of 32.

Methods	Precision (%)	Recall (%)	F1-Score (%)	mAP (%)	Params. (M)	FLOPs (G)	FPS (b@1)	FPS (b@32)
Faster-RCNN [13]	93.2	87.6	90.3	92.7	41.3	26.2	47.4	1009
RetinaNet [16]	86.9	76.6	81.4	87.2	36.3	13.1	61.0	1050
FCOS [17]	89.3	96.3	92.7	94.7	32.1	12.57	50.3	1016
YOLOX [18]	87.9	86.6	87.2	89.5	8.94	2.13	63.3	1052
YOLOv5s [49]	91.3	91.8	91.5	94.8	7.03	2.56	76.6	1591
LRTransDet	92.6	91.5	92.0	95.1	3.07	0.96	76.4	1619

The best results are in bold.

Table 7. Results of the comparison with the state-of-the-art SAR ship-detection models on the SAR-Ship Dataset [59].

Methods	mAP(%)	Parameters(M)	FLOPs(G)
MFTF-Net [33]	92.1	59.2	9.27
MSSDNet [65]	95.1	25.8	–
IC-CYSDM [66]	92.8	4.1	9.3
Improved-FCOS [67]	94.1	40.6	–
SRDet [2]	95.1	35.2	–
ESTDNet [68]	95.0	–	–
FIERNet [69]	92.01	–	–
LRTransDet	95.1	3.07	0.96

The best results are in bold.

Table 8. Results of the comparison with the general object detection models on the HRSID [60]. We calculated the FPS with a batch of 1 and of 32.

Methods	Precision (%)	Recall (%)	F1-Score (%)	mAP (%)	Params. (M)	FLOPs (G)	FPS (b@1)	FPS (b@32)
Faster-RCNN [13]	80.2	82.1	81.1	79.4	41.3	134	41.8	442
RetinaNet [16]	64.8	55.1	59.6	62.6	36.3	128	55.6	476
FCOS [17]	77.3	87.9	82.2	80.9	32.1	123	48.5	441
YOLOX [18]	94.5	54.6	69.2	90.5	8.94	20.8	63.3	564
YOLOv5s [49]	91.9	85.0	88.3	92.7	7.03	24.9	68.5	575
LRTransDet	92.7	87.8	90.2	93.9	3.07	9.4	75.8	578

The best results are in bold.

Table 9. Results of the comparison with general object detection models on the LS-SSDD-v1.0 [61]. We calculated the FPS with a batch of 1 and of 32.

Methods	Precision (%)	Recall (%)	F1-Score (%)	mAP (%)	Params. (M)	FLOPs (G)	FPS (b@1)	FPS (b@32)
Faster-RCNN [13]	69.0	43.2	53.1	58.2	41.3	134	41.8	442
RetinaNet [16]	52.8	77.4	62.7	54.6	36.3	128	55.6	476
FCOS [17]	53.5	88.0	66.5	65.0	32.1	123	48.5	441
YOLOX [18]	75.6	65.6	70.2	73.6	8.94	20.8	63.3	564
YOLOv5s [49]	83.8	67.8	74.9	75.1	7.03	24.9	68.5	575
LRTransDet	82.6	68.8	75.1	76.2	3.07	9.4	75.8	578

The best results are in bold.

Table 10. Results of the comparison with the state-of-the-art SAR ship-detection models on the HRSID [60].

Methods	mAP(%)	Parameters(M)	FLOPs(G)
ATSD [34]	88.1	61.5	7.25
LSIA-CGSD [47]	87.8	33.5	236.8
SAR-Net [70]	87.5	42.6	104.2
LPEDet [36]	89.7	5.68	18.38
FBUA-Net [32]	90.3	36.54	194.12
FEPS-Net [64]	90.7	37.31	–
SISNet [71]	92.4	118.10	–
LRTransDet	93.9	3.07	9.4

The best results are in bold.

Table 11. Results of the comparison with state-of-the-art SAR ship-detection models on the LS-SSDD-v1.0 [61].

Methods	mAP(%)	Parameters(M)	FLOPs(G)
HA-SARSD [72]	76.0	6.1	15.8
SAR-Net [70]	76.2	42.6	104.2
MHASD [48]	75.5	5.12	13.7
LssDet [23]	74.8	2.25	4.49
Lite-YOLOv5 [24]	73.15	1.04	4.44
YOLO-SD [73]	74.4	59.60	–
R-FCOS [74]	75.5	–	–
LRTransDet	76.2	3.07	9.4

The best results are in bold.

Table 12. Ablation experiment results from the SAR-Ship Dataset [59] when verifying the effect of the lightweight ViT backbone.

Backbone	Precision (%)	Recall (%)	F1-Score (%)	mAP (%)	Parameters (M)	FLOPs (G)
EfficientViT	92.6	91.5	92.0	95.1	3.07	0.96
CSPDarkNet53	92.3	92.1	92.1	95.5	6.40	2.35

The best results are in bold.

Table 13. Ablation experiment results from the SAR-Ship Dataset [59] when verifying the effect of the Faster-WF2 module.

Number of Faster-WF2	Precision (%)	Recall (%)	F1-Score (%)	mAP (%)	Parameters (M)	FLOPs (G)
4	92.6	91.5	92.0	95.1	3.07	0.96
3	91.7	91.1	91.4	94.8	3.49	1.02
2	91.3	89.7	90.5	93.8	3.60	1.07
1	91.1	90.0	90.5	94.0	3.60	1.08
0	91.8	90.9	91.3	94.5	3.62	1.09

The best results are in bold.

Table 14. Ablation experiment results from the SAR-Ship Dataset [59] when verifying the effect of coordinate attention.

Use CA	Precision (%)	Recall (%)	F1-Score (%)	mAP (%)	Parameters (M)	FLOPs (G)
✓	92.6	91.5	92.0	95.1	3.07	0.964
	90.5	89.6	90.0	93.9	3.05	0.962

The best results are in bold. ✓ means we set CA mechanism in this model.

Table 15. Ablation experiment results from the SAR-Ship Dataset [59] when verifying the effect of the NWD optimized loss function.

Use NWD	Precision (%)	Recall (%)	F1-Score (%)	mAP (%)	Parameters (M)	FLOPs (G)
✓	92.6	91.5	92.0	95.1	3.07	0.96
	89.8	89.4	89.6	93.5	3.07	0.96

The best results are in bold. ✓ means we use NWD optimized loss function in this model.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, K.; Lun, L.; Wang, X.; Cui, X. LRTransDet: A Real-Time SAR Ship-Detection Network with Lightweight ViT and Multi-Scale Feature Fusion. Remote Sens. 2023, 15, 5309. https://doi.org/10.3390/rs15225309

AMA Style

Feng K, Lun L, Wang X, Cui X. LRTransDet: A Real-Time SAR Ship-Detection Network with Lightweight ViT and Multi-Scale Feature Fusion. Remote Sensing. 2023; 15(22):5309. https://doi.org/10.3390/rs15225309

Chicago/Turabian Style

Feng, Kunyu, Li Lun, Xiaofeng Wang, and Xiaoxin Cui. 2023. "LRTransDet: A Real-Time SAR Ship-Detection Network with Lightweight ViT and Multi-Scale Feature Fusion" Remote Sensing 15, no. 22: 5309. https://doi.org/10.3390/rs15225309

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LRTransDet: A Real-Time SAR Ship-Detection Network with Lightweight ViT and Multi-Scale Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. SAR Ship-Detection Models

2.2. Transformers

3. Methodology

3.1. Model Architecture

3.2. Embedding the Lighter ViT into the Backbone

3.3. A Faster Weighted Feature Fusion Neck

3.4. Loss Function

3.5. Evaluating Metrics

4. Experiment

4.1. Datasets

4.2. Experiment Settings and Hyperparameters

4.3. Results from the SSDD

4.4. Results from the SAR-Ship Dataset

4.5. Results on Large-Scale Datasets

4.6. Ablation Experiment

5. Discussion and Future Works

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI