1. Introduction
Synthetic aperture radar (SAR) possesses advantages of reconnaissance and all-weather imaging [
1]. With the development of airborne and spaceborne SAR, it has been widely used in military and civil fields, such as Gaofen-3, Sentinel-1, TerraSAR-X, and Radarsat-2. As a basic maritime task, SAR ship detection has important value in maritime traffic control, fishery management, and maritime emergency rescue [
2]. Up to now, the SAR ship detection field can be roughly divided into two development stages: traditional methods and deep learning-based methods.
Traditional methods include the following three categories: (1) polarization information [
3,
4]; (2) wavelet transform [
5,
6]; (3) statistical characteristics [
7,
8]. Constant false alarm rate (CFAR) [
9,
10] is the most widely used method among these traditional ones. The CFAR detector calculates the detection threshold adaptively by estimating the statistics of the background clutter and maintains a constant false alarm probability. However, these traditional methods are cumbersome in manual design, complicated in the calculation process, and weak in migration capabilities, which restrict the applications of migration. In addition, these traditional methods require very high professional knowledge for researchers and can easily cause over-fitting problems.
With the gradual maturity of deep learning theory, its application fields are increasingly becoming wider. Deep convolutional neural networks (DCNN) have been shown high reliability and accuracy in target detection. DCNN can learn stable and efficient features for target detection automatically. Therefore, traditional feature extraction methods are gradually being replaced by convolutional neural networks. The current mainstream target detection methods based on DCNN consist of two categories: two-stage and one-stage.
For the two-stage networks, region proposals containing the approximate position information of the target are generated in the first stage. The second stage mainly consists of fine-tuning the target category and specific location in the region proposals. Representatives of two-stage networks are region-based fully convolutional networks (R-FCN) [
11] and faster region-based CNN (Faster R-CNN) [
12]. Different from the two-stage networks, the region proposal stage is not necessary in one-stage networks and can directly generate target classification probabilities and locate coordinates. Typical one-stage networks include Single Shot Multibox Detector (SSD) [
13], You Only Look Once (YOLO) [
14], and Retina-Net [
15]. Two-stage networks have advantages in accuracy but low speed. On the contrary, one-stage networks have advantages in detection speed which is conducive to applications on mobile devices with high real-time requirements. With the development of one-stage networks, their detection performance has gradually surpassed two-stage networks and have become the mainstream method in the field of target detection.
Ship detection methods based on deep learning in SAR images have shown outstanding detection performance. A SAR ship region extraction method based on binarized normalized gradient and Fast R-CNN was proposed in [
16,
17]. After that, based on the context area for SAR ship detection, Kang et al. [
18] proposed a multi-layer fusion convolutional neural network which consists of a region proposal network (RPN) with high network resolution and a target detection network with contextual features. YOLO-V2-reduced for SAR ship detection, which was proposed by Chang et al. [
19], reduced the parameters and layers based on YOLO-v2, thereby the detection speed improved. In [
20], a squeeze and excitation rank mechanism was designed to enhance the SAR ship detection ability of Faster R-CNN. Automatic ship detection on GF-3 multi-resolution image was realized by Wang et al. [
21] based on Retina-Net and focal loss.
It was Zhang et al. who designed a lightweight depthwise separable convolution neural network (DS-CNN) based on anchor box mechanism, concatenation mechanism, and multi-scale detection mechanism [
22]. To enhance the features of the low-level layers and high-level layers, a new two-way feature fusion module includes a semantic aggregation block and a feature reuse block was designed by Zhang et al. [
23].
For multi-scale SAR ship detection, Cui et al. [
24] designed a dense attention pyramid network (DAPN) that connects the convolutional block attention module (CBAM) close to each cascaded feature map from the top to the bottom of the pyramid network. For ship detection in high-resolution SAR images, Wei et al. [
25] designed a high-resolution ship detection network (HR-SDNet) which makes full use of high-resolution and low-resolution convolutional feature maps through a new high-resolution feature pyramid network (HRFPN). Zhang et al. [
26] proposed a quad feature pyramid network (Quad-FPN) for multi-scale SAR ship detection under complex backgrounds. Due to the difficulty of SAR image labeling, using less SAR images to achieve better detection results has become a hot research direction. Rostami et al. [
27] proposed a new framework to train a deep neural network for classification without a large number of SAR images. Zhang et al. [
28] proposed a multitask learning-based object detector (MTL-Det) that can learn more discriminative target features without increasing the cost of manual labeling.
Although the above methods have shown excellent performances, there are still the following three problems. (1) Most methods focus too much on improving detection accuracy, while the detection speed, which is particularly important in emergency military decision-making and maritime rescue, has been ignored to a certain extent. (2) Most methods have a huge network scale and numerous parameters, which leads to greater challenges in hardware migration (if the number of parameters of a network is less than 4 million, the network can be transplanted to a field-programmable gate array (FPGA) or digital signal processing (DSP) [
29]). (3) The aforementioned methods need to be strengthened for multi-scale SAR ship detection under complex backgrounds. Therefore, a fast and lightweight detection network is proposed for multi-scale SAR ship detection under complex backgrounds in this paper. While increasing the detection speed and reducing the parameters, the proposed FASC-Net maintains a quite satisfactory detection performance. The main contributions of our work are summarized as follows.
A Channel-Attention Path Enhancement block (CAPE-Block) is designed by adding a bottom-up enhancement path with a channel-attention mechanism based on feature pyramid networks (FPN), which is used to shorten the path of information transmission and enhance the precise positioning information stored in the low-level feature maps.
A fast and lightweight detection network is designed based on CAPE-Block, ASIR-Block, Focus-Block, and SPP-Block for multi-scale SAR ship detection under complex backgrounds.
A novel loss function is designed for the training of FASC-Net. Binary cross-entropy loss is used to calculate the object loss and classification loss, and GIoU loss is used to calculate the loss of the prediction box. Three hyperparameters , and are introduced to balance the weights of the three sub-losses.
Comparing with other excellent methods (e.g., Faster R-CNN [
17], SSD [
13], YOLO-V4 [
30], DAPN [
24], HR-SDNet [
25], and Quad-FPN [
26]), a series of comparative experiments and ablation studies on the SSDD dataset [
31], SAR-Ship-Dataset [
32], and HRSID dataset [
33] illustrate that our FASC-Net achieves higher mean average precision (mAP) and faster detection speed with smaller number of parameters.
The rest of our paper is organized as follows.
Section 2 introduces the datasets and structure of FASC-Net.
Section 3 introduces the evaluation criteria, data augmentation methods, and a series of comparative experiments. Ablation studies are used to discuss the effects of key technologies in
Section 4.
Section 5 concludes this paper.
4. Discussion
In this section, we will discuss the role of ASIR-Block (AB), Focus-Block (FB), SPP-Block (SB), and CAPE-Block (CB) through ablation studies. Referring to the ablation studies in literature [
38], four variants of FASC-Net is designed as follows:
FPNet: Composed by traditional convolutional layers and an FPN-Block, FPNet has the same width and length network as FASC-Net. The traditional convolution layers are used to downsample and extract features. Same data augmentation methods are used in the training methods and loss function as FASC-Net.
A-FPNet: We get A-FPN by replacing all the traditional convolutional layers of FPNet with ASIR-Blocks. We can evaluate the effect of ASIR-Blocks by comparing the parameters and performance of FPNet and A-FPNet.
FA-FPNet: After replacing the first ASIR-Block-2 used for down-sampling in A-FPNet with a Focus-Block, we get FA-FPNet. The effect of Focus-Block can be testified by comparing the parameters and performances of FA-FPNet and A-FPNet.
FAS-FPNet: FAS-FPNet is obtained by adding an SPP-Block into FA-FPNet. The effect of SPP-Block can be testified by comparing the performances of FA-FPNet and FAS-FPNet and the effect of the CAPE-Block can be testified by comparing the performances of FAS-FPNet and FASC-Net.
The number of parameters, experimental results and detailed configurations of those variants are shown in
Table 9.
ASIR-Block: The first and second rows of
Table 9 indicate that the mAP of A-FPNet has dropped by 0.9% compared to FPNet, but the number of parameters of A-FPNet is only 1/14 of FPNet, and the FPS of A-FPNet is 2.5 times that of FPNet. This proves that ASIR-Block can extract some stable features of the target with few parameters for target detection and accelerate the detection speed of the network. That is because the Channel-Shuffle mechanism and depthwise convolution in ASIR-Block work together to extract more detailed features with fewer parameters.
Focus-Block: It is obvious from the second and third rows of
Table 9 that compared with A-FPNet, the mAP and FPS of FA-FPNet increased by 0.4% and 1.6 respectively. It seems that the improvement effect is not very obvious, because the image resolution in the SSDD dataset is not high enough. Focus-Block can quickly down-sample high-resolution remote sensing images without losing information. Focus-Block does not show obvious advantages since the image resolution in the SSDD dataset is 480 × 480. When applied to higher resolution remote sensing images, Focus-Block can show more powerful performance.
SPP-Block: The third and fourth rows of
Table 9 show that compared with FA-FPNet, the mAP of FAS-FPNet increased by 2.2%, and its FPS decreased by 2.8%, which proves that SPP-Block can extract the most important context features without affecting the detection speed. Because SPP-Block uses Maxpool operations of different kernel sizes for feature fusion to increase the receptive field of feature maps, this way of feature fusion has little effect on the running speed of the entire network, the performance is significantly improved owing that it can separate out the most significant context features.
CAPE-Block: Compared with FAS-FPNet, the mAP of FAS-FPNet increased by 2.7%, and its FPS decreased by 5.8 while adding some parameters. This shows that upgrading FPN-Block to CAPE-Block can indeed shorten the path of information transmission, and use the precise positioning information stored in low-level features to enhance the detection ability. Although the addition of a bottom-up enhancement path and channel attention mechanism slightly reduced the detection speed, the detection performance has been greatly improved.