1. Introduction
Ship detection is one of the most significant missions of marine surveillance. With the characteristics of working all-weather, all-time [
1], and imaging relatively wide areas at constant resolution [
2], synthetic aperture radars (SAR) such as Terra-X, COSMOS-SkyMed, RADARSAT-2, Sentinel-1, and GF-3 are widely applied in ship detection [
3,
4,
5,
6,
7].
Traditional ship detection methods are mainly based on the following three aspects: (1) statistics characteristics [
8,
9,
10,
11]; (2) wavelet transform [
12,
13]; and (3) polarization information [
14,
15]. Among these methods, constant false alarm rate (CFAR) and variants thereof [
8,
9,
10] are most widely utilized. CFAR detectors adaptively calculate the detection thresholds by estimating the statistics of background clutter and maintain a constant probability of false alarm. However, the determination of detection thresholds depends on the distribution of sea clutter, which is not robust enough for the detection of multi-scale ships in multi-scenes. On the other hand, CFAR based methods require land masking and post-processing to reduce false alarms, which is insufficiently automated.
Recently, with the rapid development of deep convolutional networks [
16,
17,
18,
19,
20], great progress has been made in deep convolutional neural networks (CNN)-based object detection [
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33]. Generally, the current CNN-based detectors can be divided into one-stage [
25,
26,
29,
31] and two-stage detectors [
22,
23,
24,
28,
32,
33]. One-stage detectors include you only look once (YOLO) [
25] and its derivative versions [
34,
35], single shot detector (SSD) [
26] and RetinaNet [
29], et al. YOLO reframes object detection as a regression problem. The input images are divided into
grid cells and then YOLO predicts bounding boxes and class probabilities for each grid cell straightly. SSD generates a set of default boxes over different sizes per feature map location to match the shape of objects better. RetinaNet proposed the focal loss to overcome the extreme foreground-background class imbalance. On the other hand, faster region-based CNN (Faster R-CNN) [
23] and region-based fully convolutional networks (R-FCN) [
28] are representative two-stage detectors. Faster R-CNN generates anchors of different scales and aspect ratios through the region proposal network (RPN). In addition, then the feature map and proposals rescaled from anchors are fed into the Fast R-CNN sub-network to predict the location and class probabilities of bounding boxes. Different from the per-region sub-network of Faster R-CNN, R-FCN is a fully convolutional network with the shared computation on the entire image. It proposed position-sensitive score maps to address the problem between translation-invariance in image classification and object detection [
28]. Feature pyramid network (FPN) [
24] combines low-level and high-level features for more comprehensive feature expression, which has outstanding performance on multi-scale object detection. In summary, one-stage detectors show superiority in detection speed benefits from the single network of detection pipeline. However, for accuracy, the two-stage detectors are better than that of one-stage, especially for small dense object detection.
For SAR ship detection, Deep CNNs have been widely applied in recent years. As a typical one-stage detection method, YOLOv2 was utilized to detect ships in SAR imagery [
36]. Wang et al. [
37] utilized RetinaNet for automatic ship detection of multi-resolution GF-3 imagery. Zhang et al. [
38] proposed a lightweight feature optimizing network with lightweight feature extraction and attention mechanism for better feature representation. On the other hand, many two-stage detectors were proposed for higher detection accuracy. Ref. [
39] proposed an improved Faster R-CNN for SAR ship detection. A multilayer fusion light-head detector [
40] was proposed to improve the detection speed. Jiao et al. [
41] proposed a densely connected neural network, which utilizes a modified FPN, for multi-scale and multi-scene ship detection. Ref. [
42,
43,
44,
45] added attention mechanisms into CNNs as attention mechanisms adaptively recalibrate feature responses to increase representation power [
19,
46].
Though many CNN-based methods have been proposed for SAR ship detection, they still encounter bottlenecks on the following issues: (1) Quite different from natural images, in SAR images, strip-like ships are often presented under bird’s eye perspectives with various rotation angles and densely arranged in an inshore complex background, as shown in
Figure 1a. A ship with an inclined angle leads to a relatively large redundancy region, which would introduce background noise. Moreover, two horizontal bounding boxes of densely arranged ships have a high Intersection-over-Union (IoU) leading to missing detection after the non-maximum suppression (NMS) operation [
47,
48]. Under these circumstances, the limited capacity of detection with horizontal bounding boxes would be exposed. (2) In R-CNN based object detection methods, an IoU threshold is utilized to distinguish positive and negative samples in the Fast R-CNN sub-network. A relatively low IoU threshold will result in a high recall but a low precision due to the generation of noisy bounding boxes. On the contrary, a relatively high IoU threshold leads to inadequate positive samples. In this case, the overfitting model will cause missing detection.
Figure 2 shows the detection results under IoU thresholds of 0.5 and 0.7, respectively. Some noisy background regions are detected as ship regions in
Figure 2a, and the missing detection appears in
Figure 2b.
Due to the inherent drawback of horizontal bounding boxes, rotated bounding boxes were gradually developed in optical remote sensing [
47,
48,
49,
50]. Ref. [
47] proposed a rotation dense feature pyramid network (R-DFPN) in which the dense FPN and multiscale region of interest (ROI) align are used to detect ships in different scenes. In [
48], a multi-category rotation detector was proposed for small, cluttered and rotated objects. Li et al. [
49] proposed a rotatable region-based residual network (R3-Net) for multi-oriented vehicle detection. The ROI Transformer [
50] that is lightweight was proposed to decrease the computational complexity.
To address the problems in SAR ship detection, a multi-stage rotational region based network (MSR2N) is proposed for arbitrary-oriented ship detection. As shown in
Figure 1b, rotated bounding boxes can locate ships more accurately with less redundant noise background, and would not overlap with each other even in a dense arrangement. Therefore, the rotated bounding box representation is utilized in this paper. In the feature extraction module, we apply the FPN to fuse the high-resolution features and semantically strong features from the backbone network, which enhances feature representation. To generate rotated anchors and proposals, a rotational RPN (RRPN) is utilized. In addition, a multi-stage rotational detection network (MSRDN) is applied in MSR2N. The MSRDN, which contains an initial rotational detection network (IRDN) and two refined rotational detection networks (RRDN), is trained stage by stage. Three increasing IoU thresholds are selected to sample the positive and negative proposals in three stages, respectively. The increasing IoU thresholds guarantee sufficient positive samples to avoid overfitting, and reduce close false positive proposals in the meantime. Compared to other methods, the proposed MSR2N achieves state-of-the-art performance on SAR ship detection, especially for densely arranged ships in inshore complex backgrounds. The main contributions of proposed MSR2N are enumerated as follows:
Alluding to the characteristics of SAR images, the MSR2N framework is proposed in this paper, which is more beneficial for arbitrary-oriented ship detection than horizontal bounding box based methods.
In RRPN, a rotation-angle-dependent strategy is utilized to generate anchors with multiple scales, ratio aspects, and rotation angles, which can represent arbitrary-oriented ships more adequately.
The MSRDN is proposed, where three increasing IoU thresholds are chosen to resample and refine proposals successively. With the proposals refined more accurately, the number of refined proposals also increases.
The multi-stage loss function is employed to accumulate losses of RRPN and three stages of MSRDN to train the entire network.
Compared to other methods, the proposed MSR2N has achieved state-of-the-art performance on SAR ship detection dataset (SSDD).
This paper is organized as follows.
Section 2 describes the proposed MSR2N in detail. In
Section 3, the ablation and comparative experiments are carried out on SSDD, which verifies the effectiveness of proposed MSR2N.
Section 4 draws the conclusions for this paper.