Fully Deformable Convolutional Network for Ship Detection in Remote Sensing Imagery

Guo, Hongwei; Bai, Hongyang; Yuan, Yuman; Qin, Weiwei

doi:10.3390/rs14081850

Open AccessArticle

Fully Deformable Convolutional Network for Ship Detection in Remote Sensing Imagery

¹

School of Energy and Power Engineering, Nanjing University of Science and Technology (NJUST), Nanjing 210094, China

²

Xi’an Research Institute of High-Tech, Xi’an 710025, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(8), 1850; https://doi.org/10.3390/rs14081850

Submission received: 8 March 2022 / Revised: 10 April 2022 / Accepted: 11 April 2022 / Published: 12 April 2022

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

In high spatial resolution remote sensing imagery (HRSI), ship detection plays a fundamental role in a wide variety of applications. Despite the remarkable progress made by many methods, ship detection remains challenging due to the dense distribution, the complex background, and the huge differences in scale and orientation of ships. To address the above problems, a novel, fully deformable convolutional network (FD-Net) is proposed for dense and multiple-scale ship detection in HRSI, which could effectively extract features at variable scales, orientations and aspect ratios by integrating deformable convolution into the entire network structure. In order to boost more accurate spatial and semantic information flow in the network, an enhanced feature pyramid network (EFPN) is designed based on deformable convolution constructing bottom-up feature maps. Additionally, in considering of the feature level imbalance in feature fusion, an adaptive balanced feature integrated (ABFI) module is connected after EFPN to model the scale-sensitive dependence among feature maps and highlight the valuable features. To further enhance the generalization ability of FD-Net, extra data augmentation and training methods are jointly designed for model training. Extensive experiments are conducted on two public remote sensing datasets, DIOR and DOTA, which then strongly prove the effectiveness of our method in remote sensing field.

Keywords:

remote sensing; ship detection; feature pyramid network; deformable convolution; convolutional neural networks (CNNs)

1. Introduction

With the rapid development of airborne and spaceborne sensors, high spatial resolution remote sensing images (HRSI) have become widely available, offering rich information with which to observe and interpret the Earth. As a challenging problem in the practical application of remote sensing imagery, the technique of automatic ship detection in HRSI has received substantial attention as it has wide practical applications, such as sea rescue, sea security and maritime transport management [1,2]. Moreover, there is also an increasing demand for online ship detection based on Unmanned Airborne Systems (UAS) [3] and satellite platforms.

The past decades have seen remarkable development in automatic ship detection technology of HRSI, and many effective and representative methods have been proposed, such as template matching [4,5], grayscale statistics [6,7], visual saliency-based methods [8,9] and traditional machine-learning-based methods [10,11,12]. However, these methods mostly rely on handcrafted features, which are not universal and thus have difficulties in expressing high-level semantic information in different complex environments.

With the rapid progress of CNNs [13,14,15,16] and CNN-based object detection methods [17,18,19,20], automatic ship detection technology for remote sensing images has been developed and a considerable number of methods have been proposed [21,22,23,24]. However, there are still many challenges facing ship detection in HRSI compared to natural images. Firstly, there are always many small objects due to the long imaging distance, resulting in unclear and limited information of these objects. Secondly, there may be a huge scale difference between the large objects and small objects in the image, which leads to poor detection results for small objects. Thirdly, various objects in such images are unevenly distributed, which are often densely distributed in some complex backgrounds; in this complex background, it is difficult to distinguish the characteristics of small and dense objects, which will lead to false positives and missed detection. In addition, objects could appear in any direction and any aspect ratio, resulting in morphological differences, which seriously affects the detection accuracy of the object. Some typical examples of ship detection under different situations are shown in Figure 1. Thus, despite the existing object detectors having achieved outstanding results in ship detection, aiming at the problem of accurate ship detection in HRSI, there are still many problems to be solved.

In order to solve the above issues, a ship detection method for HRSI named fully deformable convolutional network (FD-Net) is proposed. FD-Net can detect ships with variable scale, orientation and shape by integrating deformable convolution into the entire network structure, which takes the VFNet [25] network as the main structure with the enhanced feature pyramid network (EFPN) and an adaptive balanced feature integrated (ABFI) module. EFPN aims to boost more accurate spatial and semantic information flow in the network through deformable convolutional networks and path aggregation network. Additionally, in considering of the feature level imbalance in feature fusion, the ABFI module is connected after EFPN to generate adaptive weight factors by modeling the scale-sensitive dependence among feature maps and highlighting the object feature. By transforming the scale-sensitive dependence into the channel attention problem, the fused feature maps can adaptively select features of appropriate scales. To further improve the performance of FD-Net, we proposed a novel data augmentation method named crop mosaic data augmentation to improve the diversity of the dataset while preserving the object information as much as possible. In addition, the effects of some training methods are evaluated for the proposed FD-Net. To solve the problem of extreme scale changes in remote sensing datasets and effectively detect multi-scale ship objects, we adopt large-scale jitter in the training stage. Moreover, in order to further enhance the generalization ability of FD-Net, Stochastic Weights Averaging (SWA) [26] is also adopted in the training stage.

Two public datasets are used to verify our experimental results. DOTA [27] is a significant public remote sensing dataset that provides challenges for multi-scale object detection and contains 403,318 instances of 17 common object categories in 2806 images. DIOR [28] is a publicly available remote sensing dataset which consists of 23,463 images and 192,472 instances of 20 object categories. Extensive experiments are conducted on the above public remote sensing image datasets, which indicates that FD-Net could outperform the methods involved in the comparison in this paper. The experiment results demonstrate the effectiveness of our proposed method, which can meet the needs of ship detection in different situations, such as extremely small scale, dense distribution, different directions and aspect ratio. In general, we discuss and demonstrate the effectiveness of fully deformable convolutions for feature extraction in object detection tasks, and propose a novel adaptive weighted feature fusion strategy and a novel crop mosaic data augmentation method, which can also be applied to other object detection tasks due to their generality.

Our main contributions are as follows:

A novel fully deformable convolutional network for ship detection in HRSI is proposed, which is named FD-Net.
In FD-Net, we design an enhanced feature pyramid network (EFPN) to improve the ability to detect ships with variable scale, orientation and shape by integrating deformable convolution into the entire network structure. At the same time, we design an adaptive balanced feature integrated (ABFI) module to detect dense and small objects by modeling the scale-sensitive dependence among feature maps and highlighting the object feature.
We propose a novel crop mosaic data augmentation method to improve the diversity of the dataset while preserving the target information as much as possible.
We evaluate the effect of some training methods on the accuracy of the proposed model. Compared with other remote sensing ship detection methods, experiments verify that our method achieves higher detection accuracy on the two public remote sensing datasets mentioned above. Ablation experiments confirmed that all parts of our method have a positive effect on the improvement of the detection result.

The rest of this paper is organized as follows. Section 2 briefly reviews related work. Then, we introduce the proposed method for ship detection in HRSI in Section 3. Section 4 explains the details and environment of the experimental realization and compares the proposed method with state-of-the-art object detection methods. The ablation experiment is carried out, and the influence of different networks on the proposed method is discussed in Section 5. Finally, Section 6 gives the conclusion.

2. Related Work

2.1. Multiple-Scale Object Detection Methods

One of the main challenges for object detection in HRSI faced by researchers is the multi-scale problem, especially for small objects, which can be easily submerged in deep feature maps. Since the convolutional neural network has become the most popular and effective object detection algorithm, it is mainly divided into the one-stage algorithm and two-stage algorithm. The one-stage algorithm is represented by the single-shot multi-box detector (SSD) [19], the You Only Look Once (YOLO) series [18,29,30,31] and RetinaNet [32]. The two-stage algorithm is proposed based on region proposal network and is represented by R-CNN [33], Fast R-CNN [34] and Faster R-CNN [17]. Unlike YOLOv1 [18] or SSD, to directly detect objects by single-scale feature maps or unprocessed multi-scale feature maps, a top-down architecture with horizontal connections was developed by feature pyramid network (FPN) [35] to transport high-level semantic feature to different scales, which effectually improves the performance of detecting multiple-scale objects. Several improved methods of FPN have been developed. PANet [36] adds an additional bottom-up path based on FPN, enhancing the entire feature hierarchy with accurate spatial information in shallow features. Bi-FPN [37], proposed by Google designs, is an effective and fast multi-scale feature fusion strategy based on a bi-directional connection structure. As the inconsistency of the importance of features at different scales in the feature fusion process can greatly limit the performance of the single-shot detectors, Liu et al. [38] proposed an adaptively spatial feature fusion (ASFF) strategy to improve the sensitivity of the network for features at different scales. Pang et al. [39] proposed a balanced feature pyramid (BFP) for reducing the imbalance among multi-scale features, which could generate balanced semantic features and transport balanced information to each level. On the one hand, the above multi-scale feature fusion methods lack attention to spatial information, such as shape and orientation. On the other hand, most methods merely fuse multi-scale features directly in the channel dimension, which is not scale-sensitive.

Encouraged by the remarkable progress of CNN-based object detection methods, a series of CNN-based ship detection methods have been proposed. In order to solve the main challenges of complex background and dense distribution in ship detection task, Yang et al. [22] proposed a rotation dense feature pyramid network based on dense connections to build high semantic features for feature maps of all scales. Whereas dense connections provide more semantic features, they increase the difficulty while training the network. The unselected redundant features can even interfere with valid information. Tang et al. [40] proposed a ship detection method called N-YOLO based on YOLOv5 [41], which consists of a noise level classifier and an object potential area extraction module. Li et al. [42] proposed a complete YOLO-based method to detect ship objects in thermal infrared remote sensing images. Considering the huge difference of ship size, an enlarged receptive field based on SE [43] module and dilated convolution [44] was designed to extract semantic features. To solve issues such as the scattering interference, sparsity of objects, and small objects, Zhu et al. [45] redesigned the sample definition and the feature extraction to improve the feature representation ability. An improved focal loss and regression refinement with complete intersection over union were also introduced to improve the classification and regression performance. Dong et al. [46] proposed a two-stage ship detection algorithm, which could generate more accurate candidate regions through the set of saliency maps. Moreover, the HOG (histogram of oriented gradients) cells and the Fourier basis are combined together, which could distinguish ship objects and obtain the main orientation of ship objects. To address the challenges of color, aspect ratio, complex background, and angular variability in ship detection. Dong et al. [21] developed a vector field filter with active rotation capability to encode the orientation information of the ship objects, which could obviously improve the detection accuracy. With the increasing demand for onboard object detection based on UAS and satellite platforms, some lightweight ship detection methods have also been proposed. Xu et al. [47] proposed a lightweight onboard SAR ship detector based on a lightweight cross stage partial module to reduce the amount of computation while extracting features. Moreover, several effective modules were adopted to improve the detection performance, such as histogram-based pure backgrounds classification module, shape distance clustering module and attention module. Liu et al. [48] proposed a lightweight ship detection method based on YOLOv4 [31] with an improved receptive field block to enhance the feature extraction ability, which could improve the accuracy of multi-scale ship detection. Although previous studies have proposed effective solutions to many challenges in ship detection, such as multiple scales, different directions and aspect ratio, etc., the performance of ship detection methods still needs to be improved. Firstly, few methods pay attention to the scale and shape differences of ship objects in remote sensing images, which may require more flexible sampling methods different from standard convolution. Secondly, feature pyramids are widely adopted by many methods for multi-scale object detection due to their effectiveness. However, the feature fusion strategy of most methods directly fuses the feature maps by element-wise addition or connection in the channel dimension, which ignores the scale-sensitive connection among feature maps and the selection of effective information for feature maps.

2.2. Deformable Convolutional Networks

Extensive research has shown that the standard convolution with regular grid sampling makes it hard to adapt to geometric deformation. In order to weaken this limitation, the researchers at Microsoft Research Asia designed a flexible sampling method in the convolution kernel by adding an offset variable to each sampling point, which could randomly sample around the current position [49]. The sampling locations in 3 × 3 standard and deformable convolutions are shown in Figure 2, respectively [50]. By learning the offset of the convolution kernel in horizontal and vertical orientation, the deformable convolution kernel could be adaptively and dynamically adjusted so as to adapt to the geometric deformation such as shape and size of different objects. Since the sampling regions of deformable convolutions may involve too much influence from irrelevant background information, DCN V2 [50] is proposed to improve the ability to concentrate on relevant image regions through a more comprehensive and flexible integration of deformable convolutions and modeling mechanisms. By integrating the deformable convolution layer and deformable RoI (Region-of-Interest) pooling layer into the network, Deng et al. [51] proposed a multi-class object detection method, which achieves more accurate and robust detection results. Ren et al. [52] modified Faster R-CNN by replacing the standard convolution with deformable convolution in the last network stage, which improves the mean average precision for partial occlusion object detection. We rely on DCN v2 to extract the features of various scales and shapes of the entire network structure to improve the detection performance of multi-scale ships in dense scenes, especially small ships.

3. Proposed Method

3.1. Fully Deformable Convolution Network

The overall framework of our method is shown in Figure 3, which is based on VFNet and deformable convolution, and several different aspects have been improved. FD-Net is mainly composed of four parts. The backbone network adopts ResNet-50 [15]. We use DCN v2 to modify ResNet-50 to extract the features of various scales and shapes. In the horizontal connection of each layer of ResNet-50, we designed an enhanced feature pyramid network based on PANet and deformable convolution to obtain more semantic information through branches of different scales and to reduce the loss caused by multiple scales and shapes.

Then, we design an adaptive balanced feature integrated (ABFI) module in the horizontal connections to model the scale-sensitive dependence between feature maps and highlight the object feature. According to the level of the input feature maps, there are three types of ABFI modules for integrating the features of adjacent levels through adaptive weighted connections. By adaptively learning the importance of feature maps participating in feature fusion, we can obtain the optimal combination of feature maps during feature fusion. Finally, the prediction branches based on VFNet head are used to generate the bounding box and the classification score.

The specific process is as follows: the modified ResNet-50 generates C3, C4, C5, C6 and C7 feature maps of different scales, extracts E3, E4, E5, E6 and E7 feature maps through FPN and then adds them to the feature maps of the same size obtained by up-sampling to generate P3, P4, P5, P6 and P7 feature maps. The number of feature map is denoted as the number of down-sampling times of the feature map relative to the original image. For example, C3 is obtained by down-sampling the original image three times, and its size is 1/8th of the original image. Next, the ABFI module performs weighted fusion of the feature maps of each level and the feature maps of two adjacent levels to suppress irrelevant information and highlight the object feature. Lastly, we generate prediction results by VFNet head with two subnetworks, which are used for regressing initial bounding boxes and predicting IoU-aware classification scores, respectively.

3.2. Enhanced Feature Pyramid Network

We use the ResNet-50 as the backbone and the path aggregation network (PANet) to construct the neural network. Through convolution stride and the down-sampling of the residual network with level increasing, the spatial resolution of the feature map is gradually reduced, while the number of channels is gradually increased. FPN improves the performance of detecting multi-scale objects by a top-down architecture with horizontal connections to generate and transport deep semantic information to various scales. However, the spatial information of shallow features is as important as the semantic information of deep features for object detection in HRSI, especially for dense and small objects. Therefore, we use the path aggregation network to transmit accurate spatial information in lower levels with an extra bottom-up path structure, which enhances the entire feature hierarchy. Considering that the scale and aspect ratio of ships are quite different in complex remote sensing image scenes, we introduce DCN V2 in PANet to help detect objects with variable scales, shapes and orientations. DCN V2 is defined as follows:

y (p) = \sum_{k = 1}^{K} ω_{k} \cdot x (p + p_{k} + Δ p_{k}) * Δ m_{k}

(1)

p_{k} = {(- 1, - 1), (- 1, - 0), \dots, (0, 1), (1, 1)}

(2)

where

K

is the number of sampling locations,

Δ p_{k}

represents the diffusion position learned by each convolution grid,

Δ m_{k}

represents the penalty coefficient learned by each convolution grid,

ω_{k}

and

p_{k}

are denoted as the weight and pre-specified offset for the k-th location, respectively, and

y (p)

denote the features at location

p

output feature maps

y

. The convolution kernel variable is obtained by calculating the conventional convolution kernel with these parameters. For instance, if

K = 9

,

p_{k}

denotes a

3 \times 3

convolutional kernel of dilation 1.

The specific structure of the enhanced feature pyramid is shown in Figure 4. The deformable convolutions are inserted between top-down and bottom-up structures, and after PANet to help capture object features that vary in scale, shape and orientation.

3.3. Adaptive Balanced Feature Integrated Module

Former FPN-based research could integrate multi-level features by using lateral connections, which could effectively transmit semantic information to layers at all scales. Nonetheless, the transmitted information may contain irrelevant background information and information about other objects, resulting in a decrease in detection accuracy. To strengthen the multi-level features, Pang et al. [39] proposed a balanced feature pyramid (BFP) for reducing the imbalance among multi-scale features, which could generate balanced semantic features and transport balanced information to each level. We denote features at resolution level

l

as

C_{l}

. In order to integrate features of different scales and save their semantic information simultaneously, BFP first resizes the multi-scale features to the same intermediate size with interpolation and down-sampling. Then, the features are scaled and refined to generate balanced semantic features, which could be rescaled to enhance the original features. However, there are some hidden issues when using BFP in the face of dense and small objects detection in HRSI. Firstly, there are four feature maps of different scales used in BFP. In general, a single pixel in the high-level feature map represents a larger area in the original image. The huge scale differences between feature maps at different scales in feature fusion may negatively affect object detection. For example, the irrelevant background information in large feature maps and coarse spatial information in small feature maps may be re-transmitted into the fused feature maps. Secondly, the feature maps of various scales are considered to have equal importance in BFP. However, in fact, feature maps of different scales may play different roles in detection.

To solve the above issues, we designed an adaptive balanced feature integrated (ABFI) module, which could adaptively integrate adjacent multi-level features by learning the weights of each level during feature fusion through channel attention. To integrate multi-level features, we firstly resize the adjacent multi-scale feature maps

\{C_{3}, C_{4}, C_{5}\}

to the same scale, which depends on the role positions of the ABFI modules. When the ABFI module acts on the feature map with the smallest scale, i.e., the

C_{5}

in Figure 5, the ABFI-A module would rescale the larger adjacent feature maps

C_{3}

and

C_{4}

by down-sampling to the same scale as

C_{5}

. Contrary to the ABFI-A module, the ABFI-C module would rescale the smaller adjacent feature maps

C_{4}

and

C_{5}

by up-sampling to the same scale as

C_{3}

. The ABFI-B module resize the feature maps to the same intermediate scale with interpolation and down-sampling, respectively. Then, we concatenate the current feature map with other two rescaled feature maps in the channel dimension to generate a fused feature map

X = [X_{1}, X_{2}, X_{3}] \in ℝ^{H \times W \times 3 C}

, which is then processed by three branches, respectively.

X_{1}, X_{2}, X_{3} \in ℝ^{H \times W \times C}

are denoted as the current feature map with two other rescaled feature maps, respectively.

The first branch is used to learn the nonlinear relationship between feature maps of three adjacent scales. We first compress the number of channels of the fused feature map through the convolution operation. By using

Z = [z_{1}, z_{2}, z_{3}]

to denote the learned set of filter kernels, where

z_{i}

refers to the parameters of the

i

-th filter, we could generate the outputs as

V = [v_{1}, v_{2}, v_{3}]

, where:

v_{i} = z_{i} * X = \sum_{s = 1}^{3 C} z_{i}^{s} * x^{s}, i = 1, 2, 3 .

(3)

where

*

is denoted as convolution,

z_{i} = [z_{i}^{1}, z_{i}^{2}, \dots, z_{i}^{3 C}]

,

X = [x^{1}, x^{2}, \dots, x^{3 C}]

and

v_{i} \in ℝ^{H \times W}

.

Our purpose is to improve the scale sensitivity of the network during feature fusion. However, the information in deep neural networks is mainly divided into semantic information in feature maps and channel information between feature maps. As a derivative in the feature extraction process, the scale connections among multi-level feature maps are rarely noticed, especially in the feature fusion process. In order to establish the scale connection among multi-level feature maps in the feature fusion process, we transform the scale sensitivity into the association among channels in the fused feature map. We use global average pooling to generate the global spatial information among various channels, which is defined as:

F = \frac{\sum_{x = 1}^{M} \sum_{y = 1}^{N} f (x, y)}{M \cdot N}

(4)

where

F

is defined as value of the image block, M and N are defined as width and height of the image block, respectively, and

f

refers to the pixel value of each position.

We follow global average pooling with a convolutional layer of

1 \times 1 \times 3 r

, a ReLU [53] function and a convolutional layer of

1 \times 1 \times 3

to generate

M = [m_{1}, m_{2}, m_{3}] \in ℝ^{1 \times 1 \times 3}

. Considering that finite channels would limit the nonlinear representation ability of the network, we add the expansion factor

r

to the above operation. In our proposed method, the expansion factor

r

is set to 8. In order to learn non-mutually exclusive and nonlinear inter-channel relationships, we follow

M

with a sigmoid activation and a softmax function to output the weighted factors (

α, β, γ

) of three branches of different scales:

P = [α, β, γ] = \frac{e^{σ (m_{i})}}{\sum_{n = 1}^{3} e^{σ (m_{n})}} \in ℝ^{1 \times 1 \times 3}, i = 1, 2, 3

(5)

where

P = [α, β, γ]

is denoted as weight factor and

σ

refers to sigmoid activation function.

With the redefinition of the importance of the feature map by the weighted factor, the valuable information in the feature map can be effectively highlighted. The weighted balanced features are obtained by element-wise addition as:

Q = α \cdot X_{1} + β \cdot X_{2} + γ \cdot X_{3}

(6)

where

α, β, γ

are the weighted factors of the multi-level feature maps. Finally, a residual connection is adopted to fused the original feature maps and the weighted balanced features to form the adaptive balanced features

\tilde{X}

.

3.4. Crop Mosaic Data Augmentation

In general, conventional object detectors are trained offline. Therefore, researchers hope to take advantage of this and develop better training methods that allow object detectors to achieve better performance without increasing computational costs. The data augmentation method has become one of the important ways to improve the accuracy of the detector because it only increases the training cost while enhancing the generalization ability. The purpose of data augmentation is to improve the diversity of the input image so that the designed object detector can achieve high robustness in images obtained in various scenes. In order to train the detector on limited batch size, the Mosaic data augmentation method was proposed in YOLOv4. Mosaic mixes four training images so that the trained detector can detect objects outside their normal context, which improves the robustness of the detector. However, we notice that there are many extremely small and dense ships in HRSI. Given that the scale of remote sensing objects is already very small, if we mix images directly, we may lose a lot of information as the scale of the object in the image is further reduced. To solve the above issues, we first enhance the images with various methods, such as rotation, inversion, stretching, brightness balance, etc., and then crop the images containing the objects. The cropped images are mixed through the mosaic data augmentation, which can effectively improve the diversity of the dataset without changing the original scale of the object. A few samples generated by the crop mosaic data augmentation are shown in Figure 6.

4. Experiments

4.1. Datasets

We evaluate our proposed method on DOTA and DIOR, respectively, which are two public remote sensing datasets and are briefly introduced as follows.

DOTA: DOTA v1.5 is a massive remote sensing dataset for object detection. The images are collected from various airborne sensors and spaceborne sensors. The image size in DOTA varies from 800 × 800 to 20,000 × 20,000 pixels, which contains objects of various scales, orientations and shapes. The objects in DOTA images are manually annotated by arbitrary quadrilaterals. DOTA v1.5 contains 403,318 instances of 17 common object categories in 2806 images. The training set, validation set and test set are randomly divided according to the proportion of 1/2, 1/6 and 1/3. We cropped the DOTA images to generate sub-images of size 1000 × 1000 pixels, with an overlap of 200 pixels. Then, we removed the images that did not contain any ship objects and saved the sample that contained ship objects. We finally obtained 2163 patches for training and 541 patches for testing.

DIOR: DIOR is a large-scale, publicly available optical remote sensing dataset for object detection, which contains 23,463 images and 192,472 instances of 20 common object categories. The size of each image in DIOR is 800 × 800 pixels, and the spatial resolution varies from 0.5 m to 30 m. Each object instance in DIOR is manually annotated by horizontal bounding box. The number of small-scale instances and large-scale instances in DIOR is well balanced, and the large difference in object scales among various categories makes it a challenging task to detect objects. We extracted samples from DIOR that contained ship objects and removed other annotations that did not belong to ships. We finally obtained 2161 patches for training and 541 patches for testing.

4.2. Implementation Details

The proposed method was trained and evaluated on a 64-bit Ubuntu 16.04 computer with an Intel Core i7-7700K CPU, GeForce GTX 1080 Ti GPU × 2 and 32 GB computer memory, implemented using MMDetection [54]. During the experiments, we adopted SGD as the optimizer in the training stage, with a batch size of 4 (the number of GPUs is 2 and each GPU calculates 2 images), weight decay of 0.0001 and momentum of 0.9. We trained our model for 24 epochs with an initial learning rate of 0.0025 and then employed SWA (Stochastic Weights Averaging) for an additional 12 epochs to further improve the generalization ability of our model. We adopted ResNet-50 as the backbone and used its pre-trained weights on ImageNet to initialize it. The curve of the loss function in the training stage is shown in Figure 7.

4.3. Metrics

We used several universal metrics to evaluate the performance of FD-Net, which are named precision rate, recall rate and average precision (AP), respectively. The precision–recall curve reflects the variation of precision with recall, which indicates the overall performance of the algorithm.

The precision rate is defined as the ratio of the number of correct detections to the number of total detections, while the recall rate is the proportion of the number of correct detections to the number of total annotated instances, which can be illustrated by the following two common formulas [55,56]:

p r e c i s i o n = \frac{T P}{T P + F P}

(7)

r e c a l l = \frac{T P}{T P + F N}

(8)

where precision represents the detection accuracy, recall represents the detection completion rate and TP, FP and FN represent the number of true positive samples, false positive samples and false negatives samples, respectively. Average Precision (AP) is the averaged precision across all recall values between 0 and 1, namely, the area under the precision–recall curve. A higher AP indicates a better detector.

All reported results follow standard COCO-style Average Precision (AP) metrics that include AP (averaged over IoU thresholds),

A P_{50}

(AP for IoU threshold 50%) and

A P_{75}

(AP for IoU threshold 75%). We also adopt

A P_{S}

,

A P_{M}

and

A P_{L}

to illustrate the results on small, medium and large scales, respectively.

4.4. Ablation Experiments

In order to compare the effectiveness of each part of FD-Net, extensive experiments were conducted on the DOTA v1.5 dataset. We chose VFNet as the baseline and added the deformable convolution, the EFPN module and the ABFI module to compare the performance changes. We also used the proposed crop mosaic data augmentation method to expand the data and used the Stochastic Weights Averaging (SWA) and large-scale jitter in the training stage to improve the generalization ability and accuracy. The experimental result is shown in Table 1, where CM and LSJ mean crop mosaic and large-scale jitter, respectively.

Our baseline was a VFNet using ResNet-50 as the backbone network, and FPN was also used as the neck. As shown in Table 1, the detection result of baseline on the DOTA v1.5 dataset was 48.6%, and although the baseline method has good detection performance for medium and large objects, it is difficult to deal with small object detection. The EFPN module achieved 48.7%, which significantly improves the detection ability of the proposed method for small objects, leading to a gain of 3.1% for small objects. The ABFI module also improved the ability to detect small objects, leading to a gain of 2.5% for small objects. The crop mosaic data augmentation could achieve 51%, which is 2.4% higher than the detection result of baseline. The AP for objects of different scales was obviously improved. As for the large-scale jitter, the detection performance of different IOU thresholds and multi-scale objects has been significantly improved. Moreover, by training through SWA, our method reached 50.8%, leading to a gain of 2.2%, which further improved the detection results for medium and large objects. Compared to the original baseline, there was, finally, a 5% rate of improvement with the help of the above effective modules and methods. Additionally, our proposed method outperforms the baseline for the detection of small, medium and large objects. The

A P_{50}

and

A P_{75}

of our proposed method are significantly improved relative to the baseline, leading to a gain of 7.3% and 10.6%, which indicates that the bounding box predicted by our proposed method is more accurate.

To further evaluate the effectiveness of each part of FD-Net, we also conducted ablation experiments on the DIOR dataset, and the experimental result is shown in Table 2. The detection result of baseline on the DIOR dataset was 52.8%, and the

A P_{S}

was obviously lower than

A P_{M}

and

A P_{L}

, which indicates that the performance of the baseline method to detect small objects needs to be improved. The EFPN module could achieve 53.9%, which comprehensively improved the detection performance of our proposed method for objects of different scales. The ABFI module and the crop mosaic data augmentation reached 53.5% and 52.9%, respectively, which significantly improved the detection of small objects and the accuracy of predicted bounding boxes with a slight loss of

A P_{M}

and

A P_{L}

. The large-scale jitter method reached 55.1%, leading to a gain of 2.3%, which was significantly improved compared to baseline. The improvement of

A P_{S}

,

A P_{M}

and

A P_{L}

showed the effectiveness of LSJ for multi-scale object detection. By training through SWA, our method reached 53.1%, leading to a gain of 0.3%. Compared to the original baseline, there was finally a 3.3% rate of improvement with the help of the above effective modules and methods. The

A P_{50}

,

A P_{75}

,

A P_{S}

,

A P_{M}

and

A P_{L}

are comprehensively improved, leading to a gain of 5%, 8.3%, 7.9%, 2% and 0.8%, which indicates our improvements have a good effect on multi-scale ship object detection, especially for small objects.

4.5. Results of DOTA Dataset

We compared our proposed FD-Net with ten state-of-the-art methods, including some one-stage networks and two-stage networks, such as ATSS [57], Faster-R-CNN [17], FSAF [58], GFL [59], PAA [60], RetinaNet-GHM [61], RetinaNet [31], VFNet, YOLOv5-s and YOLOv5-m [41]. The comparison models were implemented in their original environments without any additions and were retrained on DOTA v1.5 dataset and DIOR dataset, respectively. We use the standard metrics average precision to evaluate our results.

Table 3 reports the quantitative comparison of the proposed FD-Net and several state-of-the-art methods on DOTA v1.5 dataset. According to the detection results shown in Table 3, our FD-Net won all the competing methods by achieving an AP of 53.6%. Compared with the above competing methods, the AP of our method was significantly improved, leading to a gain from 0.8% to 12.1%. In addition, the

A P_{S}

,

A P_{M}

and

A P_{L}

of FD-Net were significantly higher than other state-of-the-art methods. Especially the

A P_{S}

of FD-Net obviously outperformed all the competing methods, leading to a gain from 8.8% to 20.6%, and the

A P_{M}

and

A P_{L}

of FD-Net were also better than other methods, which illustrated the effectiveness of our proposed method for multi-scale ship object detection, especially for small ships. Some examples of the detection results on the DOTA v1.5 dataset in different situations are shown in Figure 8.

4.6. Results of the DIOR Dataset

To further evaluate the performance of FD-Net, we compared our proposed FD-Net with the state-of-the-art methods mentioned above on the DIOR dataset, the quantitative comparison results are shown in Table 4. According to the detection results shown in Table 4, our FD-Net bested all the competing methods by achieving an AP of 56.1%. Compared with the above competing methods, the AP of our method was significantly improved, leading to a gain from 2.2% to 14.1%. The

A P_{S}

,

A P_{M}

and

A P_{L}

of FD-Net were significantly higher than other state-of-the-art methods. Especially the

A P_{S}

of FD-Net obviously outperformed all the competing methods, leading to a gain from 7.9% to 12.7%, and the

A P_{M}

and

A P_{L}

of FD-Net were also better than other methods, which further illustrated the effectiveness of our proposed method for multi-scale ship object detection, especially for small ships. Some examples of the detection results on the DIOR dataset in different situations are shown in Figure 9.

5. Discussion

Overall, our study established a novel fully deformable convolutional network (FD-Net) optimized for dense and multiple-scale ship detection. The advantages of the proposed FD-Net are illustrated as follows: (1) Objects might appear in HRSI in any direction and any aspect ratio, resulting in morphological differences and seriously affecting the detection of the object. By integrating the deformable convolution into the entire network structure, FD-Net could better extract features from ships with variable scale, orientation and shape. The experiment results in Table 1 and Table 2 indicate that FD-Net has excellent feature extraction ability and detection accuracy due to the deformable convolution inserted into it. (2) There are always a large number of small objects due to the long imaging distance, resulting in unclear and limited information about these objects. Meanwhile, there may be a huge scale difference between the large objects and small objects in the image, which leads to poor detection results of small objects. Thus, we conducted ablation experiments to evaluate the effectiveness of the EFPN module, which consists of deformable convolution and PANet. Sufficient experiments show that the EFPN can better extract features and can avoid feature loss in the process of transformation. Moreover, the ABFI module could adaptively integrate adjacent multi-level features and significantly improve the ability of detecting small ship objects. (3) Given that the scale of remote sensing objects is always very small, the crop mosaic data augmentation was designed and implemented, which can effectively improve the diversity of the dataset without changing the original scale of the object by mixing up the cropped images that have been enhanced with various methods, such as rotation, inversion, stretching, brightness balance, etc. (4) According to the experiment results in Table 1 and Table 2, training methods such as large scale jitter and SWA could effectively improve the performance of multi-scale ship object detection.

Benefiting from the above effective modules and methods, FD-Net outperforms all the competing methods on the DOTA v1.5 dataset and the DIOR dataset. Figure 8 and Figure 9 show the detection results on the DOTA v1.5 dataset and DIOR dataset in different situations, illustrating the excellent performance and robustness of the proposed method for use in multi-scale ship object detection.

6. Conclusions

In this paper, we proposed a novel fully deformable convolutional network (FD-Net) optimized for dense and multiple-scale ship detection in HRSI. By integrating the deformable convolution into the entire network structure, FD-Net could better extract features from ships with variable scale, orientation and shape. During the feature fusion processing, the EFPN could boost more accurate spatial and semantic information flow in the network through deformable convolutional networks and path aggregation network. Additionally, the ABFI module could adaptively select features of appropriate scales by transforming the scale-sensitive dependence into the channel attention problem. We also proposed a crop mosaic data augmentation method to improve the diversity of the dataset without changing the original scale of the object, which brings benefits for small ship object detection. In addition, we discussed and evaluated the effectiveness of some training methods on the performance of the proposed model. We conducted extensive experiments on the public datasets DOTA v1.5 and DIOR and also verified the effectiveness of our improvements through ablation experiments. Our experiment results illustrate that FD-Net could outperform the methods involved in the comparison in this paper and can meet the needs of ship detection in different situations, such as extremely small scale, dense distribution, different directions and aspect ratio. The effective modules and data augmentation method proposed in this paper can also be applied to other object detection tasks in remote sensing images or natural images due to their generality.

Author Contributions

All authors contributed to this manuscript. Data curation, Y.Y.; Methodology, H.G.; Experimental results analysis, H.G. and Y.Y.; Writing original draft, H.G.; Writing—review and editing, H.B. and W.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (NSFC), grant number U2031138.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://captain-whu.github.io/DOTA/tasks.html (accessed on 7 March 2022), http://www.escience.cn/people/gongcheng/DIOR.html (accessed on 7 March 2022).

Acknowledgments

The authors would like to thank the anonymous reviewers for the constructive suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, D.; Zhan, J.; Tan, L.; Gao, Y.; Župan, R. Comparison of two deep learning methods for ship target recognition with optical remotely sensed data. Neural Comput. Appl. 2021, 33, 4639–4649. [Google Scholar] [CrossRef]
Feng, Y.; Diao, W.; Sun, X.; Yan, M.; Gao, X. Towards automated ship detection and category recognition from high-resolution aerial images. Remote Sens. 2019, 11, 1901–1924. [Google Scholar] [CrossRef] [Green Version]
Lippitt, C.D.; Zhang, S. The impact of small unmanned airborne platforms on passive optical remote sensing: A conceptual perspective. Int. J. Remote Sens. 2018, 39, 4852–4868. [Google Scholar] [CrossRef]
Xu, J.; Fu, K.; Sun, X. An Invariant Generalized Hough Transform Based Method of Inshore Ships Detection. In Proceedings of the 2011 International Symposium on Image and Data Fusion (ISIDF), Tengchong, Yunnan, China, 9–11 August 2011; pp. 1–4. [Google Scholar]
Weber, J.; Lefevre, S. A multivariate hit-or-miss transform for conjoint spatial and spectral template matching. In Proceedings of the International Conference on Image and Signal Processing, Cherbourg, France, 1–3 July 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 226–235. [Google Scholar]
Corbane, C.; Najman, L.; Pecoudl, E.; Demagistrit, L.; Petit, M. A complete processing chain for ship detection using optical satellite imagery. Int. J. Remote Sens. 2010, 31, 5837–5854. [Google Scholar] [CrossRef]
Proia, N.; Pagé, V. Characterization of a Bayesian Ship Detection Method in Optical Satellite Images. IEEE Geosci. Remote Sens. Lett. 2010, 7, 226–230. [Google Scholar] [CrossRef]
Nie, T.; He, B.; Bi, G.; Zhang, Y.; Wang, W. A method of ship detection under complex background. Int. J. Geo Inf. 2017, 6, 159–177. [Google Scholar] [CrossRef] [Green Version]
Qi, S.; Ma, J.; Lin, J.; Li, Y.; Tian, J. Unsupervised ship detection based on saliency and s-hog descriptor from optical satellite images. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1451–1455. [Google Scholar]
Dong, C.; Liu, J.; Xu, F. Ship detection in optical remote sensing images based on saliency and a rotation-invariant descriptor. Remote Sens. 2018, 10, 400–419. [Google Scholar] [CrossRef] [Green Version]
Su, X.; Yang, G.; Sang, H. Ship detection in polarimetric sar based on support vector machine. Res. J. Appl. Sci. Eng. Technol. 2012, 4, 3448–3454. [Google Scholar]
Yu, Y.; Ai, H.; He, X.; Yu, S.; Zhong, X.; Lu, M. Ship Detection in Optical Satellite Images Using Haar-like Features and Periphery-Cropped Neural Networks. IEEE Access 2018, 6, 71122–71131. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [Green Version]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Dong, Y.; Chen, F.; Han, S.; Liu, H. Ship Object Detection of Remote Sensing Image Based on Visual Attention. Remote Sens. 2021, 13, 3192–3210. [Google Scholar] [CrossRef]
Yang, X.; Sun, H.; Fu, K.; Yang, J.; Sun, X.; Yan, M.; Guo, Z. Automatic Ship Detection in Remote Sensing Images from Google Earth of Complex Scenes Based on Multiscale Rotation Dense Feature Pyramid Networks. Remote Sens. 2018, 10, 132–146. [Google Scholar] [CrossRef] [Green Version]
Liu, W.; Ma, L.; Chen, H. Arbitrary-Oriented Ship Detection Framework in Optical Remote-Sensing Images. IEEE Geosci. Remote Sens. Lett. 2018, 15, 937–941. [Google Scholar] [CrossRef]
Wang, C.; Bai, X.; Wang, S.; Zhou, J.; Ren, P. Multiscale Visual Attention Networks for Object Detection in VHR Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 310–314. [Google Scholar] [CrossRef]
Zhang, H.; Wang, Y.; Dayoub, F.; Sünderhauf, N. VarifocalNet: An IoU-aware Dense Object Detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 8514–8523. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sünderhauf, N. Swa Object Detection. arXiv 2020, arXiv:2012.12645. [Google Scholar]
Xia, G.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.; Liao, H. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 1440–1448. [Google Scholar]
Lin, T.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 821–830. [Google Scholar]
Tang, G.; Zhuge, Y.; Claramunt, C.; Men, S. N-YOLO: A SAR Ship Detection Using Noise-Classifying and Complete-Target Extraction. Remote Sens. 2021, 13, 871–887. [Google Scholar] [CrossRef]
Ultralytics. YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 November 2021).
Li, L.; Jiang, L.; Zhang, J.; Wang, S.; Chen, F. A Complete YOLO-Based Ship Detection Method for Thermal Infrared Remote Sensing Images under Complex Backgrounds. Remote Sens. 2022, 14, 1534–1547. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Zhu, M.; Hu, G.; Zhou, H.; Wang, S.; Feng, Z.; Yue, S. A Ship Detection Method via Redesigned FCOS in Large-Scale SAR Images. Remote Sens. 2022, 14, 1153–1171. [Google Scholar] [CrossRef]
Dong, C.; Liu, J.; Xu, F.; Liu, C. Ship Detection from Optical Remote Sensing Images Using Multi-Scale Analysis and Fourier HOG Descriptor. Remote Sens. 2019, 11, 1529–1548. [Google Scholar] [CrossRef] [Green Version]
Xu, X.; Zhang, X.; Zhang, T. Lite-YOLOv5: A Lightweight Deep Learning Detector for On-Board Ship Detection in Large-Scene Sentinel-1 SAR Images. Remote Sens. 2022, 14, 1018–1045. [Google Scholar] [CrossRef]
Liu, S.; Kong, W.; Chen, X.; Xu, M.; Yasir, M.; Zhao, L.; Li, J. Multi-Scale Ship Detection Algorithm Based on a Lightweight Neural Network for Spaceborne SAR Images. Remote Sens. 2022, 14, 1149–1169. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9308–9316. [Google Scholar]
Deng, Z.; Sun, H.; Lei, L.; Zhou, S.; Zou, H. Object detection in remote sensing imagery with multi-scale deformable convolutional networks. Acta Geod. Cartogr. Sin. 2018, 47, 1216–1227. [Google Scholar]
Ren, Y.; Zhu, C.; Xiao, S. Deformable faster r-cnn with aggregating multi-layer features for partially occluded object detection in optical remote sensing images. Remote Sens. 2018, 10, 1470–1483. [Google Scholar] [CrossRef] [Green Version]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Rome, Italy, 10–14 June 2013; pp. 315–323. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Lin, D. MMDetection: Open mmlab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Li, Y.; Huang, Q.; Pei, X.; Jiao, L.; Shang, R. RADet: Refine Feature Pyramid Network and Multi-Layer Attention Network for Arbitrary-Oriented Object Detection of Remote Sensing Images. Remote Sens. 2020, 12, 389–409. [Google Scholar] [CrossRef] [Green Version]
Wang, Y.; Jia, Y.; Gu, L. EFM-Net: Feature Extraction and Filtration with Mask Improvement Network for Object Detection in Remote Sensing Images. Remote Sens. 2021, 13, 4151–4169. [Google Scholar] [CrossRef]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 9759–9768. [Google Scholar]
Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 840–849. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 11632–11641. [Google Scholar]
Kim, K.; Lee, H. Probabilistic anchor assignment with iou prediction for object detection. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; pp. 355–371. [Google Scholar]
Li, B.; Liu, Y.; Wang, X. Gradient Harmonized Single-Stage Detector. arXiv 2018, arXiv:1811.05181. [Google Scholar] [CrossRef]

Figure 1. Some examples of ship detection in different situations. (a) Complex background; (b) multiple scales; (c) dense distribution; (d) different directions and aspect ratio.

Figure 2. Illustration of the sampling locations in 3 × 3 standard and deformable convolutions. (a) Regular sampling grid of standard convolution; (b–d) deformed sampling locations of deformable convolution with augmented offsets. The green areas are the sampling locations in 3 × 3 standard convolution. The gray areas and the blue areas are the initial sampling locations and final sampling locations of the deformable convolution, respectively. The yellow arrow points from the initial sampling location to the corresponding final sampling location.

Figure 3. The framework of the proposed FD-Net. FD-Net consists of four main components: the backbone uses the ResNet-50 and deformable convolution to extract features, an enhanced feature pyramid network (EFPN) based on PANet and deformable convolution for multi-scale feature fusion, an adaptive balanced feature integrated (ABFI) module and the VFNet head.

Figure 4. The enhanced feature pyramid.

Figure 5. The pipeline of adaptive balanced feature integrated (ABFI) module. According to the different role positions, ABFI modules are divided into three types, namely ABFI-A, ABFI-B and ABFI-C, as shown in (a–c). GAP, Add and Concat mean global average pooling, addition and concatenation, respectively. Features at resolution level

l

are denoted as

C_{l}

.

r

is denoted as the expansion factor. The red arrow is denoted as the multiply operation. (Conv, k × k/s, n) is denoted as the convolution operation, where k is the kernel size of convolution layer, s is the stride and n is the number of convolution kernels.

Figure 5. The pipeline of adaptive balanced feature integrated (ABFI) module. According to the different role positions, ABFI modules are divided into three types, namely ABFI-A, ABFI-B and ABFI-C, as shown in (a–c). GAP, Add and Concat mean global average pooling, addition and concatenation, respectively. Features at resolution level

l

are denoted as

C_{l}

.

r

is denoted as the expansion factor. The red arrow is denoted as the multiply operation. (Conv, k × k/s, n) is denoted as the convolution operation, where k is the kernel size of convolution layer, s is the stride and n is the number of convolution kernels.

Figure 6. The crop mosaic data augmentation.

Figure 7. The evolution of the loss function in the training stage. (a) The evolution of the loss function on DIOR dataset. (b) The evolution of the loss function on DOTA v1.5 dataset.

Figure 8. Some examples of the detection results on the DOTA v1.5 dataset in different situations. (a) Complex background; (b) dense distribution; (c) multiple scales; (d) different directions and aspect ratio.

Figure 9. Some examples of the detection results on the DIOR dataset in different situations. (a) complex background; (b) dense distribution; (c) multiple scales; (d) different directions and aspect ratio.

Table 1. Ablation experiment of the components on the DOTA v1.5 dataset.

Method	Backbone	EFPN	ABFI	CM	LSJ	SWA	$A P$	$A P_{50}$	$A P_{75}$	$A P_{S}$	$A P_{M}$	$A P_{L}$
VFNet	ResNet-50	-	-	-	-	-	0.486	0.664	0.595	0.289	0.632	0.788
our	ResNet-50	√	-	-	-	-	0.487	0.672	0.599	0.320	0.611	0.701
our	ResNet-50	-	√	-	-	-	0.484	0.671	0.593	0.314	0.611	0.720
our	ResNet-50	-	-	√	-	-	0.510	0.677	0.625	0.322	0.646	0.824
our	ResNet-50	-	-		√		0.509	0.681	0.626	0.327	0.644	0.803
our	ResNet-50	-	-	-	-	√	0.508	0.675	0.615	0.297	0.660	0.839
our	ResNet-50	√	√	√	√	√	0.536	0.737	0.701	0.402	0.706	0.804

Table 2. Ablation experiment of the components on the DIOR dataset.

Method	Backbone	EFPN	ABFI	CM	LSJ	SWA	$A P$	$A P_{50}$	$A P_{75}$	$A P_{S}$	$A P_{M}$	$A P_{L}$
VFNet	ResNet-50	-	-	-	-	-	0.528	0.783	0.626	0.428	0.705	0.868
our	ResNet-50	√	-	-	-	-	0.539	0.789	0.648	0.447	0.706	0.847
our	ResNet-50	-	√	-	-	-	0.535	0.789	0.642	0.444	0.698	0.825
our	ResNet-50	-	-	√	-	-	0.529	0.787	0.630	0.435	0.694	0.848
our	ResNet-50	-	-		√		0.551	0.794	0.659	0.459	0.711	0.875
our	ResNet-50	-	-	-	-	√	0.531	0.782	0.633	0.434	0.704	0.872
our	ResNet-50	√	√	√	√	√	0.561	0.833	0.709	0.507	0.725	0.876

Table 3. Performance evaluation of different methods on the DOTA v1.5 dataset.

Method	Backbone	$A P$	$A P_{50}$	$A P_{75}$	$A P_{S}$	$A P_{M}$	$A P_{L}$
ATSS	ResNet-50	0.486	0.671	0.598	0.283	0.634	0.789
Faster-R-CNN	ResNet-50	0.478	0.669	0.592	0.314	0.597	0.784
FSAF	ResNet-50	0.477	0.675	0.582	0.303	0.602	0.735
GFL	ResNet-50	0.453	0.648	0.548	0.279	0.584	0.776
PAA	ResNet-50	0.459	0.647	0.548	0.219	0.631	0.790
RetinaNet-GHM	ResNet-50	0.429	0.628	0.511	0.196	0.605	0.725
RetinaNet	ResNet-50	0.423	0.627	0.502	0.187	0.601	0.703
VFNet	ResNet-50	0.486	0.664	0.595	0.289	0.632	0.788
YOLOv5-s	-	0.415	0.572	-	-	-	-
YOLOv5-m	-	0.528	0.735	-	-	-	-
FD-Net	ResNet-50	0.536	0.737	0.701	0.402	0.706	0.804

Table 4. Performance evaluation of different methods on the DIOR dataset.

Method	Backbone	$A P$	$A P_{50}$	$A P_{75}$	$A P_{S}$	$A P_{M}$	$A P_{L}$
ATSS	ResNet-50	0.515	0.782	0.609	0.416	0.689	0.817
Faster-R-CNN	ResNet-50	0.501	0.775	0.588	0.409	0.665	0.820
FSAF	ResNet-50	0.505	0.784	0.593	0.416	0.669	0.787
GFL	ResNet-50	0.522	0.783	0.621	0.425	0.694	0.850
PAA	ResNet-50	0.503	0.766	0.593	0.395	0.692	0.849
RetinaNet-GHM	ResNet-50	0.488	0.754	0.571	0.380	0.686	0.788
RetinaNet	ResNet-50	0.487	0.752	0.569	0.378	0.688	0.799
VFNet	ResNet-50	0.528	0.783	0.626	0.428	0.705	0.868
YOLOv5-s	-	0.420	0.613	-	-	-	-
YOLOv5-m	-	0.539	0.805	-	-	-	-
FD-Net	ResNet-50	0.561	0.833	0.709	0.507	0.725	0.876

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, H.; Bai, H.; Yuan, Y.; Qin, W. Fully Deformable Convolutional Network for Ship Detection in Remote Sensing Imagery. Remote Sens. 2022, 14, 1850. https://doi.org/10.3390/rs14081850

AMA Style

Guo H, Bai H, Yuan Y, Qin W. Fully Deformable Convolutional Network for Ship Detection in Remote Sensing Imagery. Remote Sensing. 2022; 14(8):1850. https://doi.org/10.3390/rs14081850

Chicago/Turabian Style

Guo, Hongwei, Hongyang Bai, Yuman Yuan, and Weiwei Qin. 2022. "Fully Deformable Convolutional Network for Ship Detection in Remote Sensing Imagery" Remote Sensing 14, no. 8: 1850. https://doi.org/10.3390/rs14081850

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fully Deformable Convolutional Network for Ship Detection in Remote Sensing Imagery

Abstract

1. Introduction

2. Related Work

2.1. Multiple-Scale Object Detection Methods

2.2. Deformable Convolutional Networks

3. Proposed Method

3.1. Fully Deformable Convolution Network

3.2. Enhanced Feature Pyramid Network

3.3. Adaptive Balanced Feature Integrated Module

3.4. Crop Mosaic Data Augmentation

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Metrics

4.4. Ablation Experiments

4.5. Results of DOTA Dataset

4.6. Results of the DIOR Dataset

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI