1. Introduction
An insightful understanding in very high resolution (VHR) aerial images (20 cm–30 cm resolution) can be available under imagery analysis for geospatial areas [
1,
2]. Object detection, a critical component of automatic aerial imagery analysis, plays an important role in national defense construction, urban planning, environmental monitoring and so on [
3,
4,
5]. Although many object detection methods in VHR aerial images have been proposed before, this task is still full of great challenges due to arbitrary orientations and large scale variations with various background. Recently, with the rapid development of convolutional neural networks (CNNs), a variety of CNN-based object detection frameworks [
6,
7] have been proposed and very impressive results have been achieved over natural benchmarks including PASCAL VOC [
8] and MS COCO [
9]. The existing CNN-based object detection methods can be commonly divided into two parts: two-stage methods and single-stage methods. In the two-stage methods, the input image in the first stage contributes to the generation of category-independent region proposals, and subsequently features of these regions are extracted, then the refinement of category-specific classifiers and regressors is achieved for classification and regression in the second stage. Finally, accurate detection results are attained with the deletion of redundant bounding boxes, such as non-maximum suppression (NMS). Region-based CNN (R-CNN) [
10] is a pioneering work. Its reformative version SPP-Net [
11], Fast-RCNN [
12] makes it possible to simplify learning and runtime efficiency. Faster-RCNN [
6] integrates the proposed Region Proposal Network (RPN) and Fast R-CNN into a unified network by sharing convolution weights, which makes object detection quick and accurate in an end-to-end manner. Zhang et al. [
13] proposed a multiscale cascaded object detection network and introduce multiscale features in pyramids to obtain feature of each scale with a novel attention method, which can highlight the object features and efficiently detect objects for traffic sign with complex background. Many high-performance detection methods are also proposed until now, such as FPN [
14], R-FCN [
15], Mask-RCNN [
16], Libra RCNN [
17], Trident-Net [
18], and so forth. In addition, the single-stage methods directly consider object detection as a regression problem, without the procedure of proposal generation, which can perform nearly real-time achievement. YOLO [
19] and SSD [
20] are popular in single-stage methods, which maintained real-time speed with ensured detection accuracy. RetinaNet [
7] proposes a new focal loss function to address class imbalance issue of single-stage approaches. Inspired by the two-stage methods, RefineDet [
21] can adjust the sizes of anchors and locations with the adoption of cascade regression and the application of an Anchor Refinement Module (ARM), and then filter out easy negative anchors to improve accuracy.
Inspired by the great success of CNN-based object detection methods in natural images, a growing number of studies have been devoted recently to object detection in VHR optical aerial images [
22,
23]. Considering the arbitrary orientation, Cheng et al. [
24] proposed to learn a Rotation-Invariant CNN (RICNN) model based on R-CNN framework used for multi-class object detection. To achieve real-time object detection, Tang et al. [
25] adopted a regression method based on SSD to detect vehicle targets through applying a set of default boxes with various scales on per feature map location. To be specific, to better fit the shape of objects, the offsets for each default box are predicted. To deal with the problem of multi-scale detection with the large ratio of remote sensing objects, Guo et al. [
26] proposed a unified multi-scale framework, which is composed of multi-scale object proposal network and a multi-scale detection network. To achieve further accuracy of the localization in aerial images, Zhang et al. [
27] proposed a Double Multi-scale Feature Pyramid Network (DM-FPN), which makes the most of semantic and resolution features simultaneously and bring up some multi-scale training, inference and adaptive categorical non-maximum suppression (ACNMS) strategies. In addition, object detection based on weakly supervised deep learning method arouses more and more attentions of researchers in recent years. Except depending on the costly bounding box annotations, Li et al. [
28] proposed a weakly supervised deep learning method which combine the separate category information and mutual cues between scene-level pairs to train a deep network for multi-class geospatial object detection.
These methods have achieved very promising detection performances by using horizontal bounding boxes (HBBs) as region of interests. HBBs are appropriate for ground-level images with mainly regular and vertical objects. However, in VHR aerial images, objects can have any orientations between 0 and 360 degrees viewed from overhead. Such HBB-based methods can result in missed detection and redundancy of detection region especially for densely-distributed and strip-like objects such as ship and large vehicle as shown in
Figure 1. Therefore, employing oriented bounding boxes (OBBs) as region of interests is highly recommended, which can identify more accurate and intuitive localization with fitting regions for aerial images. Recently, OBB-based methods in VHR aerial images gradually attract researchers’ attention. The existing methods, which contribute to the oriented object detection, can be divided into three categories: OBB detection with generating rotated region proposals, OBB regression from the coarse horizontal region proposals and OBB representation by calculating minimum area rectangle from the mask shape prediction. For the first method, Yang et al. [
29] presented a Rotation Dense Feature Pyramid Networks (R-DFPN) by producing rotational proposals from the RPN to achieve rotated location regression in the Fast-RCNN stage for ship detection with a large aspect ratio. Azimi et al. [
30] proposed Rotation Region Proposal Network (R-RPN) and Rotated Region of Interest Network (R-RoI) to generate and handle Rotation-based proposals respectively. Ding et al. [
31] proposed RoI Transformer to address the misalignment problem between the horizontal region of interests and oriented objects. Moreover, the Rotated RoI Learner and Rotated Position Sensitive RoI Align layer are designed to boost rotated object classification and regression. These methods achieve advanced performance. Meanwhile, the computational burden will mount as well as a result of the possibility that each pixel may generate dozens or even hundreds of rotated proposals with using a more complicated structure. For the second method, R
CNN [
32] is an efficient and classic rotation detection method, but it is especially for scene text detection, which is not suitable for aerial scenarios. Inspired by the R
CNN, Yang et al. [
33] proposed a multi-class detection method based on Faster R-CNN named SCRDet, making it probable to estimate and regress OBBs by making good use of the coarse resolution information in horizontal regions. SCRDet can achieve precisely rotation detection especially for small and cluttered objects on VHR remote sensing images. For the third method, Li et al. [
34] proposed a rotation detector named RADet inspired by Mask RCNN [
16], which obtains the rotated bounding box by calculating the minimum aera rectangle from the correspondingly precited mask shape. This simple presentation with an efficient multi-scale network achieve the competent performance on two benchmarks DOTA [
35] and NWPU VHR-10 [
36].
To better build an accurate and oriented object detection for multi-class objects in VHR aerial images with diverse background, this paper proposes a novel Multi-scale Feature Integration Attention Rotation Network (MFIAR-Net). The proposed framework is composed of three modules: Multi-scale Feature Integration Network (MFIN), Double-Path Feature Attention Network (DPFAN) and Rotation Detection Network. Compared with advanced rotation detection methods such as RADet [
34], SCRDet [
33], RoI-Transformer [
31] and ICN [
30], our framework is more suitable for multi-class and arbitrary-oriented object detection in aerial images. The main contributions of this paper are as follows:
We propose an accurate and unified Multi-scale Feature Integration Attention Rotation Network (MFIAR-Net) for VHR aerial images, which can efficiently detect the multi-category and arbitrary-oriented objects with fitting OBBs.
We propose a Multi-scale Feature Integration Network (MFIN) by integrating semantically strong, low-spatial resolution features and semantically weak, high-spatial resolution features into a discriminative feature map to handle the scale variations of geospatial objects. The Asymmetric Convolution Block (AC Block) is a crucial design to substitute for standard square-kernel convolutional layer to extract distinguished features.
We design a Double-Path Feature Attention Network (DPFAN) supervised by the mask information of ground truth to guide the network to focus on object representations and suppress the irrelevant background information.
We present a robust Rotation Detection Network to regress OBBs with five parameters , in which Position Sensitive RoI Align (PS RoI Align) layer and a new multi-task learning loss is introduced to make the localization of deep network more sensitive and benefit the oriented regression.
Besides, the multi-scale training and inference strategy is adopted to handle multi-scale remote sensing objects. The proposed framework is evaluated on two publicly aerial datasets DOTA [
35] and HRSC2016 [
37] compared with several state-of-the-art approaches. And the effectiveness and superiority of the proposed framework is demonstrated with comprehensive experiments.
The rest of this paper is organized as follows:
Section 2 describes the proposed MFIAR-Net for oriented object detection in detail.
Section 3 illustrates the datasets, details of implementation, evaluation criteria and experiment results.
Section 4 discusses the proposed framework, conducts the careful ablation study and analyzes limitations and future research directions. Finally, the conclusions are drawn in
Section 5.