1. Introduction
In recent years, with the development of machine learning and deep learning, object detection, which can be used in navigation [
1], disaster warning [
2], building detection [
3], and other fields, has gradually become a popular research topic in computer vision. Object detection requires identifying and locating a specific object, such as an aircraft, a car, a pedestrian or another object, in an image scene. Object detection is a fundamental problem in the field of computer vision, along with typical tasks such as image classification [
4], image segmentation [
5], motion estimation [
6], and object tracking [
7], and it has prompted the development of a number of classical algorithms. However, it is still a difficult task to make machines learn to detect objects in remote sensing images [
8], which have the problems of complex scenes, large scenes but small objects, and multi-scale objects [
9] in different categories, and these make remote sensing object detection suffer from the problems of difficult detection of small objects and low accuracy of multiscale objects.
Traditional object detection method, such as the deformable parts model (DPM) [
10,
11], the histogram of oriented gradients [
12]-support vector machine [
13] (HOG-SVM), and the HOG-Cascade [
14], are not ideal when applied directly to remote sensing object detection. Although these methods perform better when detecting common objects such as pedestrians and vehicles, because remote sensing images have complex backgrounds, large-scale differences of objects, and small objects, traditional detection algorithms are ineffective when detecting remote sensing objects. With the rapid development of computer technology and deep learning, researchers have applied convolutional neural networks (CNNs) [
15] to remote sensing object detection and achieved good results. J Redmon et al. proposed YOLOv3, an incremental improvement [
16] over previous detection methods. Z Cui et al. proposed dense attention pyramid networks for multiscale ship detection in SAR images [
17]. W Huang et al. proposed CF2PN [
18], a cross-scale feature fusion pyramid network-based method for remote sensing object detection. D Xu et al. proposed FE-YOLO [
19], a feature-enhancement network for remote sensing object detection. Compared with traditional object detection algorithms, object detection algorithms based on CNNs are more accurate, allowing them to detect multiscale objects and small objects in remote sensing images with high accuracy.
CNNs can extract spatial context information and have been widely used to detect objects in remote sensing images. At present, the most common neural networks for object detection are neural networks based on region proposal and neural networks based on anchor box regression. Most region proposal-based neural networks are two-stage networks that first determine the approximate object location based on the region proposal network and then accurately predict the object class and regress to the exact bounding box. While this step-by-step learning strategy improves the detection accuracy of these networks, it also increases the detection time and the difficulty in achieving efficient processing, and the training time is too long for remote sensing images with large input image sizes. Some typical examples of such networks include R-CNN [
20], Fast R-CNN [
21], and Faster R-CNN [
22]. Most neural networks based on anchor box regression are one-stage networks that treat the whole prediction process as a regression process. This simplification of the process not only maintains the accuracy but also increases the speed; examples of such networks include the SSD [
23,
24,
25], YOLO [
26,
27,
28] series, and Efficientdet [
29,
30]. Among them, YOLO series networks are typical neural networks based on anchor box regression, and several versions, such as YOLOv2 [
31], YOLOv3 [
16], YOLOv4 [
32], and YOLOv5 [
33], have been open-sourced. Among these versions, YOLOv3, YOLOv4, and YOLOv5 achieve a good balance between speed and accuracy when faced with the demands of traditional object detection applications, and they can achieve both efficient processing and good performance. However, when these methods are applied directly to remote sensing image detection, there are various problems, such as a lower detection accuracy for objects with large-scale differences and difficulty detecting small objects in complex scenes. Therefore, the network results of the YOLO series for remote sensing object detection need to be improved further to achieve better detection performance.
To address the problems of complex scenes in remote sensing images, multi-scale objects in different categories, and large scenes with small objects, we propose DFPN-YOLO, a dense feature pyramid network structure based on YOLO. Since the YOLO series became more integrated after version v3, the structure changes of network were not significant. Therefore, we use YOLOv3 as a baseline to easier compare the accuracy before and after altering the structure of network. First, we add a spatial groupwise enhancement [
34] (SGE) attention module to the residual block [
35] of the backbone to increase the efficiency of the backbone in extracting meaningful semantic information from complex scenes; then, we add a large detection layer to improve the accuracy in detecting small objects in remote sensing images; and finally, we propose Dense-FPN, a dense feature pyramid network structure that combines the semantic information of the feature layers to improve the ability to detect objects at different scales.
The remainder of this paper is organized as follows: related work on YOLO, in particular the framework structure of YOLOv3, is discussed in
Section 2. In
Section 3, our methodology is described in detail. In
Section 4, an experimental validation is presented, introducing the datasets used as well as the relevant evaluation metrics. Finally, the conclusions are given in
Section 5.
2. Related Work
YOLO was first proposed by Joseph Redmon et al. in 2015, and the official version has been updated from YOLOv1 to YOLOv3. It is worth noting that YOLOv4 and YOLOv5 are not official versions. The YOLO series network directly regresses the information of a grid cell bounding box to the final feature map, yielding three prediction values for each bounding box: (1) the probability of the object being in the grid; (2) the coordinates of the bounding box, and (3) the object class and its probability. For each grid cell, the predicted values include five parameters: x, y, w, h, and cf, where x, y, w, and h denote the x and y coordinates, height, and width of the center point of the enclosing box, respectively, and cf denotes the confidence of the bounding box. Therefore, the loss function of the whole network can be written as shown in Equation (1):
In the equation, represents the number of grids, B represents the number of anchors, and represents whether the corresponding anchor box is responsible for detecting the object. If it is responsible, is 1; otherwise, it is 0. represents the ground truth, which is determined by whether or not the bounding box of the grid is responsible for predicting an object. If this is the case, is 1; otherwise, it is 0. When calculating the multi-classification loss, we regard it as multiple two-classification tasks. For each category, the ground truth is 1 if the object belongs to this category; otherwise, it is 0, and the prediction indicates the probability that the object belongs to this category. Our approach follows the loss function of YOLOv3, which will not be described in subsequent sections.
The backbone of YOLOv3 is Darknet53 [
36], which downsamples each input image five times, with the last three downsampled layers transmitted to the detection layer for object detection after feature fusion. The structure of the YOLOv3 is shown in
Figure 1. For a 416 × 416 input image, the three scales of the detection layers are 13 × 13, 26 × 26, and 52 × 52, which are responsible for detecting objects at different scales. The deep layer contains a large amount of semantic information, while the shallow feature-mapping layer contains a large amount of fine-grained information. Therefore, the network uses a feature pyramid to perform feature fusion, where the downsampled 32-fold feature map is first upsampled to the same size as the downsampled 16-fold feature map, and then, the feature maps are cascaded together. Similarly, the same process is performed for the downsampled 16-fold feature map and the downsampled 8-fold feature map.
3. Methods
Even the YOLOv3 has a poor performance in remote sensing object detection. Because remote sensing images are characterized by complex scenes, small objects, and multi-scale objects in different categories, additional detection layers are necessary in remote sensing object detection to extract features more efficiently without deepening the network. For this purpose, we propose the DFPN-YOLO. The structure of DFPN-YOLO is shown in
Figure 2.
The specific methods are as follows: first an attention module is added to the residual block of the backbone to allow the network to more effectively extract features in complex scenes. Second, a larger detection layer is added on top of the original three detection layers to allow the network to detect small objects. The four detection layers correspond to 4×, 8×, 16×, and 32× downsampling of the original image, and the feature information of small objects is fully retained on the feature map with 4× downsampling. Finally, a dense feature pyramid network structure is used to combine the scales of the four feature layers, allowing the fused feature layers to combine semantic information before and after sampling, improving the object detection performance at different scales.
3.1. Attention-Based Feature Extraction Network
Darknet53, the backbone of YOLOv3, is mainly composed of residual units, and because of the way these residuals are combined, Darknet53 can be trained effectively even when stacked to 53 layers, with no gradient explosions or gradient disappearance. However, because the residual block stacking is very deep, the training is slow, and the shortcut in the individual residual blocks causes the perceptual field to capture only detail information and not global characteristics. Thus, in complex scenes, the features in each layer are not extracted sufficiently or effectively, and complex scenes in remote sensing image object detection and the simple stacking of residual units to deepen the network do not significantly improve the feature extraction ability. In order to solve the problem, which is difficult to extract features under the complex background of remote sensing images, we add the spatial groupwise enhancement (SGE) attention module to the residual unit. SGE is based on SE-Net and combined with the idea of grouping so that it is a lightweight attention module that increases the classification and detection performance with nearly no increase in the number of parameters or the computational cost. A complete feature is composed of many subfeatures, which are distributed in groups in each layer; however, these subfeatures are all processed in the same manner and are all affected by background noise, which can lead to incorrect recognition and localization results. Therefore, the addition of the SGE module can generate an attention factor in each group, allowing the importance of each subfeature to be obtained and each group to learn and suppress noise as follows:
The feature map is divided into G groups based on the channel dimension;
The attention factor of each group is determined;
Global average pooling is performed on each group to obtain the vector g;
The vector g is element-wise dotted with the original group feature;
The vector is normalized, sigmoid activated, and element-wise dotted with the original group feature;
Finally, the enhanced feature map is generated.
A feature map was obtained from the original image after continuous processing of multiple convolutions, and then, it is divided into several groups along the channel dimension and processed by SGE module. The attention factor of each group of features was obtained and mapped to the corresponding feature map. Finally, after semantic feature enhancement, the feature map was generated. The SGE structure diagram is shown in
Figure 3.
Due to the light weight of the SGE module and its effectiveness for higher-order semantic features, the SGE module can be perfectly integrated with Darknet53. We add the SGE module to the residual unit to improve the ability of the backbone network to extract features in complex scenes. In particular, the original feature map is convolved, batch normalized and activated by the activation function, and after the second convolution and batch normalization, the feature enhancement is performed by the SGE module, and the enhanced feature map is summed with the original feature map by shortcut edges and then activated by the activation function.
Figure 4 shows the SGE module after it has been inserted into the residual block.
The residual units with attention modules are continuously stacked to form the backbone SGEDarknet53, whose structure is shown in
Table 1.
3.2. Detection Layer for Small Objects
YOLOv3 uses different detection layers to detect objects of various sizes. For a 416 × 416 input image, the sizes of the three detection layers are 13 × 13, 26 × 26, and 52 × 52, i.e., the feature maps of the three detection layers are downsampled 8 times, 16 times, and 32 times, respectively. The smaller the size of the feature map, the larger the area corresponding to each grid cell in the input image; in contrast, the larger the size of the feature map, the smaller the area corresponding to each grid cell in the input image. Thus, the 13 × 13 detection layer is suitable for detecting large objects, while the 52 × 52 detection layer is suitable for detecting small objects. However, when compared with the original map, the 52 × 52 detection layer is downsampled 8 times, i.e., when the size of the object is smaller than 8 × 8, the space it occupies in the feature map may be less than 1 pixel after the feature extraction process, which makes it difficult to detect small objects. In general, remote sensing images contain a large number of small objects. To further improve the detection capability of small objects in remote sensing images, one of the most direct and effective ways is to perform object detection directly on the feature map with larger resolution. Although it will increase the computational cost to a certain extent, but in the feature fusion stage, the feature maps under high resolution have relatively low dimension, the increase in the number of parameters is only concentrated in the prediction layer so that the increase in the number of parameters is relatively limited. We add a 104 × 104 detection layer to detect small objects, and compared with the original image, it downsampled four times. Theoretically, even the resolution is 4 × 4, the feature information can also be retained on this detection layer, which greatly improves the detection performance of small objects. The improved network structure with the 104 × 104 × 255 small object detection layer is called P2 layers in
Figure 2.
3.3. Multiscale Feature Fusion Based on Dense Feature Pyramids
In the feature fusion stage, YOLOv3 uses a feature pyramid network [
37] (FPN) to laterally combine the semantic information of the last three feature layers sampled; the feature pyramid network structure is shown in
Figure 5.
However, when a P2 detection layer is added, the FPN has four layers, and the simple horizontal connection does not combine the semantic feature information well. Thus, we propose a dense feature pyramid network called Dense-FPN. Dense-FPN continuously samples and combines the feature maps of the C2, C3, C4, and C5 layers to generate the P2, P3, P4, and P5 layers. The specific approach is to upsample and combine the feature maps of the C3, C4, and C5 layers and then upsample and combine the fused feature maps with the previous layers until the top layer, C2, is reached, thus generating the middle hidden layers H2, H3, H4, and H5. After that, the feature maps of the middle hidden layers, H2, H3, and H4, are downsampled and fused with the feature maps of the next layer. The fused feature maps are downsampled and fused with the next layer until layer H5 is reached, thus generating the final layers, P2, P3, P4, and P5. We also connect the input feature layer, the hidden layer, and the output layer with a jump connection to achieve feature reuse. This connection is more conducive to gradient backpropagation, as it better utilizes the feature information and improves the information transfer efficiency between the layers. The Dense-FPN structure is shown in
Figure 6.
3.4. K-Means for Anchor Boxes
We use the k-means algorithm to generate anchors for the four detection layers. The k-means algorithm generates anchors that have large IOUs with the ground truth, which is more conducive to network convergence. The specific method is as follows:
Randomly select some points as centroids of cluster for the initial aggregation, with the centroid of the cluster corresponding to the center of the sample that we will approach;
For each sample in the datasets, calculate the ground truth to the centroid of each cluster, and classify the sample into the cluster with the smallest distance, as shown in Equations (2) and (3), where bbox represents the bounding box, and d(bbox, centriod) represents the distance between the centroid of the cluster and the center of the bbox;
- 3.
Recalculate the cluster center for each cluster;
- 4.
Repeat steps 2 and 3 until the clusters converge.
For resolution 416 × 416 input images, with the k-means algorithm, we generated 12 anchor boxes for the four detection layers: (21, 25), (25, 31), (33, 39), (44, 51), (59, 81), (84, 95), (104, 116), (119, 148), (161, 184), (221, 201), (246, 213), and (259 278). The anchor boxes (21, 25), (25, 31), and (33, 39) were designed for the added 104 × 104 detection layer, and they can be used to detect small objects, which are usually only a few pixels in size, in remote sensing images. For medium-sized objects, a slightly larger anchor can be used on a 52 × 52 or 26 × 26 feature map. The anchor boxes (221, 201), (246, 213), and (259 278) were designed for the big objects on 13 × 13 feature map. Therefore, even if an image contains objects of different sizes, as shown in
Figure 7, the anchor of hierarchical designed can match the objects.
4. Experiments and Results
To verify the effectiveness of our proposed method, we conduct comparison experiments using the publicly available RSOD [
38] datasets and DIOR [
39] datasets with different versions of YOLO, some classical detection algorithms, and our proposed method. In this section, we present the datasets used, the evaluation metrics, the experimental procedures, and the experimental results.
4.1. Datasets
The RSOD datasets are open object detection datasets for object detection in remote sensing images. The datasets include aircraft, fuel tanks, sports fields, and overpasses that have been annotated in the format of PASCAL VOC [
40] datasets. The datasets are divided into four folders as follow:
4993 aircraft in 446 images;
191 playgrounds in 189 images;
180 overpasses in 176 images;
1586 oil tanks in 165 images.
Some example images from the RSOD datasets are shown in
Figure 8.
We randomly divided the datasets into a training set, a validation set, and a test set according to a 6:2:2 ratio, i.e., 580 images for training, 197 images for validation, and 199 images for testing, as shown in
Table 2.
The DIOR datasets are large-scale benchmark datasets for object detection in optical remote sensing images. The datasets includes 23,463 images of different seasons and weather patterns, with 190,288 object instances, a uniform image size of 800 × 800, and resolutions ranging from 0.5 m to 30 m. The DIOR datasets officially provided helped us divide the training set, verification set, and test set according to the ratio of 2.5:2.5:5 as shown in
Table 3 [
39]. Note that one image may contain multiple object classes, so the column totals do not simply equal the sums of each corresponding column. The number of each category represents the object number, not the number of images, and the “Total” in last line represents the number of images in each set.
Some example images from the DIOR datasets are shown in
Figure 9.
As shown in the figure, the scenes in the DIOR datasets and RSOD datasets are relatively complex, including scenes such as mountains, lakes, grasslands, farms, docks, and airports. The scales of the different object categories vary greatly, ranging from small objects such as airplanes and cars, with sizes less than 30 × 30, to playgrounds and golf courses, with sizes larger than 500 × 500. The scales of similar objects, such as ships and airplanes, also vary greatly.
4.2. Evaluation Metrics
In this paper, we use the mean average precision [
41] (mAP) as an evaluation metric. The mAP is an important metric for evaluating object detection performance. We divide the samples into true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN) cases to calculate the precision (P) and recall (R) as shown in Equation (4).
The precision and recall are two mutually constrained and balanced metrics. To measure these two metrics, we introduce the mAP, which is defined as the area under the average PR curve of each category at different confidence levels as shown in Equation (5).
where N
C represents the number of categories in the datasets.
4.3. Experimental Design
We trained the RSOD datasets and DIOR datasets using Faster RCNN, SSD, YOLOv2, YOLOv3, YOLOv3-SPP, YOLOv4, and DFPN-YOLO in the PyTorch framework and performed data augmentation uniformly for the unbalanced categories of the original datasets. All experiments were performed on four NVDIA GTX 2080Ti with 11 GB of RAM, and to ensure the fairness of the comparison experiments, we used stochastic gradient descent [
42] (SGD) to optimize the model with a momentum of 0.843 and a weight decay of 0.00036.
4.4. Results and Analysis
4.4.1. Experimental Results of DFPN-YOLO
DFPN-YOLO achieved a high performance when it was tested on the RSOD datasets and DIOR datasets. The result of each categories as shown in
Figure 10.
The above figures show the results of the DFPN-YOLO on the DIOR datasets and the RSOD datasets, including the average precision of the different categories and the mAP of the total categories. Our DFPN-YOLO model had a better detection performance on the RSOD datasets, but the slightly lower performance on overpass images was difficult to improve due to a fewer number of training samples. On the DIOR datasets, our model had 13 classes with AP values greater than 0.7. Some of the test results are shown in
Figure 11.
However, we found that there are some categories with low detection performance on the DIOR datasets, such as vehicles, bridges, and stadiums. According to our analysis of the test set results, our DFPN-YOLO model had a large number of false positives for small dense objects, such as ships and vehicles, as shown in
Figure 12.
The reason for the high number of false positives is that our model detects some small objects that are not labeled but do exist, such as vehicles and ships. Since we add a detection layer for small objects, our model detects some real objects with lower confidence, which have an impact in the calculation of mAP despite their lower confidence, resulting in a lower final accuracy.
Figure 12b shows that although there are many small vehicles, no vehicles are marked in the labeled figure. However, our model detects some of the small cars. Furthermore, there were a small number of training samples for objects such as stadiums, increasing the difficulty of training the model for these objects.
4.4.2. Results of the Comparison Experiment
To further validate the effectiveness of our method, we conducted comparison experiments using Faster RCNN with resnet50 as backbone, SSD with vgg16 as backbone, YOLOv2 with darknet19 as backbone, YOLOv3 with darknet53 as backbone, YOLOv3-SPP with darknet53 and spatial pyramid pooling (SPP) module as backbone, YOLOv4 with CSPdarknet53 as backbone, and DFPN-YOLO with SGEdarknet53 as backbone to compare the accuracy of the algorithms based on the mAP, and the results on the DIOR datasets are shown in
Table 4.
Classes 1–20 represent the following categories: airplane, airport, baseball field, basketball court, bridge, chimney, dam, expressway service area, expressway toll station, golf field, ground track field, harbor, overpass, ship, stadium, storage tank, tennis court, train station, vehicle, and windmill. Similarly, we performed comparison experiments on the RSOD datasets, with Classes 1–4 representing the oil tank, playground, aircraft, and overpass, respectively. The results are shown in
Table 5.
On the DIOR datasets, our method has the highest mAP from the original 62.41% of YOLOv3 to 69.33% while outperforming other advanced methods, even higher than the 66.73% of YOLOV4, and we have the best detection performance in most categories. Of the RSOD datasets, our method is also the most accurate. Compared with 83.9% of YOLOv3, DFPN-YOLO reaches 92%, which is even 0.7% higher than YOLOv4. Furthermore, in the two categories of oil tank and playground, our detection performance is much higher than other methods, with AP reaching nearly 98%.
4.4.3. Ablation Experiments
To further validate the improved performance of Dense-FPN structure, we verified the effective improvement introduced by each step of our method by performing ablation experiments on the RSOD datasets. The results are shown in
Table 6.
The experimental results show that, based on the YOLOv3, when only the SGE attention module was added, the overall detection performance of the four categories improved due to the enhanced feature extraction ability of the backbone. After the fourth detection feature layer was added for small object detection, the mAP of category three, i.e., small objects in the aircraft category, significantly increased from 88.4% to 91.4%. After the Dense-FPN structure was added, the overall detection accuracy of the four object categories of objects improved, which shows that the Dense-FPN structure has a strong feature fusion capability for objects of different scales. In addition, compared with the original YOLOv3 without improvement, the mAP improved from 83.9% to 92% after adding the SGE module, the fourth detection feature layer, and the Dense-FPN structure, demonstrating the effectiveness of our method.
5. Conclusions
As satellite imaging technology and deep learning technology have developed, remote sensing object detection has become a popular research topic. To address the problems of complex scenes, large scenes with small objects, and large-scale differences of objects in remote sensing object detection, a dense feature pyramid network based on YOLO known as DFPN-YOLO was proposed in this paper.
First, we added an attention module to the residual blocks of the backbone to allow the network to quickly extract key feature information in complex scenes. Then, we added a larger detection layer to address the difficulty of detecting small objects in large fields of view. Finally, we proposed a dense feature pyramid network structure named Dense-FPN, which enabled all four detection layers to combine the semantic information, improving the object detection performance at different scales. Our proposed method achieves a high accuracy on the RSOD datasets and DIOR datasets and outperforms both classical algorithms and even outperforms the YOLOv4 in terms of the mAP metric. On the DIOR datasets, our algorithm achieves a maximum mAP of 69.33%, which is considerably higher than the 62.41% mAP of YOLOv3, and due to the Dense-FPN structure, the detection accuracy of our algorithm is higher than the accuracies of other algorithms in most object categories. On the RSOD datasets, the precision of our algorithm is better than the performance of other classical algorithms, reaching an mAP of 92%, which is 8% higher than the mAP of 83.9% of YOLOv3. From the comparison experiments, we found that YOLOv4 with an FPN + PAN structure and DFPN-YOLO with a Dense-FPN structure significantly outperformed YOLOv3 in terms of overall performance, demonstrating the importance of feature fusion for detection precision. Furthermore, our method performed slightly better than YOLOv4.
However, although our method achieves good performance on the RSOD datasets and DIOR datasets, it has a poor detection performance on some high-noise remote sensing images, and the detection of blurred images and high-noise remote sensing images remains a major challenge for remote sensing object detection. We will carry out additional research in future work.