With the advent of the information age, remote sensing technology has remedied the problem of the limited coverage of traditional ground detection and serious lack of related detection data through rapid air-to-ground information acquisition and target detection. Due to its significant advantages of flexible maneuverability, high resolution, and optional observation range, aerial remote sensing provides a new method for target detection. Airplanes and ships are vital strategic resources and means of transportation in both military and civilian fields. Therefore, they are of great significance to the research of remote sensing image detection. Compared with road vehicle detection, remote sensing images of airplanes and ships have complex backgrounds with diverse targets and small sizes, so target detection is more challenging in this context.
Before the rise of convolutional neural networks, aerial remote sensing images relied on traditional algorithms for target detection. Traditional target detection algorithms mainly include edge detection algorithms [
1], such as the Roberts algorithm [
2]; threshold segmentation methods [
3], such as the Otsu threshold segmentation algorithm [
4]; visual saliency detection algorithms, such as the ITTI algorithm [
5]. The first two algorithms complete the detection task by detecting the strong contrast and the difference in gray value between the target and the image background, respectively, and the latter algorithm obtains the positioning through the difference in imaging angle. However, for complex remote sensing images, too many interference elements lead to poor positioning accuracy.
Based on the above problems, convolutional neural networks (CNNs) have replaced traditional target detection algorithms to meet the small target detection of aerial remote sensing images with complex backgrounds. CNN-based target detection algorithms can be divided into two categories: The first being region-based target detection algorithms, forming the two-stage algorithms represented by R-CNN [
6,
7,
8]. The detection accuracy of this algorithm is high, but the speed is slow. The second type is the regression-based target detection algorithm, forming the one-stage algorithms represented by You Only Look Once (YOLO) [
9,
10,
11] and SSD (Single Shot multibox Detector) [
12]. This algorithm converts the detection problem into a regression problem and the speed is significantly accelerated. The CNN network has outstanding advantages in remote sensing images and can successfully complete target positioning and classification tasks. Deng et al. [
13] presented a method based on an enhanced deep CNN, which followed the general process of “CNN feature extraction + region suggestion + region classification” and successfully implemented a test of large-scale Google Earth images. Long et al. [
14] developed an object localization framework based on CNN in remote sensing images. Yu et al. [
15] introduced a bilinear convolutional neural network model for scene classification, which greatly improved the performance and accuracy of remote sensing image classification tasks. Focusing on the problem of target detection in remote sensing images, Yao et al. [
16] proposed an integrated model based on Faster R-CNN to detect chimneys and condensation towers in high-resolution remote sensing images. Tang et al. [
17] used a Faster R-CNN-based network to monitor vehicle targets in remote sensing images in real time. Facing smaller targets and more difficult airplane and ship missions, Zhang et al. [
18] proposed a weakly supervised learning framework based on coupled convolutional neural networks for airplane detection. Xu et al. [
19] proposed a remote sensing image airplane detection method that used multilayer feature fusion in fully convolutional neural networks. Zou et al. [
20] designed the Singular Value Decomposition network (SVDNet) for ship detection based on the convolutional neural network and the SVD algorithm. Wang et al. [
21] studied a convolutional neural network-based renormalization method to realize ship detection with very high resolution (VHR) remote sensing images. Zhang et al. [
22] designed a Deconv R-CNN model through a network with a deconvolution layer after the last convolution layer of the basic network for airplane and ship detection. The above methods are more suitable for large-scale targets with high contrast in natural scenes, but in the case of complex backgrounds and the detection of small targets, the detection results are less accurate.
The detection accuracy and real-time performance of the YOLO series of algorithms have significant advantages over other algorithms, among which the YOLOv3 algorithm is considered the best. This paper aims to improve the network based on YOLOv3 to further meet the detection requirements for small targets. Training based on a convolutional neural network requires a large number of samples. In order to make up for insufficient data, we screened specific types of training samples from the DOTA (Dataset of Object Detection in Aerial Images) dataset and trained the detection network of aerial remote sensing images through the synthetic dataset. The detection accuracy of small targets in the network model was relatively high. We chose small targets in two different complex backgrounds (i.e., airplanes and ships) to boost the optimization of the network model and improved the accuracy requirements of the network model. To solve the problem of low detection accuracy of small targets, a detection scale was added to the deep features of the network to obtain a smaller receptive field to enhance the sensitivity to small targets. During the training process, the imbalance of positive and negative samples may lead to data overfitting. L2 regularization was appended to the network to improve the overall loss function and enhance the anti-interference ability of the network model. The experimental results show that the improved network has higher accuracy than the previous detection algorithms.