Remote Sensing Image Target Detection Method Based on Refined Feature Extraction

Tian, Bo; Chen, Hui

doi:10.3390/app13158694

Open AccessArticle

Remote Sensing Image Target Detection Method Based on Refined Feature Extraction

by

Bo Tian

and

Hui Chen

^*

College of Electrical and Information Engineering, Lanzhou University of Technology, Lanzhou 730050, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(15), 8694; https://doi.org/10.3390/app13158694

Submission received: 28 June 2023 / Revised: 25 July 2023 / Accepted: 26 July 2023 / Published: 27 July 2023

Download

Browse Figures

Versions Notes

Abstract

:

To address the challenges posed by the large scale and dense distribution of small targets in remote sensing images, as well as the issues of missed detection and false detection, this paper proposes a one-stage target detection algorithm, DCN-YOLO, based on refined feature extraction techniques. First, we introduce DCNv2 and a residual structure to reconstruct a new backbone network, which enhances the extraction of shallow feature information and improves the network’s accuracy. Then, a novel feature fusion module is employed in the neck network to adaptively adjust the fusion weight for integrating texture information from shallow features with deep semantic information. This targeted approach effectively suppresses noise caused by extracting shallow features and enhances the representation of key features. Moreover, the normalized Gaussian Wasserstein distance loss, replacing Intersection over Union (IoU), is used as the regression loss function in the model, to enhance the detection capability of multi-scale targets. Finally, comparing our evaluations against recent advanced methods such as YOLOv7 and YOLOv6 demonstrates the effectiveness of the proposed approach, which achieves an average accuracy of 20.1% for small targets on the DOTAv1.0 dataset and 29.0% on the DIOR dataset.

Keywords:

remote sensing images; target detection; refinement feature extraction; deformable convolution; SimAM attention mechanism

1. Introduction

Remote sensing image object detection plays a crucial role in various domains [1], including large-scale scene detection [2], natural disaster monitoring [3], and resource surveying. It is of great significance for human life and societal development [4,5]. In addition, remote sensing images are characterized by multiple scales [6], complex backgrounds [7], and multiple perspectives and are susceptible to lighting conditions, occlusion, and masking [8]. Therefore, how to intelligently extract the target features of interest from the high-resolution remote sensing images that contain rich feature information is a difficult area of research at present.

Traditional methods for target detection in remote sensing images, such as Scale-Invariant Feature Transform (SIFT) [9], Histogram of Oriented Gradient (HOG) [10], Harris corner point detection, and Haar features, are highly ineffective when targets undergo a non-rigid deformation. However, owing to the rapid development of deep learning, scholars have proposed a series of object detection algorithms based on convolutional neural networks (CNN). These algorithms can be categorized into one-stage and two-stage methods based on whether they require a Region of Interest (ROI) [11] extraction step. The R-CNN [12] is representative of the two-stage detection methods, which achieve high detection accuracy but suffer from slow processing due to the need for ROI selection, classification, and detection for each region. On the other hand, one-stage detection methods, such as YOLO [13], provide a better balance between accuracy and speed. Consequently, many studies have directly applied these algorithms to remote sensing image target detection tasks. However, the diversity of scales, orientations, shapes, distributions, and illumination in remote sensing images, along with the complexity and variability of backgrounds, presents a great challenge for traditional convolutional structures in extracting refined target features [14]. Conventional feature fusion structures, such as the widely adopted Feature Pyramid Network (FPN) [15], tend to introduce noise from shallower layers, thereby submerging the feature information of small objects. To address these issues, researchers have proposed various improved convolutional network–based algorithms. For example, Yang et al. use a dense connection structure to enrich feature expression capabilities and increase the number and depth of feature fusion structures to improve detection accuracy [16]. YOLOT uses upsampling for feature map transformation, splices the intermediate feature map with the output feature map, and adds the number of detection heads to improve the detection performance of small targets [17]. Chen et al. combined the semantic information into the shallowest feature map for enhancement and fused deeper features to enhance small target detection performance [18]. Wang et al. employed feature fusion techniques. They improved the loss function to enhance the model’s focus on small targets [19]. Li et al. employed deconvolution for feature map transformation and fused low-level and high-level features to further enhance small target detection [20]. Although the aforementioned feature fusion approaches can improve detection accuracy to some extent, they also introduce background noise that obscures the features of small targets. Fu et al. designed fusion coefficients based on empirical knowledge to determine the fusion ratio between shallow and deep features [21]. However, these coefficient values can be subjective and lack generalization.

In response to the limitations of existing object detection algorithms in detecting small objects, this paper proposes a novel object detection algorithm called DCN-YOLO, based on fine-grained feature extraction. The main contributions of this research are as follows: (1) A refined feature extraction network based on DCNv2 is proposed. This network effectively captures the shape and location features of the target by introducing a SimAM attention mechanism. It adaptively changes feature weight in 3D space and reduces noise during contextual semantic fusion. This improves the network’s ability to extract target features in complex contexts while maintaining a reduced number of parameters. (2) Normalized Gaussian Wasserstein distance (NGWD) loss function for multi-scale target detection: To enhance the robustness and generalization ability of multi-scale target detection, the CIoU loss function [22] is replaced with the NGWD loss function as the regression loss function. This compensation helps improve detection performance for targets of different sizes. (3) Small target YOLO detection head: A small target YOLO detection head is added to the publicly available DOTAv1.0 dataset [23]. This head uses the K-means++ algorithm for clustering anchor frames, significantly improving the accuracy of small target detection. The effectiveness of the proposed method is validated through comprehensive experiments conducted on the publicly available DOTAv1.0 dataset and DIOR dataset [24]. Ablation experiments verify the rationality of each module proposed in this paper.

2. Materials and Methods

2.1. Overall Structure of DCN-YOLO Network

This chapter introduces a DCN-YOLO algorithm, which is an improvement built upon the foundation of YOLOv7. The overall structure of the DCN-YOLO model is illustrated in Figure 1. First, the original convolutional block is replaced with a new basic block that utilizes DCNv2 [25] reconstruction. Additionally, a residual structure is incorporated into the new block, which effectively improves the network’s ability to adapt to geometric changes in objects, and leads to a more accurate estimation of target spatial deformation. To address the issues of detecting dense targets and reducing missed detections and false alarms, the algorithm incorporates SimAM at the lateral connections during multi-scale feature fusion. SimAM allows for the targeted detection of feature information in a 3D space without increasing the model’s training parameters. This approach reduces the impact of high semantic noise during contextual semantic fusion and enables the extraction of more powerful features. Moreover, the traditional distance metric is influenced by the limited number of pixels, which can adversely affect the detection of small targets. To mitigate this issue, the paper introduces the NGWD loss function. This loss function helps compensate for the impact of small targets with a small number of pixels during the calculation of the regression loss, thereby improving the detection performance of small targets. The DCN-YOLO algorithm is evaluated using various datasets, including DOTAv1.0 and DIOR, to assess its performance and effectiveness. Experimental results demonstrate its capability to effectively detect small targets in challenging scenarios.

The structure of DCN-YOLO can be divided into four main modules, namely, the input end, backbone, neck, and head. In the backbone module, we adopted the CBS structure, which includes convolution, batch normalization, and SiLU activation function, as well as the D-ELAN module and D-MP module. Detailed explanations regarding this part can be found in Section 2.2. As for the neck module, we employed the SPAN structure, and a comprehensive description of the SPAN structure can be found in Section 2.3.

2.2. Stronger Backbone Network

Through experimental comparison, it has been observed that DCNv2 offers a wider range of feature-level sampling compared to its predecessors, such as DCN and Conv, when extracting feature maps. Moreover, DCNv2 incorporates a modulation mechanism that not only learns offsets for the samples but also allows them to be modulated by the learned feature magnitude. This enables DCNv2 to have control over its sample space and relative impact. Given these advantages, for this paper, we chose DCNv2 to upgrade and improve the model. Two main basic modules are derived from DCNv2, and their internal structures are illustrated in detail in Figure 2. The D-EALN module follows the concept of an efficient aggregation network with a multi-branch structure. Each branch maintains consistent input and output channels, and the features are concatenated together using the concat operation while ensuring that the longest gradient path remains unchanged. Additionally, a 1 × 1 convolutional layer is applied to compress the channels, enhancing the network’s robustness. In the D-MP module, the left branch employs Maxpool for spatial downsampling and incorporates a 1 × 1 convolutional layer for channel compression. On the other hand, the right branch begins with a 1 × 1 convolutional layer for channel compression, followed by the DCN block for downsampling. Finally, the outputs of the two branches are concatenated using the concat operation.

2.3. Neck Feature Fusion Optimization

In target detection algorithms, the Pyramid Attention Network (PAN) has been widely recognized as a classical feature extraction method that enhances the performance of target detection algorithms [26]. One crucial step in PAN is the lateral connection, which fuses low-level features with high-level features. This fusion process aims to extract fine-grained target features from feature maps at different scales and combine them into a pyramidal feature map, thereby improving the algorithm’s accuracy. To address this issue, Figure 3 depicts the proposed Spatial Pyramid Attention Network (SPAN) structure in this paper. The SPAN structure extends the original PAN structure by adding a layer of feature maps. This augmentation introduces richer target feature information to the model. Furthermore, a 3D non-parametric SimAM is introduced to adjust the weights of the feature maps at the PAN lateral connections. This attention mechanism enables the feature to adjust and reduce the impact of high semantic noise during the fusion process, thereby preserving the crucial features of small targets.

SimAM is an attention mechanism inspired by neuroscience theory [27]. It leverages the understanding that neurons exhibit distinct firing patterns based on the abundance or scarcity of information, allowing for the assessment of neuron importance. In the context of target detection, neurons deemed more important exert stronger inhibitory effects on surrounding neurons, which carry critical information and should be assigned greater weight. In remote sensing image target detection, these crucial neurons often play a role in extracting significant features of the target. SimAM combines the energy function with the attention mechanism, as illustrated in Figure 4. This combination enables the weighing of each neuron’s input, thereby enhancing the model’s attention toward important information. The SimAM attention mechanism is mathematically represented by Equations (1)–(4). In these equations,

t

represents the target neuron,

x

represents the neighboring neuron,

λ

is a hyperparameter, and

e_{t}^{*}

denotes energy. Lower energy values indicate a higher distinction of the neuron

t

from the surrounding neurons, signifying greater importance. Equation (4) demonstrates how the neurons are weighted based on their importance. Compared to mainstream attention mechanisms like SENet [28] and CBAM [29], SimAM offers greater explanatory power and does not require the introduction of learnable parameters.

u_{t} = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} x_{i}

(1)

σ_{t}^{2} = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(x_{i} - u_{t})}^{2}

(2)

e_{t}^{*} = \frac{4 ({\hat{σ}}^{2} + λ)}{{(t - \hat{u})}^{2} + 2 σ^{2} + 2 λ}

(3)

\tilde{X} = s i g m o i d (\frac{1}{E}) ⊙ X

(4)

2.4. Optimized Regression Loss Function

In target detection, the evaluation of the similarity between bounding box predictions is a crucial task. The commonly used IoU (Intersection over Union) metric measures the similarity between two sets of samples. However, it has been observed that the IoU metric is sensitive to position offsets, particularly for objects of different scales. Figure 5 illustrates this sensitivity, wherein a small position offset can result in a significant decrease in the IoU value for a tiny object (e.g., 5 × 5 pixels), but only a relatively small change for a general object (e.g., 33 × 33 pixels). This discrepancy in sensitivity affects the accuracy of label assignments. To address this issue, a new metric called NGWD is introduced as a regression loss [30], replacing the traditional IoU metric. NGWD offers several advantages, particularly in measuring the similarity between distributions even in cases where there is no overlapping region, or the overlapping region is negligible. This capability makes NGWD more suitable for measuring the similarity between tiny objects. The NGWD is defined in Equations (5)–(7) as follows:

W_{2}^{2} (N_{a}, N_{b}) = {‖([c_{x_{a}}, c_{y_{a}}, \frac{w_{a}}{2}, \frac{h_{a}}{2}]}^{Τ}, {{[c_{x_{b}}, c_{y_{b}}, \frac{w_{b}}{2}, \frac{h_{b}}{2}]}^{Τ})‖}_{2}^{2}

(5)

N G W D (N_{a}, N_{b}) = \exp (- \frac{\sqrt{W_{2}^{2} (N_{a}, N_{b})}}{C})

(6)

L_{N G W D} = 1 - N G W D (N_{p}, N_{g})

(7)

where,

(c_{x_{p}}, c_{y_{p}}, w_{p}, h_{p})

denotes the center coordinates, width, and height of the prediction frame,

(c_{x_{g}}, c_{y_{g}}, w_{g}, h_{g})

denotes the center coordinates, width, and height of the real frame,

N_{p}

denotes the Gaussian distribution model of the prediction frame,

N_{g}

is the Gaussian distribution model of the real frame

G

,

W_{2}^{2} (N_{p}, N_{g})

is the distance metric, and

C

is the constant associated with the dataset, which is taken as 12.8 here.

2.5. Optimal Design of the Anchor Frame

Selecting an appropriate prior frame is essential for enhancing the training efficiency of target detection networks. However, remote sensing image targets exhibit significant variations in scale, morphology, and distribution compared to natural image datasets like COCO [31]. To improve the model’s prior knowledge, the K-means++ algorithm was utilized to cluster the labels in the remote sensing image dataset, resulting in 12 sets of anchor frames with different sizes. The clustering outcomes are presented in Table 1. By employing the K-means++ algorithm, more consistent and stable clustering results can be obtained, which yields anchor frames that align closely with the actual size distribution of the dataset.

3. Experimental Results and Analysis

3.1. Experimental Platform and Hyperparameter Settings

In this study, the target detection algorithm was implemented on a computer hardware platform consisting of an Intel(R) Core(TM) i9-12900K CPU and two NVIDIA GeForce GTX 3090ti GPUs. The software setup includes CUDA 11.3, the Windows 10 operating system, the PyTorch 1.11.0 deep learning framework, and the programming language Python. During the training process, the SGD optimizer was employed for parameter updates. The training was conducted with a batch size of 32, an initial learning rate of 0.01, a decay coefficient of 0.0005, and a momentum factor parameter of 0.937. The total number of training iterations was set to 150, and the input resolution of the model was 640 × 640 pixels.

3.2. Introduction to the Dataset

To evaluate the effectiveness and generalization of the proposed method, comprehensive experiments were performed on two widely used remote sensing image target detection datasets: DOTAv1.0 and DIOR. For the DOTAv1.0 dataset, preprocessing was conducted by segmenting the original images into sub-images of size 1024 × 1024 pixels with an overlap degree of 200. From the segmented images, a total of 10,000 samples were randomly selected. These samples were then divided into a training set, a validation set, and a test set, following a split ratio of 7:1:2, respectively. This preprocessing step ensures that the DOTAv1.0 dataset is appropriately prepared for the training, validation, and testing of the proposed method. It enables the algorithm to learn and generalize from diverse sub-images, enhancing its ability to detect targets in remote sensing imagery accurately.

3.3. Evaluation Indicators

The evaluation metrics used in this paper are the number of parameters (Params), the number of floating point operations (FLOPs), and the average precision (mAP). The mAP is calculated as shown in Equations (8) and (9), with recall R as the horizontal coordinate and correctness P as the vertical coordinate. The AP value of a single category is obtained by integrating the formed region, and the average value of all categories is mAP. [email protected] is the average precision when the IoU threshold is equal to 0.5, [email protected]: 0.95 is the average precision under the IoU threshold from 0.5 to 0.95. This index can evaluate the performance of the model more comprehensively; it is the average precision of small, medium, and large targets, respectively.

A P = \int_{0}^{1} P (R) d R

(8)

m A P = \frac{1}{c} \sum_{i = 1}^{c} A P_{i} = \frac{1}{c} \int_{0}^{1} \sum_{i = 1}^{c} P (R) d R

(9)

3.4. Comparison with Other Algorithms

In this study, DCN-YOLO was compared with other state-of-the-art remote sensing image target detection algorithms on the DOTAv1.0 and DIOR datasets, as shown in Table 2 and Table 3. In comparison to the single-stage object detection algorithms YOLOv3, YOLOv3-Spp, YOLOv4-Csp, YOLOR-Csp, YOLOv5l(7.0), YOLOv5-Bifpn(7.0), YOLOv5s-Transformer(7.0), YOLOv6l, and YOLOv7, the proposed method in this paper has shown significant improvements in [email protected] values. Specifically, the [email protected] values have increased by 4%, 3.1%, 3.1%, 3.4%, 3.5%, 5.3%, 1.6%, 3.4%, and 3.4%, respectively. The proposed model demonstrated superior performance in several performance metrics and reduced computational effort, thus verifying its effectiveness.

On the DOTAv1.0 dataset, the proposed method achieved an average precision of 63.4% at IoU threshold 0.5,

{AP}_{M}^{v a l}

of 36.4%,

{AP}_{L}^{v a l}

of 51.2%, and notably

{AP}_{S}^{v a l}

of 20.1%, indicating a significant improvement in small target detection performance compared to other indices. Similarly, on the DIOR dataset, at 640 resolution, the proposed method achieved an average precision of 90.1%, with

{AP}_{S}^{v a l}

,

{AP}_{M}^{v a l}

,

{AP}_{L}^{v a l}

of 29.0%, 57.4%, and 79.6%, respectively. The results indicate a significant improvement in small target detection performance compared to other methods. Overall, the experimental results demonstrate the superior detection accuracy of the proposed method, especially for small targets.

Figure 6 presents some detection results of the proposed methods, YOLOv7 and YOLOv6l, on the DOTAv1.0 dataset. The differences in the detection results are highlighted with red boxes. As shown in the four examples, Figure 6 the proposed method performs significantly better than the other methods in detecting small and dense targets such as swimming pools, airplanes, carts, and fuel tanks.

After conducting an in-depth analysis of the results, we found that the DCN-YOLO algorithm demonstrates the ability to accurately capture and detect dense small objects, exhibiting higher localization precision. This improved performance allows it to better recognize and distinguish objects in complex scenes. It is worth noting that, consistently with the comparative analysis results in other research works, our proposed method exhibits improved detection accuracy and [email protected] values on various datasets, further validating the robustness and generalization capability of our approach.

3.5. Ablation Experiments

To comprehensively analyze the effectiveness of various improvement strategies, we randomly selected 10,000 images from the cropped DOTAv1 dataset and divided them into training, validation, and test sets in a 7:1:2 ratio. Ablation experiments were conducted to compare and analyze the impacts of different improvement strategies on the model’s detection performance. The experiments were based on the original YOLOv7, and different improvements were added to it separately, with the original algorithm serving as the control group. The specific experimental results are presented in Table 4.

Through experimental comparison, improvement 1 adds a detection head to the basic model and utilizes the K-means++ algorithm to cluster anchor boxes, resulting in a 1.2% increase in [email protected]:0.95 compared to the initial model, with the small target detection index

{AP}_{S}^{v a l}

increasing by 3.3%. The anchor box matching method for remote sensing images was also redesigned to better suit the four-scale feature layer of the network framework, greatly improving the model’s detection ability for small targets and effectively reducing the occurrence of missed detections and false positives. In the second improvement, while the detection performance of the basic model remained unchanged, the computational time complexity (FLOPs) of the improved model was reduced by 11.8%, making it more convenient for deployment on embedded mobile terminals and allowing users to choose a lower-cost processor by sacrificing a small amount of detection performance. An improvement of 3%, utilizing SFPN, had a positive impact on object detection accuracy in remote sensing images, especially for the large target index

{AP}_{L}^{v a l}

, which increased by 2.9%. The module introduces the SimAM attention mechanism at the lateral junction of the FPN to improve the fusion ability of deep semantic information and shallow texture information and further focus on the target feature information. In the fourth improvement, the NGWD loss function was introduced as the regression loss on the basic model, resulting in a 1.5% increase in [email protected] and greatly improving the network’s detection accuracy.

To further demonstrate the effectiveness of the proposed improvements, we visualized the feature maps extracted from the network, with red indicating that the model attends more to the target region and blue indicating vice versa. Five sets of images were randomly selected for feature map visualization, as shown in Figure 7. The feature maps illustrate that the improved model pays more attention to the target regions than the original model when processing these images, with red regions more focused on the target regions and blue regions more distributed in the background. This indicates that the improved model can better capture the target features in the images, thereby improving the model’s accuracy and performance. Additionally, the feature maps of different images show that the target features in different images are distinct, indicating that the improved model can better adapt to various types of image data.

4. Conclusions

Object detection algorithms based on deep learning have made great progress in natural scene images. However, remote sensing images have difficulties, such as complex backgrounds, many small objects, and arbitrary arrangement directions. Therefore, this paper proposes a remote sensing image object detection algorithm, DCN-YOLO, for remote sensing image detection tasks. First, the model introduces DCNv2 to reconstruct the backbone network and enhance its feature extraction capability. Then, the model improves the multi-scale feature fusion structure by incorporating the SimAM attention mechanism to suppress background interference and enhance the model’s ability to extract target features, thus improving localization accuracy. Moreover, a detection head is added to the head network, and anchor boxes are generated using the K-means++ algorithm to increase the model’s priors and significantly improve its detection performance for small targets. Finally, the model employs the normalized Gaussian Wasserstein distance loss function for regression loss calculation, effectively enhancing the model’s multi-scale detection capability. Experimental results show that the proposed method achieves good detection results on large-scale public remote sensing image object detection datasets, DOTAv1.0 and DIOR. However, the method in this paper has the disadvantages of slow detection speed and large consumption of GPU resources, so further research will be carried out on the lightweight version of the network in subsequent work.

Author Contributions

Conceptualization, H.C. and B.T.; methodology, B.T.; software, B.T.; validation, B.T.; formal analysis, B.T.; investigation, B.T.; resources, B.T.; data curation, B.T.; writing—original draft preparation, B.T.; writing—review and editing, H.C. and B.T.; visualization, B.T.; supervision, H.C.; project administration, H.C.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62163023, 61873116, 61763029), the Industrial Support Project of the Education Department of Gansu Province (2021CYZC–02), and the Gansu Provincial Science and Technology Planning (20JR10RA184).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used for training and test set DOTAv1.0 are available at: https://captain-whu.github.io/DOTA, accessed on 8 May 2023. The data used for training and test set DIOR are available at: http://www.escience.cn/people/gongcheng/DIOR, accessed on 8 May 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, Y.; Ma, L.; Liu, T.; Huang, X.; Sun, G. The Synergistic Effect between Precipitation and Temperature for the NDVI in Northern China from 2000 to 2018. Appl. Sci. 2023, 13, 8425. [Google Scholar] [CrossRef]
Potić, I.; Srdić, Z.; Vakanjac, B.; Bakrač, S.; Đorđević, D.; Banković, R.; Jovanović, J.M. Improving Forest Detection Using Machine Learning and Remote Sensing: A Case Study in Southeastern Serbia. Appl. Sci. 2023, 13, 8289. [Google Scholar] [CrossRef]
Alkhatib, R.; Sahwan, W.; Alkhatieb, A.; Schütt, B. A Brief Review of Machine Learning Algorithms in Forest Fires Science. Appl. Sci. 2023, 13, 8275. [Google Scholar] [CrossRef]
Wang, J.A.; Zhang, A.; Zhao, X. Development and application of the multi-dimensional integrated geography curricula from the perspective of regional remote sensing. J. Geogr. High. Educ. 2020, 44, 350–369. [Google Scholar] [CrossRef]
Masita, K.L.; Hasan, A.N.; Shongwe, T. Deep learning in object detection: A review. In Proceedings of the 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Durban, South Africa, 6–7 August 2020; pp. 1–11. [Google Scholar]
Cai, D.; Lu, Z.; Fan, X.; Ding, W.; Li, B. Improved YOLOv4-Tiny Target Detection Method Based on Adaptive Self-Order Piecewise Enhancement and Multiscale Feature Optimization. Appl. Sci. 2023, 13, 8177. [Google Scholar] [CrossRef]
Cai, Y.; Zhou, Y.; Zhang, H.; Xia, Y.; Qiao, P.; Zhao, J. Review of Target Geo-Location Algorithms for Aerial Remote Sensing Cameras without Control Points. Appl. Sci. 2022, 12, 12689. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, J.; Shen, W. A Review of Ensemble Learning Algorithms Used in Remote Sensing Applications. Appl. Sci. 2022, 12, 8654. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Wang, X.; Han, T.X.; Yan, S. An HOG-LBP human detector with partial occlusion handling. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 32–39. [Google Scholar]
Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 2013, arXiv:1312.6229. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Miao, W.; Geng, J.; Jiang, W. Multigranularity Decoupling Network with Pseudolabel Selection for Remote Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5603813. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Yang, X.; Sun, H.; Sun, X.; Yan, M.; Guo, Z.; Fu, K. Position detection and direction prediction for arbitrary-oriented ships via multitask rotation region convolutional neural network. IEEE Access 2018, 6, 50839–50849. [Google Scholar] [CrossRef]
Van Etten, A. You only look twice: Rapid multi-scale object detection in satellite imagery. arXiv 2018, arXiv:1805.09512. [Google Scholar]
Chen, S.; Zhan, R.; Zhang, J. Geospatial object detection in remote sensing imagery based on multiscale single-shot detector with activated semantics. Remote Sens. 2018, 10, 820. [Google Scholar] [CrossRef] [Green Version]
Wang, P.; Sun, X.; Diao, W.; Fu, K. FMSSD: Feature-merged single-shot detection for multiscale objects in large-scale remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3377–3390. [Google Scholar] [CrossRef]
Su, H.; Wei, S.; Yan, M.; Wang, C.; Shi, J.; Zhang, X. Object detection and instance segmentation in remote sensing imagery based on precise mask R-CNN. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1454–1457. [Google Scholar]
Zhu, L.; Wu, F.; Fu, K.; Hu, Y.; Wang, Y.; Tian, X.; Huang, K. An Active Service Recommendation Model for Multi-Source Remote Sensing Information Using Fusion of Attention and Multi-Perspective. Remote Sens. 2023, 15, 2564. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
Ghiasi, G.; Lin, T.-Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2019, 32, 4–24. [Google Scholar] [CrossRef] [Green Version]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.-S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]

Figure 1. Overall structure visualization of DCN-YOLO model.

Figure 2. D-ELAN module and D-MP module visualization.

Figure 3. Multi-scale feature fusion structure introducing SimAM attention mechanism.

Figure 4. Diagram of 3D attention weight.

Figure 5. Sensitivity analysis of IoU for small-scale and normal-scale objects. Each grid represents one pixel, box A represents the real bounding box, and boxes B and C represent the predicted bounding boxes with 1- and 4-pixel diagonal deviations, respectively.

Figure 6. Illustrates the detection results of various algorithms for different scenarios: (a) pool detection, (b) airplane detection, (c) car detection, and (d) oil tank detection.

Figure 7. Illustrates a comparative visualization of feature maps. In panels (a–f), we present visualizations of feature maps from 6 sets using two distinct methods.

Table 1. K-means++ generates the prior box.

Feature Map	Feel the Wild	Anchor Frame
20 × 20	Big	(138,219), (273,118), (315,308)
40 × 40	Medium	(53,31), (58,118), (315,308)
80 × 80	Small	(29,28), (23,63), (67,24)
160 × 160	Tiny	(10,11), (15,24), (25,13)

Table 2. Comparison results of the DOTAv1.0 dataset.

Model	Params	FLOPs(G)	[email protected]	[email protected]:0.95	$A P_{S}^{v a l}$	$A P_{M}^{v a l}$	$A P_{L}^{v a l}$	Year
YOLOv3 [32]	61.6	154.8	59.4%	38.7%	14.0%	34.2%	48.0%	2018
YOLOv3-Spp	62.6	156.6	60.3%	39.0%	15.6%	34.3%	44.1%	2018
YOLOv4-Csp [33]	52.5	119.1	60.3%	39.6%	14.6%	34.2%	46.5%	2020
YOLOR-Csp [34]	46.2	107.9	60.0%	39.6%	16.6%	34.1%	50.3%	2021
YOLOv5l(7.0)	46.4	108.7	59.9%	40.0%	14.5%	33.8%	50.9%	2021
YOLOv5-Bifpn(7.0)	7.0	15.7	58.1%	36.4%	12.8%	31.4%	44.1%	2021
YOLOv5s-Transformer(7.0)	52.5	119.1	61.8%	40.4%	15.1%	34.7%	50.8%	2021
YOLOv6l [35]	59.6	150.5	60.0%	40.1%	15.8%	35.8%	49.6%	2022
YOLOv7 [36]	36.5	103.4	60.0%	40.0%	15.9%	34.4%	48.2%	2022
DCN-YOLO	38.1	98.4	63.4%	41.9%	20.1%	36.4%	51.2%	ours

Table 3. Comparison results of the DIOR dataset.

Model	Params	FLOPs(G)	[email protected]	[email protected]:0.95	$A P_{S}^{v a l}$	$A P_{M}^{v a l}$	$A P_{L}^{v a l}$	Year
YOLOv3	61.6	154.8	87.0%	61.4%	24.0%	53.5%	74.0%	2018
YOLOv3-Spp	62.6	156.6	88.0%	62.8%	25.7%	53.6%	75.3%	2018
YOLOv4-Csp	52.5	119.1	88.3%	63.1%	25.9%	55.1%	76.2%	2020
YOLOR-Csp	46.2	107.9	89.5%	66.1%	27.1%	55.9%	77.8%	2021
YOLOv5l(7.0)	46.4	108.7	88.9%	65.8%	26.2%	55.6%	77.4%	2021
YOLOv5-Bifpn(7.0)	7.0	15.7	85.0%	57.0%	21.4%	49.6%	67.9%	2021
YOLOv5s-Transformer(7.0)	52.5	119.1	88.5%	63.8%	25.4%	54.9%	77.0%	2021
YOLOv6l	59.6	150.5	87.4%	65.8%	25.4%	54.8%	79.3%	2022
YOLOv7	36.5	103.4	88.7%	65.0%	25.6%	56.1%	78.2%	2022
DCN-YOLO	38.1	98.4	90.1%	66.7%	29.0%	57.4%	79.6%	ours

Table 4. Detection performance of different improvement strategies on the DOTA dataset.

Model	Params	FLOPs(G)	[email protected]	[email protected]:0.95	$A P_{S}^{v a l}$	$A P_{M}^{v a l}$	$A P_{L}^{v a l}$
Base	36.5M	103.4	61.5%	40.4%	15.9%	34.4%	48.2%
+Head	37.1M	117.3	61.5%	41.2%	19.2%	34.6%	47.1%
+DCNv2	36.9M	91.2	61.1%	39.7%	16.0%	33.9%	47.6%
+SFPN	36.5M	103.4	62.1%	40.8%	15.8%	34.6%	52.2%
+NGWD	36.6M	103.4	63.0%	40.8%	16.8%	34.5%	51.1%
Our	38.1M	98.4	63.5%	41.9%	20.1%	36.4%	51.2%
improvement	+4.4%	−4.8%	+2	+1.5	+4.2	+2.0	+3.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, B.; Chen, H. Remote Sensing Image Target Detection Method Based on Refined Feature Extraction. Appl. Sci. 2023, 13, 8694. https://doi.org/10.3390/app13158694

AMA Style

Tian B, Chen H. Remote Sensing Image Target Detection Method Based on Refined Feature Extraction. Applied Sciences. 2023; 13(15):8694. https://doi.org/10.3390/app13158694

Chicago/Turabian Style

Tian, Bo, and Hui Chen. 2023. "Remote Sensing Image Target Detection Method Based on Refined Feature Extraction" Applied Sciences 13, no. 15: 8694. https://doi.org/10.3390/app13158694

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Remote Sensing Image Target Detection Method Based on Refined Feature Extraction

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Structure of DCN-YOLO Network

2.2. Stronger Backbone Network

2.3. Neck Feature Fusion Optimization

2.4. Optimized Regression Loss Function

2.5. Optimal Design of the Anchor Frame

3. Experimental Results and Analysis

3.1. Experimental Platform and Hyperparameter Settings

3.2. Introduction to the Dataset

3.3. Evaluation Indicators

3.4. Comparison with Other Algorithms

3.5. Ablation Experiments

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI