1. Introduction
Object detection serves as a prerequisite for advanced visual tasks such as scene understanding. Compared to object detection in videos, detecting objects in static images is more challenging. The utilization of object detection in optical remote sensing images has been widespread, encompassing diverse applications such as monitoring, traffic surveillance, agricultural development, disaster planning, and geospatial referencing [
1,
2,
3,
4]. This has attracted considerable interest from researchers in recent years.
Traditional remote sensing image object-detection algorithms can be categorized into several types: threshold-based methods, feature engineering-based methods, template matching methods, machine learning-based methods, segmentation-based methods, and spectral information-based methods [
5].
(1) Threshold-Based Methods: These methods typically use image brightness or color information to set appropriate thresholds for separating targets from the background. When pixel values exceed or fall below specific thresholds, they are considered as targets or non-targets. These methods are simple and user-friendly, but they are less stable under changing lighting conditions and complex backgrounds.
(2) Feature Engineering-Based Methods: These methods rely on manually designed features, such as texture, shape, and edges, to identify objects. Information is often extracted using filters, shape descriptors, and texture features. Subsequently, classifiers like support vector machines or decision trees are used to categorize the extracted features.
(3) Template Matching: Template matching is a method for identifying objects by comparing them with predefined templates or patterns. When the similarity between the target and the template exceeds a certain threshold, it is detected as an object. This method works well when there is a high similarity between the target and the template, but it is not very robust for target rotation, scaling, and other variations.
(4) Machine Learning-Based Methods: Machine learning algorithms, such as neural networks, support vector machines, and random forests, are employed to learn how to detect objects from data. Feature extraction and classifier parameters are often automatically learned from training data. This approach tends to perform well in complex object detection tasks, but it requires a significant amount of labeled data and computational resources.
(5) Segmentation-Based Methods: Object detection methods based on segmentation first divide the image into target and non-target regions and then perform further analysis and classification on each region. Segmentation methods can include region growing, watershed transform, and graph cuts. This approach works well when there are clear boundaries between objects and the background.
(6) Spectral Information-Based Methods: Remote sensing images typically contain information from multiple spectral bands, and this information can be leveraged for object detection. Methods based on spectral information often use spectral features like spectral reflectance and spectral angles to distinguish different types of objects.
The swift advancement of deep learning technologies, particularly the introduction of convolutional neural networks (CNNs), has ushered in new possibilities and applications for object detection in remote sensing images [
6,
7].
At present, the predominant frameworks for object detection in remote sensing images can be broadly classified into two major categories: single-stage and two-stage methods. Notable two-stage object detection algorithms encompass Spatial Pyramid Pooling Networks (SPP-Net) [
8], Region-based CNN (R-CNN) [
9], and Faster R-CNN [
10]. Two-stage methods often achieve high detection accuracy but tend to be slower in terms of detection speed with larger model sizes and parameter counts, due to their two-stage nature. Single-stage detection algorithms have effectively addressed these issues, with representative algorithms such as YOLO [
11], SSD [
12], RetinaNet [
13], CornerNet [
14], and RefineDet [
15]. As these single-stage algorithms—for example, YOLO—have matured, they not only outperform two-stage methods in terms of detection speed but also match or surpass them in terms of accuracy.
Despite the current state of the art, there is still room for improvement in the detection of objects at multiple scales and small targets. Several object categories in remote sensing display size variations, even within the same category. For instance, ships in ports can range in size from tens of meters to hundreds of meters. Additionally, the height of capture and distance from the target can affect an object’s size in the image. Moreover, many small objects are present in aerial images, which are often filtered out in the pooling layers of convolutional neural networks (CNNs) due to their small size, making them challenging to detect and recognize.
To tackle these challenges, this paper suggests an improved network model built upon the foundation of YOLOv8. Our approach improves conventional convolution techniques, incorporates state-of-the-art attention mechanisms, and enhances the loss functions. The primary contributions of this paper are as follows:
(1) A lightweight convolution SEConv was introduced to replace standard convolutions, reducing the network’s parameter count and speeding up the detection process. To address multi-scale object detection, the SEF module was proposed, based on SEConv.
(2) A novel EMA attention mechanism was introduced and integrated into the network, resulting in the SPPFE module, which enhances the network’s feature extraction capabilities and effectively addresses multi-scale object detection challenges.
(3) To improve the detection of small objects, an additional prediction head designed for tiny-object detection was added. Furthermore, the original detection head was replaced by a transformer prediction head to capture more global and contextual information.
(4) To mitigate the adverse gradients generated by low-quality examples, the Wise-IoU loss function was introduced.
(5) In the evaluation on the SIMD dataset, the AP50 reached 86.5%, marking a 2.1% improvement over YOLOv8, outperforming the state-of-the-art model YOLO-HR by 0.91%. Furthermore, we conducted validation on the NWPU VHR-10 dataset, where YOLO-SE achieved 94.9% accuracy, outperforming YOLOv8 by 2.6%.
The paper unfolds as follows:
Section 2 delves into a comprehensive review of existing work concerning object-detection networks in remote sensing images, with a particular focus on attention mechanisms.
Section 3 provides an intricate overview of both the YOLOv8 network and our proposed YOLO-SE network. In
Section 4, we present a detailed account of our experiments and conduct a thorough analysis of the results, using both the SIMD dataset and the NWPU VHR-10 dataset. Finally,
Section 5 encapsulates our conclusions.
3. Materials and Method
3.1. YOLOv8
YOLOv8 utilizes a similar backbone to YOLOv5, but with some modifications in the CSPLayer, now referred to as the C2f module. The C2f module, which consists of a two-convolution cross-stage partial bottleneck, combines high-level features with contextual information to enhance detection accuracy. YOLOv8 employs an anchor-free model with decoupled heads to independently handle object-detection, classification, and regression tasks. This design allows each branch to focus on its specific task, leading to improved overall model accuracy. In the output layer of YOLOv8, they use the sigmoid function as the activation function for object scores, indicating the probability of an object being present within the bounding box. They use the softmax function to represent class probabilities, signifying the probability of an object belonging to each possible class.
For bounding box loss, YOLOv8 uses the CIoU [
42] and DFL [
43] loss functions, and for classification loss, it employs binary cross-entropy. These loss functions enhance object-detection performance, especially when dealing with smaller objects. For this paper, we selected YOLOv8 as the baseline model, which consists of three key components: the backbone network, the neck network, and the prediction output head. The backbone network is the core part of the YOLOv8 model and is responsible for extracting features from the input RGB color images. The neck network is positioned between the backbone network and the prediction output head. Its primary role is to aggregate and process the features extracted by the backbone network. In YOLOv8, the neck network plays a crucial role in integrating features of different scales. Typically, the neck network adopts a Feature Pyramid Network (FPN) structure, which effectively fuses features of various scales to construct a more comprehensive representation.
The prediction output head is the topmost part of the YOLOv8 model and is responsible for identifying and locating object categories in the images. The output head usually contains multiple detectors, with each detector responsible for predicting the position and category of objects. In YOLOv8, three sets of detectors are employed, each with a different scale, aiding the model in recognizing objects of various sizes. The network architecture of YOLOv8 is illustrated in
Figure 1.
3.2. YOLO-SE
To address the issues related to detecting small objects and objects at multiple scales with the YOLOv8 network, we propose the YOLO-SE algorithm, as discussed in this section. We first provide an overview of the YOLO-SE architecture. Building on this, we introduce the essential components of YOLO-SE, including the Efficient Multiscale Convolution Module (SEF), improvements to convolution through the introduction of the EMA attention mechanism in the SPPFE module, replacing the original YOLOv8 detection head with a transformer prediction head, and adding an additional detection head to handle objects at different scales. Finally, we replace the original CIOU bounding box loss function with Wise-IoU. The network structure of YOLO-SE is depicted in
Figure 2.
3.3. SEF
We replaced the standard convolutions in C2f with a more lightweight and efficient multi-scale convolution module called SEF. This module introduces multiple convolutions with different kernel sizes, enabling it to capture various spatial features at multiple scales. Additionally, SEF extends the receptive field using larger convolution kernels, enhancing its ability to model long-range dependencies.
As shown in
Figure 3, the SEConv2d operation partitions the input channels into four smaller channels. The first and third smaller channels remain unchanged, while the second and fourth channels undergo 3 × 3 and 5 × 5 convolution operations, respectively. Subsequently, a 1 × 1 convolution consolidates the features from these four smaller channels. By employing half of the features for convolution and then integrating them with the original features, the objective is to generate redundant features, decrease parameters and computational workload, and alleviate the influence of high-frequency noise. This approach aims to reduce the number of parameters and computational expenses while preserving essential feature information. Each distinct convolutional mapping learns to focus on features of varying granularities adaptively. SEF excels in capturing local details, preserving the nuances and semantic information of target objects as the network deepens.
The structure of SEF is shown in
Figure 4. In summary, the SEF module reduces the network’s parameter count, accelerates detection speed, and effectively captures multi-scale features and local details of the target.
3.4. SPPFE
The SPPF module in YOLOv8 has demonstrated its advantages in enhancing model performance through multi-scale feature fusion, particularly in certain scenarios. However, we must acknowledge that the SPPF module may have limitations in complex backgrounds and situations involving variations in target scales. This is because it still lacks a fine-grained mechanism to focus on task-critical regions.
To address the limitations of the SPPF module and enhance feature extraction capabilities, we introduce the Efficient Multi-Scale Attention (EMA) mechanism [
44], which dynamically adjusts the weights in the feature maps based on the importance of each region in an adaptive manner. The EMA attention mechanism is employed to retain information on each channel while reducing computational costs. We achieve this by restructuring a portion of the channels into the batch dimension and grouping the channel dimensions into multiple sub-features, ensuring an even distribution of spatial semantic features within each feature group. This approach helps maintain channel-wise information while minimizing computational expenses. This allows the module to focus on task-critical regions, making it more targeted in complex scenes. The structure of SPPFE is depicted in
Figure 5, and we incorporate the EMA attention mechanism into this module. The SPPEF module not only performs multi-scale feature fusion but also finely adjusts features at each scale, effectively capturing information at different scales. This enhancement significantly improves the model’s ability to detect small objects.
3.5. TPH
Due to the significant variation in object sizes within remote sensing images, including numerous extremely small instances, experimental results have shown that YOLOv8′s original three detection heads do not adequately address the challenges presented by remote sensing imagery. As a result, we added an additional prediction head specifically designed for detecting tiny objects. When combined with the other three prediction heads, this approach enables us to capture relevant information about small targets more effectively while also detecting objects at different scales, thus improving overall detection performance.
We replaced the original detection head with a transformer prediction head to capture more global and contextual information. The structure of the Vision Transformer is depicted in
Figure 6. It consists of two main blocks: a multi-head attention block and a feedforward neural network (MLP). The LayerNorm layer aids in better network convergence and prevents overfitting. Multi-head attention allows the current node to focus not only on the current pixel but also on the semantic context. While the additional prediction head introduces a considerable amount of computational and memory overhead, it has improved the performance of tiny-object detection.
3.6. Wise-IoU Loss
YOLOv8 uses Complete Intersection over Union (CIoU) [
42] as the default loss-calculation method. CIoU builds upon Distance Intersection over Union (DIoU) by introducing the aspect ratio of the predicted bounding box and the ground-truth bounding box, making the loss function more attentive to the shape of the bounding boxes. However, the computation of CIoU loss is relatively complex, leading to higher computational overhead during the training process. Weighted Intersection over Union (WIoU) [
45] proposes a dynamic non-monotonic focus mechanism, replacing IoU with dissimilarity to assess the quality of anchor boxes. It adopts a gradient gain allocation strategy, reducing the competitiveness of high-quality anchor boxes and mitigating harmful gradients caused by low-quality anchor boxes. This allows WIoU to focus on low-quality anchor boxes, ultimately improving the overall performance of the detector. WIoU comes in three versions, namely WIoU
v1, which constructs an attention-based bounding box loss, and WIoU
v2 and WIoU
v3, which build upon
by adding gradient gain to the focus mechanism.
The formula for calculating
is as shown in Equation (2):
The calculation formula for Region-based Weighted Intersection over Union (R-WIoU) is as follows, as shown in Equation (3):
The values of
,
,
, and
are illustrated in
Figure 7.
To prevent from producing gradients that hinder convergence, w and h are separated from the computation graph. takes values in the range [1, e), significantly amplifying the importance of low-quality anchors. Loss Intersection over Union (LIoU), on the other hand, takes values in the range [0, 1], significantly reducing for high-quality anchors and focusing on the distance between the center points when Bbox and Tbox overlap.
The dynamic non-monotonic focus mechanism uses “outlyingness” to assess anchor box quality instead of IoU, and it provides a wise gradient gain allocation strategy. This strategy reduces the competitiveness of high-quality anchor boxes while also mitigating harmful gradients generated by low-quality examples. This allows WIoU to focus on ordinary-quality anchor boxes and improve the overall performance of the detector.