4.1. Framework Overview
YOLOv8 demonstrates outstanding performance across multiple application domains. However, in remote sensing object detection, challenges persist in accurately detecting small objects. These challenges manifest primarily in two aspects: First, when neural networks extract features from images, features of small objects may be obscured by larger surrounding objects, causing a loss of critical information. This can result in small objects being overlooked during the learning phase, thus impacting the precision of detection. Second, in complex scenes with multiple object interactions, small objects are more susceptible to false positives and omissions. Compared to larger objects, they are more likely to be obscured or to overlap with other objects, making visual distinction and localization more difficult. To tackle these challenges, we introduce HP-YOLOv8, an improved version of the YOLOv8 algorithm specifically designed for detecting small objects in remote sensing (as depicted in
Figure 2).
As shown in
Figure 3. Firstly, we designed a continuously stacking and fusing module named C2f-DM (detailed in
Section 4.2). The C2f-DM module, by integrating local and global information, enhances the capability to capture features of small objects and effectively alleviates the detection accuracy problems caused by object overlaps.
Secondly, we introduced an attention-based feature fusion technique, named BGFPN (detailed in
Section 4.3). This technique utilizes an efficient feature aggregation network and reparameterization technology to optimize the interaction of information between feature maps at various scales. Additionally, by introducing the BRA mechanism, BGFPN can more effectively capture critical feature information of small objects.
Lastly, we introduced a novel IoU loss function calculation method named SMPDIoU (detailed in
Section 4.4). This method comprehensively considers the shape and size of detection boxes, thereby strengthening the model’s focus on the attributes of detection boxes. It not only adjusts the shape and position of bounding boxes more accurately but also adapts the regression strategy according to the varying sizes of objects. Moreover, SMPDIoU, by considering the perpendicular distance between two target boxes, provides a more precise bounding box regression loss calculation method.
4.2. C2f-DM Module
The YOLOv8 backbone network mainly consists of stacks of simple convolutional modules. This design can cause small object features to be overshadowed by those of larger surrounding objects during image extraction, leading to the loss of crucial information. To improve the network’s capability to process small objects, we introduced a novel module called C2f-DM, which replaces the existing C2f module before the detection head.
As shown in
Figure 4, in the C2f module, we use the DM-bottleneck structure embedded with the Dual Dynamic Token Mixer (D-Mixer) [
48] to replace the original bottleneck. This configuration merges the benefits of both convolution and self-attention mechanisms while introducing a robust inductive bias for handling uniformly segmented feature segments. It achieves the dynamic integration of local and global information, considerably extending the network’s effective field of view. The module processes the input feature map in two segments: one via Input-dependent Depth-wise Convolution (IDConv) and the other through Overlapping Spatial Reduction Attention (OSRA). Subsequently, the outputs of these two parts are merged.
Specifically, we consider a feature map
X of dimensions
. This map is initially split into two sub-maps,
, along the channel dimension, each with dimensions
. Subsequently,
is processed by the OSRA, while
is handled by IDConv, resulting in new feature maps
of the same dimensions. These maps are subsequently combined along the channel dimension, resulting in the final output feature map
with dimensions
. Finally, the Compression Token Enhancer (STE) enables efficient local token aggregation. The D-Mixer performs the following sequence of operations:
In the IDConv module, the input feature map
X with dimensions
initially undergoes adaptive average pooling to gather spatial context and reduce spatial dimensions to
, ensuring the capture of global information. Following this, the map is passed through two consecutive
convolution layers to create an attention map
of dimensions
, enabling the network to dynamically focus on important regions of the input feature map, thereby achieving local information integration. where
G represents the quantity of attention groups. The map
is reshaped to
and a softmax function is applied across the
G dimension to produce the attention weights
A in
. These weights
A are then multiplied element-wise with a set of learnable parameters
P, also in
, and aggregated over the
G dimension to form the tailored deep convolution kernel
W in
. This process dynamically adjusts the convolutional kernel weights based on the different characteristics of the input feature map, integrating global and local information. The entire process of IDConv can be expressed as
In the OSRA module, a technique known as Overlapping Space Reduction (OSR) is employed to improve the spatial structure representation within the self-attention mechanism. This technique employs larger and overlapping patches to more effectively capture spatial information near patch boundaries, thus improving the representation of spatial structures in the self-attention mechanism. This process not only ensures the capture of local information and the expression of features but also achieves the integration of global information through the overlapping parts of the patches. The entire process of OSRA can be expressed as
where
B denotes the relative position bias matrix,
d represents the number of channels per attention head, and
refers to the Local Refinement Module, implemented using a
depth-wise convolution.
In the STE module, a
depth-wise convolution enhances local relationships,
convolutions for channel squeezing and expansion reduce computational cost, and a residual connection ensures representational capacity. This setup integrates both local and global information. STE can be represented as
By integrating the C2f-DM module, we demonstrate significant advantages in processing complex remote sensing imagery and small object detection compared to traditional frameworks like YOLOv8 and other detection algorithms. Currently, many detection algorithms primarily rely on simple stacks of convolutional modules. While this approach offers computational efficiency, it may lack the required flexibility and precision when dealing with complex scenes or small objects. In contrast, the C2f-DM module dynamically integrates local and global information, enabling more precise feature extraction in scenarios involving small objects and complex backgrounds. This capability is particularly crucial for remote sensing imagery applications that require extensive fields of view and meticulous feature analysis. The OSRA technology within the C2f-DM module significantly improves the capture of spatial features through overlapping space reduction techniques, especially around object edges. Simultaneously, IDConv dynamically adjusts the convolution kernels based on input, enhancing the module’s sensitivity to small objects and effectively reducing information loss. These characteristics allow the C2f-DM module to surpass current detection methods in providing more efficient and precise detection performance.
Although the C2f-DM introduces relatively complex mechanisms, it optimizes the use of computational resources through STE technology by performing channel squeezing and expansion operations after deep convolution. This ensures that the module remains highly efficient even in resource-constrained environments. Compared to traditional methods, the design of the C2f-DM module allows for more flexible network structure adjustments to accommodate different application needs. By tuning parameters within STE, an optimal balance between precision and speed can be found, tailored to specific tasks without the need for redesigning the entire network architecture.
Furthermore, the design of the C2f-DM module incorporates dynamic adjustment capabilities, enabling it to flexibly handle input features of varying scales and complexities and automatically adjust processing strategies to suit different inputs and scene conditions. This trait is particularly key for remote sensing image analysis, as these images often involve extensive geographic and environmental variations, along with constantly changing lighting and weather conditions. Therefore, the high adaptability of the C2f-DM module allows it to excel in scenarios with complex backgrounds or multi-scale objects, showcasing exceptional optimization potential and robust adaptability. Compared to existing methods, the adaptive capability of the C2f-DM is more pronounced, reducing reliance on manual intervention and significantly enhancing usability and flexibility, especially under a wide range of practical application conditions.
4.3. Bi-Level Routing Attention in Gated Feature Pyramid Network
4.3.1. Improved Feature Fusion Method
FPNs achieve multi-scale feature fusion by aggregating different resolution features from the backbone network. This approach not only boosts network performance but also improves its robustness, and has been proven to be extremely crucial and effective in object detection. Nonetheless, the current YOLOv8 model only adopts the PANet structure. This approach can be easily disrupted by normal-sized objects when processing small-sized objects, potentially leading to a gradual reduction or even complete disappearance of small object information. Additionally, there are issues with the precision of object localization in this model. To tackle these challenges, we propose BGFPN, a new feature fusion method.
We incorporated a top-down pathway to transmit high-level semantic feature information, guiding subsequent network modules in feature fusion and generating features with enhanced discriminative capacity. Additionally, a BRA [
49] mechanism was introduced to extract information from very small target layers (as shown in
Figure 5). This is a structure that uses sparse operations to efficiently bypass the most irrelevant areas, creating powerful discriminative object features.
BGFPN innovates on the basis of the Re-parameterized Gated Feature Pyramid Network (RepGFPN) [
50] through an efficient feature aggregation network and reparameterization techniques, optimizing the information interaction between different scale feature maps. This architecture improves the model’s handling of multi-scale information and efficiently merges spatial details with low-level and high-level semantic information. Although a large number of upsampling and downsampling operations was introduced to enhance interactions between features, a method was adopted to remove additional upsampling operations that cause significant latency, improving real-time detection speed.
When dealing with feature fusion issues between different scales, the model eliminates traditional 3 × 3 convolution modules and introduces Cross Stage Partial Stage (CSPStage) [
51] modules with a reparameterization mechanism. This module uses an efficient layer aggregation network connection as a feature fusion block, utilizing Concat operations to connect inputs from different layers. This allows the model to integrate shallow and deep feature maps, thereby obtaining rich semantic and positional information and high pixel points, enhancing the receptive field and improving model precision. RepConv [
52], as a representative of the reparameterized convolution module, achieves branch fusion during inference, which not only reduces inference time but also increases inference speed.
Furthermore, to more precisely address the detection of small targets, we introduced dilated convolution technology [
53]. This technology enhances feature extraction capabilities by expanding the convolution kernel’s receptive field without adding extra computational burden. This approach bypasses pooling operations, thus maintaining the high resolution of the feature maps. This is critical for the precise localization and identification of small objects within images, greatly enhancing the model’s detection precision in intricate scenes, particularly those with visual noise.
4.3.2. Bi-Level Routing Attention
In remote sensing images, complex backgrounds and severe noise often obscure small objects. Incorporating an attention mechanism into the network greatly enhances the capture of essential feature information, thus enhancing object detection precision. However, traditional attention mechanisms impose a considerable computational load when dealing with extremely small object layers, especially at high resolutions. To mitigate this, we have integrated a BRA mechanism tailored for vision transformers into the neck structure of YOLOv8. As shown in
Figure 6, this mechanism first filters out irrelevant large-area features at a coarser region level, then focuses at a finer token level, and dynamically selects the most pertinent key–value pairs for each query. This strategy not only saves computational and memory resources but also greatly improves the precision of detecting small objects.
Initially, we divided a two-dimensional input feature map
into
non-overlapping regions, each containing
feature vectors. This reshaped
X into
. Linear projections then generated the queries, keys, and values tensors
. We proceeded to construct a directed graph that maps the attention relations between these regions. We averaged the regions within
Q and
K to create region-level queries and keys
. The adjacency matrix for the region-to-region affinity graph was subsequently computed by multiplying the transpose of
with
.
From here, we identified the top-k regions with the highest similarity for each region in the adjacency matrix through row-wise operations. These indices were then annotated in the region-to-region routing index matrix, where top-k indicates the number of regions of interest within BGFPN.
Utilizing the inter-region routing index matrix
, we then implemented fine-grained token-to-token attention. Initially, we gathered the key and value tensors, denoted as
and
. Following the integration of Local Context Enhancement (LCE), attention was directed towards these gathered key–value pairs to generate the output:
Depicted in
Figure 2, in the BGFPN structure, we incorporate the BRA mechanism after each C2f module during the upsampling process, before downsampling, and before feature fusion. By adding the BRA module before the upsampling step, the features can be focused on earlier, allowing for a more precise handling of small object information, significantly enhancing the object’s recognition and localization performance. Moreover, by introducing the BRA module after each C2f module during the downsampling process, it ensures that even after feature simplification, the model can still sensitively capture details, strengthening the recognition of key information. Especially by introducing the BRA module before feature fusion, this can screen key areas at the macro level and conduct in-depth detail attention at the token level, ensuring that the network prioritizes key information in the image before integrating features, further improving the detection precision of small objects. This integrated attention mechanism effectively isolates crucial information in intricate settings while amplifying focus on fundamental features, thereby markedly boosting the precision of detecting small objects.
By integrating the BGFPN feature fusion technology, we have effectively optimized the information interaction between differently scaled feature maps. This design not only enhances the model’s ability to process multi-scale information, but also effectively merges high-level semantic information with low-level spatial information. Compared to other FPNs, BGFPN more meticulously handles features from details to the whole, thus providing richer and more accurate information in complex scenarios. Unlike traditional methods, which often struggle with small object detection, especially in noisy or complex backgrounds, BGFPN demonstrates significant advantages by incorporating dilated convolution technology. This not only expands the receptive field of the convolution kernels to enhance feature extraction capabilities but also avoids additional computational burdens. This strategy maintains a high spatial resolution of the feature maps, significantly enhancing the model’s precision in detecting small objects in complex scenes.
Furthermore, unlike conventional methods that typically incorporate attention mechanisms only at the backbone and neck connections of networks, our model introduces the BRA mechanism after each C2f module during both upsampling and downsampling processes, as well as before feature fusion. This approach not only captures key features of small objects more effectively but also optimizes small object detection and localization, significantly improving overall detection precision.
Although BGFPN employs numerous upsampling and downsampling operations to enhance interaction between features, it eliminates additional upsampling operations that cause significant latency issues, effectively improving the model’s speed in real-time detection. This improvement is crucial for scenarios requiring rapid responses, such as video surveillance or autonomous driving, as it significantly reduces processing time while ensuring efficient feature handling.
In summary, BGFPN not only improves the precision and speed of detection but also exhibits stronger adaptability and performance in handling complex and variable scenes, particularly surpassing many existing frameworks in terms of small object detection requirements.
4.4. Shape Mean Perpendicular Distance Intersection over Union
The bounding box regression loss function is crucial in object detection tasks. Researchers consistently propose various improved methods, such as GIoU [
44], DIoU [
45], and CIoU. While these approaches have enhanced the handling of small objects and bounding boxes with extreme aspect ratios, they still mainly emphasize the geometric relationship between bounding boxes, overlooking the influence of the bounding box’s own shape and scale on regression results.
To enhance small object detection, we introduced a new method called SMPDIoU. This method combines the advantages of SioU [
46] and MPDIoU [
47], comprehensively considering the shape and scale of the bounding boxes, thus addressing the deficiencies of IoU and its improved versions.Furthermore, SMPDIoU incorporates a detailed regression loss calculation method centered on the vertical distance between two bounding boxes. This approach not only markedly enhances the precision of detecting large objects but also excels in detecting small objects, efficiently addressing prevalent issues in small object detection. The specific calculation formula is provided below:
where
is a weight parameter, used to balance the influences of SIOU and MPDIoU, which can be adjusted according to specific application scenarios. In this model, distance loss (
) and shape loss (
) play a key role. By measuring the spatial distance and shape discrepancies between the actual and predicted boxes, SMPDIoU effectively reduces the angular differences between the anchor and true boxes in the horizontal or vertical directions, thus accelerating the convergence process of bounding box regression. The distance loss (
) is defined by the following equation:
where
, and
is the standardized distance between the centers of the true and predicted bounding boxes, calculated as follows:
As shown in
Figure 7,
and
are the center coordinates of the predicted bounding box, while
and
are the center coordinates of the true bounding box. Additionally,
,
,
, and
denote the respective heights and widths of the predicted and actual bounding boxes. The coefficient
related to the angle is calculated by the following equation:
where
denotes the Euclidean distance from the center of the predicted bounding box to the center of the true bounding box, calculated as follows:
where
represents the discrepancy in the y-axis distances between the minimum and maximum extents of the true and predicted bounding boxes, expressed as:
The equation for shape loss (
) is given below:
where
and
represents the proportional variances in height and width across the bounding boxes, calculated in the following manner:
In order to enhance the precision of assessing the spatial alignment between true and predicted bounding boxes, the model integrates the computation of the vertical distance separating their centers:
where
is the normalized vertical distance, which varies between 0 and 1. This normalized distance is derived from the Euclidean distance
between the true and predicted bounding box centers, relative to the maximum distance
that serves as the normalization reference: