In order to enhance the detection precision of small objects for UAVs, an improved YOLOv8s-based model is proposed. The network architecture of the proposed model is shown in
Figure 1, which includes three main improved parts, namely, the PMSE module, the SCFPN module, and the WIOU loss function.
3.1. The Proposed PMSE Module
The semantic information in the shallow feature maps of the YOLOv8s backbone network is disordered, and the receptive field is small. During the bidirectional context feature fusion process, it is easy to cause feature information loss, and the large amount of background information contained in the shallow feature layers will cause interference in the fusion process. To address these issues, a feature enhancement module PMSE is proposed to improve the feature fusion performance of the backbone network. The network structure and parameters structure of the proposed PMSE module are shown in
Figure 2 and
Figure 3, respectively.
As shown in
Figure 2, the feature map first passes through three parallel branches. One branch involves a convolutional kernel of size
to adaptively adjust the input channels, thereby reducing network complexity. Then, through the extracted layer convolutional layer, the feature information is further extracted: one path consists of
convolution combined with dilated convolution of rate 2, while the other path consists of
convolution combined with deformable convolution. The third branch preserves the original feature information through global average pooling, then reorganizes the space channel through a convolutional kernel of size
to reduce parameters, and restores the size information of the feature map through upsampling. The feature maps generated by the three branches are concatenated in the channel dimension. Then, the features are compressed along the spatial dimension to
scalar, and output as
feature map through global average pooling. Finally, through sigmoid non-linear feature mapping, the weight of each channel is multiplied by the weight of the original feature map, adaptively tuning algorithms for the target of interest.
The calculation process of the proposed PMSE module is as follows:
where
F is the original feature map;
is the feature map after splicing through three branches;
denotes the local features output from the first branch;
denotes the local features output from the second branch;
denotes the local features output from the third branch;
is the feature map activated by a function after compression and reorganization of the spliced feature map along the spatial dimensions;
is the globally average-pooled local features; and
is the final output feature map.
There are two extracted layers (ELs) in the proposed PMSE module (see
Figure 2). The EL layer introduces a residual block structure, which helps to address the issues of gradient explosion and vanishing gradients. It can not only extract feature information at a deeper level but also ensures the robustness of the model. By using two
convolution layers to adjust the model’s channel numbers, it reduces the computational load while maintaining the same number of output channels for the model. The structure of the EL is illustrated in
Figure 4, and its calculation process is as follows:
where
represents the output feature map obtained after processing with the residual block,
represents the local features obtained from the residual branch, and
represents the local features output by the convolutional branch.
Remark 1. By using convolutional kernels of different sizes, feature information can be extracted at different scales. The convolutional kernel can capture local detailed information, while deformable convolution can capture a larger range of contextual information through adaptive receptive fields. Using dilated convolution and deformable convolution can flexibly adjust the sampling positions of convolutional kernels to adapt to the shapes and sizes of different objects, better capturing the deformations and geometric transformations of objects, thus improving the model’s perceptual ability toward the objects. Thus, the proposed PMSE module can adaptively adjust the distribution of weights, enhancing useful features and suppressing irrelevant information. When inputting feature maps of varying sizes, the model is capable of adapting in such a way as to capture the receptive field of small objects. This adaptation is intended to enhance the detection performance of these objects.
3.2. The Proposed SCFPN Module
In the process of deep learning computation, as the network deepens gradually, the model complexity increases. In most cases, each layer of the network will cause a certain feature loss. In a neural network model, shallow networks have higher resolution than deep networks, providing more accurate location information for object detection. On the other hand, deep networks have a larger receptive field compared to shallow networks, containing more semantic information beneficial for object classification. Therefore, adopting a multi-scale feature fusion method to integrate feature information at different scales can reduce the feature loss caused by deepening the network.
The neck structure used in the general YOLOv8s model is PA-FPN [
41], which establishes a bottom–up channel on top of the feature pyramid network (FPN). The structure of PA-FPN is shown in
Figure 5a. To further enhance the detection performance of small objects, a new scale compensation pyramid network (SCFPN) is proposed in this study. The structure of the proposed SCFPN module is shown in
Figure 5b.
In the proposed SCFPN module, firstly, nodes with only one feature input are removed to reduce the computational load further. The main reason is that when a node has only one feature for fusion, its contribution to feature extraction in the network is minimal. After considering the network’s computational complexity and detection performance, these nodes are deleted.
Subsequently, the feature information in the backbone network is input into the bottom–up channel with the same dimensions, allowing more comprehensive interaction between low-level and high-level features to capture more positional and semantic information.
Finally, the feature maps from the backbone network are input across dimensions into the top–down pyramid. The low-level feature maps contain more positional information, and they are re-inputted to extract positional information at a higher-dimensional level, enabling the entire feature pyramid structure to extract more positional information of small targets.
In addition, the general YOLOv8s model faces challenges in handling small objects due to the feature maps being obtained through simple downsampling, leading to the loss of relevant information when the downsampling multiplier is too large. The main reason is that high-level feature maps undergo multiple convolution operations, such as P4 and P5. At this time, the resolution of the feature map is very small, which only contains a large amount of semantic information of the object. However, shallow feature maps have a higher resolution and contain a large amount of position information for small objects, such as P1 and P2. To enhance the model’s positioning accuracy for objects while maintaining good semantic information, high-level feature maps can be fused with the position information of shallow networks by performing multiple upsampling operations over longer paths [
21,
42]. Thus, to address these problems and enhance the model’s accuracy and robustness, in the proposed SCFPN module, we use the input of the five feature layers P1, P2, P3, P4, and P5, and add an ultra-small detection layer P2.
After the original detection layer of the YOLOv8s model, to further extend the feature map, we incorporate convolution and upsampling operations because in the general YOLOv8s model, for small objects, the feature at P5 is downsampled to a
feature map, which leads to the problem of insufficient features. Therefore, in our proposed model, P2, P3 and P4 are used to output feature maps (with sizes of
,
, and
respectively) to achieve small object detection, and optimize the network structure of P5 layer
. By obtaining lower spatial features and fusing them with higher-level semantic features to generate a P2 detection layer, it will result in an
feature map, which can better improve the model’s ability to detect smaller objects. The two types of prediction heads are shown in
Figure 6.
The conventional approach to feature fusion often entails the overlaying or addition of feature maps, such as the utilization of Concat or Shortcut connections, without the appropriate allocation of weights to the concurrently added feature maps. In fact, the information content of different input feature maps varies, and their contributions to the fusion of input feature maps are also distinct. The mere addition or overlaying of these feature maps is insufficient to fully extract features. Thus, a weighted feature fusion mechanism is employed in the proposed SCFPN module [
43]:
where
i and
j represent the number of input layers for feature fusion nodes, with
;
denotes the weights of respective input layers; and
I is the input feature map of the node. The use of the Relu activation function for non-linear mapping ensures
, setting
to prevent instability during training. This calculation method scales the weight values to be within [0, 1], leading to fast and efficient training.
Remark 2. In the proposed SCFPN module, the improvements result in better performance and accuracy of the network in handling small object detection. An extra edge is added between the original input node and the output node in the proposed SCFPN module for location information extraction (see Figure 5b), which does not introduce new nodes nor increase the number of additional parameters. Adding an ultra-small-object detection P2 layer to the proposed SCFPN network improves the detection performance of small target objects (see Figure 6b). The novel weighted feature fusion mechanism in the fusion part allows adaptive adjustment of the contribution to the object detection based on input feature maps, facilitating more comprehensive feature extraction. 3.3. The Loss Function
The accuracy of the model in locating objects, particularly in UAV scenarios where objects are often small, can be improved by using an optimized loss function. The original YOLOv8 uses DFL and CIOU for bounding box regression. However, CIOU does not consider the balance of the dataset between high and low-quality samples. Therefore, in this paper, an improved WIOUv3 loss function instead of CIOU is adopted to ensure better bounding box prediction regression for small objects [
44,
45]. WIOUv1 is computed using a two-layer attention mechanism, which is as follows:
where the coordinates of the center point of the actual box are represented by
and
, while the coordinates of the center point of the prediction box are represented by
and
. The width and height of the minimum perimeter rectangle between the prediction box and the actual box are represented by
and
.
WlOUv3 defines the anomaly degree coefficient
to measure the quality of anchor frames, uses
to construct the non-monotonic focusing coefficient
and applies it to WIOUv1, with
and
as hyperparameters to regulate the quality cut-off criterion. WIOUv3 adopts a dynamic gradient gain assignment strategy to reasonably allocate the high and low weights of quality anchor frames. The specific computation of WIOUv3 is as follows:
Remark 3. The proposed WIOUv3 increases the focusing mechanism by calculating the focusing coefficients, achieves the boundary frame regression loss for dynamic focusing, and optimizes the dynamic focusing on small objects in UAV images. Furthermore, the introduction of WIOUv3 does not lead to an increase in additional parameters, thereby maintaining the lightness.