A Lightweight Network for UAV Multi-Scale Feature Fusion-Based Object Detection

Deng, Sheng; Wan, Yaping

doi:10.3390/info16030250

Open AccessArticle

A Lightweight Network for UAV Multi-Scale Feature Fusion-Based Object Detection

by

Sheng Deng

and

Yaping Wan

^*

School of Computer Science, University of South China, Hengyang 421001, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(3), 250; https://doi.org/10.3390/info16030250

Submission received: 27 February 2025 / Revised: 15 March 2025 / Accepted: 18 March 2025 / Published: 20 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

To tackle the issues of small target sizes, missed detections, and false alarms in aerial drone imagery, alongside the constraints posed by limited hardware resources during model deployment, a streamlined object detection approach is proposed to enhance the performance of YOLOv8s. This approach introduces a new module, C2f_SEPConv, which incorporates Partial Convolution (PConv) and channel attention mechanisms (Squeeze-and-Excitation, SE), effectively replacing the previous bottleneck and minimizing both the model’s parameter count and computational demands. Modifications to the detection head allow it to perform more effectively in scenarios with small targets in aerial images. To capture multi-scale object information, a Multi-Scale Cross-Axis Attention (MSCA) mechanism is embedded within the backbone network. The neck network integrates a Multi-Scale Fusion Block (MSFB) to combine multi-level features, further boosting detection precision. Furthermore, the Focal-EIoU loss function supersedes the traditional CIoU loss function to address challenges related to the regression of small targets. Evaluations conducted on the VisDrone dataset reveal that the proposed method improves Precision, Recall,

m A P_{0.5}

, and

m A P_{0.5 : 0.95}

by 4.4%, 5.6%, 6.4%, and 4%, respectively, compared to YOLOv8s, with a 28.3% reduction in parameters. On the DOTAv1.0 dataset, a 2.1% enhancement in

m A P_{0.5}

is observed.

Keywords:

object detection; multi-scale fusion; YOLOv8; squeeze-and-excitation; PConv

Graphical Abstract

1. Introduction

The rapid advancement of drone technology has led to its extensive use in both civilian and military applications, such as traffic monitoring, disease surveillance, emergency response to disasters, and geological exploration. Notably, the need for reliable, real-time detection systems capable of identifying small targets in various environments has increased. Since drones are typically used as edge devices with limited computational resources, the challenge lies in balancing detection performance with computational efficiency. Citroni et al. [1] proposed an energy harvester based on plasmonic nano-antenna technology, which optimizes energy consumption and incorporates solar energy to extend the flight range and duration of UAVs. By utilizing a nano-antenna array, solar radiation is converted into electrical power, reducing the size and weight of traditional batteries and thereby enhancing the efficiency of UAVs. By integrating energy harvesters with advanced battery systems, the future holds the potential for achieving indefinite flight durations for UAVs.

This research focuses on enhancing the detection accuracy for small targets in drone imagery, addressing the critical need for effective detection under the constraints of limited computational resources. The difficulty in solving this problem arises from two key challenges. First, small target detection in drone imagery is inherently difficult due to the reduced size and low resolution of the targets, which often leads to missed detections and high false positive rates. Additionally, drones, being edge devices, come with limited computational power, which makes it difficult to create models that can perform both precise and efficient detection. This requires balancing computational costs, such as the number of parameters and floating-point operations (FLOPs), with the need to achieve high detection accuracy. Considering the constraints of real-time performance and limited resources, these challenges make small target detection in aerial images particularly difficult.

Detection methods can generally be divided into two main categories: single-stage and two-stage algorithms. Single-stage detection algorithms directly extract features from the raw image and subsequently classify and localize the targets. Some representative single-stage detection algorithms include SSD [2], YOLO series [3,4,5,6,7,8,9,10,11], and RetinaNet [12]. In contrast, two-stage detection algorithms first generate region proposals and then carry out classification and localization of these regions. Well-known two-stage detection algorithms include RCNN [13], Fast R-CNN [14], Faster R-CNN [15], and Mask R-CNN [16]. Single-stage algorithms are more commonly used in drone applications due to their faster processing speed and lower resource consumption.

In this paper, we propose an enhanced lightweight multi-scale feature fusion target detection algorithm, LMSF-YOLOv8s (Lightweight Multi-Scale Fusion–YOLOv8s). The new method aims to address the challenges posed by small target detection in drone imagery by improving detection accuracy, reducing false positives, and optimizing detection efficiency, all while considering the constraints of limited computational resources. Our contributions are as follows.

The C2f_SEPConv module substitutes the C2f module in the backbone network. By incorporating the advantages of PConv [17] and SE [18], it reduces the model’s parameters while enhancing overall performance.
The integration of the MSCA [19] module within the backbone network improves multi-scale feature fusion, efficiently combining features from various scales to boost detection accuracy.
A 160 × 160 small target detection head was introduced, replacing the 20 × 20 large target detection head, specifically designed to enhance the detection of small targets.
The MSFB module is constructed in the neck network to fuse shallow, mid-level, and deep features, thereby enhancing the network’s capacity to capture more complex features.
To optimize anchor boxes, the Focal-EIoU loss function [20] was utilized, which minimizes the influence of poor-quality anchor boxes and improves the regression accuracy by emphasizing high-quality boxes.

The effectiveness of our proposed method was validated through extensive experiments on the VisDrone dataset [21], where it was compared with popular existing methods. The results indicate a significant improvement in detection accuracy, with notable increases in

m A P_{0.5}

and

m A P_{0.5 : 0.95}

, while the model’s parameters and computational costs were reduced. These findings demonstrate that our approach successfully balances high detection accuracy with computational efficiency.

The paper is organized as follows: Section 1 discusses the significance of UAV object detection, outlines the current challenges, and presents the main objectives and contributions of the study. Section 2 provides an overview of YOLO-based networks and their applications in UAV object detection. Section 3 presents the LMSF-YOLOv8s model, detailing the design of each improved module. Section 4 describes the experiments in detail, including comparative and ablation studies, and analyzes the experimental results. Section 5 concludes the paper, summarizing the key findings and suggesting future research avenues.

2. Related Work

2.1. YOLO Networks

In 2016, Joseph et al. [3] introduced the fully convolutional YOLOv1 network, which revolutionized the target detection task by transforming it into a regression problem. This shift not only enhanced detection speed but also maintained a high level of accuracy. The architecture of YOLOv1 consists of three main components: the backbone, the neck, and the prediction head. It is a single-stage, end-to-end network. With the introduction of YOLOv1, single-stage detection methods have gained significant attention, with many researchers working to optimize and refine the YOLO architecture, resulting in several versions of the YOLO network.

Currently, YOLO is widely regarded as one of the most prominent target detection networks, particularly due to its exceptional real-time performance and detection accuracy, making it highly popular in both academic and industrial settings. The network has evolved through versions 1.0 to 12.0, with YOLOv8 emerging as the most advanced version in 2023. YOLOv8 can be applied to a broad range of computer vision tasks, including object detection, instance segmentation, and tracking. The model comes in five different versions (n, s, m, l, x) that vary in size, speed, and accuracy. This flexibility enables users to select the most appropriate model based on their specific needs, balancing performance and efficiency across various use cases.

The YOLOv8 network is depicted in Figure 1. It is composed of the backbone, neck, and detection head. The backbone incorporates the Conv module, C2f module, and SPPF module. In the neck section, the features of the Feature Pyramid Network (FPN) and the Path Aggregation Network (PAN) are integrated to facilitate multi-scale feature extraction and bidirectional feature fusion, utilizing both top-down and bottom-up pathways. This fusion enhances feature representation, significantly boosting the performance of target detection. The detection head is responsible for making predictions for target detection, such as identifying bounding boxes, object categories, and confidence scores.

2.2. UAV Object Detection

Many studies have been conducted on target detection algorithms for drone-based aerial imagery. Li et al. [22] proposed a multi-path inverse residual module and integrated an attention mechanism to tackle interference caused by large-scale variations and complex backgrounds. This approach improved the spatial pyramid pooling framework in YOLOv5s to enhance the model’s ability to detect small-scale targets. A lightweight decoupled head was also introduced to accelerate model convergence and improve detection accuracy. Zhang et al. [23] developed a new residual structure to enhance the internal feature fusion of individual layers, while also adding a triple attention module to better capture the relationships between spatial and channel features, thus preserving critical feature information. They proposed the RIOU_Loss loss function, which accounts for overlapping regions, center point distances, and diagonal lengths, improving localization accuracy and speeding up network convergence. Wang et al. [24] introduced a lightweight residual structure that integrates sparse features from aerial images, significantly reducing detection network parameters and weights. During the feature fusion process, they employed deconvolution to upscale the feature maps, preserving high-level information and enhancing detection accuracy. Cao et al. [25] introduced the GCL-YOLO network, which reduces the computational load and network parameters while maintaining detection performance. They designed a specialized feature fusion architecture and loss function, optimized for the detection of small targets in UAV operations. Sui et al. [26] proposed the BDH-YOLO model, which incorporates multi-scale feature fusion and a bidirectional pyramid network to capture richer semantic information. This model features a dynamic detection head combined with a self-attention mechanism, significantly improving detection performance across multiple dimensions of perception, such as scale, spatial, and task perception. Xiao et al. [27] introduced DSMD-LFIM, optimized for small target detection and low computational costs, making it suitable for edge deployment. Yang et al. [28] introduced the L-YOLO model, which achieves lightweight performance by replacing the original convolution operation with the GhostNet module. By refining the loss function and redesigning anchor box sizes, they enhanced small target detection and reduced model parameters and computational load. Xu et al. [29] developed the improved YOLOv8n model for small target detection, which boosts multi-scale feature fusion and enables better detection by adding additional detection layers. They also introduced conditional convolutions to improve the model’s expressive power while maintaining low computational costs. Wang et al. [24] presented the MFP-YOLO model, which enhances feature representation and reduces computational costs through multi-channel parallel processing and inverse residual structures. By employing parallel modules and deconvolution operations, this model extracts target information at specific scales, thus improving multi-scale target detection capabilities. The decoupling of classification, regression, and confidence estimation tasks reduces model parameters and accelerates network convergence. Mei et al. [30] proposed the BGF-YOLOv10. By incorporating modules such as BoTNet, GhostConv, and Patch Expanding Layer, this algorithm significantly enhances the accuracy of small object detection while minimizing the complexity of the model. Zou et al. [31] proposed SMFF-YOLO, which enhances small object detection by leveraging multi-level feature integration and an improved prediction module. To counteract the challenges posed by complex backgrounds, they introduced the AASPP module, which employs a combination of hybrid attention mechanisms and multi-scale feature extraction to refine detection accuracy. He et al. [32] proposed YOLO-ERF, a model that integrates the ERF module. This approach leverages residual and dense connections to broaden the receptive field of the convolutional kernel while maintaining the integrity of fine-grained details. The ERF module, embedded in a lightweight backbone network, minimizes the need for additional contextual modules after the backbone to increase the receptive field. Furthermore, the model features a compact detection head tailored for small target recognition in complex settings. Tahir et al. [33] proposed the PVswin-YOLOv8s model, which leverages the Swin Transformer to enhance small target detection capabilities. The introduction of the CBAM and Soft-NMS refines feature extraction and improves the handling of occlusions, addressing the limitations of traditional YOLOv8s models.

3. Methods

3.1. Overall Architecture

To address the challenges related to missed and false detections in small target identification from aerial images, along with the difficulty of optimizing detection performance while working within the computational limits of drones as edge computing units, this paper presents an improved model for small target detection.

The LMSF-YOLOv8s model architecture is illustrated in Figure 2. The C2f module is substituted by the C2f_SEPConv module, and an MSCA layer is appended to the end of the backbone network. The MSFB module is integrated into the network. Furthermore, a detection head designed for small objects is introduced, while the detection head for larger objects is removed.

3.2. C2f_SEPConv Module

Chen et al. [17] introduced PConv, a technique that performs convolution on a selected subset of input channels, leaving the other channels unchanged. This method enhances spatial feature extraction and minimizes redundant computations and memory usage, which in turn reduces FLOPs. Hu et al. [18] proposed SE, which models the relationships between channels and learns adaptive channel weights, enabling the model to prioritize more relevant features. The SE-based SEPCov module, as shown in Figure 3, utilizes both of these techniques.

The input tensor is split, with one part undergoing convolution while the other part remains unchanged. These two parts are then recombined to form a new tensor, which effectively reduces FLOPs. The calculation of FLOPs is as follows:

h \times w \times k^{2} \times c_{p}^{2}

(1)

h and w represent the height and width of the input tensor, respectively.

c_{p}

is the number of channels selected for convolution and k denotes the size of the convolution kernel. The typical selection ratio is

r = c_{p} / c = 1 / 4

. Based on this, it can be deduced that the FLOPs of PConv are one-sixteenth of those of a conventional convolution.

SE Attention first performs global average pooling on the feature map of each channel, which reduces the spatial dimensions and generates a scalar for each channel. This scalar encapsulates the global information for that channel. The exact formula for this operation is presented in Equation (2).

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)

(2)

z_{c}

represents the global statistical data for the the c-th channel. H and W correspond to the height and width of the feature map, respectively.

u_{c}

is the feature map derived from operations like convolution. The excitation operation is performed using two fully connected (FC) networks, which calculate the activation values for each channel. These activation values are then normalized with the

S i g m o i d

function to control the output strength of each channel. The exact formula is provided in Equation (3).

s = σ (W_{2} δ (W_{1} z))

(3)

δ

represents

R e L U

, while

W_{1}

and

W_{2}

represent the weight matrices for the two FC layers. The activation coefficient for each channel is denoted as s. Ultimately, s is applied as a scaling factor to the input feature map

u_{c}

, leading to the adjusted version of the feature map. The specific formula is shown in Equation (4):

{\hat{x}}_{c} = s_{c} \cdot u_{c}

(4)

s_{c}

is the scaling coefficient obtained through the excitation operation, and

{\hat{x}}_{c}

is the feature map after scaling. This method facilitates the dynamic modification of feature weights for each channel, enhancing the model’s ability to concentrate on key features.

GhostNet [34] and MobileNetV3 [35] are commonly used lightweight technologies, replacing the backbone of YOLOv8. The experiments, which used the VisDrone dataset, are detailed in Table 1.

As presented in Table 1, experiments in the VisDrone dataset indicate that while GhostNet and MobileNetV3 reduce parameters and GFLOPs, they also lower

m A P_{0.5}

. Conversely, substituting the bottleneck in C2f with the SEPCov module to develop the C2f_SEPConv variant reduces parameter usage and computational cost while maintaining accuracy, ensuring a balance between detection performance and resource efficiency.

3.3. Detection Head Adjustment

By default, the YOLOv8 model generates three detection heads for the output layers P3, P4, and P5, as illustrated in Figure 4a. The output resolutions of these layers are 80 × 80, 40 × 40, and 20 × 20, with the detection heads being able to detect targets with minimum resolutions of 32 × 32, 16 × 16, and 8 × 8, respectively. As the output layer resolution decreases, the focus shifts towards detecting larger targets. Given that small targets are prevalent in drone-based aerial imagery and are prone to be missed or falsely detected, an additional detection head is incorporated into the P2 output layer. The output resolution for P2 is 160 × 160, with the minimum detectable target resolution being 4 × 4. Experimental results conducted using the VisDrone dataset show that adding the P2 detection head while removing the P5 detection head significantly enhances detection accuracy. These findings are presented in Table 2, and the updated architecture is shown in Figure 4b.

As shown in the results in Table 2, the optimal solution is achieved by removing P5 and adding P2, with an improvement of 3.3% in

m A P_{0.5}

. Compared to the solution that only adds the P2 detection head, which shows a slight improvement of 0.4%, the findings indicate that the proposed adjustment to the detection heads yields the best performance for small target detection.

3.4. MSFB Module

The MSFB module’s structure is shown in Figure 5. To enhance the model’s detection capabilities, the paper proposes incorporating the MSFB module into the neck network, facilitating multi-scale feature fusion. Shallow features, due to their higher resolution, retain rich spatial information and details, which are crucial for detecting small targets. In contrast, deep features, although having a lower resolution, capture higher-level semantic information as the network depth increases, demonstrating stronger feature abstraction capabilities. These deep features are crucial for distinguishing complex categories and provide a deeper understanding of the image’s contextual information. The fusion mechanism of the MSFB module integrates these complementary features, improving the model’s capacity to handle intricate scenes while preserving detailed information.

Shallow features are downsampled using ADown [10], a lightweight downsampling module that employs a branch-based design, processing the features through two parallel branches to enhance feature representation. Figure 6 presents the architecture of the ADown. Initially, the original feature map is downsampled using an average pooling layer. Following this, the feature map is divided along the channel dimension into

F_{1}

and

F_{2}

.

F_{1}

undergoes a 3 × 3 convolution with stride 2, while

F_{2}

is processed with a 3 × 3 max pooling layer with a stride of 2. The two components are then merged to create the processed feature map

F_{A D o w n}

. The processed shallow features

F_{A D o w n}

are shown in Equation (5):

F_{A D o w n} = C o n c a t e (C o n v (F_{1}), M a x P o o l (F_{2}))

(5)

Deep features are upsampled using Dysample [36], a method that combines the advantages of dynamic convolution for higher efficiency. Traditional upsampling methods include nearest-neighbor interpolation and bilinear interpolation. However, Dysample outperforms these traditional methods in dense prediction tasks, making it particularly suitable for high-density scenes in drone-based aerial imagery. Therefore, the Dysample module is chosen as the upsampling method. A pointwise convolution is then applied to reduce the number of channels, ensuring that the channel count of the middle-level features remains consistent, resulting in

F_{D y s a m p l e}

. The processed feature map

F_{D y s a m p l e}

is shown in Equation (6):

F_{D y s a m p l e} = C o n v (D y s a m p l e (F_{d e e p_f e a t u r e}))

(6)

After processing, the shallow, middle, and deep features are merged along the channel dimension. These are then passed through a 1 × 1 convolution, followed by a 3 × 3 convolution, reducing the number of channels and enhancing the non-linear expression capabilities. A set of parallel depth-wise convolutions (DWConv), each with different receptive fields, is used to capture multi-scale feature information. After summing the outputs, a pointwise convolution is used to generate a feature map rich in information, denoted as

F_{f u s e d_f e a t u r e}

. The fused feature map

F_{f u s e d_f e a t u r e}

is shown in Equation (7):

F_{f u s e d_f e a t u r e} = C o n v (\sum_{i = 0}^{2} D W C o n v (C o n c a t (F_{A D o w n}, F_{m i d d l e_{f} e a t u r e}, F_{D y s a m p l e}))

(7)

3.5. Loss Function

The CIoU loss function [37] has limitations in small target detection due to class imbalance, often leading to missed and false detections, particularly when small targets are occluded. To address this, the Focal-EIoU loss function combines the benefits of EIoU Loss and Focal Loss to tackle the sample imbalance issue in bounding box regression tasks. This approach prioritizes high-quality anchor boxes, thus improving the model’s robustness and accuracy. The EIoU loss function, denoted as

L o s s_{E I o U}

, is defined in Equation (8).

L_{E I o U} = 1 - I O U + L_{d i s} + L_{a s p}

(8)

L_{d i s} = \frac{ρ^{2} (b, b^{g t})}{{(w^{c})}^{2} + {(h^{c})}^{2}}

(9)

L_{a s p} = \frac{ρ (w, w^{g t})}{{(w^{c})}^{2}} + \frac{ρ (h, h^{g t})}{{(h^{c})}^{2}}

(10)

L_{d i s}

is the center distance loss.

L_{a s p}

denotes the width and height loss.

I O U

represents the intersection over union between the predicted and ground truth boxes.

ρ

is the distance between two points. h, w, and b correspond to the predicted box’s height, width, and center, respectively.

h^{g t}

,

w^{g t}

, and

b^{g t}

represent the ground truth box’s height, width, and center.

h^{c}

and

w^{c}

denote the height and width of the minimum enclosing rectangle of the predicted and ground truth bounding boxes. The Focal-EIoU loss

L_{F o c a l_E I o U}

is defined in Equation (11):

L_{F o c a l - E I o U} = {(I O U)}^{γ} \times L_{E I o U}

(11)

Experiments were carried out on the VisDrone dataset by switching between different loss functions. The results of these experiments are presented in Table 3.

The results presented in Table 3 show that

L_{F o c a l - E I o U}

performs the best. Replacing

L_{C I o U}

with Focal-EIoU significantly improves small target detection quality.

3.6. MSCA

Shao et al. [19] proposed the MSCA to acquire multi-scale information and and enhance long-range pixel relationships. Figure 7 displays the structure of MSCA, and Figure 8 shows the structures of the multi-scale x-axis and y-axis convolutions. The feature map F is processed through these convolutions to obtain multi-scale contextual information along the x -axis and y-axis, denoted as

F_{x}

and

F_{y}

, respectively. The mathematical expressions for

F_{x}

and

F_{y}

are provided in Equations (12) and (13).

F_{x} = C o n v (\sum_{i = 0}^{2} C o n v_{1 D_{x}} (N o r m (F)))

(12)

F_{y} = C o n v (\sum_{i = 0}^{2} C o n v_{1 D_{y}} (N o r m (F)))

(13)

C o n v_{1 D_{x}}

refers to the 1D convolution along the x-axis,

C o n v_{1 D_{y}}

refers to the 1D convolution along the y-axis, and Norm represents the normalization layer. Then, through Multi-Head Cross-Axis Attention (MHCA), long-range dependencies between the two spatial dimensions are captured, resulting in the features

F_{X}

and

F_{Y}

. The expressions for

F_{X}

and

F_{Y}

are given in Equations (14) and (15).

F_{X} = M H C A_{y} (F_{y}, F_{x}, F_{x})

(14)

F_{Y} = M H C A_{x} (F_{x}, F_{y}, F_{y})

(15)

M H C A_{x}

represents the MHCA along the y-axis, and

M H C A_{y}

represents the MHCA along the x-axis. MHCA effectively captures the contextual information across the axes. The specific derivation of MHCA is shown in Equations (16)–(18):

A = \frac{Q \cdot K^{T}}{\sqrt{d_{k}}}, A^{'} = Softmax (A)

(16)

O = A^{'} \cdot V

(17)

O_{f i n a l} = Concat (O_{1}, O_{2}, \dots, O_{h})

(18)

V, K, and Q represent the value, key, and query matrices. A is the attention matrix, and

d_{k}

is the dimension of k. The normalized attention weights

A^{'}

are obtained through the

S o f t m a x

function. O represents the output features, and

O_{f i n a l}

is the final result obtained by concatenating the features from multiple attention heads.

4. Experiments

4.1. Datasets

The VisDrone 2019 dataset [21] is a drone-based aerial image dataset collected and publicly released by the AISKYEYE team. It consists of images captured by drones equipped with cameras across 14 different cities in China, covering a variety of geographic environments and weather conditions, which include both sparse and dense target scenarios. It contains 10 target classes: car, pedestrian, people, bicycle, etc. It is divided into a training set (6471 images), a validation set (548 images), and a test set (3190 images), totaling 2.6 million target samples.

The DOTAv1.0 dataset [38] is another dataset designed for object detection in aerial imagery, collected from multiple regions and scenes using different sensors and platforms. Its primary aim is to enhance the precision of small target detection. The dataset includes 15 target categories: storage tank, plane, baseball diamond, ship, tennis court, etc. The dataset contains a total of 2806 images, with approximately 180,000 target samples.

4.2. Implementation Details and Evaluation Metrics

The training process was carried out on a system running Windows 10 Professional Edition, featuring an i9-11900K 3.5 GHz CPU, a single RTX 4090 24 GB GPU, and 64 GB of RAM. The network was built using the PyTorch 2.6 deep learning framework with Python 3.10.0. The training consisted of 200 epochs, with a batch size of 8. The SGD optimizer was used with a learning rate of 0.01. Mosaic augmentation was applied during the early epochs and was disabled in the final 10 epochs. The initial image size was configured to 640 × 640.

To assess the model’s performance effectively, several metrics were utilized: P, R,

m A P

, Params, FPS, and GFLOPs.

m A P

is computed using

m A P_{0.5}

and

m A P_{0.5 : 0.95}

as performance references. The detailed definitions of P, R, and

m A P

are provided in Equations (19)–(21):

P = \frac{T P}{T P + F P}

(19)

R = \frac{T P}{T P + F N}

(20)

m A P = \frac{1}{N} \sum_{i = 1}^{N} \int_{0}^{1} P R d R

(21)

N represents the total number of categories in the dataset.

F P

corresponds to the count of false positive predictions,

T P

indicates the count of true positive predictions, and

F N

denotes the number of false negative predictions.

4.3. Experimental Results and Analysis

To assess the performance of the LMSF-YOLOv8s algorithm, a comparison was conducted with several previous object detection methods using the VisDrone dataset. The results of these experiments are presented in Table 4.

The VisDrone dataset contains numerous small targets, and the SSD algorithm performs poorly in detecting these targets. The Faster R-CNN method, while effective, is less suitable for edge devices due to its large number of parameters. The YOLOv8n method has the smallest number of parameters and GFLOPs but requires improvements in detection accuracy. Based on these observations, this study uses YOLOv8s as the baseline and proposes the LMSF-YOLOv8s method. The results show a 6.3% improvement in

m A P_{0.5}

, a 4.1% improvement in

m A P_{0.5 : 0.95}

, a 28.3% reduction in Params, and an increase in GFLOPs. Compared to the YOLOv8m method, LMSF-YOLOv8s achieves better detection performance with fewer parameters and GFLOPs.

Table 5 and Table 6 show the detection accuracy for each category using the baseline (YOLOv8s) and the LMSF-YOLOv8s method on the VisDrone dataset.

As shown in Table 5 and Table 6, the LMSF-YOLOv8s method improves all four metrics across various target categories in the VisDrone dataset. The most significant improvement in the P metric is seen in the awning–tricycle class, with an increase of 8.3%. The best improvement in the R metric is observed in the people class, with an increase of 11.4%. The pedestrian class exhibits the most substantial increases in

m A P_{0.5}

and

m A P_{0.5 : 0.95}

, with increases of 11.8% and 6.8%, respectively.

4.4. Ablation Study

To evaluate the contribution of each enhancement module in the LMSF-YOLOv8s method, a set of ablation experiments was conducted. The loss function is defined as A, the MSCA module as B, the C2f_SEPConv module as C, the detection head adjustment as D, and the MSFB module as E. The results of these ablation tests are presented in Table 7.

The effect of the introduced modules on detection results is presented in Table 7. Modules A and B are designed to improve the model’s robustness. Module C achieves model lightweighting through the PConv module combined with the SE mechanism, balancing detection performance. As a result,

m A P_{0.5}

increases by 0.2%, and Params decrease by 10.81%. Module D adjusts the detection head to better handle small target scenarios, leading to a 3.3% increase in

m A P_{0.5}

and a 33.33% reduction in Params. Module E fuses multi-level image features to enhance detection performance, although the complexity slightly increases, resulting in a 2.2% improvement in

m A P_{0.5}

and an 11.53% increase in Params. As demonstrated in Table 7, the proposed improvements in this study effectively enhance detection accuracy while balancing computational resources. The model complexity is moderate, and detection accuracy has significantly improved, making it more suitable for small target detection tasks in aerial imagery.

To evaluate the model’s generalization capability, comparative experiments were conducted using both the baseline method and the LMSF-YOLOv8s method on the DOTA dataset. The results from these experiments are presented in Table 8:

As shown in Table 8, the proposed method achieves P and R values of 79.4% and 50.9%, respectively, which are higher than those of the baseline method. Additionally,

m A P_{0.5}

and

m A P_{0.5 : 0.95}

increase by 2.1% and 1.3%, respectively. These results highlight the effectiveness of the proposed method in exhibiting strong generalization and its suitability for detecting datasets with large-scale variations.

4.5. Visualization

Several images from different scenes in the VisDrone dataset were selected, and object detection was performed using both the baseline method and the LMSF-YOLOv8s method. The comparison results are presented in Figure 9 and Figure 10.

The left section shows the original images from the VisDrone dataset, the middle section displays the results using the YOLOv8s method, and the right section shows the results from the LMSF-YOLOv8s method proposed in this study. The LMSF-YOLOv8s performs better at detecting small targets, such as pedestrians and motorcycles in traffic scenarios. As shown, the LMSF-YOLOv8s method surpasses the baseline method in small target detection across various scenes, demonstrating its effectiveness in handling small target detection tasks in diverse environments.

5. Conclusions

This paper addresses the challenges of missed and false detections in small targets from drone-based aerial images, as well as the challenges of limited hardware resources during model deployment. The LMSF-YOLOv8s method, built upon the YOLOv8s model, incorporates enhancements to the backbone, neck, and detection head, aiming to enhance detection accuracy under resource-constrained conditions. Given the characteristics of drones as edge computing nodes, the method improves detection accuracy while working within limited resources. Through a series of comparative and ablation experiments, the effectiveness and generalization capabilities of LMSF-YOLOv8s have been validated. Future work will concentrate on further optimizing the model to improve detection rates, minimize missed detections, and better balance detection performance with computational resource constraints.

Author Contributions

Author Contributions: Conceptualization, S.D. and Y.W.; methodology, S.D.; software, S.D.; validation, S.D.; formal analysis, S.D.; investigation, S.D.; resources, Y.W.; data curation, S.D.; writing—original draft preparation, S.D.; writing—review and editing, S.D; visualization, S.D.; supervision, Y.W.; project administration, Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Project 2024JJ7428 of the Hunan Provincial Natural Science Foundation of China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in the VisDrone dataset at [http://aiskyeye.com/iccv2019/], (accessed on 17 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
PConv	Partial Convolution
SE	Squeeze-and-Excitation
MSCA	Multi-Scale Cross-Axis Attention
MSFB	Multi-Scale Fusion Block
LMSF-YOLOv8s	Lightweight Multi-Scale Fusion-YOLOv8s
GFLOP	Giga Floating Point Operations per Second
mAP	Mean Average Precision
P	Precision
R	Recall
IOU	Intersection Over Union
FPS	Frame Per Second
FPN	Feature Pyramid Network
PAN	Path Aggregation Network
CBAM	Convolutional Block Attention Module
FC	Fully Connected
GAP	Global Average Pooling
DWConv	Depth-Wise Convolution

References

Citroni, R.; Di Paolo, F.; Livreri, P. A novel energy harvester for powering small UAVs: Performance analysis, model validation and flight results. Sensors 2019, 19, 1771. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, J.; Chen, Y.; Dong, Z.; Gao, M. Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput. Appl. 2022, 35, 7853–7865. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. YOLOv9: Learning What You Want toLearn Using Programmable Gradient Information. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2025. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Shao, H.; Zeng, Q.; Hou, Q.; Yang, J. Mcanet: Medical image segmentation with multi-scale cross-axis attention. arXiv 2023, arXiv:2312.08866. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 213–226. [Google Scholar]
Li, Y.; Wang, J.; Zhang, K.; Yi, J.; Wei, M.; Zheng, L.; Xie, W. Lightweight object detection networks for uav aerial images based on yolo. Chin. J. Electron. 2024, 33, 997–1009. [Google Scholar] [CrossRef]
Zhang, P.; Deng, H.; Chen, Z. RT-YOLO: A residual feature fusion triple attention network for aerial image target detection. Comput. Mater. Contin. 2023, 75, 1411–1430. [Google Scholar] [CrossRef]
Wang, J.; Zhang, F.; Zhang, Y.; Liu, Y.; Cheng, T. Lightweight object detection algorithm for uav aerial imagery. Sensors 2023, 23, 5786. [Google Scholar] [CrossRef] [PubMed]
Cao, J.; Bao, W.; Shang, H.; Yuan, M.; Cheng, Q. GCL-YOLO: A GhostConv-based lightweight yolo network for UAV small object detection. Remote. Sens. 2023, 15, 4932. [Google Scholar] [CrossRef]
Sui, J.; Chen, D.; Zheng, X.; Wang, H. A new algorithm for small target detection from the perspective of unmanned aerial vehicles. IEEE Access 2024, 12, 29690–29697. [Google Scholar] [CrossRef]
Xiao, Y.; Di, N. SOD-YOLO: A lightweight small object detection framework. Sci. Rep. 2024, 14, 25624. [Google Scholar] [CrossRef]
Yang, R.; Zhang, J.; Shang, X.; Li, W. Lightweight small target detection algorithm with multi-feature fusion. Electronics 2023, 12, 2739. [Google Scholar] [CrossRef]
Xu, L.; Zhao, Y.; Zhai, Y.; Huang, L.; Ruan, C. Small object detection in UAV images based on Yolov8n. Int. J. Comput. Intell. Syst. 2024, 17, 223. [Google Scholar] [CrossRef]
Mei, J.; Zhu, W. BGF-YOLOv10: Small object detection algorithm from unmanned aerial vehicle perspective based on improved YOLOv10. Sensors 2024, 24, 6911. [Google Scholar] [CrossRef]
Wang, Y.; Zou, H.; Yin, M.; Zhang, X. Smff-yolo: A scale-adaptive yolo algorithm with multi-level feature fusion for object detection in uav scenes. Remote. Sens. 2023, 15, 4580. [Google Scholar] [CrossRef]
Wang, X.; He, N.; Hong, C.; Sun, F.; Han, W.; Wang, Q. Yolo-erf: Lightweight object detector for uav aerial images. Multimedia Systems 2023, 29, 3329–3339. [Google Scholar] [CrossRef]
Tahir, N.U.A.; Long, Z.; Zhang, Z.; Asim, M.; ELAffendi, M. PVswin-YOLOv8s: UAV-based pedestrian and vehicle detection for traffic management in smart cities using improved YOLOv8. Drones 2024, 8, 84. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6027–6037. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Wang, Z.; Su, Y.; Kang, F.; Wang, L.; Lin, Y.; Wu, Q.; Li, H.; Cai, Z. Pc-yolo11s: A lightweight and effective feature extraction method for small target image detection. Sensors 2025, 25, 348. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; He, M.; Hui, B. ESO-DETR: An Improved Real-Time Detection Transformer Model for Enhanced Small Object Detection in UAV Imagery. Drones 2025, 9, 143. [Google Scholar] [CrossRef]
Zhu, G.; Zhu, F.; Wang, Z.; Yang, S.; Li, Z. EDANet: Efficient Dynamic Alignment of Small Target Detection Algorithm. Electronics (2079-9292) 2025, 14, 242. [Google Scholar] [CrossRef]

Figure 1. YOLOv 8 structure diagram.

Figure 2. LMSF-YOLOv8s architecture diagram.

Figure 3. Schematic diagram of the SEPConv module. * is Element-wise Multiplication.

Figure 4. Detection head structure diagram.

Figure 5. MSFB module structure diagram.

Figure 6. ADown module structure diagram.

Figure 7. MSCA structure diagram.

Figure 8. Multi-scale convolutional structure diagram on x and y axes.

Figure 9. (a) The original image set, (b) the detection results obtained using the YOLOv8s method, and (c) the detection outcomes from the LMSF-YOLOv8s method.