LD-YOLOv10: A Lightweight Target Detection Algorithm for Drone Scenarios Based on YOLOv10

Qiu, Xiaoyang; Chen, Yajun; Cai, Wenhao; Niu, Meiqi; Li, Jianying

doi:10.3390/electronics13163269

Open AccessArticle

LD-YOLOv10: A Lightweight Target Detection Algorithm for Drone Scenarios Based on YOLOv10

by

Xiaoyang Qiu

,

Yajun Chen

^*,

Wenhao Cai

,

Meiqi Niu

and

Jianying Li

School of Electronic Information Engineering, China West Normal University, Nanchong 637009, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3269; https://doi.org/10.3390/electronics13163269

Submission received: 23 July 2024 / Revised: 6 August 2024 / Accepted: 16 August 2024 / Published: 17 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Due to the limited computing resources and storage capacity of edge detection devices, efficient detection algorithms are typically required to meet real-time and accuracy requirements. Existing detectors often require a large number of parameters and high computational power to improve accuracy, which reduces detection speed and performance on low-power devices. To reduce computational load and enhance detection performance on edge devices, we propose a lightweight drone target detection algorithm, LD-YOLOv10. Firstly, we design a novel lightweight feature extraction structure called RGELAN, which utilizes re-parameterized convolutions and the newly designed Conv-Tiny as the computational structure to reduce the computational burden of feature extraction. The AIFI module was introduced, utilizing its multi-head attention mechanism to enhance the expression of semantic information. We construct the DR-PAN Neck structure, which obtains weak features of small targets with minimal computational load. Wise-IoU and EIoU are combined as new bounding box regression loss functions to adjust the competition between anchor boxes of different quality and the sensitivity of anchor box aspect ratios, providing a more intelligent gradient allocation strategy. Extensive experiments on the VisdroneDET-2021 and UAVDT datasets show that LD-YOLOv10 reduces the number of parameters by 62.4% while achieving a slight increase in accuracy and has a faster detection speed compared to other lightweight algorithms. When deployed on the low-power NVIDIA Jetson Orin Nano device, LD-YOLOv10 achieves a detection speed of 25 FPS.

Keywords:

YOLOv10; lightweight; edge devices; object detection

1. Introduction

Drones, as important tools for development in various fields, are widely used in tasks such as agricultural management [1], smart transportation [2], and disaster relief [3] due to their real-time data transmission capabilities and high mobility. Object localization and classification in drone images have become a hot topic in recent years and are also prerequisites for many upstream studies. With the rapid development of artificial intelligence, deep-learning-based object detection algorithms have become popular. While commonly used two-stage object detection algorithms [4,5] can achieve high detection accuracy, they suffer from issues such as large model sizes and longer detection times, making deployment in performance-limited embedded environments very challenging. On the other hand, the one-stage YOLO series models are particularly well-suited for deployment in embedded systems, thanks to their ability to simultaneously perform both localization and classification, improving detection speed. However, due to drone endurance, power consumption, and performance limitations, balancing detection accuracy with payload constraints remains a very challenging task.

To address these challenges, researchers have been working on developing efficient and lightweight object detection networks. Most models are trained on the COCO dataset, and directly applying them to detect drones with extreme scale variations often results in poor accuracy. To handle the issues of small target size and high target density in drones, there are generally two common solutions: (1) Adding small object detection Heads to the network structure [6], which enriches details by inputting large-sized feature maps into the detection layers and improves detection accuracy for small targets. However, this greatly increases the model’s floating-point operations per second (GFLOPS), affecting its running speed and making it unsuitable for lightweight models. (2) Using Soft-NMS [7] in the post-processing stage to weight and decay the confidence scores of overlapping bounding boxes, reducing the number of overlapping boxes and improving detection performance for dense targets. However, using Soft-NMS for handling duplicate anchor boxes significantly increases post-processing time, making real-time object detection on embedded systems challenging.

Many aerial object detectors, such as SMFF-YOLO [8] and FFAGRNet [9], require a large number of parameters and computational resources to achieve improved detection accuracy, and they have demonstrated real-time performance on high-power devices. However, in real-world detection scenarios, most edge devices lack the computational capability of laboratory equipment, making it difficult to deploy these models on low-power devices and support real-time inference. Some lightweight detectors maintain detection accuracy by adding small-object detection layers or introducing post-processing methods such as Soft-NMS. However, they overlook the significant reduction in inference speed. The purpose of lightweight algorithms is to enable real-time detection on edge devices, but lower inference speed is not conducive to deployment on these devices. Therefore, seeking a lightweight aerial detection algorithm with significant advantages in terms of parameter count and inference speed is necessary.

The consistent dual assignment strategy proposed by YOLOv10 [10] avoids the need for NMS to eliminate duplicate prediction boxes during inference, greatly reducing post-processing time and achieving the fastest inference speed among models with similar detection accuracy. As the latest object detection model, it has achieved state-of-the-art (SOTA) results on both the COCO [11] and VOC [12] datasets. Therefore, we propose a lightweight aerial object detection model based on YOLOv10, called LD-YOLOv10. By significantly reducing the number of parameters and computational load, LD-YOLOv10 maintains detection accuracy while achieving the fastest detection speed, making it suitable for real-time object detection on drones with limited performance. The main contributions of this paper are:

To address the issue of excessive parameter count and computational load in the model, we proposed the RGELAN structure, which reduces the number of parameters and computational load of the feature extraction structure by over half.
To tackle the problem of small target misdetection, we designed the DR-PAN Neck structure. This increases small target detection accuracy by incorporating shallow feature maps using DySample [13] and the RGELAN structure to reduce computational load.
We optimized the bounding box regression loss to the Wise-EIoU loss function, addressing the issue of anchor box quality while providing more refined and efficient adjustments to anchor box shapes.
We proposed a lightweight network model for detecting aerial targets, which reduces computational load while maintaining detection accuracy. The model’s real-time detection capability was validated on the Jetson Orin Nano, demonstrating its potential for real-time drone application.

2. Related Work

As a single-stage object detector, the YOLO algorithm is the most commonly used edge-side object detector. Through continuous iterations, the YOLO series algorithms have been updated to YOLOv10. The following sections will mainly introduce the lightweight YOLO series algorithms.

Joseph et al. [14] first proposed the YOLO algorithm, laying the foundation for real-time detection by directly regressing target boxes. Ultralytics [15] introduced the YOLOv5 model, which achieved lightweight performance by adopting the C3 lightweight feature extraction structure. YOLOv5 was also the first to be divided into four versions—S, M, L, and X—to meet different needs, making it the most widely used object detection model. However, the anchor-based method of generating prediction boxes results in many redundant boxes, increasing computational load. In 2023, Ultralytics [16] introduced the YOLOv8 model, which uses C2f to obtain more gradient flow information and an anchor-free approach to speed up detection, reducing time and computational power requirements. However, post-processing operations are still needed. Wang et al. [10] proposed YOLOv10, which addresses the need for post-processing with a consistent dual assignment strategy, reducing model detection time and achieving the fastest detection speed to date.

Lightweight design is a crucial indicator for detection models used on edge devices, aiming to reduce model size while maintaining good accuracy under limited computational resources. Standard lightweight methods include model compression and network structure optimization. Model compression methods include pruning [17], quantization [18], and knowledge distillation [19]. Pruning reduces model size by removing redundant model parameters, while quantization converts weight floating-point numbers to lower-bit representations. These methods may introduce some information loss, leading to decreased model performance. Network structure optimization involves designing and optimizing network architectures for specific tasks and scenarios, aiming to minimize model size and maintain high accuracy within the constraints of limited computational resources.

From the perspective of network structure optimization, detection models are generally divided into three parts: Backbone, Neck, and Head. The Backbone is responsible for feature extraction, the Neck is used for feature fusion, and the Head is used for object classification and localization. The following sections introduce research on lightweight models related to these three aspects.

Backbone: The Backbone of object detection extracts low-level and high-level features from training data. FFCA-YOLO [20] reconstructs the Backbone and Neck using partial convolutions, reducing frequent memory redundant accesses. GCL-YOLO [21] constructs a Backbone network based on GhostConv to generate redundant feature maps, thereby minimizing the loss of detection accuracy. MELF-YOLOv5 [22] uses MobileNetV3 [23] as the Backbone network and employs depth-wise separable convolutions to control the number of network channels, reducing both model size and computational load.

Neck: The Neck structure in object detection integrates features of different scales provided by the Backbone using FPN-PAN structures to capture feature information at various scales, which can be less effective in extreme scale situations. PP-PicoDet [24] reduces parameters by unifying the number of input channels and expands the use of larger separable convolutions to improve accuracy. PP-YOLOE [25] mitigates the computational load required for inference and maintains accuracy by introducing re-parameterization techniques. AMFLW-YOLO [26] employs a bi-directional feature pyramid network (BiFPN) structure to enhance the network’s multi-scale feature extraction capabilities.

Head: The Head of an object detection model utilizes multi-scale feature maps provided by the Neck layer to output bounding boxes and class probabilities through convolutional and fully connected layers. LWUAVDet [6] introduces the PixED Head for effective feature extraction and uses the Aux Head to enhance feature representation. YOLOv10 [10] designs a lightweight classification Head using 3 × 3 separable convolutions and 1 × 1 convolutions. YOLO-RS [27] enhances sensitivity to anchor box width and height using EIoU, accelerating model convergence. Sithmini et al. [28] introduced the Xception architecture, which combines depthwise separable convolutions and pointwise convolutions to enhance feature representation learning. This approach achieves better performance with the same computational complexity.

3. Proposed Models

3.1. Proposal Network Overview

3.1.1. Structure of YOLOv10

As the most SOTA single-stage object detection algorithm currently, YOLOv10 introduces the NMS-Free concept, achieving true end-to-end detection. The model structure is shown in Figure 1. The YOLOv10 detection model consists of three main parts: Backbone, Neck, and Head. The Backbone includes the CBS module, SCDown module, C2f module, C2fCIB module, and PSA self-attention mechanism, primarily responsible for feature extraction from the input image. The Neck employs an FPN-PAN structure to integrate feature information from the Backbone network. The Head features a lightweight decoupled Head designed to implement a consistent dual allocation strategy, addressing YOLO’s reliance on NMS in post-processing.

3.1.2. Structure of LD-YOLOv10

As shown in Figure 2, LD-YOLOv10 is a lightweight and efficient network model based on YOLOv10, still following the Backbone–Neck–Head structure. Compared to YOLOv10, we have introduced three innovations: (1) In the Backbone structure, we designed RGELAN as a new feature extraction structure, retaining the first layer C2f structure to enrich detail information and replacing the PSA structure with AIFI [29]. (2) In the Neck, we developed a simple DR-PAN multi-scale fusion structure composed of DySample and RGELAN. (3) In the Head, we employed the Wise-EIoU loss function as the new bounding box regression loss function. The following sections provide a detailed explanation of these innovations.

3.2. The RGELAN Structure

The RGELAN structure integrates the design principles of GELAN from YOLOv9 [30], replacing the C2f structure in YOLOv10. Typically, a block consists of several convolutional structures. To reduce computational complexity, RGELAN omits the blocks in GELAN and uses a newly designed lightweight convolution, Conv-Tiny, as the computation structure for the gradient branch. The goal of the fusion structure is to integrate richer features from the Backbone network, but using Conv-Tiny inevitably leads to limited feature extraction capability. Therefore, we use RepConv [31] in the gradient flow branch to enrich feature extraction information and reduce the network parameters during inference through re-parameterization. The structures of GELAN and RGELAN are shown in Figure 3.

In image classification tasks, the features of an object may be related to the background features surrounding it. Capturing correlations between feature maps allows the model to better understand the relationship between the object and the background. DWConv [32] uses one convolutional kernel per channel, with each channel convolved independently, achieving lightweight results. This approach neglects spatial correlations and is still suboptimal regarding efficiency trade-offs. To address this issue, we improved DWConv by adding spatial correlations in feature extraction and retaining redundant features to enrich object characteristics. As shown in Figure 4, we group feature maps in pairs and perform grouped convolutions with half the number of output channels. In each group convolution, we split each feature map into two parts, apply 3 × 3 convolutions to one part, concatenate the convolved feature map with the other feature map, and finally obtain an output feature map with the same number of channels as the input feature map, reducing computational complexity by convolving only half of the feature maps.

RepConv uses three branches to capture features with different receptive fields during training. It reduces the model’s parameter and computational requirements during inference through re-parameterization, ensuring efficient inference speed in lightweight models. During training, RepConv includes an identity branch, a 1 × 1 convolution, and a 3 × 3 convolution. This multi-branch architecture allows the network to learn features from different receptive fields and enrich the extracted feature information. During inference, RepConv re-parameterizes by converting the identity branch and 1 × 1 convolution into a 3 × 3 convolution, which is then fused with the 3 × 3 convolution branch to produce a single branch, integrating features from different input branches with the parameter count of a standard 3 × 3 convolution.

The process between the conv layer and the BN layer can be seen as:

\hat{x} = ω_{B N} (ω_{c o n v} x + b_{c o n v}) + b_{B N} = (ω_{B N} \cdot ω_{c o n v}) x + (ω_{B N} \cdot b_{c o n v} + b_{B N})

(1)

The merged convolution parameters are obtained as follows:

\{\begin{matrix} ω_{f u s e} = ω_{B N} \cdot ω_{c o n v} \\ b_{f u s e} = ω_{B N} \cdot b_{c o n v} + b_{B N} \end{matrix}

(2)

The convolution after fusion in inference is:

{\hat{x}}_{i} = (ω_{f u s e}^{3 * 3} + ω_{f u s e}^{1 * 1} + ω_{f u s e}^{0 * 0}) \cdot x_{i} + (b_{f u s e}^{3 * 3} + b_{f u s e}^{1 * 1} + b_{f u s e}^{0 * 0})

(3)

Equation (3) shows the combined convolution parameters and the parameters used in RepConv during inference. Here,

ω_{c o n v}

and

ω_{B N}

represent the parameters of the convolution process and the BN layer, respectively, while

b_{c o n v}

and

b_{B N}

denote the biases for the convolution and BN operations. The fused convolution parameters are equivalent to those used in a standard convolution operation.

Table 1 compares the performance of mainstream feature extraction structures in the YOLO series, including C3Ghost [33], C3 [15], ELAN [34], C2f [16], RepNCSPELAN4 [26], and C2fCIB [10]. MeanTime represents the average time for each forward pass. Compared to the lightest feature fusion structure, C3Ghost, RGELAN exhibits faster inference speed and shorter forward propagation time. It is preliminarily proved that this structure has a faster feature extraction speed.

3.3. The AIFI Structure

As shown in Figure 5, AIFI applies the self-attention mechanism to high-level features, which helps the network more effectively capture information from different positions in the image sequence, thereby assisting the subsequent network in recognizing objects in the image. Compared to the PSA structure, AIFI transforms the feature map into feature vectors, fed into the multi-head attention mechanism and feed-forward network. This approach avoids the loss of some important original features that can occur with multiple convolution operations. Therefore, we replace the PSA structure with the AIFI structure to enrich high-level features using attention mechanisms simultaneously, reducing the number of parameters while maintaining accuracy.

Specifically, AIFI converts the input 2D feature map into a 1D vector. This vector is processed through multiple sets of self-attention mechanisms, where the sequence information is weighted and normalized with the identity branch to obtain sequence information focused on the target. The sequence information then undergoes non-linear learning introduced by the feed-forward network, allowing it to learn the complex relationships between different elements in the sequence. Finally, the 1D vector is transformed back into a 2D feature map to serve as input for the Neck network. As shown in Equation (4), the multi-head attention mechanism allows each attention Head to focus on different subspaces, enabling the model to capture various aspects of the input features. As shown in Equation (5), the feed-forward network (FFN) performs non-linear transformations at each position, enhancing the model’s ability to express non-linear relationships.

M u l t i H e a d (Q, K, V) = C o n c a t (A (Q W_{1}^{Q}, K W_{1}^{K}, V W_{1}^{V}), \dots, A (Q W_{n}^{Q}, K W_{n}^{K}, V W_{n}^{V})) W^{O}

(4)

F F N (x) = \max (0, x W_{i} + b_{i}) W_{i + 1} + b_{i + 1}

(5)

Specifically, the query (Q), key (K), and value matrices (V) are the inputs to the attention mechanism. A represents the attention Head,

W

represents the weight matrix, and

W^{O}

is a linear transformation. x represents the input,

W_{i}

,

b_{i}

,

W_{i + 1}

,

b_{i + 1}

are the weights and biases of the linear transformation for the

i

and

i + 1

positions. AIFI transforms the two-dimensional features into a one-dimensional vector and uses multi-head attention to establish global dependencies and capture long-distance context information. The FFN further extracts and processes the features at each position to enhance the representation ability of local features and thus improve the model’s overall performance.

3.4. The DR-PAN Neck Structure

To identify targets of different scales, the main branch of the network’s features is increased in resolution through upsampling, resulting in a Neck structure with a sequence of scale features. Traditional upsampling methods rely on bi-linear interpolation, which is prone to checkerboard artifacts [35], leading to the loss of semantic information of small aerial targets. Therefore, this paper introduces DySample as the upsampling structure of the Neck network. Using point sampling and learned sampling avoids the computational overhead caused by dynamic convolution and subnetworks and improves the model’s performance with the minimum computational cost. For example, the structure of DySample is as follows:

X^{'} = g r i d_s a m p l e (X, S)

(6)

S = G + O

(7)

O = 0.5 s i g m o i d (l i n e a r_{1} (X) \cdot l i n e a r_{2} (X)

(8)

Here,

S

represents the sampling set,

G

is the original sampling network, and

O

is the offset.

X

denotes the input feature map, and

X^{'}

is the sampled feature map. Specifically, given an input feature map

X

of size

C \times H \times W

, a linear projection of

X

is performed, and a pointwise dynamic range factor is generated using the sigmoid function with a static coefficient of 0.5. The offset is then reshaped into a size of

2 \times s H \times s W

using the pixel shuffle method, where 2 represents the x and y coordinates and

S

is the upsampling factor. The grid sampling function uses the sampling point generator to re-sample the input features, producing the final upsampled feature map of size

C \times s H \times s W

.

To address the challenge of detecting small objects, the most effective methods involve adding small object detection Heads, which inevitably increase the computational load. We designed the DR-PAN Neck structure based on DySample and RGELAN to overcome this. Compared to FPN-PAN, our approach upsamples the 80 × 80 feature map output by FPN and fuses it with the 160 × 160 large feature map from the Backbone, which contains rich detail information, using RGELAN to obtain new features for small objects. Additionally, since the number of large objects in aerial images is very low, we discarded the network structure that generates high-level features and retained a lightweight structure for generating mid-level features. This model reduces the number of parameters while enhancing its focus on recognizing small to medium-sized objects. Table 2 compares the parameter count and computational load of different Neck structures.

3.5. Wise-EIoU Loss

In the YOLOv8 model, the total loss function combines weighted losses from classification and regression. The classification loss uses binary cross-entropy to measure the difference between predicted class probabilities and the true labels. The regression loss comprises the distribution focal loss (DFL) and the bounding box regression loss (BBRL). DFL focuses the network on the distribution around the target location and its neighboring areas, while BBRL aims to minimize the difference between the predicted and true bounding box coordinates. The overall loss function integrates these two aspects through weighting, as shown in Equation (12).

f_{B C E L} = w e i g h t [c l a s s] (- x [c l a s s] + \log (\sum_{j} \exp (x [j])))

(9)

f_{D F L} (S_{i}, S_{i + 1}) = - ((y_{i + 1} - y) \log (S_{i}) + (y - y_{i}) \log (S_{i + 1}))

(10)

f_{B B R L} = f_{C I o U} = 1 - I O U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α υ

(11)

f_{l o s s} = λ_{1} f_{B C E L} + λ_{2} f_{D F L} + λ_{3} f_{B B R L}

(12)

Here,

f_{l o s s}

represents the total loss, while

f_{B C E L}

,

f_{D F L}

, and

f_{B B R L}

denote the classification loss, distribution focal loss, and bounding box regression loss functions, respectively.

λ_{1}, λ_{2}, λ_{3}

represent the weighting factors assigned to each loss, and

υ

denotes the aspect ratio of the CIoU.

The YOLOv10 model employs a consistent dual assignment strategy, leading to noticeable loss function changes. This change occurs because YOLOv10 has two decoupled Heads: one computes a one-to-one assignment strategy, and the other computes a one-to-many assignment strategy. The loss functions for these two strategies are consistent, differing only in some hyperparameters. The final loss function is the sum of these two losses, as shown in Equation (13).

f_{l o s s} = f_{l o s s - o n e 2 o n e} + f_{l o s s - o n e 2 m a n y}

(13)

CIoU [36], used as a bounding box regression loss, can encounter issues with aspect ratio definition ambiguity and difficulty in addressing sample quality imbalance. EIoU [37] breaks down the aspect ratio in CIoU, considering the actual differences in width and height with respect to confidence, using width and height loss to minimize the differences between predicted and ground truth box dimensions, thus optimizing the shape more effectively. Wise-IoU [38] dynamically focuses on balancing the impact of samples with different qualities on the model, addressing issues of low-quality training data. Therefore, we combined EIoU and Wise-IoU to create Wise-EIoU as the bounding box regression loss function, addressing sample anchor quality issues while providing more precise and efficient shape adjustments. The Wise-IoU, EIoU, and Wise-EIoU losses are shown in Equations (14), (15), and (16), respectively. The final overall loss function is given in Equation (17).

L_{E I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{{(ω^{c})}^{2} + {(h^{c})}^{2}} + \frac{ρ^{2} (ω^{υ}, ω^{g t})}{{(ω^{c})}^{2}} + \frac{ρ^{2} (h^{ρ}, h^{g t})}{{(h^{c})}^{2}}

(14)

ρ^{2} (b, b^{g t}) = {(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}

(15)

L_{W I o U v 1} = ℜ_{W I o U} L_{I o U}

(16)

ℜ_{W I o U} = \exp (\frac{ρ^{2} (b, b^{g t})}{{(ω^{c})}^{2} + {(h^{c})}^{2}})

(17)

where

ω^{c}

and

h^{c}

represent the width and height of the ground truth and predicted boxes, respectively.

x, x_{g t}, y

, and

y_{g t}

are the horizontal and vertical coordinates of the centers of the predicted and ground truth boxes, respectively. Since

ℜ_{W I o U}

lies in the range [1, e), it significantly increases the weight of anchor boxes

L_{I o U}

with ordinary quality, while

L_{I o U}

in the range [0, 1] reduces the weight of high-quality anchor boxes

ℜ_{W I o U}

. From Equations (14) and (17), we can derive the Wise-EIoU loss and the overall loss function, as shown in Equations (18) and (19):

f_{W i s e - E I o U} = ℜ_{W I o U} L_{E I o U}

(18)

f_{l o s s} = λ_{1} f_{B C E L} + λ_{2} f_{D F L} + λ_{3} f_{W i s e - E I o U}

(19)

From the practical application of the loss function, Wise-IoU focuses on adjusting the weight of anchor box quality across the entire sample, while EIoU emphasizes handling the center distance and width–height differences of the anchor boxes to control specific anchor box regression. This approach ensures optimal training results.

4. Discussion

4.1. Dataset Description

To verify the effectiveness and generalization of the algorithm, we conducted extensive experiments on two public datasets: VisdroneDET-2021 [39] and UAVDT [40].

The Visdrone2022 dataset is the most widely used aerial dataset, containing 10 object classes and covering various scenes, weather, and lighting conditions. The training set, validation set, and test set contain 6471, 1610, and 548 images, respectively. It primarily consists of dense small objects. All models in this paper are trained, validated, and evaluated on this dataset.

The UAVDT dataset comprises 50 videos and 38327 images, containing three object classes: car, bus, and truck. Of these, 23,258 images are used for training, and 15,069 images are used for testing.

Contents of the Visdrone2022 dataset and the UAVDT dataset are shown in Figure 6.

4.2. Implementation Details and Evaluation Indicators

During model training, we use Python 3.9, PyTorch 2.1.0, and CUDA 12.1 as the desktop computing software environment, with an NVIDIA RTX 4060 GPU as the hardware. The detection model is based on the THU-ME YOLOv10.1.1 version, with all hyperparameters kept consistent. To ensure a fair comparison, all networks in the experiments are trained from scratch without using official pre-trained weights. The epoch is set to 300, SGD is used as the optimizer, and the batch size is set to 4, ensuring all models converge. In the experimental results listed below, the YOLO series and LD-YOLOv10 are obtained from our training, while the results of other models are sourced from the respective cited papers. The hyperparameters of model training are shown in Table 3.

In the embedded experiments, we verify the inference speed of the model on the Jetson Orin Nano edge device. Released by NVIDIA in 2023, it is a new-generation entry-level edge computing device. It features an 8-core ARM Cortex-A78AE CPU and a 1024-core NVIDIA Ampere GPU, providing 40 trillion operations per second (TOPS) of performance. The software environment includes Ubuntu 20.04, CUDA 11.4, and cuDNN 8.6.0.

In evaluating the model metrics, we use the mean average precision (mAP50) at IOU = 0.5 and the mean average precision (mAP50-95) at IOU = 0.5:0.05:0.95 to assess detection accuracy across all object categories. The weight file size (MByte) is used to measure the model size. Frames per second (FPS) is used to evaluate the detection speed of the network. The number of giga floating-point operations per second (GFLOPS) measures the computational load of the network, and the number of parameters (Params) assesses the parameter count of the network. Precision (P) and recall (R) are defined by the following formulas:

p = \frac{T P}{T P + F P}

(20)

R = \frac{T P}{T P + F N}

(21)

where TP denotes true positives, FP denotes false positives, and FN denotes false negatives. AP is the area under the PR curve; a higher AP indicates better precision. mAP is the average of AP across all categories; a higher mAP signifies better model performance. The formulas for calculating AP and mAP are as follows:

A P = \int_{0}^{1} P (R) d R

(22)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P (i)

(23)

4.3. Ablation Experiments

We conduct a series of ablation analyses on the LD-YOLOv10 detector using the VisdroneDET-2021 dataset to evaluate the performance of each module.

4.3.1. Effect of RGELAN Module

We replace two C2f modules in the Backbone with different lightweight feature extraction structures, and the results are shown in Table 4. The results indicate that RGELAN achieves detection accuracy comparable to RepNCSPELAN4 with a similar number of parameters and computational complexity as C3Ghost, demonstrating the effectiveness of the proposed module.

4.3.2. Effect of Wise-EIoU Loss

To explore the impact of different IoU-based loss functions on model detection accuracy, we use CIoU as the baseline loss function for YOLOv10. We train the model using CIoU, DIoU, shapeIoU [41], PIoU2 [42], and new loss functions combining EIoU and Wise-IoU. As shown in Table 5, Wise-IoU achieves higher accuracy and recall than CIoU, indicating that there are many low-quality anchor boxes in the training data. Wise-DIoU addresses anchor box quality imbalance and effectively guides the predicted boxes closer to the ground truth, solving the gradient vanishing problem during optimization. Wise-CIoU further optimizes the bounding boxes by considering aspect ratio consistency, resulting in better detection performance compared to Wise-DIoU. Wise-EIoU eliminates aspect ratio measurement, focusing directly on width and height differences for finer adjustments and increasing the penalty on distance metrics. Consequently, Wise-EIoU achieves the highest accuracy and the best detection performance.

4.3.3. Ablation Experiments of LD-YOLOv10

We use YOLOv10-S as the baseline model to validate the effectiveness of each module and comprehensively compare the impact of various improvements on the model. As shown in Table 6, the RGELAN module effectively reduces computational load and speeds up detection, albeit with a trade-off in accuracy. The AIFI structure, which uses one-dimensional features as input, reduces the loss of important information, decreasing parameter count and slightly improving accuracy. The DR-PAN structure provides rich feature information, enhancing the detection performance of small objects. Discarding the large object detection components significantly reduces parameters. However, the increase in the size of feature maps inputted to the Head results in a rise in computational load. Wise-EIoU, as the new bounding box regression loss function, effectively addresses harmful gradients caused by low-quality data and refines the regression of anchor box widths and heights. This results in a slight improvement in accuracy without increasing the computational load.

4.3.4. The Impact of Different IoUs

To verify the impact of different IoU thresholds on LD-YOLOv10, we tested the AP and mAP values for each category of the improved model with IoU thresholds set to 0.5, 0.75, and 0.9. As shown in Table 7, the detection accuracy for each category is highest when the IoU is 0.5. As the IoU threshold increases, the detection accuracy decreases. This is because the detected objects are mostly small and densely distributed, leading to some overlap in the detection boxes. When the IoU threshold increases, some overlapping boxes will be reduced to one, resulting in missed detections. Conversely, too low of an IoU threshold can lead to false detections. Therefore, the best detection performance is achieved when the IoU is set to 0.5.

4.4. Comparisons with Lightweight Object Detection Networks

The paper compares LD-YOLOv10 with mainstream algorithms and lightweight algorithms designed for drone images in recent years, using the VisdroneDET-2021 dataset. The comparison covers various aspects including Params, GFLOPS, mAP, and FPS, with algorithms such as the YOLO series [10,15,16,30,34,43,44,45] and lightweight algorithms specifically designed for aerial images like PP-PicoDet [24], YOLOvX-S [46], MELF-YOLOv5-S [22], and MGFAFNET [47]. The experimental results are shown in Table 8.

The data in Table 6 show that the proposed LD-YOLOv10 outperforms the original YOLOv10 by reducing the parameter count by 62.4% while maintaining the same computational load and achieving a 0.6% increase in mAP50. It also preserves the efficient detection speed of YOLOv10. Compared to other algorithms, LD-YOLOv10 has slightly lower GFLOPS than YOLOv8, YOLOv9, and YOLOv10, but it achieves the best results in both mAP50 and mAP50:95, with the lowest parameter count. While it achieves the highest FPS among models with similar detection precision in recent years, models with higher FPS have seen a significant drop in detection accuracy. These results indicate that LD-YOLOv10 strikes a balanced trade-off between parameter count and detection accuracy, with a significant advantage in inference speed, making it highly suitable for real-time detection on edge devices.

4.5. Extended Experiments

To validate the general applicability of the proposed algorithm, we tested the model’s performance on the UAVDT aerial dataset. We compared LD-YOLOv10 with mainstream models such as YOLOv4, YOLOv8, and YOLOv10, as well as with leading lightweight aerial models EfficientDet [48] and LWUAVDet [6]. As shown in Table 9, the experiments reveal that LD-YOLOv10 slightly improves detection accuracy on this dataset. Compared to the current SOTA lightweight model LWUAVDet, LD-YOLOv10 reduces the parameter count by 37.1% and increases mAP0.5 by 2.2% and mAP0.5:0.95 by 3.7%, with a trade-off of a 5.3G increase in GFLOPS. The results demonstrate that our LD-YOLOv10 model also performs well on other aerial datasets.

We evaluated YOLOv10-S and LD-YOLOv10 in an embedded environment using the VisdroneDET-2021 dataset on the Jetson Orin Nano edge AI platform. As shown in Table 10, LD-YOLOv10 achieves the best accuracy on the dataset while maintaining detection speeds consistent with YOLOv10. The model size is reduced by 60.8%, and the FPS meets the minimum requirements for real-time detection. The model’s time and space complexity are significantly reduced, demonstrating its real-time performance on edge devices.

4.6. Visualization

Figure 7 and Figure 8 visualize the results of YOLOv8-S, YOLOv10, and LD-YOLOv10 on the VisdroneDET-2021 and UAVDT datasets. The detected objects are highlighted with bounding boxes, with colors representing different categories. We evaluated daytime and nighttime scenes in the VisdroneDET-2021 dataset and compared dense and foggy scenes in the UAVDT dataset. The results show that the proposed method exhibits significant advantages in both conventional and special scenarios, successfully detecting small objects that were missed by the original methods.

5. Conclusions

In this paper, we propose a lightweight UAV object detector, LD-YOLOv10. We make comprehensive improvements across the Backbone, Neck, and Head components. In the Backbone, we design RGELAN as a feature extraction module to reduce computational redundancy and replace the PSA structure with AIFI to reduce the parameter count further. In the Neck, we develop a lightweight DR-PAN structure, enhance small object detection accuracy with a small object detection Head, and use DySample and RGELAN to minimize the impact of increased computational load from the detection Head. In the Head, we combine EIoU and Wise-IoU to create a new bounding box regression loss function, Wise-EIoU, addressing anchor box quality issues in the training data and providing a more effective gradient allocation strategy.

The performance of LD-YOLOv10 was extensively validated through experiments on the VisdroneDET-2021 dataset. In terms of category detection, our method exhibits the lowest parameter count and faster detection speed compared to other methods, with a slight improvement in detection accuracy. This is highly advantageous for deploying the model on edge devices. We also validated our model on the UAVDT dataset, a public aerial imagery dataset, to assess its generalization performance in aerial view object detection. Despite a 62.4% reduction in parameter count, the model maintains high detection accuracy and speed. Testing on the Jetson Orin Nano edge device demonstrated the model’s capability for real-time detection on low-computational-power devices, confirming its effectiveness and real-time performance.

As a lightweight model, our proposed method only reduces the computational load by 3% compared to YOLOv10, which can impact real-time detection on edge devices with stricter computational requirements. In future work, we will focus on potential paradigms for lightweight design that further reduce computational load by decreasing the number of parameters. This will aim to increase detection speed without sacrificing accuracy or with only a slight loss in accuracy, making the model suitable for even more cost-effective and lower-computational-capacity edge devices.

Author Contributions

Conceptualization: X.Q. and Y.C.; methodology: X.Q. and W.C.; software: X.Q.; validation: X.Q., W.C., J.L. and M.N.; formal analysis: X.Q., Y.C. and W.C.; investigation: X.Q. and Y.C.; data curation: X.Q. and M.N.; writing—original draft preparation: X.Q.; writing—review and editing: X.Q. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China West Normal University Talent Fund (No.: 463177).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cui, B.; Liang, L.; Ji, B.; Zhang, L.; Zhao, L.; Zhang, K.; Shi, F.; Creput, J.C. Exploring the YOLO-FT Deep Learning Algorithm for UAV-Based Smart Agriculture Detection in Communication Networks. IEEE Trans. Netw. Serv. Manag. 2024. Early Access. [Google Scholar] [CrossRef]
Mao, G.; Liang, H.; Yao, Y.; Wang, L.; Zhang, H. Split-and-Shuffle Detector for Real-Time Traffic Object Detection in Aerial Image. IEEE Internet Things J. 2024, 11, 13312–13326. [Google Scholar] [CrossRef]
Xu, J.; Fan, X.; Jian, H.; Xu, C.; Bei, W.; Ge, Q.; Zhao, T. YoloOW: A Spatial Scale Adaptive Real-Time Object Detection Neural Network for Open Water Search and Rescue From UAV Aerial Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5623115. [Google Scholar] [CrossRef]
Girshick, R. Fast R-Cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Min, X.; Zhou, W.; Hu, R.; Wu, Y.; Pang, Y.; Yi, J. LWUAVDet: A Lightweight UAV Object Detection Network on Edge Devices. IEEE Internet Things J. 2024, 11, 24013–24023. [Google Scholar] [CrossRef]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–Improving Object Detection with One Line of Code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
Wang, Y.; Zou, H.; Yin, M.; Zhang, X. SMFF-YOLO: A Scale-Adaptive YOLO Algorithm with Multi-Level Feature Fusion for Object Detection in UAV Scenes. Remote Sens. 2023, 15, 4580. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, C.; Zhang, T.; Zheng, Y. Full-Scale Feature Aggregation and Grouping Feature Reconstruction-Based UAV Image Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5621411. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. ISBN 978-3-319-10601-4. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6027–6037. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Chaurasia, A.; Changyu, L.; Hogan, A.; Hajek, J.; Diaconu, L.; Kwon, Y.; Defretin, Y. Ultralytics/Yolov5: V5. 0-YOLOv5-P6 1280 Models, AWS, Supervise. Ly and YouTube Integrations. Zenodo 2021. [Google Scholar] [CrossRef]
Jocher, G. Ultralytics YOLOv8: V6. Available online: https://Github.Com/Ultralytics/Ultralytics (accessed on 23 October 2023).
Shen, L.; Su, J.; He, R.; Song, L.; Huang, R.; Fang, Y.; Song, Y.; Su, B. Real-Time Tracking and Counting of Grape Clusters in the Field Based on Channel Pruning with YOLOv5s. Comput. Electron. Agric. 2023, 206, 107662. [Google Scholar] [CrossRef]
Liu, X.; Wang, T.; Yang, J.; Tang, C.; Lv, J. MPQ-YOLO: Ultra Low Mixed-Precision Quantization of YOLO for Edge Devices Deployment. Neurocomputing 2024, 574, 127210. [Google Scholar] [CrossRef]
Ma, T.; Tian, W.; Xie, Y. Multi-Level Knowledge Distillation for Low-Resolution Object Detection and Facial Expression Recognition. Knowl.-Based Syst. 2022, 240, 108136. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Cao, J.; Bao, W.; Shang, H.; Yuan, M.; Cheng, Q. GCL-YOLO: A GhostConv-Based Lightweight Yolo Network for UAV Small Object Detection. Remote Sens. 2023, 15, 4932. [Google Scholar] [CrossRef]
Cao, L.; Song, P.; Wang, Y.; Yang, Y.; Peng, B. An Improved Lightweight Real-Time Detection Algorithm Based on the Edge Computing Platform for UAV Images. Electronics 2023, 12, 2274. [Google Scholar] [CrossRef]
Mobilenetv3: A Deep Learning Technique for Human Face Expressions Identification|International Journal of Information Technology. Available online: https://link.springer.com/article/10.1007/s41870-023-01380-x (accessed on 21 July 2024).
Yu, G.; Chang, Q.; Lv, W.; Xu, C.; Cui, C.; Ji, W.; Dang, Q.; Deng, K.; Wang, G.; Du, Y.; et al. PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices. arXiv 2021, arXiv:2111.00902. [Google Scholar]
Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An Evolved Version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar]
Peng, G.; Yang, Z.; Wang, S.; Zhou, Y. AMFLW-YOLO: A Lightweight Network for Remote Sensing Image Detection Based on Attention Mechanism and Multiscale Feature Fusion. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4600916. [Google Scholar] [CrossRef]
Xie, T.; Han, W.; Xu, S. YOLO-RS: A More Accurate and Faster Object Detection Method for Remote Sensing Images. Remote Sens. 2023, 15, 3863. [Google Scholar] [CrossRef]
Gunasekara, S.; Gunarathna, D.; Dissanayake, M.B.; Aramith, S.; Muhammad, W. Deep Learning Based Autonomous Real-Time Traffic Sign Recognition System for Advanced Driver Assistance. Int. J. Image Graph. Signal Process. 2022, 14, 70–83. [Google Scholar] [CrossRef]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic Snake Convolution Based on Topological Geometric Constraints for Tubular Structure Segmentation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6070–6079. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Soudy, M.; Afify, Y.; Badr, N. RepConv: A Novel Architecture for Image Scene Classification on Intel Scenes Dataset. Int. J. Intell. Comput. Inf. Sci. 2022, 22, 63–73. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning With Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Odena, A.; Dumoulin, V.; Olah, C. Deconvolution and Checkerboard Artifacts. Distill 2016, 1, e3. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Yang, Z.; Wang, X.; Li, J. EIoU: An Improved Vehicle Detection Algorithm Based on Vehiclenet Neural Network. J. Phys. Conf. Ser. 2021, 1924, 012001. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Cao, Y.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Gool, L.V.; Han, J. VisDrone-DET2021: The Vision Meets Drone Object Detection Challenge Results. In Proceedings of the International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2847–2854. [Google Scholar]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Zhang, H.; Zhang, S. Shape-IoU: More Accurate Metric Considering Bounding Box Shape and Scale. arXiv 2024, arXiv:2312.17663. [Google Scholar]
Powerful-IoU: More Straightforward and Faster Bounding Box Regression Loss with a Nonmonotonic Focusing Mechanism—ScienceDirect. Available online: https://www.sciencedirect.com/science/article/abs/pii/S0893608023006640 (accessed on 21 July 2024).
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Sun, J.; Gao, H.; Yan, Z.; Qi, X.; Yu, J.; Ju, Z. Lightweight UAV Object-Detection Method Based on Efficient Multidimensional Global Feature Adaptive Fusion and Knowledge Distillation. Electronics 2024, 13, 1558. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]

Figure 1. Structure of YOLOv10.

Figure 2. Structure of LD-YOLOv10.

Figure 3. The structure of GELAN and RGELAN.

Figure 4. The structure of Conv-Tiny taking 12 channel Input Features as an example.

Figure 5. The AIFI structure.

Figure 6. (a) Each category and quantity of the VidroneDET-2021 dataset. (b) Each category and number of UAVDT datasets.

Figure 7. Visualization of LD-YOLOv10 on the VisdroneDET-2021 dataset.

Figure 8. Visualization results of LD-YOLOv10 on the UAVDT dataset.

Table 1. RGELAN compared to other lightweight modules.

Module Name	Meantime (%)	FPS	FLOPS	Params (K)
C3Ghost	0.13	785	2.33	142.56
C3	0.14	716	4.83	295.68
ELAN	0.20	501	8.05	492.28
C2f	0.18	543	7.51	459.52
RepNCSPELAN4	0.14	707	3.69	226.17
C2fCIB	0.18	531	3.83	235.39
RGELAN	0.11	911	3.42	209.53

Table 2. Comparison of different Neck structures. P2 is the small object detection layer, P5 is the large object detection layer, Params and GFLOPS are the number of parameters and calculation of the overall model with the corresponding structure added.

Structure	FPN-PAN	FPN-PAN + P2	FPN-PAN + P2 − P5	DR-ELAN
Params (M)	8.07	8.25	6.33	5.40
FLOPS (G)	24.8	37.1	35.4	29.5

Table 3. Hyperparameter Settings for LD-YOLOv10.

Hyperparameters	Value
Imgsz	640 × 640
Batch Size	4
Epoch	300
Weight Decay	0.0005
Momentum	0.937
Initial Learning Rate	0.01
Optimizer	SGD

Table 4. Comparative experiment of different lightweight modules.

Module	mAP0.5	mAP0.5:0.95	GFLOPS	Params (M)
C3Ghost	36.7	21.8	20.6	7.28
RepNCSPELAN4	37.2	22.2	22.7	7.84
RGELAN	37.2	22.3	20.9	7.24

Table 5. Comparative experiment on the binding effect of different IoU and Wise-IoU.

Method	P	R	mAP0.5	mAP0.5:0.95
CIoU	46.6	37.8	38.1	22.8
Wise-IoU	47.5	38.2	38.3	22.7
Wise-CIoU	48.6	38.0	38.8	22.9
Wise-DIoU	48.0	37.6	38.2	22.8
Wise-shapeIoU	48.7	37.7	38.6	23.2
Wise-PIoU2	47.6	38.6	38.8	23.0
Wise-EIoU	48.8	37.9	38.9	23.1

Table 6. Ablation experiments of different modules of LD-YOLOv10.

Module	P (%)	R (%)	mAP0.5 (%)	mAP0.5:0.95 (%)	Params (M)	GFLOPS	Latency (ms)
YOLOv10-S	50.1	38.3	38.8	23.3	8.04	24.5	6.9
+RGELAN	48.1	37.0	37.4	22.4	7.24	20.9	6.1
+AIFI	48.5	37.4	37.9	28.8	6.46	22.9	5.5
+DR-PAN	48.3	37.9	38.9	23.1	3.04	23.7	6.5
+Wise-EIoU	48.4	38.9	39.4	23.5	3.04	23.7	6.5

Table 7. The influence of different IoUs on the detection results.

IoU	mAP (%)	AP (%)
IoU	mAP (%)	Pedestrian	People	Bicycle	Car	Van	Truck	Tricycle	Awning-Tricycle	Bus	Motor
0.5	39.4	46.2	35.3	12.4	82.1	46.1	31.7	25.2	16.1	51.5	47.4
0.75	24.2	21.7	14.7	5.5	58.2	32.0	20.1	14.1	10.3	36.0	22.3
0.9	5.0	17.2	8.9	4.0	66.0	36.7	21.7	14.4	11.8	43.1	17.6

Table 8. Comparison experiment between LD-YOLOv10 and mainstream lightweight model.

Model	Params (M)	GFLOPS	mAP50 (%)	mAP50:95 (%)	FPS
YOLOv3-Tiny (2018)	8.68	12.9	15.9	6.9	195
YOLOv4-Tiny (2020)	5.90	16.2	13.5	24.4	172
YOLOv5-S (2020)	7.04	15.8	32.7	16.2	139
PP-PicoDet-L (2021)	3.30	8.90	34.2	-	150
YOLOX-S (2021)	9.00	26.8	34.0	19.8	106
YOLOv6-N (2022)	4.30	11.1	31.6	16.9	-
YOLOv7-Tiny (2022)	6.03	13.1	36.8	18.9	112
YOLOv8-S (2023)	11.1	28.5	39.2	23.4	128
MELF-YOLOv5-S (2023)	3.40	9.8	34.8	18.7	-
MGFAFNET-S (2024)	4.20	24.2	36.0	21.0	-
YOLOv9-S (2024)	9.60	38.8	38.6	23.2	-
YOLOv10-S (2024)	8.04	24.5	38.8	23.0	143
LD-YOLOv10 (ours)	3.04	23.7	39.4	23.5	147

Table 9. LD-YOLOv10 compared with mainstream lightweight algorithms on the UAVDT dataset.

Network	Params (M)	GFLOPS	FPS	mAP50 (%)
Network	Params (M)	GFLOPS	FPS	ALL	Car	Truck	Bus
YOLOv3-Tiny	8.68	12.9	193	26.9	61.4	12.9	6.1
YOLOv4-Tiny	5.89	7.0	172	27.7	63.5	13.4	6.4
YOLOv5-S	7.04	15.8	138	29.8	70.1	14.4	5.0
YOLOX-Tiny	5.04	15.3	155	29.1	68.4	13.8	5.3
PP-PicoDet-L	3.30	8.9	151	31.1	71.2	16.5	5.8
YOLOv7-Tiny	6.03	13.1	114	31.2	70.4	16.8	6.8
YOLOv8-S	11.10	28.8	126	31.9	71.3	17.4	7.1
LWUAVDet-S	5.20	19.2	-	34.1	-	-	-
YOLOv10	8.04	24.5	147	36.0	77.0	11.9	19.0
LD-YOLOv10	3.04	23.7	146	36.3	74.9	20.4	13.7

Table 10. Validation of LD-YOLOv10 on the Jetson Orin Nano edge device.

Model	mAP0.5 (%)	Params (M)	Model Scale (M)	FPS	Average Inference Time per Image
YOLOv10-S	38.6	8.04	16.1	23	41.7
LD-YOLOv10	39.3	3.04	6.3	25	38.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiu, X.; Chen, Y.; Cai, W.; Niu, M.; Li, J. LD-YOLOv10: A Lightweight Target Detection Algorithm for Drone Scenarios Based on YOLOv10. Electronics 2024, 13, 3269. https://doi.org/10.3390/electronics13163269

AMA Style

Qiu X, Chen Y, Cai W, Niu M, Li J. LD-YOLOv10: A Lightweight Target Detection Algorithm for Drone Scenarios Based on YOLOv10. Electronics. 2024; 13(16):3269. https://doi.org/10.3390/electronics13163269

Chicago/Turabian Style

Qiu, Xiaoyang, Yajun Chen, Wenhao Cai, Meiqi Niu, and Jianying Li. 2024. "LD-YOLOv10: A Lightweight Target Detection Algorithm for Drone Scenarios Based on YOLOv10" Electronics 13, no. 16: 3269. https://doi.org/10.3390/electronics13163269

APA Style

Qiu, X., Chen, Y., Cai, W., Niu, M., & Li, J. (2024). LD-YOLOv10: A Lightweight Target Detection Algorithm for Drone Scenarios Based on YOLOv10. Electronics, 13(16), 3269. https://doi.org/10.3390/electronics13163269

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LD-YOLOv10: A Lightweight Target Detection Algorithm for Drone Scenarios Based on YOLOv10

Abstract

1. Introduction

2. Related Work

3. Proposed Models

3.1. Proposal Network Overview

3.1.1. Structure of YOLOv10

3.1.2. Structure of LD-YOLOv10

3.2. The RGELAN Structure

3.3. The AIFI Structure

3.4. The DR-PAN Neck Structure

3.5. Wise-EIoU Loss

4. Discussion

4.1. Dataset Description

4.2. Implementation Details and Evaluation Indicators

4.3. Ablation Experiments

4.3.1. Effect of RGELAN Module

4.3.2. Effect of Wise-EIoU Loss

4.3.3. Ablation Experiments of LD-YOLOv10

4.3.4. The Impact of Different IoUs

4.4. Comparisons with Lightweight Object Detection Networks

4.5. Extended Experiments

4.6. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI