A Efficient and Accurate UAV Detection Method Based on YOLOv5s

Feng, Yunsong; Wang, Tong; Jiang, Qiangfu; Zhang, Chi; Sun, Shaohang; Qian, Wangjiahe

doi:10.3390/app14156398

Open AccessArticle

A Efficient and Accurate UAV Detection Method Based on YOLOv5s

by

Yunsong Feng

^1,*,†,

Tong Wang

^1,2,†,

Qiangfu Jiang

²,

Chi Zhang

¹,

Shaohang Sun

¹ and

Wangjiahe Qian

¹

State Key Laboratory of Pulsed Power Laser Technology, National University of Defense Technology, Hefei 230037, China

²

School of Physics and Optoelectronic Engineering, Anhui University, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(15), 6398; https://doi.org/10.3390/app14156398

Submission received: 17 June 2024 / Revised: 7 July 2024 / Accepted: 15 July 2024 / Published: 23 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

Due to the limited computational resources of portable devices, target detection models for drone detection face challenges in real-time deployment. To enhance the detection efficiency of low, slow, and small unmanned aerial vehicles (UAVs), this study introduces an efficient drone detection model based on YOLOv5s (EDU-YOLO), incorporating lightweight feature extraction and balanced feature fusion modules. The model employs the ShuffleNetV2 network and coordinate attention mechanisms to construct a lightweight backbone network, significantly reducing the number of model parameters. It also utilizes a bidirectional feature pyramid network and ghost convolutions to build a balanced neck network, enriching the model’s representational capacity. Additionally, a new loss function, EloU, replaces CIoU to improve the model’s positioning accuracy and accelerate network convergence. Experimental results indicate that, compared to the YOLOv5s algorithm, our model only experiences a minimal decrease in mAP by 1.1%, while reducing GFLOPs from 16.0 to 2.2 and increasing FPS from 153 to 188. This provides a substantial foundation for networked optoelectronic detection of UAVs and similar slow-moving aerial targets, expanding the defensive perimeter and enabling earlier warnings.

Keywords:

UAV detection; lightweight network; ShuffleNetV2; bidirectional feature pyramid network; coordinate attention; ghost convolution; EIoU

1. Introduction

Future conflicts have shifted towards intelligent warfare, driven by emerging military requirements and advancements in technology areas such as data, algorithms, and chips. Particularly in response to new trends in drone tactics, armed forces worldwide are actively exploring innovative approaches. Accurate small object detection heavily relies on automatic identification technology. By leveraging target feature databases and information from ground sensors, intelligent information processing algorithms are deployed to detect and recognize potentially threatening small aerial vehicles [1].

The current primary methods for UAV detection include radar detection, infrared detection, visible light detection, audio detection, and wireless detection. Radar detection [2] offers advantages such as all-weather operation, long-distance detection capabilities, and strong resistance against interference but suffers from imprecise target information and susceptibility to electromagnetic interference. Infrared detection excels [3] in high-performance detection, operates day and night, and boasts strong interference resistance, yet it is vulnerable to adverse weather conditions and high costs. Visible light detection [4] is characterized by its low cost, high-resolution imaging but is sensitive to lighting conditions and faces challenges with high false alarm rates. Audio detection [5] is noted for its lightweight build and easy installation, though it falls short in detection range and is sensitive to environmental noise. Wireless detection [6,7] stands out for analyzing drone flight status and operator details from intercepted signals; however, it cannot detect drones without wireless signals and faces challenges with signal interference in real-world settings.

To meet the demands of the high-precision detection of low, slow, and small unmanned aerial vehicles such as drones and cruise missiles in modern warfare, multimodal, multi-perspective networked optoelectronic reconnaissance technology has become essential [8,9,10,11]. These aerial vehicles, characterized by minimal radar signatures, high maneuverability, and unpredictable flight paths, pose a significant threat to national and societal security. In this context, efficient target detection is crucial yet it is challenged by limited computational resources [12]. Currently, there is a relative scarcity of algorithms specifically designed for drone-based object detection. Developing specialized object detection algorithms for drones is an important and yet underexplored area of research. Hua introduces a novel lightweight network structure named Cross-Stage Partially Deformable Network (CSPDNet) [13]. The core approach of CSPDNet involves the use of a Deformable Separable Convolution Block for feature separation, significantly reducing the computational load of convolutions and enhancing the information exchange capability through adaptive sampling of the separated feature map. Additionally, a channel weighting module is introduced to compensate for the effects of point-wise convolutions and filter out more significant feature information. While CSPDNet achieves a certain balance between model parameter size and detection accuracy, shortcomings include potential real-time performance issues and the detection efficacy for small targets, which require further exploration in future studies. Zheng introduces a strategy for the precise detection and localization of unmanned aerial vehicle swarms [14]. The primary approach involves the use of a coprime array combined with the KT-Dechirp and FS-RAM techniques for super-resolution detection and localization. This strategy capitalizes on the large array aperture of the coprime array and the integration of coherent long-time integration with gridless sparse techniques to enhance the signal-to-noise ratio and resolution of detection. However, the drawbacks include sensitivity to low signal-to-noise ratios and high processing complexity, necessitating further optimization to enhance reliability and efficiency in practical applications. Liang [15] introduces a new method for small object detection in UAV images, termed Feature Fusion and Scaling-based Single Shot Detector, complemented by spatial context analysis. This method enhances feature representation through an improved feature fusion module and an added deconvolution module, thereby increasing the detection accuracy for small objects. Additionally, the paper proposes spatial context analysis by recalculating the reliability of detection results through intra-class and inter-class distances among different object instances. However, the shortcomings of this method include relatively slower processing speed and a need for further enhancement in detecting extremely small or occluded objects.

When multiple detection devices operate in coordination within a network, achieving optimal detection accuracy with constrained computational resources is imperative. This necessitates not only high precision in target recognition models but also optimization in terms of model parameter size, computational load, and processing speed. For base station processors with limited computing power, low energy consumption, and strict constraints on memory and data bandwidth, lightweight design and model compression acceleration have become particularly important [16]. This involves the adoption of efficient algorithms and techniques, such as model pruning [17], quantization [18], and knowledge distillation [19] to reduce model complexity and enhance computational efficiency. These methods enable effective deployment on low-power devices without sacrificing detection performance, ensuring the capability to detect and respond to aerial threats in real-time [12,20]. Furthermore, optimized models are adaptable to various deployment environments, enhancing the overall reliability and efficiency of the system.

As illustrated in Figure 1. To enhance the efficiency of drone detection and recognition, this paper has implemented a lightweight optimization design for the YOLOv5s. The model employs the ShuffleNetV2 network combined with a coordinate attention mechanism to maintain efficient feature extraction while reducing computational load. Furthermore, to improve the efficiency of feature fusion, the algorithm incorporates a bidirectional feature pyramid network to optimize the feature fusion path and assigns learnable weights to each input feature, achieving more balanced feature integration. Lastly, to enhance the precision of bounding box regression tasks, we introduce the EIoU bounding box loss function, which accelerates network convergence and improves the model’s positioning accuracy.

The main contribution of this work is summarized as follows.

The main backbone network of YOLOv5s was reconstructed using Shufflenetv2 and coordinate attention mechanism, simplifying the network structure and reducing the model parameter size.
The neck network of YOLOv5s was reconstructed using Bi-FPN and Ghost Convolution, extracting more accurate and rich features.
The introduction of the new loss function EIoU to replace CIoU accelerates the network convergence speed and enhances the model’s localization ability.

2. Related Work

2.1. Small Object Detection Method

A few shot object detection algorithms have made significant progress in addressing challenges associated with detecting small objects, enhancing the performance of detection networks [21,22]. However, current research predominantly concentrates on object recognition in natural settings, with limited exploration into aerial threat scenarios, characterized by their heightened complexity. Aerial threat environments are more intricate, increasing the likelihood of background-foreground confusion in detection models. Additionally, the non-cooperative nature of aerial threats complicates sample acquisition, making target feature extraction more challenging [23,24,25,26].

Deep learning-based methods for small object detection can be broadly classified into two categories: two-stage object detection methods that first propose candidate regions and then classify and regress them, represented by Faster R-CNN [27]; and one-stage object detection methods that directly predict object categories and coordinates from images without region proposals, exemplified by YOLO [28,29,30] and SSD [31]. However, both one-stage and two-stage methods encounter difficulties in effectively detecting small objects. Table 1 provides a comparison between classic small object detection algorithms, demonstrating that the current mainstream framework combines a backbone network with a feature pyramid structure.

Challenges in small object detection include insufficient semantic information in low-level features leading to difficulties in detecting small objects; the limited availability of training data for small objects in popular datasets like Real World and COCO (Common Objects in Context), restricting the thorough learning of small object features during model training; and discrepancies between the scales of objects in classification datasets where backbone networks are trained and those in detection datasets, posing challenges in adapting to small object detection.

2.2. Lightweight Network

To enhance the efficiency and capability of portable devices in processing image and video data while adhering to storage space and power consumption constraints, the design of lightweight deep neural network architectures is crucial. In recent years, the design of lightweight neural network architectures has garnered extensive attention from both the academic and industrial communities, resulting in several seminal approaches, categorized into three distinct directions: manual design of lightweight neural network models [15,32]; automated neural network architecture design via neural architecture search [33]; and neural network model compression [34].

Significant progress has been made in the design of lightweight neural architectures. For instance, Google introduced the MobileNet architecture, which utilizes depthwise separable convolutions instead of standard convolutions to reduce complexity [35,36]. Face++ developed ShuffleNet, employing pointwise group convolutions and channel shuffle techniques [37]. MnasNet [38] and NasNet [39] leverage reinforcement learning to automate the construction of lightweight neural networks on portable devices. Notably, NasNet employs a block-based search space that significantly accelerates the search process.

In Zhang’s IGCNets [40], a two-tier grouped convolution approach is utilized to ensure inter-group feature interactions, achieved through varying group sizes and subsequent feature rearrangement. IGCv2 further advances by employing a combination of 3 × 3 depthwise convolutions and 1 × 1 grouped convolutions for enhanced convolution efficiency, all meeting complement conditions [41]. Mehta et al. proposed EspNet [42], which decomposes standard convolutions into 1 × 1 pointwise convolutions and spatial pyramid expansion convolutions, reducing parameters and computations while increasing receptive fields. In EspNetV2 [43], dilated convolutions have been replaced with deep convolutions, and 1 × 1 pointwise convolutions are substituted with 1 × 1 grouped convolutions.

The widespread use of low, slow, and small unmanned aerial vehicles presents novel challenges to the security of high-value targets. While current deep learning-based detection models have made significant strides in improving recognition accuracy, this often leads to substantial computational resource consumption, resulting in low detection efficiency in practical applications. To address this, we propose the EDU-YOLO (Efficient Detection of UAV–YOLOv5s) algorithm, which is designed to enhance the detection efficiency of small targets through two main modules: lightweight feature extraction and balanced feature fusion.

3. Materials and Methods

3.1. Lightweight Feature Extraction Module

When deploying network models to actual devices, researchers have observed significant variations in inference speeds among models with similar floating-point operations (FLOPs). This highlights that memory access times have a substantial impact on inference speed, an effect that cannot be assessed solely by FLOPs. To address this, we have employed the ShuffleNetV2 [37] to replace the backbone of YOLOv5s, incorporating the Coordinate Attention (CA) mechanism [44] to efficiently utilize limited computational resources and achieve an ideal balance between speed and accuracy.

ShuffleNetV2 has undergone comprehensive enhancements in efficient network design. Initially, it avoids complex branching structures by employing a Channel Split operation, which divides the feature map into two branches along the channel dimension. One branch performs an identity mapping, while the other undergoes three convolutions, including two 1 × 1 standard convolutions and a central depthwise separable convolution. These branches are then concatenated along the channel dimension, ensuring the number of channels in the input and output feature maps remains equal. This process avoids element-wise additions, significantly reducing memory access times. Lastly, a Random Shuffle operation ensures thorough information exchange among channels.

ShuffleNetV2 has optimized its downsampling module. During downsampling, to increase the total output channel count, the initial Channel Split operation is omitted. Instead, the information is processed separately and then concatenated, effectively doubling the output channels. Additionally, a 3 × 3 average pooling layer in one branch is replaced with a 3 × 3 depthwise separable convolution to enhance feature extraction capabilities. The structure of ShuffleNetV2 is shown in Figure 2.

The CA mechanism is widely recognized as a key technique for enhancing feature expression capabilities in mobile networks [45,46]. This mechanism decomposes channel attention into a pair of one-dimensional feature encodings that aggregate features along different directions, effectively capturing cross-channel information and perceiving directional and positional information to focus attention on regions of interest.

The structure of the coordinate attention mechanism includes the following steps. As illustrated in Figure 3, upon receiving the input feature tensor x, it first encodes all channels along the horizontal and vertical directions using pooling kernels of sizes (H,1) and (W,1) respectively, to extract directionally-aware feature information. These transformations enable the network to learn interactions between features over long distances and their coordinate information, thus improving target localization accuracy.

\begin{matrix} Z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} X_{c} (h, i) \end{matrix}

(1)

\begin{matrix} Z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} X_{c} (j, w) \end{matrix}

(2)

The aggregated feature maps are initially concatenated and then processed through a 1 × 1 convolutional function

F_{1}

:

\begin{matrix} f = δ (F_{1} ({[z}^{h}, z^{w}])) \end{matrix}

(3)

Here,

[,]

denotes the concatenation of feature information,

δ

represents the activation function, and

f \in R^{\frac{C}{r \times (H + W)}}

is an intermediate feature map that encodes spatial information in both horizontal and vertical directions. The parameter

r

controls the scale of the linear transformation. Subsequently, f is split along the spatial dimension into two independent tensors,

f_{h} \in R^{\frac{C}{r \times (H + W)}}

and

f_{w} \in R^{\frac{C}{r \times (H + W)}}

, which are then transformed by 1 × 1 convolutions

F_{h}

and

F_{w}

into tensors of the same channel count:

\begin{matrix} g^{h} = σ (F_{h} (f^{h})) \end{matrix}

(4)

\begin{matrix} g^{w} = σ (F_{w} (f^{w})) \end{matrix}

(5)

The outputs

g^{h}

and

g^{w}

serve as attention weights and are multiplied by the original input x to produce the final weighted feature map, expressed as follows:

\begin{matrix} y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j) \end{matrix}

(6)

3.2. Multi-Scale Feature Fusion Module

The Feature Pyramid Networks method is designed to enhance the semantic information of low-level features in object detection algorithms [47,48,49,50,51]. The framework of the Feature Pyramid Networks (FPN), as shown in Figure 4, consists of two branches: a bottom-up branch for generating multi-scale features and a top-down branch for transmitting rich semantic information from high-level to low-level features.

The Bidirectional Feature Pyramid Network (Bi-FPN) is a more efficient feature fusion architecture that improves upon traditional designs by eliminating single-input nodes and employing a weighted bidirectional fusion approach [52]. This method merges both top-down and bottom-up information flows concurrently. Bi-FPN incorporates a direct connection between the highest and lowest layers, enabling the fusion of additional features without significantly increasing computational costs, thus enhancing detection performance.

Bi-FPN assigns learnable weights to each input feature, achieving a more balanced fusion of features. This weighting mechanism is akin to an attention mechanism, which aids in precisely regulating the contribution of features between different layers. Additionally, to further optimize object detection capabilities and ensure comprehensive positional and semantic information, Bi-FPN employs a lighter and more efficient structure. This includes the use of depthwise separable convolutions instead of traditional convolutions and an optimized topological structure obtained through Neural Architecture Search [40,53,54]. These improvements reduce computational complexity and memory consumption, thereby increasing detection speed.

Stacking convolutional layers captures a wealth of feature information, including redundant data, which is beneficial for a network’s comprehensive understanding of the input. Therefore, we extract rich features through conventional convolution operations while handling redundant features with cost-effective linear transformations.

As shown in Figure 5. The Ghost Convolution (Ghost Conv) module leverages the similarity between feature maps to generate additional feature maps more efficiently, thereby reducing the parameter count of the model [55]. Inspired by the concept of depthwise separable convolutions, this module splits the input feature map evenly into two parts: one part processed by standard convolutions and the other by linear convolutions. The results of both parts are then concatenated to eliminate some of the feature redundancy. This approach not only effectively reduces the computational resources required by the model but also offers a simple, easy-to-implement design that is plug-and-play.

3.3. Boundary Box EIoU Loss Function

The bounding box loss function plays a critical role in target localization and recognition [56]. However, the Complete Intersection over Union (CIoU) loss function used in the YOLOv5 algorithm struggles to effectively address the severe imbalance between positive and negative samples. Additionally, it is prone to gradient vanishing issues, slowing down the convergence speed and affecting detection accuracy [4]. To mitigate these challenges, this experiment adopts the Efficient-Intersection over Union (EIoU) loss function [57], which explicitly measures differences in three geometric factors within the bounding box—overlap area, center point, and edge length—thereby resolving issues related to sample imbalance and gradient explosion.

The CIoU loss function adds a penalty term for the aspect ratio of the bounding boxes. This addition ensures consistency in aspect ratios between the predicted and target boxes. The mathematical expression for CIoU is as follows:

\begin{matrix} CIoU = IoU - \frac{ρ^{2} (b, b^{g t})}{c^{2}} - α v \end{matrix}

(7)

\begin{matrix} α = \frac{v}{(1 - I o U) + v} \end{matrix}

(8)

\begin{matrix} v = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2} \end{matrix}

(9)

In Equations (7)–(9), α represents the balancing parameter, b denotes the predicted box, v is the parameter measuring the consistency in the width and height ratio between the predicted and ground truth boxes,

b^{g t}

denotes the ground truth box,

w^{g t}

and

h^{g t}

represent the width and height of the ground truth box, while w and h represent the width and height of the predicted box.

To address the issue of aspect ratio imbalance in CIoU, this study introduces the EIoU loss function. The EIoU loss function improves aspect ratio loss by splitting it into the differences between the width and height of the predicted box and its minimum enclosing box. This method more accurately measures the shape and positional differences between the predicted box and the target box. It not only accelerates the convergence speed of the predicted box but also significantly enhances localization accuracy, especially when detecting small objects and objects in complex backgrounds. The EIoU loss function is formulated as follows:

\begin{matrix} L_{E I o U} = L_{I o U} + L_{d i s} + L_{a s p} \end{matrix} \begin{matrix} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{C^{2}} + \frac{ρ^{2} (w, w^{g t})}{C_{w}^{2}} + \frac{ρ^{2} (h, h^{g t})}{C_{h}^{2}} \end{matrix}

(10)

EIoU consists primarily of three components: the overlap loss

L_{I o U}

, the center distance loss

L_{d i s}

, and the width-to-height loss

L_{a s p}

. In the formula,

C_{w}

and

C_{h}

represent the width and height of the smallest enclosing box that contains both the ground truth and predicted boxes, and

ρ^{2}

measures the Euclidean distance between two variables. The structure of EDU-YOLO is shown in Figure 6.

4. Results

4.1. UAV Dataset

This dataset was created by integrating images from mainstream search engines, professional image libraries, public datasets, and manually captured images. It contains 6519 manually annotated images of drones and birds, divided into 5215 training images and 1304 validation images, for model training, optimization, and evaluation of its performance and generalization capabilities. As illustrated in Figure 7, the manually captured drone images are from the DJI drone manufacturer, including six models, FlyCart 30, Matrice 300 RTK, Phantom 4 RTK, Matrice 30T, Mavic 3T, and Mavic 2, totaling 600 images. These models vary in size, with diameters ranging from 33 cm to 140 cm, ensuring the dataset encompasses a wide range of UAV dimensions, further enhancing the robustness and diversity of our training and validation data.

4.2. Evaluation Indicators for Experiments

This article evaluates the model performance in multiple dimensions, including precision (P), recall (R), mean average precision (mAP), floating-point operations (FLOPs), and frames per second (FPS). The calculation formulas for P and R are as follows:

\begin{matrix} P = \frac{N_{t}}{N_{t} + N_{f}} \end{matrix}

(11)

\begin{matrix} R = \frac{N_{t}}{N_{r}} \end{matrix}

(12)

where

N_{t}

is the number of true positive samples detected by the algorithm,

N_{f}

is the number of false positive samples detected by the algorithm, and

N_{r}

is the number of true weak small targets actually present in the image.

mAP is the mean average precision for all categories, with [email protected] selected as the evaluation metric, indicating the mAP value when the intersection over union threshold is 0.5, calculated as follows:

\begin{matrix} mAP = \frac{1}{C} \sum_{i = 1}^{C} A P_{i} \end{matrix}

(13)

where AP represents the average precision for a single category, C is the total number of target categories, and

A P_{i}

is the AP value for the i-th category.

Frames Per Second (FPS) is a crucial metric in object detection for measuring the detection speed of the model, thus providing a clear measure of the system’s responsiveness and real-time processing capabilities. A higher FPS value indicates that the model can recognize more objects per second, enhancing the detection performance, with the formula

\begin{matrix} F P S = \frac{F r a m e C o u n t}{E l a p s e d T i m e} \end{matrix}

(14)

where FrameCount is the total number of frames, and ElapsedTime is the total time.

4.3. Experimental Environment Setup

The hardware platform configuration for the experimental training phase is shown in Table 2. The experiments in this paper use the PyTorch deep learning development framework.

The platform used for this experiment is equipped with the Ubuntu 20.04 operating system and utilizes a combination of the NVIDIA GeForce GTX 4070 GPU and i9-14900KF×32 CPU to support our computational needs. We configured a Python 3.8 environment and installed the Pytorch 2.2.0 deep learning framework, along with Cuda 12.1 for GPU acceleration. To accelerate the convergence speed of the model, we initialized some network weight parameters using the official pre-trained weights during model training.

In the data preprocessing stage, we standardized the pixel size of input images to 640 × 640. The model is trained for 300 epochs with a Batch Size of 32. During training, we utilized the Adam optimizer with a learning rate of 0.01, momentum of 0.937, and weight decay coefficient of 0.0005, while keeping other parameters at their default values.

4.4. Experimental Results Analysis

Based on the experimental results in Table 3, it is observed that using ShufflenetV2 as the backbone network on a custom small target dataset shows higher GFLOPs and FPS compared to Ghost-Net, ShuffleNetV1, and MobileNetV2. The GFLOPs and FPS for ShufflenetV2 are 2.3 and 186, respectively. Despite a slight decrease in precision and recall, with precision and recall rates of 91.2% and 86.3%, respectively, ShufflenetV2 demonstrates significant advantages. Specifically, compared to MobileNetV2, although there is a slight decrease in precision and recall, the GFLOPs decreased from 2.8 to 2.3, while the detection speed increased from 183 to 186 FPS. This indicates that ShufflenetV2 has significant advantages in performance and speed when handling small target detection tasks. Therefore, ShufflenetV2 is a promising and efficient choice as the backbone network for small target detection tasks.

The experimental results in Table 4 indicate that using Bi-FPN as the neck network results in outstanding performance across metrics such as mean average precision, recall rate, precision, floating-point operations, and detection speed. In tasks involving small targets like drones and birds, Bi-FPN outperforms PANet in performance. Bi-FPN enriches the model’s feature extraction information as the neck network, effectively enhancing the neural network’s capability to detect small targets, thereby improving the model’s accuracy and efficiency.

The proposed algorithm was tested through ablation experiments on a custom-built dataset, and the results validated its effective detection performance, illustrated in Figure 8 as a graphical representation of the merged datasets. This study contributes to the real-time and efficient detection of UAVs, facilitating the establishment of early warning systems and enhancing security measures for high-value targets.

The ablation study results presented in Table 3 and Table 5 demonstrate that replacing the backbone network of YOLOv5s with ShufflenetV2 significantly reduces computational complexity. Additionally, replacing some convolutional modules with Ghost Conv modules has notably decreased GFLOPs while maintaining a high detection speed. GFLOPs dropped from 16 to 2.1, with only a 3% decrease in mAP. By implementing the Bi-FPN, the network’s feature fusion capabilities were further enhanced, leading to a 1% increase in detection accuracy. The introduction of the CA attention mechanism strengthened feature extraction capabilities, thereby improving detection precision, which only saw a 1% decrease compared to the original YOLOv5s.

Transitioning from the CIoU loss function to the EIoU loss function significantly enhanced recall rates and mAP. Overall, compared to the baseline YOLOv5s model, the modified EDOU-YOLO model achieved only a minimal reduction in mAP of just 1.1%. It reduced the GFLOPs from 16.0 to 2.2, increased the FPS from 153 to 188, and decreased the parameter size from 44.5 M to 16.0 M, thus achieving a better balance of performance. This improved model configuration demonstrates an optimized trade-off between computational demand and model efficacy, providing a promising framework for real-time object detection applications. We also reported the memory requirements of our model during operation, highlighting the optimization in memory consumption that contributes to its deployment on devices with limited memory capacity. The graphical representation of the merged dataset is shown in Figure 8.

5. Discussion

The model under discussion leverages the ShuffleNetV2 architecture, which is renowned for its efficiency in managing computational resources, particularly in mobile and embedded applications. By incorporating a coordinate attention mechanism, the model further refines its ability to focus on more informative features during the extraction process. This mechanism selectively emphasizes features that are more relevant to the task, thereby improving the representational power of the network without a significant increase in computational demands.

To enhance the integration of features at different scales, the model integrates a Bi-FPN. This innovative approach differs from traditional unidirectional pyramids by allowing for the flow of information both upward and downward across the scales. This bidirectional flow enables more effective feature-level interactions, leading to richer and more semantically meaningful feature maps. Moreover, the algorithm assigns learnable weights to each input feature at different levels of the pyramid, optimizing the contribution of each feature during fusion. This learning-based approach to weighting allows the model to dynamically adapt to the most useful features during training, leading to a more balanced and effective integration. Lastly, the model adopts the EIoU bounding box loss function to improve the precision in bounding box regression tasks. Traditional IoU-based loss functions often face challenges such as scale sensitivity and misalignment issues, which can degrade the localization accuracy. EIoU addresses these issues by incorporating additional terms that account for the central point distances between the predicted and actual bounding boxes, enhancing the alignment accuracy. The use of EIoU facilitates faster and more stable convergence of the network during training and improves the model’s ability to precisely locate objects, which is crucial for applications requiring high localization accuracy, such as autonomous driving or precision agriculture. The visualization detection results of the EDU-YOLO is shown in Figure 9.

Together, these advancements in feature extraction, feature fusion, and bounding box regression contribute to a robust model that not only performs tasks efficiently but also achieves high precision in detecting and localizing objects, making it suitable for real-time applications where both speed and accuracy are critical.

We conducted a detailed and thorough analysis of the results from three sets of experiments. Firstly, in the ablation experiments, we observed that replacing the backbone network with Shufflenetv2 and introducing Ghost Conv modules led to a slight decrease in detection accuracy but significantly reduced computational complexity, greatly improving detection speed, showcasing a balance of performance and speed enhancement. Secondly, with the incorporation of the CA attention mechanism and Bi-FPN network module, the model excelled in feature fusion and information extraction, significantly enhancing detection accuracy. Lastly, in the experiments where the loss function was changed from CIoU to EIoU, we observed an increase in recall rate and mAP, further enhancing the overall model performance. These experimental results validate the effectiveness and superiority of the EDOU-YOLO model in small target detection tasks. This study contributes to the real-time and efficient detection of UAVs, facilitating early warning systems and enhancing the security measures for high-value targets.

6. Conclusions

This article introduces an efficient and accurate method for detecting UAVs, termed EDU-YOLO, which is based on the YOLOv5s architecture. This approach utilizes lightweight feature extraction modules, balanced feature fusion modules, and a novel EIoU loss function to enhance the detection accuracy and convergence speed of the network. EDU-YOLO employs the ShuffleNetV2 and CA mechanisms to construct a lightweight backbone network and leverages a Bi-FPN and Ghost Conv to optimize feature fusion. These innovations significantly improve processing speed and model efficiency, making it suitable for possible deployment on embedded platforms.

Although ShuffleNetV2 excels in reducing computational costs and model parameters, it may underperform compared to larger or deeper network models such as ResNet or DenseNet when tasked with high-complexity images or high-accuracy requirements. CA mechanisms enhance the model’s feature extraction capabilities but may also increase the computational complexity and inference time, particularly under conditions of extreme input variability or complex backgrounds. The Bi-FPN enhances the semantic information of features, improving performance; however, its complexity might slow down network inference speeds. Furthermore, tuning and training Bi-FPN parameters could demand substantial computational resources and time. Ghost Convolution optimizes the model by reducing parameters and computational load but may sacrifice accuracy when handling complex image content requiring advanced feature representations. In summary, although EDU-YOLO has significantly improved network efficiency and detection speed, its performance in UAV detection under extreme conditions (such as very low light or highly dynamic backgrounds) still requires further validation. Moreover, while the model has made progress in detecting small targets, its ability to balance detection accuracy and false positive rates in complex environments still needs optimization. Additionally, despite high prediction accuracy, the confidence levels are moderate, which also requires improvement in future research. For the definitions of abbreviations and symbols used in this document, please refer to Supplementary Table S1.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app14156398/s1, Table S1: Acronyms Glossary; Table S2: Symbols Meaning.

Author Contributions

Methodology, T.W. and Y.F.; software, Q.J. and T.W.; validation, T.W., C.Z. and S.S.; data curation, C.Z., W.Q., and S.S.; writing—original draft preparation, T.W. and Q.J.; writing—review and editing, T.W. and Y.F.; visualization, S.S., T.W. and Y.F.; supervision, C.Z. and Y.F.; project administration, Y.F.; funding acquisition, Y.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Projects of the Foundation Strengthening Program, grant number 2023-JCJQ-JJ-0604.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author (privacy).

Conflicts of Interest

The authors declare no conflict of interest.

References

Misbah, M.; Khan, M.U.; Yang, Z.; Kaleem, Z. TF-NET: Deep Learning Empowered Tiny Feature Network for Night-time UAV Detection. Int. Conf. Wirel. Satell. Syst. 2023, 509, 3–18. [Google Scholar]
Jian, M.; Lu, Z.; Chen, V.C. Drone Detection and Tracking Based on Phase-interferometric Doppler Radar. In Proceedings of the IEEE Radar Conference, Oklahoma, OK, USA, 23–27 April 2018; pp. 1146–1149. [Google Scholar]
Gan, W.; Wu, X.; Wu, W.; Yang, X.; Ren, C.; He, X.; Liu, K. Infrared and Visible Image Fusion With the Use of Multi-scale Edge-preserving Decomposition and Guided Image Filter. Infrared Phys. Technol. 2015, 72, 37–51. [Google Scholar] [CrossRef]
Al-Qubaydhi, N.; Alenezi, A.; Alanazi, T.; Senyor, A.; Alanezi, N.; Alotaibi, B.; Alotaibi, M.; Razaque, A.; Abdelhamid, A.A.; Alotaibi, A. Detection of Unauthorized Unmanned Aerial Vehicles Using YOLOv5 and Transfer Learning. Electronics 2022, 11, 2669. [Google Scholar] [CrossRef]
Kim, J.; Park, C.; Ahn, J.; Ko, Y.; Park, J.; Gallagher, J.C. Real-time UAV Sound Detection and Analysis System. In Proceedings of the IEEE Sensors Applications Symposium (SAS), Glassboro, NJ, USA, 13–15 March 2017; pp. 1–5. [Google Scholar]
Lv, H.; Liu, F.; Yuan, N. Drone Presence Detection by the Drone’s RF Communication. J. Phys. Conf. Ser. 2021, 1738, 012044. [Google Scholar] [CrossRef]
Nguyen, P.; Ravindranatha, M.; Nguyen, A.; Han, R.; Vu, T. Investigating Cost-effective RF-based Detection of Drones. In Proceedings of the Workshop on Micro Aerial Vehicle Networks, Systems, and Applications for Civilian Use, Singapore, 26 June 2016; pp. 17–22. [Google Scholar]
Samaras, S.; Diamantidou, E.; Ataloglou, D.; Sakellariou, N.; Vafeiadis, A.; Magoulianitis, V.; Lalas, A.; Dimou, A.; Zarpalas, D.; Votis, K. Deep Learning on Multi Sensor Data for Counter UAV Applications—A Systematic Review. Sensors 2019, 19, 4837. [Google Scholar] [CrossRef] [PubMed]
Xie, L.; Xiang, C.; Yu, Z.; Xu, G.; Yang, Z.; Cai, D.; He, X. PI-RCNN: An Efficient Multi-sensor 3D Object Detector With Point-based Attentive Cont-conv Fusion Fodule. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 1–12 February 2020; pp. 12460–12467. [Google Scholar]
Kim, Y.; Shin, J.; Kim, S.; Lee, I.J.; Choi, J.W.; Kum, D. CRN: Camera Radar Net for Accurate, Robust, Efficient 3D Perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 17615–17626. [Google Scholar]
Qi, G.; Zhang, Y.; Wang, K.; Mazur, N.; Liu, Y.; Malaviya, D. Small Object Detection Method Based on Adaptive Spatial Parallel Convolution and Fast Multi-Scale Fusion. Remote Sens. 2022, 14, 420. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Hua, W.; Chen, Q.; Chen, W. A new lightweight network for efficient UAV object detection. Sci. Rep. 2024, 14, 13288. [Google Scholar] [CrossRef] [PubMed]
Zheng, J.; Chen, R.; Yang, T.; Liu, X.; Liu, H.; Su, T.; Wan, L. An efficient strategy for accurate detection and localization of UAV swarms. IEEE Internet Things J. 2021, 8, 15372–15381. [Google Scholar] [CrossRef]
Liang, X.; Zhang, J.; Zhuo, L.; Li, Y.; Tian, Q. Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1758–1770. [Google Scholar] [CrossRef]
Mehta, V.; Dadboud, F.; Bolic, M.; Mantegh, I. A Deep Learning Approach for Drone Detection and Classification Using Radar and Camera Sensor Fusion. IEEE Sens. Appl. Symp. 2023, 77, 1–6. [Google Scholar]
Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. Learning Structured Sparsity in Deep Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2074–2082. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks Via Attention Transfer. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017; pp. 1320–1334. [Google Scholar]
Tung, F.; Mori, G. CLIP-Q: Deep Network Compression Learning by In-parallel Pruning-quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 7873–7882. [Google Scholar]
Wang, T.; Anwer, R.M.; Cholakkal, H.; Khan, F.S.; Pang, Y.; Shao, L. Learning Rich Features at High-Speed for Single-Shot Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1971–1980. [Google Scholar]
Zhang, G.; Luo, Z.; Cui, K.; Lu, S.; Xing, E.P. Meta-DETR: Image-Level Few-Shot Object Detection With Inter-Class Correlation Exploitation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12832–12843. [Google Scholar] [CrossRef] [PubMed]
Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J. Few-Shot Object Detection via Feature Reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8419–8428. [Google Scholar]
Yan, X.; Chen, Z.; Xu, A.; Wang, X.; Liang, X.; Lin, L. Meta R-CNN: Towards General Solver for Instance-Level Low-Shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9576–9585. [Google Scholar]
Zhang, T.; Zhang, Y.; Sun, X.; Sun, H.; Yan, M. Comparison Network for One-Shot Conditional Object Detection. arXiv 2019, arXiv:1904.02317. [Google Scholar]
Karlinsky, L.; Shtok, J.; Harary, S.; Schwartz, E.; Aides, A.; Feris, R.; Giryes, R.; Bronstein, A.M. RepMet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5192–5201. [Google Scholar]
Wu, A.; Han, Y.; Zhu, L.; Yang, Y. Universal-Prototype Enhancing for Few-Shot Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9547–9556. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Cao, Z.; Yu, H.; Kong, L.; Zhang, D. Multi-Scene Small Object Detection with Modified YOLOv4. J. Phys. Conf. Ser. 2022, 2253, 012027. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2778–2788. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 10–16 October 2016; pp. 21–37. [Google Scholar]
Shamsolmoali, P.; Zareapoor, M.; Granger, E.; Chanussot, J.; Yang, J. Enhanced Single-Shot Detector for Small Object Detection in Remote Sensing Images. IEEE Int. Geosci. Remote Sens. Symp. 2022, 22, 1716–1719. [Google Scholar]
Zhang, X.; Zhao, C.; Luo, H.; Zhao, W.; Zhong, S.; Tang, L.; Peng, J.; Fan, J. Automatic Learning for Object Detection. Neurocomputing 2022, 484, 260–272. [Google Scholar] [CrossRef]
He, Y.; Lin, J.; Liu, Z.; Wang, H.; Li, L.; Han, S. AMC: Automl for Model Compression and Acceleration on Mobile Devices. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 784–800. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 432–445. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.; Sun, J. Shufflenet v2: Practical Guidelines for Efficient Cnn Architecture Design. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A. Mnasnet: Platform-Aware Neural Architecture Search for Mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2820–2828. [Google Scholar]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 8697–8710. [Google Scholar]
Zhang, T.; Qi, G.J.; Xiao, B.; Wang, J. Interleaved Group Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4373–4382. [Google Scholar]
Xie, G.; Wang, J.; Zhang, T.; Lai, J.; Hong, R.; Qi, G.J. Interleaved Structured Sparse Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 8847–8856. [Google Scholar]
Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. ESPNet:Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 561–580. [Google Scholar]
Mehta, S.; Rastegari, M.; Shapiro, L.; Hajishirzi, H. ESPNetv2: A light-weight, Power Efficient, and General Purpose Convolutional Neural Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9182–9192. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13708–13717. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended Feature Pyramid Network for Small Object Detection. IEEE Trans. Multimed. 2021, 24, 1968–1979. [Google Scholar] [CrossRef]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7036–7045. [Google Scholar]
Kim, S.W.; Kook, H.K.; Sun, J.Y.; Kang, M.C.; Ko, S.J. Parallel Feature Pyramid Network for Object Detection. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 234–250. [Google Scholar]
Gong, Y.; Yu, X.; Ding, Y.; Peng, X.; Zhao, J.; Han, Z. Effective Fusion Factor in FPN for Tiny Object Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 1160–1168. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Chollet, F. Xception: Deep Learning With Depthwise Separable Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Huang, G.; Liu, Z.; Maaten, L.V.D.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More Features From Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Sairam, R.V.C.; Keswani, M.; Sinha, U.; Shah, N.; Balasubramanian, V.N. ARUBA: An Architecture-Agnostic Balanced Loss for Aerial Object Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 3719–3728. [Google Scholar]
Zhang, Y.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and Efficient IOU Loss for Accurate Bounding Box Regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]

Figure 1. Lightweight Optimization Strategy for the YOLOv5s Algorithm.

Figure 2. (a) The basic unit of ShuffleNetV2; (b) The spatial down sampling unit of ShuffleNetV2.

Figure 3. The structure of coordinate attention mechanism.

Figure 4. (a) The structure of the original FPN; (b) The structure of the Bi-FPN.

Figure 5. The structure of the ghost convolution.

Figure 6. The structure of EDU-YOLO.

Figure 7. Six types of aircraft images.

Figure 8. Graphical representation of the merged dataset.

Figure 9. Visualization detection results of the EDU-YOLO.

Table 1. Comparisons of different algorithms for small object detection.

Algorithms	Small Object Detection Methods					Backbone Network
Algorithms	Multi-Scale Prediction	Data Augmentation	Improving Feature Resolution	Based on Contextual Information	New Backbone Network	ResNet−101	ResNeXt−101	DetNet−59
SSD513	√	√				√
DSSD513	√		√			√
FPN	√					√
PANet	√						√
SNIPER	√					√
CoupleNet				√		√
DetNet					√			√
DetectoRS	√						√

Table 2. Hardware platform configuration.

Name	Related Parameters
GPU	NVIDIA GeForce GTX 4070
CPU	i9-14900KF/32G
GPU Memory	12 GB
Operating System	Ubuntu 20.04
Computing Platform	Cuda 12.1

Table 3. Experimental performance of backbone network Comparison.

Model	mAP/%	R/%	P/%	GFLOPs	FPS S⁻¹
YOLOv5s	93.9	90.2	93.0	16.0	153
¹ YOLOv5s-G	91.7	88.7	92.4	5.8	178
YOLOv5s-SV1	92.2	88.5	92.5	3.6	177
YOLOv5s-MV2	92.5	87.1	92.6	2.8	183
YOLOv5s-SV2	91.1	86.3	91.2	2.3	186

¹ G = Ghost-Net; SV1 = ShuffleNetV1; MV2 = MobileNetV2; SV2 = ShufflenetV2.

Table 4. Comparison of experimental results of neck network.

Model	mAP/%	R/%	P/%	GFLOPs	FPS S⁻¹
² YOLOv5s-SV2-P	91.1	86.3	91.2	2.3	186
YOLOv5s-SV2-B	91.4	87.4	91.8	2.2	187

² SV2 = ShufflenetV2; P = PANet; B = BiFPN.

Table 5. Comparison of ablation experimental results of different network models.

Model	mAP/%	R/%	P/%	Parms/M	GFLOPs	FPS/S⁻¹
YOLOv5s	93.9	90.2	93.0	44.5	16.0	153
³ YOLOv5s-SG	90.9	85.7	90.6	22	2.1	195
YOLOv5s-SGB	91.9	87.0	91.6	18.3	2.2	189
YOLOv5s-SGBC	92.6	88.2	92.0	16.3	2.2	188
EDU-YOLO	92.8	88.6	92.1	16	2.2	188

³ S: ShufflenetV2; G: Ghost Convolution; B: Bi-FPN; C: coordinate attention mechanism.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, Y.; Wang, T.; Jiang, Q.; Zhang, C.; Sun, S.; Qian, W. A Efficient and Accurate UAV Detection Method Based on YOLOv5s. Appl. Sci. 2024, 14, 6398. https://doi.org/10.3390/app14156398

AMA Style

Feng Y, Wang T, Jiang Q, Zhang C, Sun S, Qian W. A Efficient and Accurate UAV Detection Method Based on YOLOv5s. Applied Sciences. 2024; 14(15):6398. https://doi.org/10.3390/app14156398

Chicago/Turabian Style

Feng, Yunsong, Tong Wang, Qiangfu Jiang, Chi Zhang, Shaohang Sun, and Wangjiahe Qian. 2024. "A Efficient and Accurate UAV Detection Method Based on YOLOv5s" Applied Sciences 14, no. 15: 6398. https://doi.org/10.3390/app14156398

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Efficient and Accurate UAV Detection Method Based on YOLOv5s

Abstract

1. Introduction

2. Related Work

2.1. Small Object Detection Method

2.2. Lightweight Network

3. Materials and Methods

3.1. Lightweight Feature Extraction Module

3.2. Multi-Scale Feature Fusion Module

3.3. Boundary Box EIoU Loss Function

4. Results

4.1. UAV Dataset

4.2. Evaluation Indicators for Experiments

4.3. Experimental Environment Setup

4.4. Experimental Results Analysis

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI