Next Article in Journal
Autonomous Vehicles Traversability Mapping Fusing Semantic–Geometric in Off-Road Navigation
Previous Article in Journal
Unoccupied-Aerial-Systems-Based Biophysical Analysis of Montmorency Cherry Orchards: A Comparative Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

G-YOLO: A Lightweight Infrared Aerial Remote Sensing Target Detection Model for UAVs Based on YOLOv8

by
Xiaofeng Zhao
,
Wenwen Zhang
*,
Yuting Xia
*,
Hui Zhang
,
Chao Zheng
,
Junyi Ma
and
Zhili Zhang
Xi’an Research Institute of High-Tech, Xi’an 710025, China
*
Authors to whom correspondence should be addressed.
Drones 2024, 8(9), 495; https://doi.org/10.3390/drones8090495
Submission received: 21 August 2024 / Revised: 12 September 2024 / Accepted: 12 September 2024 / Published: 18 September 2024

Abstract

:
A lightweight infrared target detection model, G-YOLO, based on an unmanned aerial vehicle (UAV) is proposed to address the issues of low accuracy in target detection of UAV aerial images in complex ground scenarios and large network models that are difficult to apply to mobile or embedded platforms. Firstly, the YOLOv8 backbone feature extraction network is improved and designed based on the lightweight network, GhostBottleneckV2, and the remaining part of the backbone network adopts the depth-separable convolution, DWConv, to replace part of the standard convolution, which effectively retains the detection effect of the model while greatly reducing the number of model parameters and calculations. Secondly, the neck structure is improved by the ODConv module, which adopts an adaptive convolutional structure to adaptively adjust the convolutional kernel size and step size, which allows for more effective feature extraction and detection based on targets at different scales. At the same time, the neck structure is further optimized using the attention mechanism, SEAttention, to improve the model’s ability to learn global information of input feature maps, which is then applied to each channel of each feature map to enhance the useful information in a specific channel and improve the model’s detection performance. Finally, the introduction of the SlideLoss loss function enables the model to calculate the differences between predicted and actual truth bounding boxes during the training process, and adjust the model parameters based on these differences to improve the accuracy and efficiency of object detection. The experimental results show that compared with YOLOv8n, the G-YOLO reduces the missed and false detection rates of infrared small target detection in complex backgrounds. The number of model parameters is reduced by 74.2%, the number of computational floats is reduced by 54.3%, the FPS is improved by 71, which improves the detection efficiency of the model, and the average accuracy (mAP) reaches 91.4%, which verifies the validity of the model for UAV-based infrared small target detection. Furthermore, the FPS of the model reaches 556, and it will be suitable for wider and more complex detection task such as small targets, long-distance targets, and other complex scenes.

1. Introduction

Infrared (IR) target detection offers the benefit of being effective under diverse weather conditions, including multi-purpose and strong anti-jamming [1], and plays an important role in military reconnaissance [2], fire detection [3,4], security monitoring [5,6], and other fields. In the past few years, the integration of infrared detection technology into UAV platforms has been facilitated by advances in UAV technology. This particular method has gained significant recognition in both civil and military sectors as it offers excellent adaptability, secrecy, and effectiveness [7]. The domain of infrared target detection technology can be broadly reclassified into traditional methods that rely on feature extraction and methods that leverage deep learning techniques. Traditional methods mainly extract the texture, shape, edge, and other features of the target to achieve target detection, but they rely on hand-designed features and have certain limitations [8]. The above methods, however, suffer from issues like large model size and long detection time. These problems make it challenging to deploy them in embedded environments with limited performance. Striking a balance between detection accuracy and payload constraint remains difficult due to UAV endurance, power consumption, and performance limitations [9]. Therefore, the research core of UAV aerial infrared target detection technology is to optimize model design, balance high accuracy and low complexity, improve detection efficiency, and show excellent detection ability in small targets, remote targets, and complex environments [10,11].
Considering that UAVs typically capture aerial images at long distances, most of the targets in the images exhibit small target features, with target sizes typically below 32 × 32 pixels [12]. In addition, due to the low resolution of infrared images, this poses a huge challenge for UAV infrared target detection [13]. Currently, deep learning technology has become the mainstream method for object detection. The deep learning algorithms commonly used for target detection can be divided into two categories: two-stage algorithms and one-stage algorithms. Two stage algorithms require generating candidate boxes before object classification and localization, such as R-CNN [14] and Faster RCNN [15,16,17]. Single stage algorithms can directly predict the class probability and corresponding bounding box positions of objects from images, which include SDD [18], YOLO [19,20,21,22,23,24], and similar approaches. Compared with two-stage algorithms, single-stage algorithms have the advantages of simplicity, speed, and efficiency by eliminating candidate region generation and are more suitable for deployment in the embedded system of UAV platforms. However, single-stage algorithms face limitations in high-resolution and complex UAV aerial scenes, especially in the presence of complex backgrounds, dense targets, and small-scale objects. Therefore, researchers are actively exploring lightweight and efficient detection methods to meet the requirements of high efficiency, real-time, and resource constraints.
Hao et al. [25] proposed an improved YOLOv8-based algorithm for remote sensing small target detection, which incorporates a dual-branch attention mechanism to enhance the backbone feature extraction network and introduces an attention-guided bidirectional feature pyramid network to improve model detection performance. The UAV-based infrared target detection model ITD-YOLOv8, proposed by Zhao et al. [26], is built upon the lightweight network GhostHGNetV2 to enhance the architecture of the YOLOv8 backbone feature extraction net. This enhancement ensures both high detection accuracy and reduced model parameters and computational volume, thereby effectively improving multi-scale target detection in intricate environments. However, object detection methods based on deep learning techniques using computer vision technology require a large amount of computing resources while achieving high accuracy, which fails to accomplish the demands for real-time functionality and practicality. Therefore, this paper is devoted to the study of an efficient model suitable for UAV aerial infrared target detection tasks, which can save resources in UAV aerial detection and has efficient detection performance.
To improve the ability of unmanned aerial vehicles (UAVs) to detect small targets in complex environments using infrared technology, this study introduces G-YOLO—a lightweight model that leverages UAV-based infrared technology and simultaneously reduces the demand for computing resources and increases detection efficiency. The GhostBottleneckV2 [27], a highly efficient network framework, is employed to enhance the backbone network of YOLOv8, thereby reducing computational resource consumption while ensuring model detection performance. To improve the model’s ability to capture and extract subtle and key features, the ODConv [28] module is used to improve the neck structure. By using the adaptive convolution structure, the convolution kernel size and step size are adaptively adjusted, so that more effective feature extraction and detection can be performed according to different scales of targets. Simultaneously, it allows the model to consider the information of the entire image or feature map when extracting features, which is helpful to capture the context information of infrared small targets. The attention mechanism, SEAttention [29], is employed to further optimize the neck structure. The objective of this enhancement is to enhance model detection capability by leveraging comprehensive information extracted from input feature maps. Specifically, the SEAttention mechanism is applied to each individual channel and feature map, aiming to enhance the informative content within specific channels and thus improve the overall efficacy of object detection. The implementation of the DWConv module aims to minimize model parameters and computational requirements by substituting the initial convolution module, thereby enhancing the efficiency and resource utilization of the model. Finally, the SlideLoss [30] loss function is introduced so that the model is employed to calculate the differences between the prediction bounding box and the ground truth bounding box during the training process, and the model parameters are adjusted according to these differences to enhance the precision of target identification.
The primary focus of this paper encompasses the following aspects:
  • An innovative lightweight UAV-based target detection model named G-YOLO is proposed for infrared small targets detection in complex environments. The model can effectively improve the detection performances of UAV while significantly reducing model complexity and parameters.
  • The backbone network of YOLOv8 is improved and designed based on the GhostBottleneckV2 lightweight network, while the original Conv convolution in the remaining part of the network is replaced by the DWConv module. The backbone network employs a more efficient channel separation strategy and incorporates additional feature reuse techniques to optimize model performance while preserving its lightweight nature. The proposed structure significantly reduces the number of parameters and computational requirements, thereby enhancing detection efficiency while preserving its ability to detect, thus leading to improved overall performance.
  • The neck structure is enhanced by the ODConv module, which dynamically adjusts the size and position of the convolution kernel based on input data features. This enhancement increases the model’s adaptability to changes in target position and size, thereby improving its detection capabilities. The SEAttention attention mechanism is employed to dynamically allocate distinct weights to each channel, thereby facilitating the network in prioritizing salient feature information, enhancing the model’s ability to capture crucial information, and improving its detection capability.
  • The loss function, SlideLoss, is introduced by comparing the predicted results of target detection with the actual labels during the training process to obtain an error value, which in turn is used to update the model parameters through error back-propagation to enhance the model’s suitability for the given task.

2. Related Work

The YOLO family of algorithms is an advanced deep learning technology specifically designed for target recognition [31], which employs a solitary neural network structure for end-to-end target detection and is able to identify information quickly and accurately, such as geographical location, class, and dimensions of a target in an image [32]. The YOLO algorithm is extensively employed in UAV-based infrared target detection tasks due to its rapid speed, exceptional precision, and ability to process large-size images in real-time [33]. Specifically, the UAV-based infrared target detection task requires real-time and efficient detection of targets on the ground, which requires algorithms that need to have high operating speed and high detection accuracy [34]. The YOLO algorithm demonstrates enhanced computational efficiency and improved precision, and it can have better detection effect on targets in complex backgrounds. The ASFF-YOLOv5 method, proposed by Qiu et al. [35], is a UAV road detection approach that leverages multi-scale feature fusion. The aim is to improve target detection performance by integrating the sensor head of the ASFF and the spatial pyramid pooling structure of the SPPF, overcoming the challenge of varying feature scales. To address the limited effectiveness of current target detection algorithms on UAV aerial images, O. Sahin et al. [36] introduced a refined YOLODrone algorithm that enhances the accuracy of detecting targets by unmanned aerial vehicles. A lightweight algorithm for detecting and tracking multiple objects, M3-YOLOv5, was proposed by Li et al. [37]. By substituting the enhanced MobileNetV3 for the YOLOv5 backbone network and incorporating it into the CA attention mechanism and the neck network, the enhancement in the detection capability of smaller and medium-sized objects was achieved. Ma et al. [38] proposed an efficient infrared small target detection network that incorporates multiple scales for target detection. Furthermore, they suggested the integration of a module called target contextual feature extraction (TCVE) at various scales to enhance the portrayal of characteristics associated with targets. The VE-YOLOv6 algorithm, proposed by Wei et al. [39], reveals a streamlined approach for discerning diminutive objects. Modifying the backbone of the VGNetG architecture and incorporating the ECA attention mechanism with extended location information, they effectively compensate for the decrease in detection accuracy while ensuring reduced computational requirements of the model. The detection of traffic signs was addressed by Du et al. [40] through the introduction of a highly efficient algorithm, which is based on an enhanced version of YOLOv7. The initial core network was substituted by the lightweight module PP-LCNet, the traditional upsampling method was replaced by the CARAFE upsampling operator, and the attention module of SimAM was integrated into the network to minimize the quantity of model parameters and computational complexity. The T-YOLO model, proposed by D. Padilla Carrasco et al. [41], is a compact vehicle detection model that utilizes YOLO and multi-scale convolutional neural networks. The enhanced version of the YOLOv5 framework represents a highly efficient deep target detection model, specifically engineered to excel in accurately identifying compact vehicle targets. Wang et al. [42] presented a novel methodology utilizing UAV aerial photography to detect small targets in infrared imagery. The backbone feature extraction network was enhanced by incorporating a module called RPConv-Block, designed to be lightweight. Furthermore, the versatility of infrared targets with different dimensions has been enhanced by the integration of the GSConv and VoVGSCSP modules. It reduces the network calculation and generates more semantic information, to enhance the model’s ability to detect. Liu et al. [43] proposed the implementation of GhostNet as a substitute for the conventional convolution layer. They improved the last level of the backbone network by incorporating the SepViT module and integrated YOLOv5’s feature extraction network with the ECA method, which involves the channel attention mechanism. The LMSD-YOLO model, developed by Guo et al. [44], is designed to accurately detect single-stage SAS ship targets. It used the DBA module and improved the S-MobileNet module and DSASFF module to enhance the network feature extraction capability and reduce the complexity of the model. Wang et al [45] optimized the YOLOv7-tiny model. For UAV aerial image target detection tasks, BiFPN was innovatively fused in the neck of the model to enhance multi-scale feature fusion, and GAM was integrated to improve the global information perception ability, so that the model could accurately focus and identify targets in complex and variable scenes, and the detection performance was significantly improved.

3. Proposed Method

The section is organized into three parts: firstly, it presents the fundamental principles and overall framework of the G-YOLO model; secondly, it explores the significance of the HiT-UAV dataset [46], which has been previously investigated in other studies; and finally, we discuss the evaluation approach employed to validate the methodology presented in this paper.

3.1. Lightweight Network Architecture G-YOLO

The present paper introduces a novel and efficient G-YOLO model for the detection of lightweight infrared small targets, which is assisted by UAV technology as illustrated in Figure 1. The YOLOv8 backbone feature extraction network is enhanced and optimized through the integration of the lightweight GhostBottleneckV2 network, and the remaining part of the network adopts the depth-separable convolution, DWConv, instead of part of the standard convolution, which improves the performance of the model by increasing the ability of multi-scale feature extraction while ensuring a smaller computational cost. In order to enhance the ability of the model to extract features of infrared dim and small targets, the neck uses the ODConv module to improve the original C2f module, which utilizes an adaptive convolutional structure to adaptively challenge the convolutional kernel size and step size, allowing for more effective feature extraction and detection based on targets at different scales. To further optimize the model and improve the object detection ability in complex scenes, the SEAttention attention mechanism module is introduced in the neck to effectively improve the learning ability of the global information of the input feature map, so as to enhance the expression ability of the network. Finally, to solve the problem of sample imbalance and enhance robustness, the SlideLoss loss function is introduced into the detection head to make the model used to calculate the difference between the predicted bounding box and the actual truth bounding box during the training process, and the model parameters are adjusted according to these differences to improve the accuracy of object detection.

3.1.1. Improved Backbone Network Based on GhostBottleneck-V2

The GhostBottleneckV2 represents a lightweight convolutional neural network architecture that has been proposed by the esteemed research team at Facebook AI [27]. As shown in Figure 2, the GhostBottleneckV2 achieves efficient image target detection tasks by employing methods like convolution that can be separated based on depth, residual concatenation, and Ghost Module. The pivotal module of GhostBottleneckV2 is the Ghost Module, which can improve the speed of training and increase model accuracy through parameter reduction and model computation. The structure of GhostBottleneckV2 is composed of three primary components: stem, body, and head. The stem component comprises a fusion of convolutional and pooling layers, facilitating the transformation of the input image into an intricate feature map. The body section consists of multiple GhostBottleneckV2 blocks, where each block contains two Ghost Modules and a residual connection. The head section is an internationally inclusive pooling layer and a softmax layer for the central projection of the feature map to the distribution of probabilities across categories. The model’s performance and accuracy were enhanced by replacing the traditional residual block with the GhostbottleneckV2 module. Specifically, the GhostBottleneckV2 structure is composed of a block that includes two convolutional layers—a BN layer and an activation function layer. During training, the block’s weights are updated using backpropagation to optimize model performance. By incorporating GhostBottleneckV2 into the backbone network, the model reduces parameters while enhancing its sensory field to better capture intricate target information.

3.1.2. ODConv Feature Extraction Model

The ODConv [28] is a new convolutional layer structure, which adopts the idea of grouped convolution and point convolution. In scenarios characterized by a substantial amount of input and output channels, it can better ensure the quantity of parameters and calculations of the model and enhance the precision of identification, as shown in Figure 3. Specifically, the input consists of a feature map that is divided into multiple groupings by the ODConv layer, performs a convolution operation within each grouping, and then combines the results together. In addition, the ODConv layer employs the idea of pointwise convolution to decompose the convolution kernel within each grouping into two smaller convolution kernels for processing, which further decreases the count of parameters and computations.
This is shown in Figure 3, as the ODConv introduces a parallel strategy to incorporate a multi-dimensional attention mechanism, thereby enhancing the flexibility in learning the four dimensions of the convolution kernel space. The ODConv of the system can be described in the following manner:
y = ( α ω 1 α f 1 α c 1 α s 1 W 1 + + α ω n α f n α c n α s n W n ) x
Among them, αωi denotes the attention scalar of the convolutional kernel Wi, α s i R k k , α c i R c i n , and α f i R c o u t denote the three newly introduced attentions along the null domain dimension, and the dimensions of the input channel and output are denoted as the input channel dimension and the output dimension, respectively. The multi-head attention module πi(x) is utilized to compute these four attentions. Through the complementarity of these four kinds of attention, the convolution Wi is gradually multiplied by different attention along the dimensions of position. By performing the convolution operation using channel, filter, and kernel, the input undergoes alterations in every dimension, thereby augmenting its capacity to capture a wide range of contextual information.

3.1.3. Channel Attention Mechanism SEAttention

The SEAttention [29] module is used to enhance the neck structure of the model, which aims to improve the model representation ability by weighting the spatial and channel dimensions of the feature map to improve the performance of object detection. It is composed of three components: a Squeeze operation, an Excitations operation, and a Scale operation. In the Squeeze operation, to obtain a global vector, the global average pooling operation is applied to the input feature maps. This process effectively captures comprehensive information for each channel, enabling a more thorough analysis. In the Excitation operation, through two fully connected layers, a ReLU activation function and a Softmax activation function, the dimension is decreased and the dimension is augmented. Finally, the weight vector is generated by the sigmoid function to ensure that their sum is 1. In the Scale operation, the input raw feature maps are multiplied by the channel attention weights obtained in the previous step to adjust the feature values of each channel, emphasizing the information from significant channels while suppressing that from insignificant ones.
As illustrated in Figure 4, given an input X H × W × C , X H × W × C is obtained by a convolutional operation Ftr. The features, U, are obtained by aggregating the features over the spatial dimension, H×W, after the Squeeze operation Fsq. Next, the Excitation operation is performed to obtain s. The weights following the fully connected layer are denoted as W1 and W2, while δ represents the function known as the rectified linear unit (ReLU), and σ denotes the sigmoid function.
s = F e x ( z , W ) = σ ( g ( z , W ) ) = σ ( W 2 δ ( W 1 Z ) )
Finally, a Scale operation is performed to multiply feature U and feature s to obtain x. By multiplying, the attention mechanism can adjust the weights of each channel at a finer granularity, focusing more attention on the channels that are more critical to the task. This process makes the model pay more attention to the important features on the corresponding channels in the given task.
x = F s c a l e ( U , s ) = U · s

3.1.4. SlideLoss Loss Function

SlideLoss [30] is a sliding window-based loss function for target detection, which is mainly used to optimize the matching between the detection bounding box and the ground truth bounding box in target detection. SlideLoss first predicts the position and size of the detection bounding box by classifying and regressing the central location within each sliding window. It then matches the detected bounding box with the ground truth bounding box and calculates the IoU between them. If the IoU is larger than a preset threshold, SlideLoss sets the loss function for that detected box to 0. Otherwise, it sets the loss function for that detection frame to the difference between the IoU and the threshold. Finally, SlideLoss sums the loss functions of all the detection frames and is used to update the model parameters.
f ( x ) = 1 x μ 0.1 e 1 μ μ 0.1 < x < μ e 1 x x μ
This is shown in Figure 5, among them, to reduce the hyperparameters, the average of the IoU values of all the bounding boxes is taken as the threshold μ. Positive case samples are taken when x is less than μ, and negative case samples are taken when x is more than μ. Here, 0.1 is a hyperparameter that is usually set to the weight decay coefficient in the case of unbalanced positive and negative samples.

3.2. Datasets

The present study utilizes the HIT-UAV dataset to perform model validation experiments. Specifically designed for UAV infra-red target acquisition, this dataset encompasses a vast collection of UAV infrared images, encompassing diverse scenes and weather conditions. Its real background enables the model to better adapt to real-world situations during training, improving its generalization ability. The use of the HIT-UAV dataset can provide diverse, realistic, challenging, and scalable data, thereby helping models better learn and adapt to infra-red target detection tasks. The HIT-UAV dataset, shown in Figure 6, comprises infrared images taken by a high-altitude UAV. These images were acquired from various altitudes, viewing angles, and object classifications. The dataset comprises 2898 infrared images, with each image having a pixel resolution of 640 × 512. It contains a total of 24,899 labels, divided into five classifications: “people”, “cars”, “bicycles”, “other vehicles”, and “DontCare”. The purpose of simplifying the data set is to improve the training efficiency and reduce the computational cost. By making the data set cleaner and more orderly, it is beneficial to the training and convergence of the model and improves the detection effect of the model. We excluded the DontCare category and merged the Automobile and Other Vehicle categories into one single category called vehicle. Consequently, the simplified dataset now consists of three distinct categories: person, vehicle, and bicycle. In total, there are 2866 images available for further examination.
In this study, the HIT-UAV dataset was divided into three sections with a distribution ratio of 7:2:1. The dataset consists of 2008 training images, 571 testing images, and 287 validation images. As shown in Table 1, the HIT-UAV dataset contains a total of 17,118 instances of annotations for small objects measuring less than 32 × 32 pixels, along with 7249 annotations for medium-sized targets smaller than 96 × 96 pixels. Additionally, a total of 384 labels are provided for large targets. The significance of this dataset lies in its contribution to the research and implementation of infrared target detection and recognition for UAVs, making it highly relevant for academic studies and practical applications.

3.3. The Evaluation Criteria

The efficiency of the proposed enhanced G-YOLO model in this study was evaluated using various assessment criteria, including precision (P), recall (R), F1 score, average precision (AP), mean average precision (mAP), parameters, floating point operations per second (FLOPs), and frames per second (FPS). The F1 score is computed by aggregating precision and recall through a weighted average, while AP and mAP are the primary evaluation parameters for the assessment of model recognition accuracy. Parameters are the parameters of the model—i.e., the weights and biases learned during training. The term FLOPs denote the tally of computational operations that involve floating-point numbers, a pivotal metric for evaluating the computational complexity of neural network-based models. The term FPS refers to the number of frames captured or displayed within a duration of one second—how many pictures the model can detect and recognize per second. The subsequent equations represent the evaluation parameters.
Precision = TP TP + FP
Recall = TP TP + FN
F 1 = 2 P R P + R
where P represents precision, R represents recall, TP denotes instances that are true positives, the term FN denotes false negatives, and FP signifies false positives. The formulas for computing AP (average precision) and mAP (mean average precision) are as follows:
AP = 0 1 P ( r ) dr
mAP = 1 c j = 1 c A P i
The term FPS refers to the number of frames captured or displayed within one second, and how many pictures the model can detect and recognize per second. In UAV aerial target detection, FPS lower than 50 can detect static targets and slow-moving targets. Due to poor real-time performance, it may not be able to capture complete information or details of the target, thus affecting the accuracy. However, FPS higher than 500 will be suitable for a wider and more complex detection task, such as small targets, long-distance targets, and other complex scenes.

4. Results of the Experiments

4.1. Experimental Platform and Parameter Settings

The utilized experimental development environment is Windows 10, and the reference to the specific platform configuration is provided in Table 2.

4.2. Ablation Experiments

To assess the enhancement brought by G-YOLO to YOLOv8, the G-YOLO enhancement was evaluated by conducting a set of ablation tests that were performed on the HIT UAV dataset to assess the individual impact of each module. The input image of the ablation experiment is set to 640 × 640, the batch size is set to 16, and the count of repetitions is set to 300. The outcomes of the experiment are presented in Table 3, indicating that enhancing the underlying network architecture of the model using the GhostbottleneckV2 module leads to a 48.3% reduction in model gain parameters, a 55 improvement in FPS, and only a slight decrease of 2.5% in mAP. The model parameters are reduced by 1.0 M after enhancing the neck using the ODConv module, the FPS is increased by 7, the vehicle mAP is increased by 0.6%, and the mAP is only reduced by 0.7%. Additionally, the neck section incorporates a more streamlined DWConv module instead of the conventional convolutional layer, which reduces the number of parameters by 0.3 M and improves the vehicle detection accuracy AP by 0.2%. The inclusion of the SEAttention module in the neck section resulted in a reduction of 0.2 M parameters and an increase of 0.3% in vehicle AP. Finally, the SlideLoss loss function optimization model was introduced, and the mAP was increased by 0.2%, the F1 score was increased by 0.1%, and the FPS was increased by 4. After the model is improved by the above strategy, the G-YOLO model significantly diminishes the number of parameters and computation compared to YOLOv8n, with only 25.8% of the original model’s parameters. The FPS is improved by 71, the FLOPs are reduced by 54.3%, and the mAP reaches 91.4% under the same settings. The experiments illustrate the good contribution of each improvement module.
The performance improvement achieved by the G-YOLO model across various modules during the ablation experiment was converted into a visual representation, as shown in Figure 7. By including each module, the intricacy of the initial model is effectively diminished and the count of parameters reduced, without compromising on high detection accuracy. The performance of G-YOLO was evaluated by incrementally integrating experimental modules into the initial model. The results confirm that G-YOLO sustains a remarkable level of detection precision while effectively decreasing the number of parameters and simplifying the model structure. Additionally, a series of improvements increase the FPS from 485 to 556, making it more suitable for small and long-range targets and fast-moving targets.

4.3. Comparative Experiments

To enhance the credibility of G-YOLO’s performance assessment, we choose other models in the YOLO series as initial references for experimental verification on the HIT-UAV UAV infrared data set. The dataset encompasses a height range of 60–130 m for the target collection, resulting in diverse and complex image backgrounds as well as notable variations in target scale. Furthermore, the majority of the objects are diminutive in size, revealing a formidable obstacle to detection. The training of the experimental models does not involve the utilization of pre-existing weights, and the input image dimensions for the employed model are 640 × 640 pixels. The size of the batch is configured as 16, and the training procedure spans across a total of 300 epochs.
The results of the trials comparing G-YOLO with other models are shown in Table 4 and Figure 8. Experimental results show that compared with YOLOv8n, G-YOLO has the parameters reduced by 74.2%, the model complexity reduced by 54.3%, the FPS increased by 71, and the mAP reached 91.4%. Compared with YOLOv5n, which has the highest detection speed, FPS is improved by 16. Therefore, the G-YOLO model significantly reduces the model complexity and improves the detection speed while maintaining the accuracy of object detection. Additionally, it enhances the efficiency of detecting targets, making it highly promising for applications involving infrared small target detection tasks on unmanned aerial vehicle (UAV) platforms operating in complex scenes. Table 4 also shows the experimental results of the G-YOLO model compared with other models. It can be seen from the results that compared with some lightweight target detection algorithms in the YOLO series, G-YOLO has a significant reduction in parameters and FLOPs indicators and improves the processing speed of the model FPS, so that the model has good detection efficiency and response speed in the infrared target detection task based on UAV. The reduction in the quantity of model parameters is achieved by 93.3%, 68.0%, 86.9%, 70.4%, and 74.2% compared to YOLOv3-tiny, YOLOv5n, YOLOv7-tiny, YOLOv10n, and YOLOv8n, respectively. The model computational complexity is reduced by 80.4%, 47.9%, 72.0%, 54.9%, and 54.3% compared to YOLOv3-tiny, YOLOv5n, YOLOv7- tiny, YOLOv10n, and YOLOv8n, respectively. Its FPS is improved by 56, 16, 306, 80, and 71compared to the YOLO series lightweight algorithms YOLOv3-tiny, YOLOv5n, YOLOv7-tiny, YOLOv10n, and YOLOv8n, respectively, and the mAP reaches 91.4%, which is 4.1% higher than YOLOv3-tiny and just 3.6% lower than YOLOv8n. The above experimental results show that the G-YOLO model proposed in this paper can greatly reduce the parameters and complexity of the model, improve the detection efficiency of the model, and achieve better lightweight improvements while ensuring good detection effects when detecting infrared small targets in complex scenes.
Figure 9 shows the visual comparison experimental results of G-YOLO and other lightweight models of the YOLO series in complex scenes. Vehicles are represented by the targets enclosed in the detection box marked with the color red, bicycles are denoted by those within the blue box, and people are indicated by targets encompassed in the green box. Blue arrows represent the missed and false detections of vehicle targets, yellow arrows represent the missed and false detections of bicycle targets, and orange arrows represent missed people targets.
It can be seen from the first column that most models have the problem of missed detection of occluded vehicle targets, while the G-YOLO model has a good detection effect in occluded target detection, and the missed detection of occluded vehicle target detection is reduced compared with YOLOv8n. However, as can be seen in the second column of the figure, most of the lightweight models have missed detections for bicycle target detection and YOLOv7-tiny showed false detections, which were all improved in the G-YOLO model, with a reduction in missed and false detections of targets. It can be seen from the third column figures that G-YOLO greatly improves the missed and false detections of people. In general, compared with other algorithms, G-YOLO also has a good detection effect on infrared small targets under occlusion conditions in complex environments, effectively reducing the missed detection and false detection rates.

5. Conclusions

In this paper, an innovative lightweight UAV-based target detection model named G-YOLO is proposed for infrared small targets detection in complex environments. Firstly, the GhostBottleneckV2 lightweight network was utilized to enhance and advance the feature extraction network of YOLOv8’s backbone, and the remaining part of the network adopts the depth-separable convolution, DWConv, instead of part of the standard convolution, which reduces the quantity of model parameters and computation substantially while retaining the detection effect of the model effectively. Secondly, the use of DWConv module and the SEAttention attention mechanism module further reduce the quantity of parameters and computations while improving the detection efficiency of the model, so that it can be better applied to UAV platforms for real-time monitoring. Finally, the SlideLoss loss function is introduced to calculate the differences between the predicted and actual boxes during the process of training and adjust the model parameters according to these differences to improve the accuracy and efficiency of target detection. The experimental results demonstrate that the G-YOLO model effectively preserves its capability of detecting small infrared targets while significantly reducing model complexity and parameters. Compared to YOLOv8n, the quantity of model parameters is reduced by 74.2%, the number of computed floating-points is reduced by 54.3%, the FPS is improved by 71, and the average accuracy mAP reaches 91.4%. Furthermore, G-YOLO also has a good detection effect on infrared small targets in complex environments, effectively reducing the missed detection and false detection rate. This model can optimize the computing performance and storage space of the UAV platform, so as to achieve more efficient task execution compared with other algorithms.

Author Contributions

Conceptualization and methodology were conducted by X.Z. and W.Z.; software development was also carried out by X.Z. and W.Z.; validation was performed by W.Z., H.Z., Y.X. and C.Z.; formal analysis was completed by W.Z. and J.M.; investigation tasks were undertaken jointly by X.Z. and W.Z.; Editing was overseen by J.M.; Writing-original drafting responsibilities were shared among X.Z., W.Z. and H.Z.; Writing—review and edit duties were handled solely by Z.Z.; Visualization work was a collaborative effort between W.Z. and H.Z.; Oversight of the project fell to X.Z; Project administration responsibilities rested with Z.Z., while securing funding involved both X.Z. and Z.Z. All authors have reviewed the manuscript’s published version before approval. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financially supported by grants from the National Natural Science Foundation of China (Grant No. 41404022) and the National Foundation for Advancing Fundamental Sciences in China (Grant No. 2021-JCJQ-JJ-0871).

Data Availability Statement

The author responsible for correspondence is able to supply the data utilized in the research, if deemed suitable. Additionally, upon request, the corresponding author has the capability to furnish the algorithms employed in this investigation.

Conflicts of Interest

There are no conflicts of interest on the part of the authors.

References

  1. Zhang, C.; Li, D.; Qi, J.; Liu, J.; Wang, Y. Infrared Small Target Detection Method with Trajectory Correction Fuze Based on Infrared Image Sensor. Sensors 2021, 21, 4522. [Google Scholar] [CrossRef] [PubMed]
  2. Cao, S.; Deng, J.; Luo, J.; Li, Z.; Hu, J.; Peng, Z. Local Convergence Index-Based Infrared Small Target Detection against Complex Scenes. Remote Sens. 2023, 15, 1464. [Google Scholar] [CrossRef]
  3. Hayat, S.; Yanmaz, E.; Brown, T.X.; Bettstetter, C. Multi-objective UAV path planning for search and rescue. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 5569–5574. [Google Scholar]
  4. Shokouhifar, M.; Hasanvand, M.; Moharamkhani, E.; Werner, F. Ensemble Heuristic—Metaheuristic Feature Fusion Learning for Heart Disease Diagnosis Using Tabular Data. Algorithms 2024, 17, 34. [Google Scholar] [CrossRef]
  5. Choutri, K.; Mohand, L.; Dala, L. Design of search and rescue system using autonomous Multi-UAVs. Intell. Decis. Technol. 2021, 14, 553–564. [Google Scholar] [CrossRef]
  6. Qiu, Z.; Bai, H.; Chen, T. Special Vehicle Detection from UAV Perspective via YOLO-GNS Based Deep Learning Network. Drones 2023, 7, 117. [Google Scholar] [CrossRef]
  7. Liu, Y.; Li, W.; Tan, L.; Huang, X.; Zhang, H.; Jiang, X. DB-YOLOv5: A UAV Object Detection Model Based on Dual Backbone Network for Security Surveillance. Electronics 2023, 12, 3296. [Google Scholar] [CrossRef]
  8. Fang, H.; Xia, M.; Zhou, G.; Chang, Y.; Yan, L. Infrared Small UAV Target Detection Based on Residual Image Prediction via Global and Local Dilated Residual Networks. IEEE Geosci. Remote Sens. Lett. 2022, 19, 7002305. [Google Scholar] [CrossRef]
  9. Qiu, X.; Chen, Y.; Cai, W.; Niu, M.; Li, J. LD-YOLOv10: A Lightweight Target Detection Algorithm for Drone Scenarios Based on YOLOv10. Electronics 2024, 13, 3269. [Google Scholar] [CrossRef]
  10. Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A Survey of Object Detection for UAVs Based on Deep Learning. Remote Sens. 2024, 16, 149. [Google Scholar] [CrossRef]
  11. Dai, J.; Wu, L.; Wang, P. Overview of UAV Target Detection Algorithms Based on Deep Learning. In Proceedings of the 2021 IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 17–19 December 2021; pp. 736–745. [Google Scholar]
  12. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. European Conference on Computer Vision. In Computer Vision—ECCV 2014; Springer International Publishing: Cham, Switzerland, 2014. [Google Scholar]
  13. Wang, Y.; Tian, Y.; Liu, J.; Xu, Y. Multi-Stage Multi-Scale Local Feature Fusion for Infrared Small Target Detection. Remote Sens. 2023, 15, 4506. [Google Scholar] [CrossRef]
  14. Xu, Z.; Yu, M.; Chen, F.; Wu, H.; Luo, F. Surgical Tool Detection in Open Surgery Based on Faster R-CNN, YOLO v5 and YOLOv8. In Proceedings of the 2024 IEEE 7th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 15–17 March 2024; pp. 1830–1834. [Google Scholar]
  15. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  16. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 7, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  17. Liu, M.; Wang, X.; Zhou, A.; Fu, X.; Ma, Y.; Piao, C. UAV-YOLO: Small Object Detection on Unmanned Aerial Vehicle Perspective. Sensors 2020, 20, 2238. [Google Scholar] [CrossRef] [PubMed]
  18. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2014; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
  19. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  20. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  21. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
  22. Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-Time Flying Object Detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]
  23. Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
  24. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
  25. Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small Object Detection Algorithm Based on Improved YOLOv8 for Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1734–1747. [Google Scholar] [CrossRef]
  26. Zhao, X.; Zhang, W.; Zhang, H.; Zheng, C.; Ma, J.; Zhang, Z. ITD-YOLOv8: An Infrared Target Detection Model Based on YOLOv8 for Unmanned Aerial Vehicles. Drones 2024, 8, 161. [Google Scholar] [CrossRef]
  27. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar]
  28. Zhao, Z.; Dong, M. Channel-Spatial Dynamic Convolution: An Exquisite Omni-dimensional Dynamic Convolution. In Proceedings of the 2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 21–23 April 2023; pp. 1707–1711. [Google Scholar]
  29. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  30. Chen, D.; Mao, F.; Song, M.; He, Y.; Wu, X.; Wang, J.; Li, W.; Yang, Y.; Xue, H. Class Regularization: Improve Few-shot Image Classification by Reducing Meta Shift. arXiv 2019, arXiv:1912.08395. [Google Scholar]
  31. Rouhi, A.; Arezoomandan, S.; Kapoor, R.; Klohoker, J.; Patal, S.; Shah, P.; Umare, H.; Han, D. An Overview of Deep Learning in UAV Perception. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 5–8 January 2024; pp. 1–6. [Google Scholar]
  32. Dwivedi, U.; Joshi, K.; Shukla, S.K.; Rajawat, A.S. An Overview of Moving Object Detection Using YOLO Deep Learning Models. In Proceedings of the 2024 2nd International Conference on Disruptive Technologies (ICDT), Greater Noida, India, 15–16 March 2024; pp. 1014–1020. [Google Scholar]
  33. Wang, K.; Zhou, H.; Wu, H.; Yuan, G. RN-YOLO: A Small Target Detection Model for Aerial Remote-Sensing Images. Electronics 2024, 13, 2383. [Google Scholar] [CrossRef]
  34. Dong, Y.; Li, Y.; Li, Z. Research on Detection and Recognition Technology of a Visible and Infrared Dim and Small Target Based on Deep Learning. Electronics 2023, 12, 1732. [Google Scholar] [CrossRef]
  35. Qiu, M.; Huang, L.; Tang, B.H. ASFF-YOLOv5: Multielement Detection Method for Road Traffic in UAV Images Based on Multiscale Feature Fusion. Remote Sens. 2022, 14, 3498. [Google Scholar] [CrossRef]
  36. Sahin, O.; Ozer, S. YOLODrone: Improved YOLO Architecture for Object Detection in Drone Images. In Proceedings of the 2021 44th International Conference on Telecommunications and Signal Processing (TSP), Brno, Czech Republic, 26–28 July 2021; pp. 361–365. [Google Scholar]
  37. Xinxin, L.; Zuojun, L.; Chaofang, H.; Changshou, X. Light-Weight Multi-Target Detection and Tracking Algorithm Based on M3-YOLOv5. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 June 2023; pp. 8159–8164. [Google Scholar]
  38. Ma, T.; Yang, Z.; Liu, B.; Sun, S. A Lightweight Infrared Small Target Detection Network Based on Target Multiscale Context. IEEE Geosci. Remote Sens. Lett. 2023, 20, 7000305. [Google Scholar] [CrossRef]
  39. Wei, J.; Qu, Y.; Gong, M.; Ma, Y.; Zhang, X. VE-YOLOv6: A Lightweight Small Target Detection Algorithm. In Proceedings of the 2024 4th International Conference on Neural Networks, Information and Communication Engineering (NNICE), Guangzhou, China, 10–12 January 2024; pp. 873–876. [Google Scholar]
  40. Du, Q.; Wu, Y.; Tian, L.; Lin, C. A Lightweight Traffic Sign Detection Algorithm based on Improved YOLOv7. In Proceedings of the 2023 4th International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI), Guangzhou, China, 4–6 August 2023; pp. 428–431. [Google Scholar]
  41. Padilla Carrasco, D.; Rashwan, H.A.; García, M.Á.; Puig, D. T-YOLO: Tiny Vehicle Detection Based on YOLO and Multi-Scale Convolutional Neural Networks. IEEE Access 2023, 11, 22430–22440. [Google Scholar] [CrossRef]
  42. Wang, S.; Jiang, H.; Li, Z.; Yang, J.; Ma, X.; Chen, J.; Tang, X. PHSI-RTDETR: A Lightweight Infrared Small Target Detection Algorithm Based on UAV Aerial Photography. Drones 2024, 8, 240. [Google Scholar] [CrossRef]
  43. Xu, L. Improved YOLOv5 for Aerial Images Object Detection with the Introduction of Attention Mechanism. In Proceedings of the 2023 2nd International Conference on Data Analytics, Computing and Artificial Intelligence (ICDACAI), Zakopane, Poland, 17–19 October 2023; pp. 817–824. [Google Scholar]
  44. Guo, Y.; Chen, S.; Zhan, R.; Wang, W.; Zhang, J. LMSD-YOLO: A Lightweight YOLO Algorithm for Multi-Scale SAR Ship Detection. Remote Sens. 2022, 14, 4801. [Google Scholar] [CrossRef]
  45. Wang, Z.; Liu, Z.; Xu, G.; Cheng, S. Object Detection in UAV Aerial Images Based on Improved YOLOv7-tiny. In Proceedings of the 2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 12–14 May 2023; pp. 370–374. [Google Scholar]
  46. Suo, J.; Wang, T.; Zhang, X.; Chen, H.; Zhou, W.; Shi, W. HIT-UAV: A High-altitude Infrared Thermal Dataset for Unmanned Aerial Vehicles. arXiv 2022, arXiv:2204.0324. [Google Scholar]
Figure 1. G-YOLO network structure diagram, where backbone refers to the backbone feature extraction network, neck refers to the feature processing stage, head refers to the object detection and classification, and SPPF refers to the fast spatial pyramid pooling layer.
Figure 1. G-YOLO network structure diagram, where backbone refers to the backbone feature extraction network, neck refers to the feature processing stage, head refers to the object detection and classification, and SPPF refers to the fast spatial pyramid pooling layer.
Drones 08 00495 g001
Figure 2. GhostBottleneckV2 structure diagram.
Figure 2. GhostBottleneckV2 structure diagram.
Drones 08 00495 g002
Figure 3. ODConv module structure. Here, αwi denotes the attention scalar of the convolution kernel Wi. (a) For the convolution kernel Wi, αsi assigns different attention values to the convolution parameters at k*k spatial locations. (b) αfi assigns different attention values to the convolutional filters of different output channels. (c), αwi is then assigned different values to the n overall convolution kernels. Kernel smart multiplication along the kernel dimension of the convolutional kernel space in (d).
Figure 3. ODConv module structure. Here, αwi denotes the attention scalar of the convolution kernel Wi. (a) For the convolution kernel Wi, αsi assigns different attention values to the convolution parameters at k*k spatial locations. (b) αfi assigns different attention values to the convolutional filters of different output channels. (c), αwi is then assigned different values to the n overall convolution kernels. Kernel smart multiplication along the kernel dimension of the convolutional kernel space in (d).
Drones 08 00495 g003
Figure 4. SEAttention module structure.
Figure 4. SEAttention module structure.
Drones 08 00495 g004
Figure 5. Renderings of Equation 4, where μ represents the threshold parameters for positive and negative samples.
Figure 5. Renderings of Equation 4, where μ represents the threshold parameters for positive and negative samples.
Drones 08 00495 g005
Figure 6. HIT-UAV dataset.
Figure 6. HIT-UAV dataset.
Drones 08 00495 g006
Figure 7. Comparison of visual results from ablation experiments. On the abscissa, the numbers mean: 1: YOLOv8; 2: YOLOv8 + Gb; 3: YOLOv8 + ODConv; 4: YOLOv8 + SEA; 5: YOLOv8 + DWConv; 6: YOLOv8 + Gb + ODConv; 7: YOLOv8 + Gb + ODConv + SEA; 8: YOLOv8 + Gb + ODConv + EA + DWConv 9: YOLOv8 + Gb + ODConv + EA + DWConv + Slideloss.
Figure 7. Comparison of visual results from ablation experiments. On the abscissa, the numbers mean: 1: YOLOv8; 2: YOLOv8 + Gb; 3: YOLOv8 + ODConv; 4: YOLOv8 + SEA; 5: YOLOv8 + DWConv; 6: YOLOv8 + Gb + ODConv; 7: YOLOv8 + Gb + ODConv + SEA; 8: YOLOv8 + Gb + ODConv + EA + DWConv 9: YOLOv8 + Gb + ODConv + EA + DWConv + Slideloss.
Drones 08 00495 g007
Figure 8. Comparison of visual effects of the contrast experiment.
Figure 8. Comparison of visual effects of the contrast experiment.
Drones 08 00495 g008
Figure 9. The comparison of visualization outcomes generated by various algorithms, where red boxes represent cars, blue boxes represent bicycles, and green boxes represent people. Blue arrows represent the missed and false detections of vehicle targets, yellow arrows represent the missed and false detections of bicycle targets, and orange arrows represent missed people targets.
Figure 9. The comparison of visualization outcomes generated by various algorithms, where red boxes represent cars, blue boxes represent bicycles, and green boxes represent people. Blue arrows represent the missed and false detections of vehicle targets, yellow arrows represent the missed and false detections of bicycle targets, and orange arrows represent missed people targets.
Drones 08 00495 g009
Table 1. Labeled object counts for small, medium, and large targets for the HIT-UAV dataset. Pixels smaller than 32 × 32 are small targets, pixels in (32 × 32, 96 × 96) are medium targets, and pixels in (96 × 96, 640 × 512) are large targets.
Table 1. Labeled object counts for small, medium, and large targets for the HIT-UAV dataset. Pixels smaller than 32 × 32 are small targets, pixels in (32 × 32, 96 × 96) are medium targets, and pixels in (96 × 96, 640 × 512) are large targets.
Small
(0, 32 × 32)
Medium
(32 × 32, 96 × 96)
Large
(96 × 96, 640 × 512)
HIT-UAV17,1187249384
Train set12,0455205268
Test set3331137970
Validation set174266546
Table 2. Experimental platform configuration.
Table 2. Experimental platform configuration.
NamesRelated Configurations
GPUNVIDIA Quadro P6000
CPUIntel(R) Core (TM) i9-9900k
Size of GPU memory32 G
System for operatingWin 10
The computational platform.CUDA10.2
Deep learning frameworkPytorch
Table 3. Ablation experiments for each module in the G-YOLO model.
Table 3. Ablation experiments for each module in the G-YOLO model.
YOLOv8n
Ghostbottleneckv2
ODConv
SEA
DWConv
Slideloss
Parameters3.11.92.12.92.81.11.00.80.8
FLOPs/G8.15.66.77.97.84.33.93.73.7
F1 (%)91.490.190.691.291.188.687.287.087.1
APVehicle (%)98.498.298.898.798.697.797.197.297.2
mAP50 (%)94.992.594.294.193.992.191.191.291.4
FPS (bt = 16)485540492476483552539552556
Table 4. Comparison experiments between the G-YOLO model and other models.
Table 4. Comparison experiments between the G-YOLO model and other models.
ModelSizeParametersF1
(%)
A P P e r s o n
(%)
A P V e h i c l e
(%)
A P B i c y c l e mAP50
(%)
FLOPs/GFPS
(bt = 16)
YOLOv3640103 M91.892.398.792.694.5282.245
YOLOv5s6409.1 M92.193.998.793.795.423.8251
YOLOv764037.2 M86.187.696.387.790.5105.176
YOLOv8s64011.1 M91.192.698.792.894.728.4222
YOLOv9c64025.3 M91.492.398.793.594.9238.943
YOLOv3-tiny64012.1 M84.082.397.881.587.218.9400
YOLOv5n6402.5 M90.791.998.692.194.27.1540
YOLOv7-tiny6406.1 M89.291.397.289.992.813.2250
YOLOv10n6402.7 M89.191.098.191.293.48.2476
ITD-YOLOv86401.8 M90.391.798.290.793.56.0328
YOLOv8n6403.1 M91.493.298.492.994.98.1485
G-YOLO6400.8 M87.189.797.287.491.4 3.7 556
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, X.; Zhang, W.; Xia, Y.; Zhang, H.; Zheng, C.; Ma, J.; Zhang, Z. G-YOLO: A Lightweight Infrared Aerial Remote Sensing Target Detection Model for UAVs Based on YOLOv8. Drones 2024, 8, 495. https://doi.org/10.3390/drones8090495

AMA Style

Zhao X, Zhang W, Xia Y, Zhang H, Zheng C, Ma J, Zhang Z. G-YOLO: A Lightweight Infrared Aerial Remote Sensing Target Detection Model for UAVs Based on YOLOv8. Drones. 2024; 8(9):495. https://doi.org/10.3390/drones8090495

Chicago/Turabian Style

Zhao, Xiaofeng, Wenwen Zhang, Yuting Xia, Hui Zhang, Chao Zheng, Junyi Ma, and Zhili Zhang. 2024. "G-YOLO: A Lightweight Infrared Aerial Remote Sensing Target Detection Model for UAVs Based on YOLOv8" Drones 8, no. 9: 495. https://doi.org/10.3390/drones8090495

Article Metrics

Back to TopTop